
AI is only as powerful as the data that feeds it. Large Language Models (LLMs), generative AI applications, and specialized AI agents rely not just on the volume but also on the quality, relevance, and structure of data they receive. “Garbage in, garbage out” remains as true for AI as it ever was for analytics. Yet conventional approaches to data management, often scattered and siloed, fall short of AI’s requirements for agility, comprehensiveness, and trust.
In this context, data products have emerged as a pivotal tool. Data products are curated, well-documented, and governed datasets, packaged as reusable assets with clear ownership and defined quality standards. They provide the context, transparency, and reliability that AI agents demand.
Why data products matter for AI agents
AI agents take many forms. Typically, they operate as autonomous software entities capable of independently interacting with enterprise systems. Although their value is still being explored, they hold the potential to revolutionize organizations around the world.
To operate effectively, AI Agents thrive require contextual data. To succeed, they require data with the following characteristics.
- Accessible: Data must be easily discoverable and retrievable across distributed systems, including on-premises systems, cloud systems, and SaaS platforms.
- Contextualized: AI agents require datasets that are not just raw but enriched with metadata, lineage, and business context.
- Governed and Secure: With sensitive or regulated data, organizations must ensure compliance and manage access at a granular level.
- Trusted and High-Quality: Reliable, up-to-date, and audited data minimizes risk, especially given that AI agents often act or provide insights without real-time human oversight.
Without data products, even the best AI agents are limited. They struggle to find and integrate the right data and often risk introducing errors, compliance issues, or perpetuating outdated business logic.
Foundations: modern data architectures for AI data products
Data is the true foundation of AI. Modern data products used for AI workflows are inseparable from the data platforms and architectures underpinning them. Enterprises have found that legacy data warehouses or scattered data lakes cannot meet the access, collaboration, and governance needs of today’s AI-driven workflows. New architectural paradigms, particularly the data lakehouse, offer the best path forward.
While there are many ways to categorize data architectures, two common types include:
- Centralized architecture: Where data is consolidated in a single repository like a data warehouse.
- Distributed architecture: Where data is stored across multiple locations but managed cohesively, typically using data federation.
Modern implementations often include data lakes, data warehouses, and the newer hybrid approach of data lakehouses, which combine the flexibility of data lakes with the reliability of data warehouses.
The lakehouse and its benefits
A data lakehouse combines the flexibility and scale of data lakes with the reliability and performance of data warehouses. It is built using open, interoperable technologies, including Apache Iceberg and Trino. This architecture offers key advantages:
- Federation and Integration: Access data where it lives, whether on-premises, across clouds, or inside SaaS applications.
- Centralized or Decentralized by Choice: Centralize critical datasets for performance, but federate broadly for access and experimentation, adapting to organizational needs on demand.
- Rich Governance and Metadata: Table formats such as Iceberg provide versioning, time travel, lineage tracking, and support for role-based and attribute-based access control.
- Performance at Scale: Optimized for both analytics and AI workloads, including near real-time ingestion and querying of massive datasets.
By adopting a data lakehouse as the foundation, organizations unify AI, analytics, and data applications on a single, flexible, and future-ready platform.
Building data products for AI agents: key requirements
Building effective data products for AI agents involves a considered approach to design, access, collaboration, and governance.
1. Curate with context and trust
Data products package data alongside comprehensive metadata. This provides several advantages, including:
- Ownership: Clear data stewards are identified.
- Documentation: Usage guidelines, definitions, and data provenance.
- Lineage: Details about source systems, transformations, and downstream consumers.
- Quality metrics: Freshness, completeness, and error rates.
This documentation ensures AI agents can interpret the meaning and trustworthiness of data, critical for autonomous operation.
2. Enable universal access without compromising security
AI agents often require synthesized views of data drawn from mutiple platforms. Starburst data lakehouses support this by enabling federated queries across all sources. Instead of migrating all data up-front, which is often a slow, costly process that may not meet regulatory requirements, the lakehouse approach allows agents to query or pipeline only what they need, exactly when they need it.
3. Govern for compliance and security
As AI agents access more sensitive and distributed data, governance becomes essential. In this context, data products need to support:
- Fine-grained access controls: Only authorized agents or personas can view or act on particular data.
- Auditability: Every AI agent’s data access and activity is logged and traceable.
- Automated compliance: Tools like intelligent tagging enable policy-based enforcement at scale.
In heavily regulated sectors like finance, healthcare, and the public sector, this enables AI initiatives to launch and scale without increasing reputational or legal risk.
4. Support real-time and iterative workflows
For applications such as AI-driven personalization, anomaly detection, or real-time recommendations, stale data is almost as bad as no data. Modern data products should support both file ingestion and streaming ingestion, enabling fast, iterative data flows to create responsive, self-improving AI agents.
Example scenarios
To illustrate how these principles work, let’s examine hypothetical enterprise examples.
Example 1: Financial services – AI agent for regulatory compliance review
A global bank wants to deploy an AI agent to monitor trade records and communications for potential compliance issues. Traditionally, relevant data—trading logs, emails, policy documents—is scattered across on-premises data warehouses, cloud-based SaaS platforms like Salesforce or ServiceNow, and shared drives.
With a data lakehouse and curated data products, the compliance AI agent can:
- Instantly access near real-time trading data, enriched communications metadata, and historical case files packaged as governed, documented data products.
- Rely on comprehensive lineage and audit logs to demonstrate compliance to regulators.
- Operate in strict alignment with geographic or business-unit restrictions, enforced at the data product level.
- Incorporate new data sources as needed, with minimal engineering involvement, making the system agile and responsive to new regulatory requirements.
Example 2: Healthcare – personalized patient care via AI agents
A healthcare network aims to deploy AI agents that help clinicians recommend treatment plans based on electronic medical records, imaging, prescription histories, and evolving clinical guidelines.
Key data products could be created comprising the following:
- Curated datasets for patient histories, anonymized outcome statistics, genomic test results, and the latest clinical protocol documents, each tagged and access-controlled.
- Data lineage tracking to show which recommendations drew from which source, supporting transparency and auditability.
- AI agents that can assemble contextualized views for each patient, while ensuring that privacy laws such as HIPAA are never violated because access policies are embedded within each data product.
Example 3: Manufacturing – predictive maintenance with autonomous agents
A manufacturing company uses AI agents to predict equipment failures and schedule maintenance, drawing on sensor telemetry, production logs, maintenance history, and purchase orders.
Data products for this use case would:
- Aggregate and harmonize IoT device data from disparate sources, standardize formats, and apply cleaning and transformation logic.
- Clearly document how metrics are calculated, ensuring engineers and agents align on definitions for “failure” and “downtime.”
- Allow both real-time ingestion for anomaly detection and batch processing for trend analysis, letting the AI agents operate proactively and iteratively.
- Enforce strict role-based access so only appropriate agents or maintenance staff access sensitive process data.
From business intelligence to agentic AI: data products as the bridge
Data products are not a niche tool for AI alone. They unify the needs of analytics, business intelligence, and AI under a single strategy. The same curated, governed assets that support dashboards and regulatory reporting also fuel LLM fine-tuning, Retrieval Augmented Generation, and agent-based automation.
As AI matures and organizations wish to shift more business processes to AI agents, the benefits multiply:
- Faster time-to-insight and action, as AI agents tap into governed, ready-to-use data products.
- Reduced friction between IT, data engineering, and business units, since the data product abstraction aligns technical, operational, and compliance perspectives.
- Simplified onboarding of new data sources or AI use cases, since the architecture is based on openness and modularity.
For organizations starting their journey:
- Assess your data landscape: Inventory current data silos, sources, and stakeholders.
- Adopt a data lakehouse platform: Move towards an open architecture combining Apache Iceberg and Trino.
- Build your first data products: Start with high-value, cross-team assets—customer records, transaction logs, support tickets.
- Automate governance: Implement role-based and attribute-based access controls, audit trails, and metadata enrichment from day one.
- Enable iterative data product development: Use natural language interfaces and AI assistants to facilitate data product creation and discovery for technical and non-technical users alike.
AI agents are poised to redefine enterprise productivity and business operations, but only when fueled by high-quality, trustworthy, and context-rich data. Building data products on modern, open lakehouse architectures represents the key to unlocking this potential.



