
Data products are curated, documented, and governed datasets designed for use by a broad audience within an organization. This includes both humans and machines. Unlike raw data, data products serve as packaged, high-quality assets that are discoverable, governed, trustworthy, and reusable. Collectively, data products include metadata, lineage, access controls, and contextual information, making them ideal inputs for both analytics and AI workloads.
What are data products and how do they differ from raw data?
Data products are curated, documented, and governed datasets designed for use by a broad audience. Because of this, they are useful both for humans running queries and AI agents running a variety of AI workloads. In modern data architecture, they serve a unique role, including:
- High-quality, trustworthy data assets that are discoverable and reusable.
- The foundation for AI agents and analytics workflows.
- Governed resources with clear lineage, metadata, and access controls.
- Programmatically accessible resources via standard interfaces like SQL or APIs.
- The bridge between raw data and business value.
- The basis for AI-driven decision-making and automation.
For AI agents, data products provide:
- Quality and reliability: Curated and governed datasets reduce the risk of AI errors or hallucinations.
- Discoverability and context: Metadata and lineage help agents understand data origins, transformations, and trustworthiness.
- Access and interoperability: Standard interfaces like SQL or APIs enable programmatic access without custom engineering for each use case.
In practical terms
Instead of querying dozens of disparate databases or seeking undocumented data sources, an AI agent can programmatically consume a data product exposed through a platform like Starburst. This provides reliable “fuel” for decision-making and insight generation.
Understanding the foundations of data architecture
What is data architecture and why does it matter for AI agents?
Data architecture is a framework that defines how data is collected, stored, transformed, distributed, and consumed within an organization. It encompasses the models, policies, rules, and standards that govern which data is collected and how it is stored, arranged, integrated, and used in data systems and organizations. A well-designed data architecture serves as the foundation for an organization’s data strategy, ensuring that data can be accessed, managed, and utilized effectively.
Data architecture is used in organizations to:
- Provide a blueprint for data systems and infrastructure.
- Ensure data quality, security, and compliance.
- Enable efficient data integration across disparate systems.
- Support analytics, reporting, and AI/ML initiatives.
- Facilitate data-driven decision making.
- Enable AI agents to discover and consume data products.
- Create a foundation for scalable and flexible data operations.
- Reduce data silos and redundancy.
- Establish governance frameworks for data usage.
While there are multiple ways to categorize data architectures, three common types include:
- Centralized Architecture: All data is stored and managed in a single, central repository, often a data warehouse or data lake.
- Distributed/Federated Architecture: Data is stored across multiple locations or systems, but can be accessed through a unified interface or query engine.
- Hybrid Architecture: Combines elements of both centralized and distributed approaches, often with some data centralized while other data remains in source systems, accessible through federation.
Modern approaches also include data mesh (domain-oriented, distributed) and data fabric (integrated, flexible data access across environments).
The shift from human-first to agent-first data consumption
Business users traditionally accessed data through user interfaces. This might include reports, dashboards, and or applications. In different ways, each of these interfaces handles the complexity of underlying data platforms. With AI growth now firmly on the radar for all companies, a new paradigm is emerging:
- AI agents interact directly with data products, bypassing human-centric applications.
- Instead of requiring graphical interfaces, agents consume data using programmatic access, often via natural language queries translated to SQL or API calls.
- The value shifts from applications themselves to the quality and accessibility of the underlying data.
Data products enable organizations to shift from human-first to agent-first data consumption, allowing AI systems to directly access and utilize organizational data assets in a controlled, governed manner.
Example
Consider a large financial institution with siloed business applications. One data system is used for customer relationship management, another is used for risk assessment, and another for compliance reporting. With AI agents, these entities no longer operate in isolation. An AI agent detecting potential fraud can consume a “Customer Transactions” data product, augmented with metadata about customer profiles and transaction history, and apply machine learning models to flag suspicious activity across sources in real-time. The agent acts autonomously, using data products as trusted sources.
For AI agents to work effectively, they need seamless access to high-quality, current, and well-governed data in formats compatible with both analytics and AI workloads. This is where the data lakehouse comes in. It provides several advantages, including:
- Unified access layer: Data lakehouses federate access to multiple sources (cloud, on-premises, SaaS), making previously siloed data available to AI agents through standardized protocols.
- Open table formats: These include Iceberg, Delta Lake, and Hudi. They enable data products to be shared, versioned, and governed while avoiding vendor lock-in and enabling interoperability across compute engines and storage locations.
- Real-time and batch support: AI agents may require streaming data for IoT device monitoring or batches for model retraining. The lakehouse manages both patterns seamlessly.
Example
A logistics company uses Starburst to create a “Fleet Sensor Events” data product. An AI agent handling predictive maintenance can access this product to analyze temperature, vibration, and performance metrics across all vehicles, regardless of whether the underlying data is stored in AWS, on-premises systems, or external partner feeds. When the agent detects a deviation indicating likely failure, it can trigger a maintenance work order or adjust routing autonomously.
Data products and AI agent workflows: How consumption happens
For AI workflows, data products are particularly useful. The curated nature of data products, where the datasets and the accompanying metadata are combined into a discrete package, provides the perfect foundation for AI workflows. To consume data products, AI agents and other AI processes typically follow a particular pattern that goes through several stages.
1. Discovery
AI agents “learn” about available data products via catalogs or APIs. Starburst provides a unified data catalog describing each product with metadata, tags (e.g., “customer-sensitive”, “financial”), and access controls.
2. Access and governance
AI agents programmatically request access using secure authentication and authorization mechanisms. Data products enforce these controls, ensuring agents only consume permitted data.
Compliance example: A healthcare organization exposes a “De-identified Patient Encounters” data product. An AI agent conducting population-level analysis can access this de-identified data, but cannot join it with sensitive identifying information. Access logs track every interaction for compliance.
3. Consumption patterns
AI agents consume data products in several core ways:
- Retrieval-Augmented Generation (RAG): Large Language Models can reference data products in real-time to answer user questions, grounding outputs in the latest organizational truth.
- Model training and fine-tuning: Data scientists and agents may use data products as training or validation sets for custom models.
- Event-driven actions: Agents monitoring streaming data products can act when business thresholds are crossed, such as triggering fraud alerts or supply chain interventions.
Example: An e-commerce retailer creates a data product for “Abandoned Cart Sessions.” An AI marketing agent consumes this product daily to generate personalized re-engagement email content, using fresh data while ensuring compliance with privacy settings defined in the product metadata.
4. Feedback and iteration
Consumption is not one-way: Advanced AI agents may also provide feedback into data products, creating a circle of feedback and iteration. This might include tagging, annotation, model confidence scores, or even automating the creation of new products. Modern platforms allow agents to programmatically “publish” improvements, governed by workflow permissions.
How do ETL processes fit into data product creation?
Extract, Transform, Load (ETL) processes are one of the cornerstones of data architecture. ETL manages the flow of data through the architecture:
- Extract: Obtaining data from source systems.
- Transform: Converting data into appropriate formats for analysis.
- Load: Storing processed data in target destinations.
In essence, data architecture defines how these processes should be designed, implemented, and governed. Modern data architectures may incorporate both batch ETL processes and real-time streaming data integration patterns to support different use cases, including AI agent workflows.
The importance of data governance and trust
With AI agents consuming data at scale without direct human supervision, governance becomes critical. Data products are essential to this, including playing an active role in the following areas.
- Governed sharing: Role-based and attribute-based controls determine what AI agents (and by extension, which downstream users or processes) can see or do.
- Lineage and auditability: Every use of a data product by an agent can be traced for compliance, quality assurance, and risk mitigation.
- Transparency: Metadata, documentation, and versioning mean agents (including their developers and overseers) can verify data origins, recency, and usage constraints.
For highly regulated sectors such as banking, insurance, or healthcare, this is not just helpful; it’s legally required. This is also true of organizations operating in certain geographies and jurisdictions, like the EU. Lack of governance leads to data leaks, unexplainable AI outcomes, or compliance failures.
Flexibility: Bringing AI to the data
Real-world enterprises often have decentralized, hybrid, and heterogeneous data estates. Mandating data centralization is neither practical nor desirable. Modern data architecture, especially the data lakehouse, enables federated data products, making AI-ready data available wherever it lives.
This brings several advantages:
- Faster time to value: No need for lengthy migrations. Agents can consume data products directly where the data resides as a first option, while still allowing for centralization if it makes sense.
- Choice and scalability: As demand for new AI workflows emerges, new or existing data products can be exposed and governed for agent consumption quickly.
- No vendor lock-in: Open standards (such as Iceberg and Trino for compute) allow agents to interact with data across cloud and on-premises environments, keeping future AI options open.
Example
A multinational conglomerate manages energy assets in multiple countries. Local compliance means some operational data cannot leave national borders. Federated data products allow global AI agents to consume non-sensitive summaries for company-wide optimization, while local agents act on full detail. Each data product is governed according to a set policy, enacted to ensure compliance.
The Starburst approach: AI agents empowered by data products
Starburst and its lakehouse architecture are built for this new agent-first era. Features like AI workflows natively embed the creation, discovery, governance, and consumption of data products in ways tailored for AI-driven scenarios.
Concrete workflow in action
- A Starburst AI agent interacts with a business stakeholder via natural language.
- The agent discovers relevant data products, such as “Quarterly Sales,” “Customer Churn Events,” and “Product Catalog.”
- The agent translates business questions into SQL or API calls, queries the products, combines and analyzes results, and returns actionable insights.
- All actions are logged, governed, and auditable.
- New insights can be published as a curated “AI-generated Insight” data product for other agents or human users to consume.
This model scales from small teams to global enterprises, and from analytics use cases to complex, automated AI workflows.
Data products as the bridge between AI agents and business value
AI agents are only as powerful as the data they can access and trust. The evolution from raw, siloed data toward curated, governed data products is foundational to unlocking the true value of enterprise AI. Modern data architectures provide open, federated, and centered around discoverable, governed data products, ensuring that AI agents can operate with speed, accuracy, and safety.
By recognizing that AI agents are now primary consumers of data products, organizations can future-proof their AI data strategy, ensure compliance, and enable transformative, automated intelligence across business domains.
In practical terms: Empower your AI agents with high-quality data products, and you empower your organization with the ability to adapt, innovate, and lead in the age of autonomous AI.