Why AI is only as good as the data that feeds it
Evan Smith
Technical Content Manager
Starburst Data
Evan Smith
Technical Content Manager
Starburst Data


More deployment options
How do you get started with AI? It’s not always easy. When asked, most Chief Data Officers say they don’t have a detailed strategy for supporting generative AI (GenAI) use cases. Many simply don’t know how to start breaking down the monolithic task of implementing an AI strategy for an entire enterprise, while ensuring that the data that feeds these models is optimized and governed in the best way.
This isn’t surprising. AI at this scale represents new territory for most organizations. Although this is an exciting, generational opportunity for change, its inherent newness brings its own risks.
Why AI architecture is really data architecture
Meanwhile, there is urgency at every turn. Many companies are rapidly developing AI strategies to remain competitive in this emerging space. To be successful, organizations should look to their data architectural strategy first. By viewing AI architecture through the lens of data architecture more broadly, organizations are able to achieve a flexible foundation for all their data, one that can grow alongside GenAI’s quickly evolving use cases and strategy. “Garbage In, Garbage Out” (GIGO) has been a truism in computing for decades. The need for data quality and data context is even more important today with the rise of AI, particularly agentic AI.
In this article, I’ll examine what makes curating high-quality data for AI ingestion so challenging. Additionally, I’ll discuss how to move towards a flexible data architecture that can meet AI’s rising demands now and well into the future.
Why AI is only as good as its data
The need for large amounts of high-quality data isn’t new. For years, data analytics has used large volumes of data to power data-driven insights. Along the way, however, the growing need for data has revealed the shortcomings of existing data architectures, which struggle to keep up with the volume needed.
Seen in this way, the emergence of AI underscores the importance of the problems facing data architecture in general. Just like analytics, AI is only as good as the data that feeds it.
Why GenAI gets better with more data
The relationship between data quality, data volume, and insights is especially true for GenAI systems. AI uses probabilistic models, including popular large language models (LLMs). Though each model differs, all models are trained on large datasets and require training with continuously refined parameters. LLMs encode these outcomes as a series of weighted values. The LLM then uses these weights to provide a best guess as to the next word or other element that should come next in a sequence.
With more data comes better predictions
More data means better predictions. Even if you’re not training an LLM directly, providing additional context via techniques such as retrieval-augmented generation (RAG) or LLM fine-tuning is essential for providing accurate predictions within the context of your specific problem domain.
What prevents us from getting the data we need for AI use cases?
So where’s the problem? On the plus side, object storage is cheaper than ever, and the cloud makes it easier to scale data resources economically. The problem is accessing that data in a high-quality format from your source systems into your AI pipelines or models.
To fix this problem, you need to look at solutions across three separate areas:
- Data access
- Data collaboration
- Data governance
Let’s look at each area in detail.
Data access
AI requires access to high-quality data from across your organization. To be successful, you’ll need a coordinated approach to data access. Specifically, you need to leverage all of your data, wherever it lives, to improve the accuracy of your AI insights. This can be hard in fragmented data architectures, where some team’s data exists in silos.
The role of dark data
Dark data also plays a role. Many companies don’t have a firm grasp on what data assets they have. This makes it hard to ensure that they’re being used correctly. Dark data often arises due to teams creating their own solutions, either because:
- They weren’t given the proper toolsets or architecture to create more centrally managed—or at least centrally visible—solutions; or
- Companies insisted on centralized data solutions that some teams found too convoluted to adopt
As of 2022, one report estimated that as much as 55% of an organization’s data may be dark. At that time, 48% of C-level and IT pros said they didn’t feel they had the tools to deal adequately with dark data.
Dark data and AI
Dark data was already a liability before AI. With the advent of GenAI, your business may be losing out further by neglecting data that could lead to more timely and more accurate responses from agentic AI systems.
Data collaboration
In many organizations, data silos also negatively impact collaboration. Even if you have access to a dataset, you may not have access to all of the context needed to make it useful. Often, this is a result of poor data collaboration.
Examples of poor data collaboration
For example, you may want your salespeople to use GenAI to research account history to identify new prospects and automatically write outreach emails. That requires data from several systems, such as Salesforce, Hubspot, Google Analytics, etc.
Enabling easy access across data systems has been an issue for years. With GenAI, this issue becomes even more critical, as missing data leads to incomplete or misleading model outputs.
Data collaboration across hybrid architectures
Another challenge with collaboration is hybrid architectures. Data may be spread across multiple locations, including multiple clouds and even on-premises locations. On-premises data may also have certain access restrictions. For example, these might include data compliance or data residency laws that restrict data from leaving a given geographical region. All of these factors play a role and can inhibit data collaboration.
In this sense, data silos often represent the existing logical divisions within your business. They might arise between different departments like sales, marketing, engineering, and finance. In any case, overcoming them and ensuring data collaboration enables better business processes by making datasets meaningful across teams. That’s good for both your data and your business.
Data governance
Data governance is all about ensuring that the right people can access the right data in the right way. Data for AI brings the same governance challenges we encounter in regular data-driven applications.
Making data available across the company at scale for AI in a governable manner requires:
- Ensuring stakeholders can find the datasets they need, while simultaneously ensuring sensitive data remains restricted.
- Providing data in an interoperable format that’s well-documented and easy to use.
- Guaranteeing transparency around data so that consumers can identify where it’s from, how it was obtained or calculated, when it was last updated, and how it works.
Many businesses don’t have these processes in place today. Without them, however, your AI strategy won’t get off the ground.
How to curate high-quality data for AI
Addressing these challenges requires implementing a data architecture that supports data access, data collaboration, and data governance.
How do you get there? Two strategies stand out:
- Data products
- Icehouse architecture
Let’s look at how each of these impacts the quality of the data feeding AI models.
Data products and AI
Data products are curated datasets that enhance access, collaboration, and governance of the underlying data. Data products contain metadata about the origin, purpose, and meaning of the data inside them. This makes them easier to discover, verify, trust, and use. It also makes building AI solutions using data products easier because they can leverage the metadata inside them.
When used for AI, data products provide several advantages over standard datasets, including:
- Efficiency
- Data quality
- Interoperability
- Transparency
Want to know more? Check out this video.
Why the Icehouse is perfect for making your data AI-ready
Using a Starburst Icehouse Architecture is the perfect way to make sure that the data feeding your AI models involves access, collaboration, and governance in the right way. The Icehouse is an open data lakehouse architecture that consists of two components:
- Trino, an open SQL query engine that provides fast, ad hoc access to data wherever it lives.
- Iceberg, an open table format that serves as an interoperable metadata layer supported by a number of vendors.
The Icehouse is designed to preserve optionality, preventing you from getting locked into any specific vendor. While this is useful for analytics, it is doubly useful for AI architecture as it gives you a choice over your future direction as AI continues to evolve and adapt over time.
Even better, transitioning to an Icehouse architecture doesn’t require tearing down everything you’ve built in your data architecture and starting over. Using Trino, you can access your data wherever it lives in your organization to start and go from there. That means you can keep most of your existing data where it is, while moving critical workloads that will benefit from it most to the Iceberg table format.
How Starburst supports your AI data
Based on the Icehouse, Starburst provides a flexible approach to managing data for AI and all your data-driven applications.
Using Starburst, you can eliminate data silos, moving critical workloads into Iceberg tables while also accessing existing data across your organization. Additionally, Starburst provides full support for developing data products. That means you can easily curate high-quality datasets for analytics and AI that set new standards for data access, collaboration, and governance.
To learn more about migrating to an Icehouse architecture with Starburst, contact us today.