
Data professionals often face the same dilemma: data is scattered across multiple source systems, but most analytics workflows assume it should be centralized. Data insights are only as good as the data that feeds them, and data silos can easily become a bottleneck. Today, this problem is duplicated when considering AI, which creates the same problem and the same bottleneck.
The conventional solution has been to build pipelines that copy data into a single repository, typically a data warehouse, data lake, or data lakehouse. This kind of data integration activity is as old as data management itself, but it has key problems.
Let’s look at each of those in detail and explore why federated data is more important than ever, especially with the advent of real-time data ingestion and AI workflows.
The problem with centralized data
Data centralization has one big problem. Centralization brings order, but it also brings added cost and complexity. In many instances, these complexities are so onerous that they literally never reach completion. This was already a problem for analytics, and the same thing is occurring for AI. Centralization is beneficial in moderation, but when forced, it creates significant problems at both technological and project management levels.
Data federation, also known as data virtualization, offers an alternative. Rather than moving every dataset, federation enables users to query data where it lives.
Why haven’t we always been doing this?
If data federation is so good, why haven’t we always been doing it? It’s a good question, with an interesting answer.
In the past, data federation technology was not as competitive as centralization. This often made federation use cases slow or impractical.
All of this has now changed. Modern data federation architecture is powered by distributed query engines like Starburst and Trino. These compute engines have addressed the issues that hindered federation in the past. Today, data federation has become a reliable, production-ready strategy. More importantly, it gives organizations = the ability to choose how much centralization they actually need for all of their workflows.
That’s as true of analytics as it is of AI.
Why data centralization fails more often than it succeeds
Let’s dive a bit deeper and explore some of the problems that organizations encounter when they centralize everything by default.
On the surface, centralizing data has clear benefits, at least on paper. It promises consistent schemas, predictable performance, and simpler governance. Yet, it comes at a cost, including:
Data migration overhead
Data migration isn’t easy. It often runs over budget and over time, and requires a significant amount of ETL to succeed. Each new data source requires ingestion pipelines that must be built and maintained, representing exponential increases in complexity. Data consolidation often requires work across both on-premises and cloud data.
All of that is work that needs to be done before you can even get started.
Data velocity problems
Raw data changes quickly, and an organization’s data is always in flux. To move data takes time and creates delays, meaning that copies lag behind their sources, sometimes by hours or days. This problem is known as data velocity and directly impacts data quality. The Business Intelligence (BI) dashboards and AI insights that rely on your data pipeline can often lag, causing a breakdown in the speed of data-driven decisions.
Data governance headaches
Data governance is essential in every organization. Every copy creates another point of failure, as well as additional compliance exposure. In regulated industries, such as banking, there is also a regulatory and compliance requirement, creating an even greater need for oversight of access control and auditing.
In industries where compliance rules are essential, such as finance, healthcare, and government, moving sensitive data into a central repository may not even be allowed. This is where federated queries become especially valuable, allowing organizations to remain compliant.
How Starburst does federation differently
Early federation systems struggled because they were bolted onto engines that were never designed for distributed querying. To fix this, Starburst was built on Trino. From the start, our approach was built on providing fast, scalable, parallelized queries for heterogeneous sources, whether on-premises or in the cloud.
And the best part is, it’s built on SQL–a language that both data engineers and data consumers understand.
Let’s look at some of the ways that Starburst helps you achieve your data federation goals.
High-performance queries across systems
Modern data federation technology, like Starburst, often rivals centralized systems for performance. Starburst was built for speed at scale. It allows pushdown, parallelization, and cost-based optimization to help minimize bottlenecks.
Universal access to your data, wherever it lives
Your data architecture is varied and diverse. Data centralization never really understood that, but data federation does. Starburst integrates with dozens of data sources, including relational databases and SaaS tools, to data lakes and data lakehouses using AWS S3, Google Cloud Storage (GCS), and Microsoft Azure.
Apache Iceberg by default
Sometimes data centralization still makes sense. That’s both normal and expected. When datasets are worth centralizing, Starburst works natively with Apache Iceberg, providing a durable, governed data lakehouse. This approach makes the most of a focused, judicious approach to data centralization.
The key difference between thoughtful centralization and centralization by default is that federation is no longer an all-or-nothing decision. With Starburst, organizations can federate where it makes sense, and centralize where it matters most.
Why data federation matters for AI
There’s another critical angle to all of this. Most discussions of federation focus on analytics. But data federation is just as important for AI workflows, and that importance is growing quickly.
The reasons for this come down to many of the same reasons that federated queries impact analytics workloads. AI models need broad access to data, often spanning multiple source systems. Moving all that data into one place can be costly, slow, or simply impossible due to compliance rules.
Federation addresses these issues by:
- Reducing the movement of sensitive data. This means that data remains in place, minimizing exposure and compliance risk.
- Expanding training data access, especially metadata. This ensures that models can be built on broader datasets, especially those involving unstructured data, without requiring everything to be centralized.
- Supporting contextual retrieval. This approach allows retrieval-augmented generation (RAG) systems powered by data federation to create a unified query layer across diverse knowledge sources.
As AI is adopted more widely in industries of all kinds, universal access is critical. Federation allows organizations to innovate with AI while respecting data sovereignty and governance requirements.
Why having choice over your data architecture matters
What does data federation give you? In a word, choice. That choice has always been an advantage in the realm of analytics, and now it’s doubly valuable in the era of AI.
Starburst puts power back in your hands, allowing your own data teams to decide:
- Which datasets should remain in their original systems and be queried in place.
- Which should be centralized into Iceberg for long-term management.
- How to evolve these choices over time, without locking into one architectural pattern.
This flexibility lets architects design for both today’s needs and tomorrow’s changes. Instead of a single strategy, Starburst makes it possible to apply the right approach to each dataset for analytics and AI.
Why Starburst data federation should be part of your AI strategy
Data federation is no longer a compromise. With Starburst, federation has become a practical foundation for both analytics and AI. By enabling queries across disparate systems and supporting centralization into Iceberg where it makes sense, Starburst ensures that data professionals can design architectures around choice, not constraint.
For modern data teams, that choice is not just convenient, it’s essential.