
Before you can use data for analytics or AI workflows, you need to ingest it. This simple, foundational truth underpins all big data engineering projects. It means that everyone needs to consider the data ingestion process, including both streaming ingestion and file ingestion. Across industries and technologies, managing data effectively also means ingesting data efficiently.
We’re here to help make that process easy. Starburst has long simplified streaming data ingestion from Kafka. Today, we’re proud to announce that we are now expanding to include file ingestion from AWS S3 as well.
File Ingestion: The next generation of Starburst data ingestion
This new feature enables seamless, no-code file ingestion of AWS S3 data, focusing on eliminating barriers in the user experience. Specifically, it allows the continuous ingestion of data files as they are uploaded to AWS S3. This addresses a common issue that data engineers face in many organizations, where data management, automation, and batch processing are more complex than they should be.
For example, for many organizations, manual batch ingest is brittle and slow. This makes it difficult to trust and operationalize managed data. At the same time, hydrating Iceberg data remains especially challenging, as competing tools often lack support for Iceberg. The result is a delayed time to insight from new data generated in S3, without additional data engineering work such as polling or external triggers.
Adding the power of Starburst Galaxy Governance
Using Starburst, building and hydrating your data lakehouse with Apache Iceberg data has never been easier. It allows you to replace complex pipelines involving multiple tools with a single, seamless ingestion process that takes you from raw data to data integration and insights.
Once ingested, your data becomes instantly usable as Iceberg tables, updated regularly and always live. This means that there’s one more reason to help your Icehouse architecture thrive.
Want to know more? Let’s check it out.
What is data ingestion?
Data ingestion is the essential first step in any data pipeline. Although it takes many forms, in all cases, data ingestion converts raw data into a usable format for data systems. Importantly, this holds true for all data, regardless of its source or intended use. This means that data ingestion is just as necessary for data analytics, data applications, or AI workflows. And whether the data comes from a database, data warehouse, data lake, or data lakehouse, ingestion is a necessary precondition for its use in downstream systems.
Data ingestion and data lakehouses
Where do data lakehouses enter the equation? Data ingestion is the process of collecting and landing raw data from multiple sources into a central repository. For modern data architecture, this repository is typically either a data lake or data lakehouse.
Data lakehouse construction
You can think of the process of data ingestion as the foundational step that enables data lakehouse construction: the process of building an open, scalable architecture that supports both analytics and AI workloads.
Data lakehouse hydration
As data is ingested, it must be organized and made queryable. This step is known as data lakehouse hydration. Hydration transforms raw, unstructured files into structured, governed table formats, such as Apache Iceberg, allowing enterprises to access and analyze their data quickly, reliably, and at scale within the data lakehouse.
There are two common ways of achieving data ingestion:
- Streaming ingestion
- File ingestion
Let’s look at both of them individually.
What is streaming ingestion?
Streaming ingestion is the process of continuously collecting and delivering real-time data from sources such as applications, devices (including IoT), or event streams into a centralized data lakehouse.
Unlike batch ingestion, which processes data in scheduled intervals, streaming ingestion captures data as it is generated, reducing latency and enabling near real-time analytics and decision-making.
This is especially critical for use cases like financial services compliance, operational monitoring, or customer personalization. By rapidly hydrating tables with fresh data, streaming ingestion ensures that the lakehouse remains up-to-date and ready for query. Effective streaming systems must balance scale, performance, reliability, and maintainability to support the evolving needs of enterprises.
What is file ingestion?
File ingestion is the process of importing datasets from data files. The data stored in files can be held in various repositories, including AWS S3. AWS S3 file ingestion is important because it enables organizations to seamlessly import high-volume data from a widely adopted storage service directly into their lakehouse for analysis.
Typical sources for file ingestion include:
- Log data
- Clickstream data
- Daily batch exports
Once ingested, this data is added into a data lakehouse for analysis and downstream use. File ingestion is a cornerstone of data usage across multiple industries, particularly for organizations that rely on scheduled batch processes to deliver new data regularly. Batch file ingestion enables teams to quickly land these files into live data pipelines, where they can be hydrated into structured, query-ready tables.
Why fully-managed file ingestion is a gamechanger
Traditionally, file ingestion has relied on a combination of tools, including Flink, Spark, and custom scripts, each handling different aspects of the process.
Managing data ingestion complexity is important as data scales
While these tools are powerful, managing them adds complexity to the process. Teams must coordinate systems, write and maintain code, and troubleshoot issues across multiple layers of the system. This can slow down data availability and increase the risk of pipeline failures. Simplifying ingestion into a single, no-code process reduces these challenges, making it easier for teams to deliver consistent, high-quality data to support analytics, decision-making, and AI.
For example, the image below illustrates a typical data ingestion toolchain involving multiple tools.
Starburst Galaxy data governance and file ingestion, fully managed
Starburst plays a key role in managing the complexity of your ingest data stack. By integrating file ingestion into a governed platform, organizations can enforce role-based and attribute-based access controls from the moment data lands. This ensures secure, compliant use of data throughout its lifecycle.
Let’s check out how it works.
How it works: Let Starburst build your fully-managed lakehouse instantly
Starburst simplifies the file ingestion process with a fully managed architecture that quickly and reliably brings raw data into the lakehouse.
File ingestion data architecture
The Starburst file ingestion feature enables users to ingest JSON files from AWS S3 and load them directly into Apache Iceberg tables in their native format, before applying a schema. This approach directly addresses the cloud-based workloads that data engineers need. It enables organizations to consume data without complex tooling or code, thereby reducing complexity and simplifying the data pipeline. From there, it can then be queried using SQL and used for any downstream use, including analytics and business intelligence (BI) workloads and AI or Machine Learning (ML) needs.
The image below illustrates the process. Streaming data from Kafka is ingested into a new data pipeline alongside AWS S3 file ingest data, which is stored in a raw table. From there, it is automatically transformed into Iceberg tables using schematized data transformation processes. As this process repeats, it creates multiple Iceberg transformation tables. The resulting pipeline operates as a multifaceted data ingestion tool, ingesting and transforming multiple types of data from various ecosystems at high volumes.
This approach has many benefits, including:
- Quick, easy ingestion of JSON data from AWS S3 in seconds.
- Automatic schema management, with support for both inference and manual definition.
- Built-in table maintenance, including compaction, snapshot expiration, and more.
- Support for AWS S3 as the initial file ingestion source.
Live Tables: File ingestion, made simple
Unlike traditional ingestion tools, Starburst handles the end-to-end journey of data across the data platform. We do this with a feature called Live Tables. Live tables ensure that the data ingested from AWS S3 using file ingest is always optimized. To do this, Live tables are automatically optimized, governed, and ready to use.
Check out the demo below to see how this works in practice.
Additionally, because Starburst powers your entire data architecture, you can use this data for data analytics, data applications, or AI workflows. This approach ensures that not only can you create a data lakehouse in seconds, but you can also automatically govern, maintain, and scale it for the long term.
Want to see file ingestion in action? Check out the video demo below to see file ingestion in Starburst Galaxy.
Starburst: A single foundation for all your data
File ingestion is the continuation of Starburst’s long-term mission. It began with Kafka streaming ingestion, and now it’s been extended to include file ingestion as well. At its heart, this mission is about two things:
- Giving you choice over how you access, collaborate with, and govern your data.
- Making it easy to manage your architecture across all your use cases, including analytics, data applications, and AI.
File ingestion helps deliver another piece of this puzzle, allowing you to create an Iceberg data lakehouse in seconds from any AWS S3 source. File ingestion removes the traditional complexity of building ingestion pipelines. Instead of stitching together multiple tools, you can use a single tool across all your data sources and all your data use cases.
Whether you’re onboarding batch data daily or scaling across hundreds of sources, file ingestion provides a fast and secure foundation for building and expanding your lakehouse.