What are Open Table Formats?

Why data lakehouses rely on open table formats like Iceberg

April 8, 2024

Evan Smith

Technical Content Manager

Starburst Data

Evan Smith

Technical Content Manager

Starburst Data

More deployment options

Request Enterprise trial license key →

Start for Free with Starburst Galaxy

Try our free trial today and see how you can improve your data performance.

Start Free

Automating the “Icehouse” – Fully-managed Open Lakehouse Platform on Starburst Galaxy

Open table formats represent a foundational shift in modern data architecture. They add a critical layer of structure and intelligence to raw data lakes, enabling features like transactions, time travel, and fine-grained updates. Most importantly, they serve as the dividing line between traditional data lakes and emerging data lakehouses.

Open table formats have transformed how we manage data in lakes. Where Apache Hive once served as the de facto table layer, it lacked support for transactions, schema evolution, and performance at scale. Newer formats like Apache Iceberg, Delta Lake, and Apache Hudi have replaced Hive by introducing robust metadata layers and full ACID compliance.

Because of this, open table formats are the backbone of modern data lakehouses. They bring structure, flexibility, and transactional reliability to raw object storage, making data easier to query, govern, and scale. If your architecture still relies on Hive, it’s time to make the shift. This is particularly true of Apache Iceberg, which has emerged as the leader in this space.

Let’s learn more.

What is a data lake?

A data lake stores large volumes of raw or semi-structured data, but lacks built-in metadata management for managing transactional workloads or enforcing consistency. In contrast, a data lakehouse combines the flexibility of a data lake with the data management features of a data warehouse. This convergence is made possible by open table formats.

Why data lakehouses matter a lot

The first generation of data lake table formats began with Apache Hive, which introduced a way to query tabular data over a data lake. However, Hive lacked support for transactional operations, and its performance was limited by design. Newer formats like Apache Iceberg, Delta Lake, and Apache Hudi have addressed these shortcomings by introducing scalable metadata layers, ACID compliance, and support for schema evolution and efficient record-level updates.

These innovations have allowed data lakehouses to emerge as a viable alternative to traditional data warehouses, delivering both the scale and flexibility of data lakes and the robust data management required for enterprise analytics.

In short, if your data platform uses an open table format, it’s a data lakehouse; if it doesn’t, it’s just a data lake.

In the next section, we’ll explore how these modern formats enhance the utility of your data platform—particularly when it comes to transactional data—and why they’re now considered essential for building AI-ready infrastructure.

How do open table formats work?

While each table format varies, they all extend the features of a data lake in similar ways.

These include:

Full CRUD operations

Data lakes use either HDFS or object storage. Both hold data in an immutable format. These have not typically provided an easy way to update files incrementally. If you consider that a database typically includes the ability to create, read, update, or delete (CRUD), a data lake often includes only the first two. Modern table formats help fix this by allowing the ability to update and delete records.

Improved performance and scalability

Data lakes often grow in size, and many are very, very large. They need to scale their analytic capabilities to match. Newer table formats allow increased scalability compared to Hive by introducing a new way of recording data at the file level. This is a marked improvement over Hive’s record-keeping approach, which organized data by folder. This means that if a query requires data on a specific subset of data, it can search the particular files containing the information rather than searching the whole folder. This vastly improves performance and efficiency.

Transactional support and ACID compliance

With ACID capabilities in table formats like Iceberg and Delta Lake, users can now achieve transactional awareness within a data lake. This doesn’t necessarily mean that a data lake would be a replacement for an OLTP system. However, it ensures that groups of updates are either transactionally completed together or rolled back if they cannot be completed. This is useful for some of today’s evolving ETL pipelines.

Three types of data lakehouse open table formats

Let’s look closely at the modern open table formats: Apache Iceberg, Delta Lake, and Apache Hudi below.

1. Apache Iceberg

Apache Iceberg is an open-source table format used to structure the data held in data lakes. Like the other table formats listed, it was developed to solve the challenges of performance, data modification, and CRUD operations in the data lake. It can be used with HDFS or any object-based cloud platform, including Amazon S3, Azure Blob Storage, Google Cloud Storage, and MinIO.

Iceberg also offers schema evolution, schema partitioning, and time travel. This allows users to apply and update schemas, apply and update partitions, and enact version control to roll back changes to a system to a previous state. All of these adaptations push the data lake to a new level of functionality and create new use cases for data lakes.

Demo: Iceberg and Trino

In this exciting exploration, we’re delving into the powerful combination of Apache Iceberg and Trino, two dynamic tools reshaping the landscape of big data. To do this, we’ll use Starburst Galaxy and compare its use to AWS Athena.

2. Delta Lake

Delta Lake is an open-source framework developed by Databricks. Like other modern table formats, it employs file-level listings, considerably improving the speed of queries compared to Hive’s directory-level listing.

Like Iceberg, Delta Lake offers enhanced CRUD operations, including the ability to update and delete records in a data lake, which would previously have been immutable. It is ACID-compliant and is often used in transactional systems. This use case makes data lakes a viable replacement for traditional transactional databases while retaining the cost and storage benefits of other data lakes.

Delta Lake can be used with Starburst via the Delta Lake connector.

3. Apache Hudi

Apache Hudi is another table format used less often than Iceberg or Delta Lake. It addresses some of the same problems discussed above.

Open table format architecture

Metadata captures changes in state

Architecturally, modern table formats are composed of a set of hierarchically structured metadata files. These files capture changes in the state of the data in the data lake. A table format is a type of database transaction log that outlines all changes to the data lake over its lifetime. This metadata is stored in a structured, readily accessible format. How does this work? Let’s explore how Iceberg uses enhanced metadata collection to deliver additional functionality.

The image below shows how metadata tracks the changes to the dataset. The files held in the Data layer are captured by the metadata files held in the Metadata layer. As the files change, the metadata files attached to them track these changes.

Record snapshots

To achieve this metadata capture, modern table formats create records pointing to individual metadata file locations. This file is known as a manifest file and includes metadata about the table at a given time. The manifest file acts as a snapshot of the table, detailing the points at which a change is made. Multiple manifest files are stored in Manifest lists.

In the image below, changes in the Data layer have been detected in the Metadata layer. A new Manifest file and corresponding Manifest list have been created to capture these changes.

Create an up-to-date record of changes

Manifest files and Manifest lists provide the ability to record an accurate, up-to-date account of the changes over time. This includes inserts, deletions, updates, schema migrations, and partition changes. The changes themselves are stored in Metadata files known as Snapshots. Each snapshot is like a slice in time, allowing the dataset to be queried as it was in multiple instances or rolled back to a previous state.

The image below shows how the changes in the Data layer have created a new Snapshot file, Snapshot 1. The original Snapshot file, Snapshot 0, is also retained. This creates a series of snapshots, each tracking changes to the data and recording those changes in the Metadata layer.

Open table formats vs Open file formats

Table and file formats are different open-source elements of an open data lakehouse. Columnar open file formats like Parquet and ORC ensure data within an object gets written in ways that optimize query performance, while open table formats like Iceberg sit above the files and objects, providing a layer of rich metadata to enable analytics on the underlying data lake.

Open table format feature	Apache Iceberg	Delta Lake	Apache Hudi	Apache Hive
Transaction support (ACID)	Yes	Yes	Yes	Limited
File format	Parquet, ORC, Avro	Parquet	Parquet, ORC, Avro	Parquet, ORC, Avro, and more
Schema evolution	Full	Partial	Full	Partial
Partition evolution	Yes	No	No	No
Data versioning	Yes	Yes	Yes	No
Time travel queries	Yes	Yes	Yes	No
Concurrency control	Optimistic locking	Optimistic locking	Optimistic locking	Pessimistic locking
Object store cost optimization	Yes	Yes	Yes	No
Community and ecosystem	Growing	Growing	Growing	Established

Choosing the right open table formats based on the vendor

The data industry has come to realize that direct access to the underlying files storing their data should be open. Storing data in open formats, specifically Apache Iceberg, in an object storage lake has enabled this. This is ultimately good news for their customers and helps to reduce vendor lock-in.

Starburst was founded on the idea that data should always belong to the customer. With the announcements — Databricks acquires Tabular, founded by Apache Iceberg creators; Snowflake unveils Polaris, an open-source implementation of the Iceberg REST catalog — there is now a huge potential to improve price-performance by being able to choose the best engine for the job. Databricks and Snowflake have finally recognized this basic customer need.

However, Starburst believes in optionality. That’s only possible with a truly open data lakehouse, a.k.a. the Icehouse. An Icehouse architecture is based on Trino for high-performance and scale SQL querying (read and write) and Iceberg for storage. It complements your Snowflake + Iceberg solution and can help you significantly lower your operating costs.

This cheat sheet will help you easily compare open table formats in one view, so that you can make the most informed decision on which table format is right for your business.

Start for Free with Starburst Galaxy

Try our free trial today and see how you can improve your data performance.

Start Free