Announcement bar test test

How Apache Iceberg Branching Transforms Data Management

Share

Linkedin iconFacebook iconTwitter icon

More deployment options

Data lakehouse versioning and branching allow organizations to manage changes safely and efficiently. These features are extremely popular and are a big factor in the rise of lakehouse technology today

To understand how Apache Iceberg branching and versioning work, it’s useful to compare them to Git branching in software development. In a word, branching and versioning offer a safety net. Just as developers can experiment in a separate branch without affecting the main codebase, data engineers can create branches of datasets to test transformations or fixes without impacting production data.

Specifically, branching can be applied to several data cases, including:

  • Experimentation: Run transformations on a branch before merging changes into the main table.
  • Backfill Jobs: Isolate large data rewrites to avoid impacting production.
  • Safe Development: Allow multiple teams to work on the same dataset concurrently.
  • What-if Analysis: Test queries on branched data without affecting the production dataset.

Additionally, branching can also be combined with a Write-Audit-Publish (WAP) pattern, allowing you to safely stage changes in a branch, audit them, and then publish to the main branch, providing a robust workflow for managing complex data updates.

Want to know more? Let’s look at some specific scenarios where Iceberg branching is useful for data workloads. 

What is Branching in Apache Iceberg?

In Apache Iceberg, branching operates according to a particular taxonomy. Branches are named references to a table’s state, similar to branches in Git. They allow you to isolate changes, experiment safely, and manage multiple versions of a dataset simultaneously. For example, a branch might be used to compare to CLONE in Snowflake or Databricks Delta Lake tables, without producing a metadata copy. This approach allows workloads to complete extremely quickly, even in the case of large tables.

Branches, Snapshots, and Tags

Notably, branches differ from snapshots and tags:

  • Snapshots capture the table state at a specific point in time but are immutable.
  • Tags are fixed pointers to a particular snapshot for reference.

In this sense, Branches are movable references that can evolve as new commits are made, giving you a flexible way to manage table changes over time.

How does this work in practice? Let’s check it out. 

Working with Branches

Let’s look at an example of Iceberg branching in practice using Starburst. These capabilities are available immediately in Starburst Galaxy and on Starburst Enterprise release 476-e. Notably, the functionality works alongside existing Starburst access controls.

This example demonstrates how to overwrite an older partition using branching, which is particularly useful for backfill scenarios.

Since Starburst does not support the INSERT OVERWRITE syntax for replacing existing data in a table or partition, we previously had to rely on a MERGE statement without branching.

With the new syntax, however, we can now effectively simulate INSERT OVERWRITE in a much cleaner way by using DELETE, INSERT, and FAST FORWARD statements.

Prepare data

Let’s create a simple table with five partitions:

CREATE TABLE branching (
   data INT,
   part DATE) 
WITH (
   partitioning = ARRAY['part']
);
INSERT INTO branching VALUES 
(10DATE '2025-01-01'), 
(20,  DATE '2025-01-02'),
(-30, DATE '2025-01-03'),
(40,  DATE '2025-01-04'),
(50,  DATE '2025-01-05');

How to create a branch

The data for 2025-01-03 appears to be incorrect. Let’s create a new branch to correct it:

CREATE BRANCH dev IN TABLE branching;
SHOW BRANCHES FROM TABLE branching;
Branch
dev
main
DELETE FROM branching @ dev WHERE part = DATE '2025-01-03';
INSERT INTO branching @ dev VALUES (30, DATE '2025-01-03');
SELECT * FROM branching FOR VERSION AS OF 'dev';
data part
10 2025-01-01
20 2025-01-02
30 2025-01-03
40 2025-01-04
50 2025-01-05

The main branch still returns results from before the DELETE and INSERT statements are executed:

SELECT * FROM branching;
SELECT * FROM branching FOR VERSION AS OF 'main';
data part
10 2025-01-01
20 2025-01-02
-30 2025-01-03
40 2025-01-04
50 2025-01-05

Updating and committing changes to a branch

The changes haven’t been applied to the main branch yet.

To update the main branch, we can use the ALTER BRANCH … FAST FORWARD statement. Note that this statement will fail if the main branch has changed since the dev branch was created and is no longer its ancestor.

ALTER BRANCH main IN TABLE branching FAST FORWARD TO dev;

Now we can check the fix in the main branch:

SELECT * FROM branching;
data part
10 2025-01-01
20 2025-01-02
30 2025-01-03
40 2025-01-04
50 2025-01-05

Image depicting Iceberg branching data architecture.

Branch cleanup

Dropping stale branches is important to prevent retaining unnecessary data. You can remove a branch from a table using the DROP BRANCH statement:

DROP BRANCH dev IN TABLE branching;
SHOW BRANCHES FROM TABLE branching;
Branch
main

Challenges and future work

While branching in Iceberg is already powerful, there are a few limitations to consider. Currently, features such as catalog-level branching, tagging, replacing or renaming branches, and cherry-picking commits are not supported. Advanced retention policies, including setting min-snapshot-to-keep, max-snapshot-age-ms, or max-ref-age-ms, are also unavailable at this time.

Why Iceberg branching matters more than ever  

Branching in Apache Iceberg makes data lakehouses safer, more flexible, and easier to manage. By isolating changes it enables experimentation without risk, simplifies large backfill jobs, and supports collaborative development across teams. It also empowers analysts to run what-if queries without touching production data.

Starburst: The best way to use Iceberg 

As part of our ongoing commitment to Iceberg, Starburst is here to help. Our best-in-class query engine is designed to make handling Iceberg workloads easy, scalable, and efficient. Iceberg branching is part of this effort, and one more reason to choose Starburst for all Iceberg workloads. 

Start for Free with Starburst Galaxy

Try our free trial today and see how you can improve your data performance.
Start Free