Exploring Iceberg transactions and metadata
Lester Martin
Developer Adocate
Starburst
Lester Martin
Developer Adocate
Starburst


More deployment options
Apache Iceberg is more popular than ever before. It powers the data architecture that drives data analytics, data applications, and Artificial Intelligence (AI) workloads. This article provides a behind-the-scenes look at what happens when Iceberg transactions modify metadata. I will discuss both the initial Data Definition Language (DDL) needed to create a table and a variety of Data Manipulation Language (DML) statements needed to change the table’s data.
You can follow along with the article as I go, recreating the results using your own cluster.
Prerequisites
A few things first. This article is primarily written for data engineers. The reader is assumed to have an existing conceptual understanding of the Apache Iceberg modern table format. If you are new to Iceberg, please watch the short Understanding Apache Iceberg architecture video and then review the initial Iceberg blog series, which starts with Introduction to Apache Iceberg in Trino.
Environmental setup
To perform the same steps that I do, you will need a running Trino cluster configured with the Iceberg connector set to persist tables on an object store. The best way to do this is to run the Starburst Galaxy free trial and then configure it to access AWS S3, as detailed in the Starburst Galaxy: Getting Started tutorial.
Table DDL
First, I’m going to discuss table Data Definition Language (DDL). This will allow you to see the process from the beginning of the workflow.
Step 1 – Create the table
Using the following SQL command, create an Iceberg table and verify it exists. The results should be similar to the image below.
CREATE SCHEMA mycloud.mdatablog; USE mycloud.mdatablog; CREATE TABLE try_it (id int, a_val varchar, b_val varchar, prt varchar) WITH (type='ICEBERG', format='PARQUET', partitioning = ARRAY['prt']); DESC try_it;
Step 2 – Review raw metadata files
In the example presented above, the mycloud
catalog aligns with a specific folder inside the particular S3 bucket used to create the catalog. Underneath that, a mdatablog
directory was created for the schema with the same name. Additionally, another folder whose name begins with try_it
is created for the actual table.
Notice that a folder named metadata
was created within the table root. This folder houses the initial metadata content that represents the first snapshot created when building this empty table.
Step 3 – Review Iceberg metadata architecture
Now it’s time to review the Iceberg metadata architecture to ensure that you understand what’s going on. The first important thing to note is that the JSON file in the list above is referred to as a metadata file in the architectural diagram below. The AVRO file is a manifest list file. This follows the general Iceberg architectural approach noted below.
Step 4 – Download JSON metadata file
Now it’s time to download the JSON metadata file. Notice that the listing below only shows some of the key elements. Using the following SQL code, some UPPER-CASE COMMENTS
were added to help explain what is present.
{ /* IDENTIFIES THE ROOT FOLDER LOCATION THAT CONTAINS THE DATA AND METADATA FILES WHICH MAKE UP THE ICEBERG TABLE */ "location" : "s3://MYBUCKETNAME/mygalaxy/mdatablog/try_it-5425ea84465a4a8ba5c3fa67f3e3d1d4", /* schemas IS AN ARRAY OF SCHEMA DEFINITIONS AND THERE IS ONLY ONE AT THIS TIME -- THAT SCHEMA (WITH ID OF '0') IS IDENTIFIED AS THE 'CURRENT SCHEMA' */ "current-schema-id" : 0, "schemas" : [ { "type" : "struct", "schema-id" : 0, "fields" : [ { "id" : 1, "name" : "id", "required" : false, "type" : "int" }, { "id" : 2, "name" : "a_val", "required" : false, "type" : "string" }, { "id" : 3, "name" : "b_val", "required" : false, "type" : "string" }, { "id" : 4, "name" : "prt", "required" : false, "type" : "string" } ] } ], /* partition-specs IS AN ARRAY OF KNOWN PARTITIONING SPECIFICATIONS AND THERE IS ONLY ONE AT THIS TIME -- THAT PARTITION SPECIFICATION (WITH ID OF '0') IS IDENTIFIED AS THE 'CURRENT PARTITIONING SPECIFICATION' */ "default-spec-id" : 0, "partition-specs" : [ { "spec-id" : 0, "fields" : [ { "name" : "prt", "transform" : "identity", "source-id" : 4, "field-id" : 1000 } ] } ], /* THE FIRST LINE IDENTIFIES THE CURRENT SNAPSHOT (aka VERSION) ID -- THIS IS FOLLOWED BY AN ARRAY OF KNOWN SNAPSHOTS (THERE IS ONLY 1 AT THIS TIME) */ "current-snapshot-id" : 3469773113442948971, "snapshots" : [ { "sequence-number" : 1, "snapshot-id" : 3469773113442948971, "timestamp-ms" : 1735237123094, "summary" : { "total-records" : "0", "total-files-size" : "0", "total-data-files" : "0", "total-delete-files" : "0", "total-position-deletes" : "0", "total-equality-deletes" : "0" }, /* THE FOLLOWING IDENTIFIES THE LOCATION OF THE 'MANIFEST LIST' FILE AND THAT THE DATA IT REFERENCES LEVERAGES THE SCHEMA IDENTIFIED EARLIER IN THIS JSON */ "manifest-list" : "s3://MYBUCKETNAME/mygalaxy/mdatablog/try_it-5425ea84465a4a8ba5c3fa67f3e3d1d4/metadata/snap-3469773113442948971-1-b3a30d93-49c3-4768-b6da-438c9eb11eb5.avro", "schema-id" : 0 } ] }
Step 5 – Locate key metadata identifiers
As you can tell, the metadata file identifies 3 very important things about the table.
- The current version of the table;
current-snapshot-id
- The structure of the table itself;
current-schema-id
&default-spec-id
- The location of the single manifest list file that will provide more specifics on the actual files that make up the table’s content;
manifest-list
Step 6 – Unpacking logical metadata tables
At this point, it would seem logical to crack open that AVRO file above, but let’s try another approach. Iceberg supports a concept called metadata tables that allows you to use SQL to query the details from the various files in the metadata
folder.
We’ll cover more advanced concepts later in the post. For now, you can run the following query and check the first row to find the current snapshot ID.
SELECT * FROM "try_it$history" ORDER BY made_current_at DESC;
Step 7 – Validate initial snapshot data
Notice that this snapshot_id
column value aligns with the earlier current-snapshot-id
value in the metadata file.
The DESC try_id;
command already gave us the current column names and types. Run the following SQL to get information on the partitions.
SELECT * FROM "try_it$partitions";
It returned nothing because there is no data in the table to report on yet. We will look at this again as we start changing things.
Run the following to see the file name of the linked manifest list.
SELECT snopshot_id, manifest_list FROM "try_it$snapshots" ORDER BY committed_at DESC LIMIT 1;
Notice the value from the manifest_list
column aligns with the manifest-list
value in the metadata file.
Step 8 – Exploring manifest lists
That manifest list references 0..m “manifest files” and each of those will reference 1..m actual data files. In this post, we won’t explore those AVRO files directly as we can run the following query to determine the physical files that are represented in that graph for the current snapshot.
SELECT * FROM "try_it$files";
Yep, you guessed it — it returns nothing as there is no actual data in the table yet.
DML use cases
Now that we have a table set up and we understand what’s happening behind the scenes a bit more, let’s explore some CRUD (Create, Retrieve, Update, Delete) use cases as expressed in Data Manipulation Language (DML).
Txn 1: INSERT single row
Step 9 – Write an initial record to the table
Use the following SQL code to insert a single row into the table.
INSERT INTO try_it VALUES (1, 'noise', 'bogus', 'p1'); select * from try_it;
Step 10 – Verify new JSON file
A few things should have happened. Let’s start by verifying a new JSON file was created on S3 as shown in the highlighted file below.
Step 11 – Verify new manifest list
After downloading this new metadata file, I determined it’s current-snapshot-id
is identified as 4900337259264042703
and that the last bits of the file name for the manifest-list
property are e8b5c8989a75.avro
(of course, your values will be different is running these scenarios on your own). I verified that new manifest list is present on S3, too.
Step 12 – Determine file created for single record
From here we can now determine the actual data file that was created to store our single record.
SELECT * FROM "try_it$files";
The value of the file_path column ended with /data/prt=p1/20241226_203553_21880_mejih-75b2a829-9205-4d68-b0f2-4ac1783e243c.parquet
in my scenario. That let us know a few things.
- The file was stored in the
/data
folder instead of the/metadata
folder just under the table’s root directory - Its partition column value further created a
/prt=p1
folder - The file itself is in Parquet format
Step 13 – Review output
Taking a peek in the Parquet file itself shows the expected contents when viewed as JSON.
Pretty simple so far.
Txn 2: INSERT multiple rows across multiple partitions
Step 14 – Add two records with a single statement
Insert statements allow multiple rows to be added at once and they all belong to a single ACID transaction identified by a single snapshot ID. This use case is to exercise that, but to make it a bit more fun we can span more than one partition for a transaction.
INSERT INTO try_it VALUES (2, 'noise', 'bogus', 'p2'), (3, 'noise', 'bogus', 'p3'); select * from try_it;
Step 15 – Verify files
Verify that two more files, each in their own new partition folder, are identified in the metadata tables.
SELECT file_path, partition, record_count FROM "try_it$files";
When you go directly to S3 you can find these two new folders/files and additionally verify that each of the Parquet files has a single record.
Step 16 – Verify snapshot
You can see this same information from the perspective of the partitions that make up the current snapshot.
SELECT * FROM "try_it$partitions";
Note: The data
column shows all kinds of cool statistical information about each file. Not the scope of this post, but this information is key to how Iceberg can perform at scale as it doesn’t need to read the actual files when trying to determine if data it is looking for might be in the file.
Step 17 – Review Snapshot metadata table
Take a harder look at the $snapshots
metadata table to get a summary of what happened in the last transaction.
SELECT snapshot_id, summary FROM "try_it$snapshots" ORDER BY committed_at DESC LIMIT 1;
Step 18 – Review summary column
Here are the contents of the summary
column.
For this use case’s transaction, here are the key confirmations of what happened.
- 2 new records were added for a total of 3 records
- 2 new partitions were created; each with a new file in them
- 3 total files are present
Txn 3: UPDATE multiple rows across multiple partitions
Step 19 – Modify all records with a single statement
The fact that the underlying data lake files are immutable makes updates a bit tricky for table formats like Iceberg. In-place updates of the actual files can’t happen. Basically, Iceberg marks each record being updated as deleted and then does a net-new insert to account for what the updated recorded should look like at the end of the SQL statement. The section will show what this looks like behind-the-scenes.
To make the use case more interesting, we’ll make the update span records across multiple partitions so that we can see a similar behavior to the prior use case of a particular transaction number spanning these affected partitions.
Let’s start off with the SQL.
UPDATE try_it SET b_val = 'bogus2' WHERE a_val = 'noise'; select * from try_it;
Step 20 – Review summary
See what the summary
column shows for the current snapshot.
SELECT snapshot_id, summary FROM "try_it$snapshots" ORDER BY committed_at DESC LIMIT 1;
Step 21 – Assess key confirmations
For this use case’s transaction, here are the key confirmations of what happened.
- 3 new delete files (these are files that reference records that are being updated) were added with a single record in each — if all 3 records were in the same partition they likely would have been a single delete file that referenced all 3 records
- 3 new regular data files were created; each with a single row in them representing the updated rows — again, so many because each record was in a different partition
- A total of 9 data files make up this snapshot; 6 data files (3 for the original inserts and 3 for the revised updates) and 3 delete files
Step 22 – Verify $files metadata table
Verify that the $files
metadata table lists all 9 of these files.
SELECT file_path, partition, record_count FROM "try_it$files" ORDER BY partition;
Step 23 – Inspect file partition update files
Let’s think about it again. For each partition, there is a file that has the original row inserted into it. Then for the update, there is a delete file that identifies the record as it can’t be updated directly. The third file is a total rewrite of the row with the updates and all the values that did not need to be changed. Let’s check the files on S3 just for the prt=p3
partition as an example of what happened.
Step 24 – Review parquet file
First, here is the contents of the first Parquet file; ...8e8b3.parquet
Second, here is the delete file that identifies the first record in the file above that needs to be deleted.
Last, here is the third data file, which shows the full replacement with the updated column values.
As a reminder, only the value of b_val
was changed.
Txn 4: UPDATE single row (leveraging partitioning)
Step 25 – Modify a row’s partitioning column
This use case is just calling out that we should be using the partitioned column in the update statement as much as possible to make the work effort much easier by letting Iceberg only look at the folders that can possibly be affected instead of walking the full table’s contents.
UPDATE try_it SET b_val = 'bogus3' WHERE b_val = 'bogus2' AND prt = 'p2';
Step 26 – Review snapshot column
The $snapshots.summary
column now looks like this.
Step 27 – Understanding delete files in context
In a nutshell, this states that two new delete files (one designed to tackle the single row update) were added, which only affected a single partition. The total files went up by 2 (one more delete file for a total of 7 and an additional data file, bringing that count to 4) to land at a grand total of 11. We can verify that querying the $files
metadata table again.
Step 28 – Reviewing changes
The changes this time are in the prt=p2
partition and you can see these 5 files in S3 as well.
Note: This concept of marking a row as needing to be deleted in one file and then adding the record back as a net-new row in another file is referred to as a Merge On Read strategy as it is fast for the update operation and defers the responsibility of pulling together the correct row representation at query time.
Parting thoughts on Iceberg metadata
Data lake tables have evolved quite a bit from the very early days of the original Hive table format, and we should now expect ACID transactions as a base feature. That said, these two callouts should be stated clearly, at least at this point in early 2025.
- Iceberg transactions are single-statement / single-table (many records, of course) in nature — no BEGIN/COMMIT TRANSACTION statements
- While ideal for highly concurrent querying, Iceberg transactions are not designed for massive concurrent transactions beyond insert-only operations — not a general-purpose replacement for RDBMSs
There is PLENTY more to learn about Apache Iceberg, and I encourage you to return to the top of this post and read the additional articles in this Apache Iceberg metadata deep-dive series. For some great hands-on tutorials for Iceberg, head on over to Starburst Tutorials, as well.
Post originally published by Lester Martin as “Iceberg acid transactions with partitions (a behind the scenes perspective)” at Lester Martin (l11n) Blog.