Exploring Iceberg transactions and metadata

  • Lester Martin

    Lester Martin

    Developer Adocate

    Starburst

Share

Linkedin iconFacebook iconTwitter icon

Apache Iceberg is more popular than ever before. It powers the data architecture that drives data analytics, data applications, and Artificial Intelligence (AI) workloads. This article provides a behind-the-scenes look at what happens when Iceberg transactions modify metadata. I will discuss both the initial Data Definition Language (DDL) needed to create a table and a variety of Data Manipulation Language (DML) statements needed to change the table’s data.

You can follow along with the article as I go, recreating the results using your own cluster.

Prerequisites

A few things first. This article is primarily written for data engineers. The reader is assumed to have an existing conceptual understanding of the Apache Iceberg modern table format. If you are new to Iceberg, please watch the short Understanding Apache Iceberg architecture video and then review the initial Iceberg blog series, which starts with Introduction to Apache Iceberg in Trino.

Environmental setup

To perform the same steps that I do, you will need a running Trino cluster configured with the Iceberg connector set to persist tables on an object store. The best way to do this is to run the Starburst Galaxy free trial and then configure it to access AWS S3, as detailed in the Starburst Galaxy: Getting Started tutorial.

Table DDL

First, I’m going to discuss table Data Definition Language (DDL). This will allow you to see the process from the beginning of the workflow.

Step 1 – Create the table

Using the following SQL command, create an Iceberg table and verify it exists. The results should be similar to the image below.

CREATE SCHEMA mycloud.mdatablog;
USE mycloud.mdatablog;

CREATE TABLE try_it (id int, a_val varchar, b_val varchar, 
                     prt varchar)
WITH (type='ICEBERG', format='PARQUET', 
      partitioning = ARRAY['prt']);

DESC try_it;

Screenshot of DDL Iceberg metadata.

Step 2 – Review raw metadata files

In the example presented above, the mycloud catalog aligns with a specific folder inside the particular S3 bucket used to create the catalog. Underneath that, mdatablog directory was created for the schema with the same name. Additionally, another folder whose name begins with try_it is created for the actual table.

File path location of Iceberg metadata.

Notice that a folder named metadatawas created within the table root. This folder houses the initial metadata content that represents the first snapshot created when building this empty table.

Step 3 – Review Iceberg metadata architecture

Now it’s time to review the Iceberg metadata architecture to ensure that you understand what’s going on. The first important thing to note is that the JSON file in the list above is referred to as a metadata file in the architectural diagram below. The AVRO file is a manifest list file. This follows the general Iceberg architectural approach noted below.

Step 4 – Download JSON metadata file

Now it’s time to download the JSON metadata file. Notice that the listing below only shows some of the key elements. Using the following SQL code, some UPPER-CASE COMMENTS were added to help explain what is present.

{
  /*
     IDENTIFIES THE ROOT FOLDER LOCATION THAT
     CONTAINS THE DATA AND METADATA FILES WHICH
     MAKE UP THE ICEBERG TABLE
  */
  "location" : "s3://MYBUCKETNAME/mygalaxy/mdatablog/try_it-5425ea84465a4a8ba5c3fa67f3e3d1d4",

  /*
     schemas IS AN ARRAY OF SCHEMA DEFINITIONS
     AND THERE IS ONLY ONE AT THIS TIME -- THAT
     SCHEMA (WITH ID OF '0') IS IDENTIFIED AS 
     THE 'CURRENT SCHEMA'
  */
  "current-schema-id" : 0,
  "schemas" : [ {
    "type" : "struct",
    "schema-id" : 0,
    "fields" : [ {
      "id" : 1,
      "name" : "id",
      "required" : false,
      "type" : "int"
    }, {
      "id" : 2,
      "name" : "a_val",
      "required" : false,
      "type" : "string"
    }, {
      "id" : 3,
      "name" : "b_val",
      "required" : false,
      "type" : "string"
    }, {
      "id" : 4,
      "name" : "prt",
      "required" : false,
      "type" : "string"
    } ]
  } ],

  /*
     partition-specs IS AN ARRAY OF KNOWN
     PARTITIONING SPECIFICATIONS AND THERE IS
     ONLY ONE AT THIS TIME -- THAT PARTITION
     SPECIFICATION (WITH ID OF '0') IS 
     IDENTIFIED AS THE 'CURRENT PARTITIONING
     SPECIFICATION'
  */
  "default-spec-id" : 0,
  "partition-specs" : [ {
    "spec-id" : 0,
    "fields" : [ {
      "name" : "prt",
      "transform" : "identity",
      "source-id" : 4,
      "field-id" : 1000
    } ]
  } ],

  /*
     THE FIRST LINE IDENTIFIES THE CURRENT 
     SNAPSHOT (aka VERSION) ID -- THIS IS 
     FOLLOWED BY AN ARRAY OF KNOWN SNAPSHOTS 
     (THERE IS ONLY 1 AT THIS TIME)
  */
  "current-snapshot-id" : 3469773113442948971,
  "snapshots" : [ {
    "sequence-number" : 1,
    "snapshot-id" : 3469773113442948971,
    "timestamp-ms" : 1735237123094,
    "summary" : {
      "total-records" : "0",
      "total-files-size" : "0",
      "total-data-files" : "0",
      "total-delete-files" : "0",
      "total-position-deletes" : "0",
      "total-equality-deletes" : "0"
    },

  /*
     THE FOLLOWING IDENTIFIES THE LOCATION OF
     THE 'MANIFEST LIST' FILE AND THAT THE
     DATA IT REFERENCES LEVERAGES THE SCHEMA
     IDENTIFIED EARLIER IN THIS JSON
  */
    "manifest-list" : "s3://MYBUCKETNAME/mygalaxy/mdatablog/try_it-5425ea84465a4a8ba5c3fa67f3e3d1d4/metadata/snap-3469773113442948971-1-b3a30d93-49c3-4768-b6da-438c9eb11eb5.avro",
    "schema-id" : 0
  } ]
}

Step 5 – Locate key metadata identifiers

As you can tell, the metadata file identifies 3 very important things about the table.

  1. The current version of the table; current-snapshot-id
  2. The structure of the table itself; current-schema-id & default-spec-id
  3. The location of the single manifest list file that will provide more specifics on the actual files that make up the table’s content; manifest-list

Step 6 – Unpacking logical metadata tables

At this point, it would seem logical to crack open that AVRO file above, but let’s try another approach. Iceberg supports a concept called metadata tables that allows you to use SQL to query the details from the various files in the metadata folder.

We’ll cover more advanced concepts later in the post. For now, you can run the following query and check the first row to find the current snapshot ID.

SELECT * FROM "try_it$history"
 ORDER BY made_current_at DESC;

Step 7 – Validate initial snapshot data

Notice that this snapshot_id column value aligns with the earlier current-snapshot-id value in the metadata file.

The DESC try_id; command already gave us the current column names and types. Run the following SQL to get information on the partitions.

SELECT * FROM "try_it$partitions";

It returned nothing because there is no data in the table to report on yet. We will look at this again as we start changing things.

Run the following to see the file name of the linked manifest list.

SELECT snopshot_id, manifest_list
  FROM "try_it$snapshots"
 ORDER BY committed_at DESC
 LIMIT 1;

Notice the value from the manifest_list column aligns with the manifest-list value in the metadata file.

Step 8 – Exploring manifest lists

That manifest list references 0..m “manifest files” and each of those will reference 1..m actual data files. In this post, we won’t explore those AVRO files directly as we can run the following query to determine the physical files that are represented in that graph for the current snapshot.

SELECT * FROM "try_it$files";

Yep, you guessed it — it returns nothing as there is no actual data in the table yet.

DML use cases

Now that we have a table set up and we understand what’s happening behind the scenes a bit more, let’s explore some CRUD (Create, Retrieve, Update, Delete) use cases as expressed in Data Manipulation Language (DML).

Txn 1: INSERT single row

Step 9 – Write an initial record to the table

Use the following SQL code to insert a single row into the table.

INSERT INTO try_it VALUES (1, 'noise', 'bogus', 'p1');
select * from try_it;

Step 10 – Verify new JSON file

A few things should have happened. Let’s start by verifying a new JSON file was created on S3 as shown in the highlighted file below.

Step 11 – Verify new manifest list

After downloading this new metadata file, I determined it’s current-snapshot-id is identified as 4900337259264042703 and that the last bits of the file name for the manifest-list property are e8b5c8989a75.avro (of course, your values will be different is running these scenarios on your own). I verified that new manifest list is present on S3, too.

Step 12 – Determine file created for single record

From here we can now determine the actual data file that was created to store our single record.

SELECT * FROM "try_it$files";

The value of the file_path column ended with /data/prt=p1/20241226_203553_21880_mejih-75b2a829-9205-4d68-b0f2-4ac1783e243c.parquet in my scenario. That let us know a few things.

  • The file was stored in the /data folder instead of the /metadata folder just under the table’s root directory
  • Its partition column value further created a /prt=p1 folder
  • The file itself is in Parquet format

Step 13 – Review output

Taking a peek in the Parquet file itself shows the expected contents when viewed as JSON.

Pretty simple so far.

Txn 2: INSERT multiple rows across multiple partitions

Step 14 – Add two records with a single statement

Insert statements allow multiple rows to be added at once and they all belong to a single ACID transaction identified by a single snapshot ID. This use case is to exercise that, but to make it a bit more fun we can span more than one partition for a transaction.

INSERT INTO try_it VALUES
(2, 'noise', 'bogus', 'p2'),
(3, 'noise', 'bogus', 'p3');
select * from try_it;

Step 15 – Verify files

Verify that two more files, each in their own new partition folder, are identified in the metadata tables.

SELECT file_path, partition, record_count
  FROM "try_it$files";

When you go directly to S3 you can find these two new folders/files and additionally verify that each of the Parquet files has a single record.

Step 16 – Verify snapshot

You can see this same information from the perspective of the partitions that make up the current snapshot.

SELECT * FROM "try_it$partitions";

Note: The data column shows all kinds of cool statistical information about each file. Not the scope of this post, but this information is key to how Iceberg can perform at scale as it doesn’t need to read the actual files when trying to determine if data it is looking for might be in the file.

Step 17 – Review Snapshot metadata table

Take a harder look at the $snapshots metadata table to get a summary of what happened in the last transaction.

SELECT snapshot_id, summary
  FROM "try_it$snapshots"
 ORDER BY committed_at DESC 
 LIMIT 1;

Step 18 – Review summary column

Here are the contents of the summary column.

For this use case’s transaction, here are the key confirmations of what happened.

  • 2 new records were added for a total of 3 records
  • 2 new partitions were created; each with a new file in them
  • 3 total files are present

Txn 3: UPDATE multiple rows across multiple partitions

Step 19 – Modify all records with a single statement

The fact that the underlying data lake files are immutable makes updates a bit tricky for table formats like Iceberg. In-place updates of the actual files can’t happen. Basically, Iceberg marks each record being updated as deleted and then does a net-new insert to account for what the updated recorded should look like at the end of the SQL statement. The section will show what this looks like behind-the-scenes.

To make the use case more interesting, we’ll make the update span records across multiple partitions so that we can see a similar behavior to the prior use case of a particular transaction number spanning these affected partitions.

Let’s start off with the SQL.

UPDATE try_it SET b_val = 'bogus2'
 WHERE a_val = 'noise';
select * from try_it;

Step 20 – Review summary

See what the summary column shows for the current snapshot.

SELECT snapshot_id, summary
  FROM "try_it$snapshots"
 ORDER BY committed_at DESC
 LIMIT 1;

Step 21 – Assess key confirmations

For this use case’s transaction, here are the key confirmations of what happened.

  • 3 new delete files (these are files that reference records that are being updated) were added with a single record in each — if all 3 records were in the same partition they likely would have been a single delete file that referenced all 3 records
  • 3 new regular data files were created; each with a single row in them representing the updated rows — again, so many because each record was in a different partition
  • A total of 9 data files make up this snapshot; 6 data files (3 for the original inserts and 3 for the revised updates) and 3 delete files

Step 22 – Verify $files metadata table

Verify that the $files metadata table lists all 9 of these files.

SELECT file_path, partition, record_count
  FROM "try_it$files" ORDER BY partition;

Step 23 – Inspect file partition update files

Let’s think about it again. For each partition, there is a file that has the original row inserted into it. Then for the update, there is a delete file that identifies the record as it can’t be updated directly. The third file is a total rewrite of the row with the updates and all the values that did not need to be changed. Let’s check the files on S3 just for the prt=p3 partition as an example of what happened.

Step 24 – Review parquet file

First, here is the contents of the first Parquet file; ...8e8b3.parquet

Second, here is the delete file that identifies the first record in the file above that needs to be deleted.

Last, here is the third data file, which shows the full replacement with the updated column values.

As a reminder, only the value of b_val was changed.

Txn 4: UPDATE single row (leveraging partitioning)

Step 25 – Modify a row’s partitioning column

This use case is just calling out that we should be using the partitioned column in the update statement as much as possible to make the work effort much easier by letting Iceberg only look at the folders that can possibly be affected instead of walking the full table’s contents.

UPDATE try_it SET b_val = 'bogus3'
 WHERE b_val = 'bogus2' AND prt = 'p2';

Step 26 – Review snapshot column

The $snapshots.summary column now looks like this.

Step 27 – Understanding delete files in context

In a nutshell, this states that two new delete files (one designed to tackle the single row update) were added, which only affected a single partition. The total files went up by 2 (one more delete file for a total of 7 and an additional data file, bringing that count to 4) to land at a grand total of 11. We can verify that querying the $files metadata table again.

Step 28 – Reviewing changes

The changes this time are in the prt=p2 partition and you can see these 5 files in S3 as well.

Note: This concept of marking a row as needing to be deleted in one file and then adding the record back as a net-new row in another file is referred to as a Merge On Read strategy as it is fast for the update operation and defers the responsibility of pulling together the correct row representation at query time.

Parting thoughts on Iceberg metadata

Data lake tables have evolved quite a bit from the very early days of the original Hive table format, and we should now expect ACID transactions as a base feature. That said, these two callouts should be stated clearly, at least at this point in early 2025.

  1. Iceberg transactions are single-statement / single-table (many records, of course) in nature — no BEGIN/COMMIT TRANSACTION statements
  2. While ideal for highly concurrent querying, Iceberg transactions are not designed for massive concurrent transactions beyond insert-only operations — not a general-purpose replacement for RDBMSs

There is PLENTY more to learn about Apache Iceberg, and I encourage you to return to the top of this post and read the additional articles in this Apache Iceberg metadata deep-dive series. For some great hands-on tutorials for Iceberg, head on over to Starburst Tutorials, as well.


Post originally published by Lester Martin as “Iceberg acid transactions with partitions (a behind the scenes perspective)” at Lester Martin (l11n) Blog.