
Apache Iceberg‘s architecture features multiple snapshots. These snapshots provide version control, allowing us to utilize features such as time-travel querying and rollback functionality. This flexibility is shown in Visualizing Iceberg Snapshot Linear History. There’s a belief out there that having a long tail of prior versions available affects performance, but it actually doesn’t.
Most queries are simply working against the current version, and the query engine gets the name of the metadata file from the Iceberg catalog and doesn’t have to worry about all the other metadata files hanging around to support that long line of prior snapshots. The consequence of having all those snapshots is ultimately going to be the footprint of the underlying data files that must remain on your data lake to support each & every older version.
This post is focused on showing how you can very quickly keep around 2-10, OR MORE, times the data size of your current version, and how expiring snapshots on some regular schedule is going to allow you to fit into some acceptable norm. The cloud providers will let you keep as much data as you want, but they seem to keep asking us to pay for it, too.
Creating new versions of Iceberg snapshots
v1: Initial inserts
In this logical scenario, version 1 of an Iceberg table was created, inserting 1.5TB of data spread across 3 files of 500MB each.
v2: Delete a few rows
When a delete statement was run that identified 3 rows from file A and another row from file C, the new version has to reference the same files from version 1, plus a small “delete file” that references the row locations pointing to the 4 records being deleted. The good news is that you still only need to store 1.5TB of data.
This is because the data lake files themselves are immutable (they cannot be updated once created, but can be deleted), and Iceberg DML operations utilize a Merge-on-Read strategy instead of Copy-on-Write to account for deletes and updates. Need a primer? Here’s more on MoR vs CoW.
v3: Compaction
Iceberg has a cool feature for tackling compaction (i.e., replacing many smaller files with fewer larger ones). This process can also resolve deleted files by including their modified versions in the newly rewritten files, replacing the older ones. In the scenario below, the process created better versions of files A & C that were logically deleted in version 2.
At this point, we have 2.5TB of data on the lake, even though version 3 only references 1.5TB of data files.
v4: Update one row
When a single record stored in file B is updated, a positional delete file is created, along with a new data file containing the full record as it would be after the update has been applied. The MoR strategy doesn’t cause the overall footprint to go beyond the current 2.5TB.
v5: Compaction
This time, the compaction rewrites file B, which actually stores the updated record from version 4. The newly rewritten file is yet another 500MB of data being stored, bringing our total up to 3TB.
Expiring snapshots
Ultimately, coupling the older version snapshot expirations with the orphan file cleanup action allows us to start reclaiming data lake storage space as older files are no longer referenced by any versions.
Eliminate v1
Removing version 1 does not eliminate the need to retain all existing files and does not save any data lake storage space.
Eliminate v2
We can eliminate two of the older 500MB files (A & C), along with the small delete file (D), thereby lowering the overall footprint to 2TB.
Eliminate v3
Yep, we are still referencing all the files that we were before expiring this snapshot.
Eliminate v4
Finally, we are back down to 1.5TB of overall data lake storage needed to use the single remaining version.
Conclusion
Every scenario will be a bit different, but the volume & velocity characteristics of your ingestion pipeline will heavily impact what happens. The frequency & amount of your DML statements will also contribute to a particular table’s situation. Finally, the need to keep track of a certain number of prior versions will also impact your specific scenario.
Likely, you’ll settle on some general standard for the majority of your Iceberg tables, but again, keep in mind the consequences of data lake storage requirements for all of your largest tables. Periodically validate that your table maintenance routines are achieving the desired results to support prior snapshot features, without incurring excessive storage costs.
Post originally published by Lester Martin as “iceberg snapshots affect storage footprint (not performance)” at Lester Martin (l11n) Blog.