
This post is part of the Apache Iceberg blog series. Read the entire series:
- Introduction to Apache Iceberg in Trino
- Iceberg Partitioning and Performance Optimizations in Trino
- Apache Iceberg DML (update/delete/merge) & Maintenance in Trino
- Apache Iceberg Schema Evolution in Trino
- Apache Iceberg Time Travel & Rollbacks in Trino
- Automated maintenance for Apache Iceberg tables in Starburst Galaxy
- Improving performance with Iceberg sorted tables
- Hive vs. Iceberg: Choosing the best table format for your analytics workload
TL;DR
Apache Iceberg is an open source table format that brings high-performance database functionality to object storage such as AWS S3, Azure’s ADLS, Google Cloud Storage and MinIO. This allows an organization to take advantage of low-cost, high performing cloud storage while providing data warehouse features and experience to their end users without being locked into a single vendor.
What is Apache Iceberg?
Apache Iceberg is an open table format — originally created by Netflix which is now under the Apache Software Foundation — that provides database type functionality on top of object stores such as Amazon S3. Iceberg allows organizations to finally build true data lakehouses— with reliable ACID transactions — in an open architecture, avoiding vendor and technology lock-in.
The excitement around Iceberg began last year and has greatly increased in 2022. Most of the customers and prospects I speak with on a weekly basis are either considering migrating their existing Apache Hive tables to it or have already started. They are excited a true open source table format has been created with many engines both open source and proprietary jumping on board.
Advantages of Iceberg Table Format
One of the best things about Iceberg is the vast adoption by many different engines. In the diagram below, you can see many different technologies can work the same set of data as long as they use the open-source Iceberg API. As you can see, the popularity and work that each engine has done is a great indicator of the popularity and usefulness that this exciting technology brings.
With more and more technologies jumping on board, Iceberg isn’t a passing fad. It has been growing in popularity, not only because of how useful it is, but also because it’s truly an open source table format, many companies have contributed and helped improve the specification making it a true community based effort.
Here is a list of the many features Iceberg provides:
Choose your engine | As you can see from the diagram above, there are many engines that support Iceberg. This offers the ultimate flexibility to own your own datasets and choose the engine that fits your use cases. |
Avoid Data Lock-in | The data Iceberg and these engines work on, is YOUR data in YOUR account which avoids data lock-in. |
Avoid Vendor Lock-out | Iceberg metadata is always available to all engines. So you can guarantee consistency, even with multiple writers. |
DML (modifying table data) | Modifying data in Hadoop was a huge challenge. With Iceberg, data can easily be modified to adhere to use cases and compliance such as GDPR. |
Schema evolution | Much like a database, Iceberg supports full schema evolution including columns and even partitions. |
Performance | Since Iceberg stores a table state in a snapshot, the engine simply needs to read the metadata in that snapshot then start retrieving the data from storage saving valuable time and reduced cloud object store retrieval costs. |
Database feel | Partitioning is performed on any column and end users query Iceberg tables just like they would a database. |