Join us for AI & Datanova 2025 — live in New York City on October 8-9 or virtually on October 22-23!

What is Apache Iceberg?

Explore the role that Apache Iceberg is playing in analytics and AI

Share

Linkedin iconFacebook iconTwitter icon

More deployment options

Apache Iceberg is a modern open table format designed to bring powerful data management capabilities to large-scale data lakes. It was created to address the limitations of older formats like Apache Hive, enabling more reliable, efficient, and flexible data operations across complex architectures.

Iceberg is ideal for today’s evolving data needs. Whether you’re building advanced analytics pipelines or preparing high-quality datasets for AI and machine learning, Iceberg provides the transactional consistency, schema evolution, and performance optimizations needed to support both.

In this article, we’ll walk through 12 key features that make Apache Iceberg a cornerstone of open data lakehouse architecture.

12 Apache Iceberg table benefits & features:

Apache Iceberg brings advanced capabilities to open lakehouse architectures by addressing the limitations of legacy table formats like Hive. Below are twelve core features that make Iceberg a powerful choice for managing large-scale analytics and AI workloads.

1. Reduced metastore reliance

Iceberg avoids the heavy reliance on the Hive Metastore by using its own catalog system. It stores only a pointer to the latest table snapshot, reducing metadata complexity and improving query speed and reliability.

2. Time travel & Rollbacks

Iceberg enables users to query previous states of a table with built-in version control. This makes it easy to reproduce historical queries or roll back to earlier versions for audit, debugging, or recovery purposes.

3. Optimistic concurrency

Multiple users or applications can write to the same Iceberg table at the same time. Iceberg uses an optimistic concurrency model, checking for conflicts only during the final commit stage, which ensures consistency without locking.

4. Hidden partitioning

Unlike Hive, which exposes partition logic to users, Iceberg hides the complexity. This protects against user errors and ensures queries are optimized without requiring deep knowledge of the table’s physical structure.

5. Snapshots

Every change to a table generates a new snapshot. These snapshots maintain a complete history of the table, enabling both auditability and time-based analysis without performance tradeoffs.

6. Snapshot expiration

To manage storage efficiently, Iceberg allows teams to configure how long historical snapshots are retained. Older snapshots can be expired automatically based on retention policies.

7. Schema evolution

Iceberg supports dynamic schema changes with minimal disruption. You can add, remove, or rename columns, even within nested structures, and Iceberg tracks these changes by column ID to maintain consistency across schema versions.

8. Performance optimizations

Iceberg avoids expensive file-listing operations and enables intelligent query planning. With built-in stats and partition pruning, the engine can skip irrelevant data files, leading to faster query execution.

9. Sorted tables

Tables can be sorted by one or more columns at write time, which enhances filtering and supports efficient data skipping during queries. This feature further boosts performance.

10. Table maintenance

Iceberg can automatically compact many small files into fewer, larger ones. This optimization reduces I/O overhead and improves scan efficiency, especially in distributed environments.

11. Full DML support

Iceberg supports standard Data Manipulation Language (DML) operations such as UPDATE, INSERT, and DELETE. This brings SQL-like flexibility to data lake environments without needing to move data into a warehouse.

12. Leverage optimization commands

You can explicitly run optimization commands to consolidate files, reorganize partitions, and improve query performance—giving administrators more control over system tuning.

Apache Iceberg | Building an open data lakehouse architecture

An open data lakehouse comprises four elements: commodity storage, open file formats, open table formats, and high-performance query engines.

Commodity storage: You can build a lakehouse on storage platforms like Amazon S3. These efficient cloud services offer scalable storage solutions for various data types. Related reading: Using Apache Iceberg, AWS S3, and AWS Glue to manage a data lakehouse architecture

Open file formats: Various open file formats like Avro, Parquet, and ORC let you optimize how you collect and store data in your lake.

Open table formats: Iceberg is the open table format of the data lakehouse architecture. Its rich metadata files and analytics-optimized structure allow query engines to run more efficiently.

Query engines: High-performance query engines like Apache’s Spark and Trino are optimized for big data analytics.

Iceberg’s table structure

Iceberg tables use metadata, snapshots, and manifests to track individual data files. Any changes to the table are made to these components rather than the data itself. This approach gives Iceberg more robust functionality than predecessors like Apache Hive.

Like a database’s SQL tables, Metadata files describe the table’s schema, partitioning, and other information. They also contain snapshots of the table’s data files. Iceberg generates a new snapshot anytime the table’s state changes. As a result, these tables retain a complete history of state changes.

A manifest file describes the table’s data files. A snapshot may contain multiple manifest files it documents in a manifest list. Iceberg’s approach to manifest files and lists reduces overhead and makes queries more efficient.

What is Apache Iceberg used for?

Companies with petabyte-scale data ecosystems are the primary users of Iceberg. Large datasets place enormous demands on information architectures. Iceberg’s use cases simplify big data management on a data lake.

Simpler data architectures

Data lakes can store unstructured and structured data, making them better suited to modern data analytics demands. However, data lakes cannot match a data warehouse platform’s full suite of analytics capabilities. As a result, companies often layer data warehouses on top of the data lakes. Besides the added complexity, this approach increases costs and the risks associated with data moves and duplication.

Iceberg’s open table format adds rich metadata and query engine compatibility to blend a warehouse’s analytics capabilities with a lake’s storage efficiency. This simpler architecture eliminates data warehouses and turns a lake into the enterprise’s central analytics resource.

Manage complex data processing

Once generated, data takes time to settle. For example, data associated with customer orders can change anywhere from their initial creation to the end of a return window. Regulated personal data must be purged at intervals set by compliance policies. Frequent, small changes to large datasets place enormous workloads on data systems.

Iceberg’s design allows these granular changes to occur without imposing performance penalties.

Concurrent data usage

Enterprise applications and users often need access to the same data simultaneously. However, allowing concurrent access can be risky. When users of a dataset read and write at the same time, the resulting inconsistencies may contaminate downstream analysis.

Iceberg isolates the lake’s raw data through metadata abstraction, instead giving users access to unique snapshots of the data table. Changes result in a new snapshot, but the users can continue using the original snapshot to preserve consistency and repeatability.

What is hidden partitioning in Iceberg?

Many table formats can group data by common properties. This partitioning enables queries to skip irrelevant data, returning results faster at a lower cost. However, formats like Hive force users to have a deep understanding of table structure and partitioning to prevent errors or inaccurate query results.

Iceberg hides aspects of partitioning from users by, for example, automating the creation of partition values and avoiding irrelevant partitions. Hidden partitioning lets Iceberg partitions evolve without affecting queries.

Related reading: Iceberg Partitioning and Performance Optimizations in Trino

Iceberg vs Delta Lake vs Hudi vs Hive

Hive

Hudi

Delta Lake

Iceberg

Original table format Created for time/event series data Open source doesn’t support concurrent writers Hidden Partitions
Supports ORC, Parquet, JSON, etc Great for streaming use cases Only supports Parquet Metadata tree is more performant using AVRO
Partition columns must be part of a table Copy on write & on read Can’t change partitioning Partition and table evolution
Relies heavily on the metastore Table evolution, compaction, etc 10 checkpoints every commit; so every 10th write is slower Full DML
Full DML Associated with Databricks Associated with Trino

Advantages of Iceberg over other table formats

Data lakehouses are still relatively recent developments, with solutions based on Apache’s Iceberg or Hudi projects or Databricks’s Delta Table format. These three options have similar functionality, but the devil is in the details.

  • All three table formats are open source. Iceberg and Hudi are Apache projects, although Iceberg has the larger developer community.
  • The Delta Table format, while nominally open source, is primarily supported by Databricks, the corporation that first developed it.
  • Amazon’s AWS, Microsoft Azure, Google Cloud, and other data platforms support all three to varying degrees.
  • Ultimately, the right choice depends upon an enterprise’s existing infrastructure and data use cases.

Related reading:

Hive vs Iceberg

Feature

Apache Iceberg

Apache Hive

Transaction support (ACID) Yes Limited 
File format Parquet, ORC, Avro Parquet, ORC, Avro, and more
Schema evolution Full Partial 
Partition evolution Yes No
Data versioning Yes No
Time travel queries Yes No
Concurrency control Optimistic locking Pessimistic locking
Object store cost optimization Yes No
Community and ecosystem Growing  Established

Developers at Netflix created Iceberg to address the challenges of using Apache Hive on the streaming service’s extensive data infrastructure.

Hive uses a subsystem called a metastore that points to a table’s data. However, it only points to the folder containing the relevant data file. That may be acceptable in a structured environment like a Hadoop-based data warehouse. Using this approach with object storage imposes stiff performance penalties.

Another performance hit comes from how Hive interacts with Hadoop, which relies on Java-based MapReduce jobs to interact with data. Few data consumers have the specialized skills in Java+MapReduce needed to query Hadoop data stores. Hive implements HiveQL to create an SQL-like approach for generating Hadoop queries. However, the Hive approach means every query command requires a translation step between HiveQL and Java.

Netflix’s developers set out to create a new table format that addressed these and other issues Hive creates when analyzing petabytes of data. Eventually, Netflix handed the project over to the Apache Software Foundation, where it has flourished.

Iceberg and Hive query large datasets, but deciding what to use depends on your use case.

Related reading: Hive vs Iceberg: How to migrate your Hive tables to Iceberg

Apache Iceberg vs Apache Parquet

Although both Apache Iceberg and Parquet are open source projects, they address different aspects of the data lakehouse architecture. Whereas Iceberg is an open table format, Parquet is an open file format for creating column-oriented data files on a data lake. This structure compresses more efficiently than a row-oriented format like Avro, which reduces overall storage costs. In addition, Parquet files help speed queries by, for example, providing metadata that queries can use to skip irrelevant data.

Build your open data lakehouse with Starburst with Apache Iceberg Open Table Format

Iceberg’s open table format lets you connect your data lakehouse to any query engine. Starburst’s modern data lake analytics platform enables you to connect to any data source. Using Starburst to power the analytics of your Iceberg-based data lake makes data more manageable, optimizes compute and storage investments, and speeds time to insight for more effective decision-making.

Starburst is based on the Trino open source project’s massively parallel query engine but with optimizations designed to maximize the features of Iceberg data tables, including schema evolution, time travel, and partitioning.

Since some workloads work best with different table formats, we created Great Lakes, a connectivity feature of Starburst Galaxy. Great Lakes abstracts the details of a data lake’s table formats and file formats to simplify accessing tables, whether based on Iceberg, Hudi, or Delta Table. Starburst’s Great Lakes enables data teams to optimize their data lake architectures for various use cases. Data consumers can run SQL queries without having to know the details of each table’s format.

Demo: Iceberg and Trino

In this exciting exploration, we’re delving into the powerful combination of Apache Iceberg and Trino, two dynamic tools reshaping the landscape of big data. To do this, we’ll use Starburst Galaxy and compare its use to AWS Athena.

Start for Free with Starburst Galaxy

Try our free trial today and see how you can improve your data performance.
Start Free