
This post is part of the Iceberg blog series. Read the entire series:
- Introduction to Apache Iceberg in Trino
- Iceberg Partitioning and Performance Optimizations in Trino
- Apache Iceberg DML (update/delete/merge) & Maintenance in Trino
- Apache Iceberg Schema Evolution in Trino
- Apache Iceberg Time Travel & Rollbacks in open source Trino query engine
- Automated maintenance for Apache Iceberg tables in Starburst Galaxy
- Improving performance with Iceberg sorted tables
- Hive vs. Iceberg: Choosing the best table format for your analytics workload
Apache Hive has long been a popular choice for storing and processing large amounts of data in Hadoop environments. However, as data engineering requirements have evolved, new technologies have emerged that offer improved performance, flexibility, and workload capabilities.
In this blog post, we’ll walk through the differences between Hive and Iceberg, the use cases for both formats, and how to start planning your migration strategy.
What is Apache Hive?
Apache Hive is open-source data warehouse software project built on top of Apache Hadoop to provide data query and analysis capabilities via a SQL-like interface. Hive supports storage on AWS S3, ADLS, and GCS through the Hadoop Distributed File System (HDFS). With Hive, non-programmers familiar with SQL can read, write, and manage petabytes of big data.
Apache Hive architecture
There are four main components of Apache Hive:
- Driver – The component that receives queries
- Compiler – The component that parses queries
- Metastore – The component that stores all the structure information of the various tables and partitions
- Execution Engine – The component that executes the execution plan created by the compiler
For the purposes of comparison to Apache Iceberg, we will strictly be focusing on the Hive data model and the Hive metastore.
Data in Hive is organized into tables similar to a relational database and data about each table is stored in a directory in HDFS. The Hive Metastore (HMS) is a central repository of metadata for Hive tables and partitions that operates independently of Apache Hive. The HMS has become a building block for data lakes providing critical data abstraction and data discovery capabilities.
Challenges of Apache Hive
The majority of the challenges associated with Apache Hive stem from the fact that data in tables is tracked at the folder level. This leads to several challenges including:
- Slow file list operations: Each time you query data in Hive, the directories need to perform a file list which gets expensive and slow on datasets with many partitions.
- Inefficient DML: If you’re updating and deleting data frequently, you may experience high latency as Hive requires you to replace the entire file
- Costly schema changes: If you change your schema in Hive, you have to rewrite the entire data set which is costly and time intensive.
- No transaction support: It is not possible to guarantee data consistency and integrity for transactions since Hive is not ACID-compliant by default. While Hive offers optional Hive ACID transactional (version-less) tables, it is not universally or consistently supported by major SQL engines.
What is Apache Iceberg?
Apache Iceberg is an open table format that was designed with modern cloud infrastructure in mind. It was created at Netflix to overcome the limitations of Apache Hive and includes key features like efficient updates and deletes, snapshot isolation, and partitioning.
Check out Ryan Blue’s talk on creating Apache Iceberg table format at Netflix here.
Apache Iceberg architecture
As Tom Nats mentions in his “Introduction to Apache Iceberg in Trino” blog, Apache Iceberg is made up of three layers:
- The Iceberg catalog
- The metadata layer
- The data layer
As you can see, Iceberg defines the data in the table at the file level, rather than a table pointing to a directory or a set of directories.
Advantages of Apache Iceberg
Apache Iceberg brings new capabilities to the data lake – including warehouse-like DML capabilities and data consistency. Specifically, Apache Iceberg offers the following advantages:
- Fast snapshots: Snapshots eliminate costly and slow directory listings by allowing the engine to read straight from the metadata.
- Efficient DML: Iceberg allows for full DML support on cloud storage.
- In-Place Schema Changes: Iceberg supports in-place schema evolution meaning you can evolve table schema without costly rewrites.
- Transactions: Iceberg provides ACID-compliant versioning, which means that data consistency and integrity are ensured for all transactions, and functions consistently across SQL engines.