What is Apache Parquet?
Apache Parquet is a column-oriented format that accelerates query performance and uses column-specific data compression and encoding to reduce storage consumption.
Parquet file and file format
RCFile, and later ORC, offer several optimizations for Hive-based data processing platforms. In 2013, developers at Twitter and Cloudera released Parquet as a more efficient, platform-agnostic column-oriented file format. The file format has since become an Apache top-level project.
Column-oriented file formats
Analytics systems differ from transactional systems in several ways. For instance, write operations are not as important as read operations. A data warehouse writes data once at ingestion and then queries read the data multiple times a month, day, or minute. Furthermore, analytical querying processes data in one column at a time, often not needing to access data in other columns. Since fields are more important than records, analytics systems need a better-suited file format.
Rather than storing the data elements in each record in one place, columnar storage formats write all data elements from a given column together. Queries can quickly retrieve the data they need without touching other columns.
Apache Parquet framework
By storing columnar data together, Parquet files can apply type-specific encoding and compression formats to each column. As a result, Parquet files make the most efficient use of a storage system, whether on-premises or in the cloud.
Parquet files support a number of features to streamline data analytics. For example, queries can read Parquet’s metadata and skip irrelevant data using predicate pushdown. Parquet also supports schema evolution. Columns can be changed or appended without requiring a rewrite of existing data.
The Parquet format supports many big data processing frameworks and programming languages. Besides Hive, Apache Impala, Apache Spark, and Trino are popular processing frameworks that handle Parquet files. Developers can process Parquet files using Python, Java, C++, and other languages.
Parquet use cases
Parquet is widely-adopted by organizations needing to store and process large datasets. Data warehouses, data lakes, and other central repositories are ideal candidates for Parquet. Parquet also makes data management more efficient by optimizing query workloads in ETL data pipelines.
Apache Parquet, Avro, and big data
The need for formats like Parquet and Avro arose from the inexorable increase in the volume, velocity, and variety of data generated in the Big Data era. Data engineering teams require file formats that are easy to process to minimize compute costs yet also make the most efficient use of their organizations’ data storage infrastructure.
These choices become more important as companies seek to modernize their data architectures and migrate away from their legacy Hadoop systems.
Avro vs Apache Parquet
Choosing between Avro and Parquet depends on how best each format optimizes a given application. Row-based and columnar formats have distinct strengths, so it comes down to whether record processing or data aggregation is more important.
Shared benefits
Despite their differences, these modern file formats have a few similarities. For example, Avro and Parquet support schema evolution and work with multiple programming languages and data processing architectures.
While both compress their data using algorithms like Snappy, Avro’s rows can contain many different data types, which limits the potential gains. Since Parquet applies algorithms specific to the data type in each column, it can squeeze more data into the same storage space.
When to choose Avro
Avro’s row-oriented files are the best choice for online transactional processing (OLTP) systems that process entire records and need balanced read and write performance.
The self-describing file format excels in these dynamic environments where schema changes are more common.
When to choose Parquet
Parquet excels as a file format for data analytics platforms. Its column-based structure lets queries return results quickly without having to scan through irrelevant data. The file format’s efficient approach to data compression makes Parquet particularly useful for on-premises data warehouses where storage infrastructure cannot scale affordably.
Integration with Trino and Iceberg
Avro and Parquet also integrate with Trino and Iceberg to play a role in the open data lakehouse. This relatively recent architecture brings the analytics performance of a data warehouse to the cloud scalability of a data lake. Unlike these legacy architectures, the data lakehouse does not attempt to be a central repository for all enterprise data. Instead, it consolidates just the critical, frequently accessed data in object storage. Trino’s connectors abstract all other data sources to create a virtual access layer and make data architectures accessible from a single interface.
Faster SQL queries
Trino queries use ANSI-standard SQL, promoting accessibility so anyone can analyze data from the command line, in code, or through their preferred business intelligence application.
Trino leverages the metadata stored within data files to accelerate queries further. Dynamic filtering, for example, reduces the amount of irrelevant data a pushdown query returns, thereby improving query performance, reducing network traffic, and easing the query workloads on the data source.
Streamlined data pipelines
Engineers can use Trino to build their data pipelines and ingest datasets in row or column formats.
When building ingestion pipelines from streaming sources, the incoming datasets are often in row-oriented formats like Avro. Trino’s Kafka connector applies an Avro schema while serializing data into row-based files.
When ingesting data from columnar data sources, Trino’s connectors to Delta Lake, Hudi, and Hive let queries access data written in the Parquet format.
Integration with Apache Iceberg
Iceberg is more flexible than other table formats, supporting Avro, ORC, and Parquet. As engineers design their open data lakehouses, they can choose the most appropriate format for each dataset.
As mentioned earlier, the data ingested from streaming sources is often row-oriented Avro files. An Iceberg table can store this raw dataset directly to the lakehouse without having to transform it into a more structured format. Avro’s schemas are directly accessible through Trino.
For most other use cases, Parquet is the preferred file format. Starburst’s Trino-based open data lakehouse analytics platform has an enhanced Parquet reader that provides a 20% boost to read performance over Trino. We also extend Trino’s dynamic filtering to the row level to reduce data processing workloads further.