
Open-source frameworks for data lake analytics have existed for over a decade now and have been widely embraced across all industries. Three of the most popular ones are Apache Hive, Trino, and Apache Spark. Each of these frameworks has at least one vendor product built on top of it; Starburst’s relationship with Trino is no different.
Many aspects of these frameworks can be compared & contrasted, but I want to focus on the following three features and walk through each framework’s journey to attain all three (spoiler: they all get there).
- SQL – Structured Query Language, like it or not, is THE most accepted analysis language for business data with a known structure
- Performance – Obviously, this suggests that we want these SQL queries to run as fast as possible
- Durability – Many SQL queries & operations take a long time to complete, and the feature of durability would ensure that user requests will run to completion, even if there are software/hardware failures
Let’s walk the historical timeline to see when these popular open-source frameworks tackle each of these features.
Apache Hadoop surfaces (2006)

Hadoop was released to attack large-scale data analysis tasks that existing technologies either could not process at all, or organizations could not afford to scale those technologies to the needed level. The cluster combines storage (HDFS) and compute (YARN) to enable an awesome feature called “data locality,” which means taking the processing to the data instead of the other way around.
Initially, Hadoop developers were only presented with the Java MapReduce API, which did offer data analysis processing abilities with inherent job reliability and durability features. This approach did not offer SQL (Hadoop will quickly offer a SQL abstraction layer), but was focused on guaranteeing a job would complete — regardless of how long it takes to complete.
| Hadoop (Hive) | Trino | Spark | |
|---|---|---|---|
| SQL | |||
| Performance | |||
| Durability | 2006 |
Apache Hive is created (2010)

Developers at Facebook built Hive, a SQL abstraction layer on top of Hadoop, to overcome the Java programming barrier associated with Hadoop. Hive created a component called the metastore that stores the needed information for this schema-on-read data warehouse technology. This metadata for each table includes the following.

Hive is tightly-coupled with Hadoop and ultimately submits MapReduce jobs (now optimized with the Tez engine) that run in the cluster alongside other queries and various workloads.
| Hive | Trino | Spark | |
|---|---|---|---|
| SQL | 2010 | ||
| Performance | |||
| Durability | 2006 |
Hive also brought us the ORC file format, but even with this and Tez, it wouldn’t be fair to say it has fully checked off the Performance checkbox yet.
PresoDB (see Trino family tree) invented for interactive queries (2012)

Yep, I referenced PrestoDB in this section’s header and THEN dropped a Commander Bun Bun logo right after it. Here’s a great resource for describing The journey from Presto to Trino and Starburst. Armed with that awareness, bare with me as I focus on just saying Trino hereafter.
Still over at Facebook, it was determined that Hive was great for long-running analytical queries and for data engineering pipelines. Trino (fka PrestoSQL) was created to execute fast queries. It achieved this by maintaining its own cluster of dedicated compute nodes, separate from Hadoop, and optimizing query execution when data was not spilled to disk during the intermediary steps of a complete query.
This improved speed significantly, but at the cost of reliability. If anything went wrong with the execution, an error is returned to the person or process that submitted the query.
| Hive | Trino | Spark | |
|---|---|---|---|
| SQL | 2010 | 2012 | |
| Performance | 2012 | ||
| Durability | 2006 |
An added benefit of separating compute and storage is that Trino can utilize a variety of connectors, making it a single point of access for multiple data systems, not just different file formats on the data lake. Additionally, federated queries can be run across the connectors. Starburst has gone further by adding additional and enhanced Connectors.
Apache Spark emerges (2014)

Some grad students working at UC Berkley’s AMPLab were enjoying the new abilities to run jobs on Hadoop, but they realized that for recursive processing (such as machine learning algorithms) Hadoop’s inherent sharing-model of resource management was hurting them. The Spark creators recognized the existence of resource managers like Hadoop YARN and chose to utilize these existing tools rather than reinvent them.
They started building Spark (still a MapReduce engine) and realized if they allocated all the resources they needed at the start of a program and coupled that with in-memory caching options (when the processing really needed to revisit the same immutable data over and over) then they could run jobs 50-100x faster.
For non-recursive (i.e., good old-fashioned data engineering) jobs, the execution could easily be 3- 7x faster due to not needing to request resources from parallel task to parallel task; therefore, we can declare it a performance-oriented framework for a variety of workloads.
| Hive | Trino | Spark | |
|---|---|---|---|
| SQL | 2010 | 2012 | |
| Performance | 2012 | 2014 | |
| Durability | 2006 | 2o14 |
At this point, the primary API was focused on the Resilient Distributed Dataset (RDD), which required programming expertise.
Spark adds SQL support (2015)

As we know, the data analysis world is fueled by SQL. It didn’t take Spark long to add its DataFrame API, which, in addition to a programmatic API, allows for classical SQL operations. This rounded out the Spark platform regarding the features of SQL, Performance, and Durability.
| Hive | Trino | Spark | |
|---|---|---|---|
| SQL | 2010 | 2012 | 2015 |
| Performance | 2012 | 2014 | |
| Durability | 2006 | 2014 |
LLAP hits the scene (2017)

As this whole blog post is a testament to user requirements driving features, AND imitation is the best form of flattery, the Hive community created an optional framework called Live Long and Process (LLAP). It has a lot of sophistication, but I’ll boil it down to the fact that it supports resources being allocated, online and ready for querying, as well as a shared memory cache across all the nodes that the processing resources have been allocated to.
This solution can easily attain sub-second query results on datasets that can fit in the shared cache, and LLAP doesn’t have to coordinate with YARN all the time. While LLAP is an optional element of Hive, it truly does bring the feature of high-performance to a stable SQL engine.
| Hive | Trino | Spark | |
|---|---|---|---|
| SQL | 2010 | 2012 | 2015 |
| Performance | 2017 | 2012 | 2014 |
| Durability | 2006 | 2014 |
Project Tardigrade (2022)

Over the years that Trino has been fulfilling its role as a fast query processor, users have also been leveraging it in their ETL pipelines, too. While Facebook has been using fault-tolerant execution with Presto for years now, this feature has now been introduced to open-source Trino. This feature-release blog post offers more details.
| Hive | Trino | Spark | |
|---|---|---|---|
| SQL | 2010 | 2012 | 2015 |
| Performance | 2017 | 2012 | 2014 |
| Durability | 2006 | 2022 | 2014 |
Now, as promised, we finally show a full table indicating that all three popular SQL engines satisfy performance AND durability. This concludes our history lesson for today. 🙂
Post originally published by Lester Martin as “hive, trino & spark features (their journeys to sql, performance & durability)” at Lester Martin (l11n) Blog.



