From Apache Hive to Apache Spark
Hive has an inherent limitation that limits its usefulness for modern analytics. As middleware for Hadoop, Hive needs time to translate each HiveQL statement into Hadoop-compatible execution plans and return results. This additional latency exacerbates the limitations of MapReduce itself. With multi-stage workflows, MapReduce writes each interim result to physical storage and reads the result as input for the next stage. This constant reading and writing make iterative processing challenging to scale.
University of California, Berkeley, data scientists developed Spark, a faster system for their machine learning projects that uses in-memory processing. Spark reads data from storage when a job starts and only writes the final results.
The Spark framework is composed of the following elements:
Spark Core – Resource management utilities for the other aspects.
Spark SQL – An ANSI-standard implementation of SQL.
Machine Learning library (MLlib) – Libraries of optimized machine learning algorithms for the Java, Scala, Python, and R programming languages.
Structured Streaming – Micro-batching APIs for processing streaming data in near real-time.
GraphX – Extends Spark’s resilient distributed dataset (RDD) to support graph processing.
Hadoop and Apache Spark vs Trino
Engineers at Facebook experienced similar challenges with Hadoop and Hive. In addition to Hive’s inherent latency, using Hive to process petabytes of social media data incurred significant performance penalties due to its data management approach.
Hive organizes files at the folder level, meaning directories must perform file list operations to provide queries with the required metadata. The more file partitions, the more list operations, and the slower queries run. Another issue comes from running on Hadoop, which forces rewrites of files and tables to accommodate changes.
Introducing Trino
Facebook’s internal project eventually evolved into Trino, an open-source, massively parallel SQL query engine that federates disparate data sources within a single interface. Like Spark, Trino queries data directly without using MapReduce. Also like Spark, Trino uses ANSI-standard SQL, which makes datasets more accessible to a broader range of users.
Trino’s strength lies in its ability to process data from multiple sources simultaneously. Rather than limiting analytics to a monolithic data warehouse, Trino connectors can access structured and unstructured data in relational databases, real-time data processing systems, and other enterprise sources.
This flexibility enables users to conduct large-scale data analysis projects, yielding more profound and more nuanced insights to inform decision-making.
Related reading: Spark vs Trino
Open data lakehouses
Trino becomes the query engine and interface for modern open data lakehouse architectures when combined with cost-effective object storage services and the Iceberg open table format.
Object storage services utilize flat structures to store data more efficiently, thereby eliminating the challenges associated with Hive-like hierarchical folder structures.
Iceberg’s metadata-based table format greatly simplifies lakehouse data management. Table changes do not require data writes; instead, Iceberg records them as a new snapshot of the table metadata.
Trino provides an SQL-based interface for querying and managing Iceberg tables, as well as the underlying data. The engine’s low latency, even when processing large volumes of data, makes Trino ideal for interactive, ad hoc queries and data science projects. Recent updates have added fault tolerance to support both reliable batch processing and stream processing, enabling the ingestion of data from real-time data processing systems, such as Kafka.
Data migration strategy: Navigating the shift from Hadoop to cloud data lakehouses
Hadoop’s rapid adoption was the only way companies could keep pace with the explosion of data over the past twenty years. However, these legacy systems are increasingly complex and costly to maintain, especially when run on-premises. Moving to the cloud reduces costs, improves performance, and makes enterprise data architectures infinitely more scalable. However, any migration project risks significant business disruption should anything go wrong.
Starburst’s enterprise-enhanced Trino solution is ideal for migrating data from Hadoop to a cloud data lakehouse. Starburst features that streamline data migrations include:
Federation
Starburst enhances Trino’s connectors with performance and security features, resulting in more robust integrations between on-premises and cloud-based data sources. These connectors enable companies to dissolve data silos, allowing users to access data where it resides.
Unified interface
By abstracting all data sources within a single interface, Starburst becomes the single point of access and governance for a company’s data architecture. Engineers can build pipelines and manage analytics workloads from a single pane of glass. At the same time, data consumers can use the SQL tools they already know to query data across the company.
Migration transparency
Once users become accustomed to accessing data through the Starburst interface, they no longer worry about where that data resides. Architecture becomes transparent. Engineers can migrate data from Hadoop to a data lakehouse without impacting day-to-day business operations. Once the migration is complete, the data team switches connectors behind the scenes.
How Starburst helps with Apache Hadoop modernization
Starburst gives companies the optionality they never had with proprietary analytics platforms. While pairing Trino and Iceberg provides the most robust, performant alternative to Hadoop, companies can use Starburst to migrate to any modern data platform.
1. Migration to Apache Iceberg
Apache Iceberg’s open table format enjoys widespread support in the enterprise space as well as among its deep pool of open-source contributors. Iceberg’s symbiotic relationship with Trino contributes to Starburst’s enhanced features, including automated Iceberg table maintenance.
2. Migration to Delta Lake
Although closely entwined with a single company, Delta Lake is an alternative open table format for companies committed to a Databricks solution. Starburst’s Delta Lake connector can support migrations from Hadoop.
3. Migration to Apache Hudi
Faced with the same Hadoop/Hive limitations as everyone else, Uber developers created Apache Hudi, an incremental processing stack that accelerates integration workloads. The framework is compatible with several open query engines, including Trino, enabling companies to utilize Starburst as their analytics interface for a Hudi architecture.
Cost and productivity insights
After dealing with the costs, limitations, and complexities of Hadoop management, Starburst customers migrated to open data lakehouses and realized significant improvements in costs and productivity.
Migration – Starburst’s integration with Hadoop implementations and object storage services streamlines data migration, while also minimizing the hardware and cloud resources required to complete the project.
Operations – Starburst unifies enterprise architectures to simplify data operations and reduce maintenance demands.
Productivity – Starburst’s single point of access enables analysts to access the data they need faster, resulting in richer insights and more agile, informed decision-making.
Learn more about data migration with Starburst through our webinar and ebook.