
For almost two decades, companies have built big data processing architectures based on the Hadoop ecosystem. To extend the Hadoop project beyond its core design, they use Apache Software Foundation projects like the HBase relational database or Oozie’s workflow management resources.
Yet, Hadoop cannot address the full scope of enterprise data requirements.
Apache Hadoop and extended data management is a way for enterprises to store and process large amounts of data using distributed computing on commodity hardware. The open-source framework popularized big data analytics by letting companies affordably manage large datasets. Revolutionary for its time, the Hadoop framework poses complexity, performance, and latency challenges for how organizations use data today.
This article will discuss how the Trino massively parallel processing SQL query engine, enhanced by Starburst, significantly improves Hadoop data management.
Data Ingestion
Ingestion, the landing, and staging of raw data from a source, are the traditional first steps in Hadoop data integration. However, batch ingestion is no longer the only process for integrating enterprise data.
Batch Ingestion
The relative obscurity of Hadoop’s programming model is challenging. Hadoop MapReduce uses a unique variant of Java that few people within, much less beyond, the data management community understand. Building basic batch processing jobs requires input from a data team’s overstretched Hadoop experts. Projects like Apache Sqoop, now archived, let data teams extract data from relational databases using more widely understood SQL queries.
How Starburst Helps
Besides MapReduce’s obscurity, ETL ingestion pipelines impose storage and network penalties as large volumes of data flow from a relational database to Hadoop’s staging area. Trino’s SQL engine can pushdown queries to process data directly from the source. Rather than transferring huge datasets into interim storage, Trino lets pipelines load the final results into the destination.
Real-Time Ingestion
Hadoop wasn’t designed for the constant flow of small files generated by e-commerce, social media, and other real-time sources. Stream processing frameworks like Apache Kafka and Flink collect data from these sources for ingestion into Hadoop repositories.
However, streaming sources are sparse, and most of the data they generate is not useful. Given how the Hadoop Distributed File System (HDFS) handles small files, ingesting real-time data severely affects storage capacity and cost.
How Starburst Helps
Trino’s Kafka connector lets data managers query streaming sources directly without transferring full data streams. Starburst administrators can add schema metadata to a Kafka topic, allowing users to run real-time SQL queries from their preferred application. Starburst’s JSON functions help queries apply schema-on-read to semi-structured data streams. As with batch ingestion, Starburst lets engineers pushdown query processing to reduce storage and compute costs.
Data Processing
Data processing of large datasets enforces data warehouse schemas during integration and supports data analysis. Hadoop made processing more cost-effective than the proprietary systems of twenty years ago, but inherent performance limitations pose challenges for today’s large workloads.
Hadoop: MapReduce
MapReduce is one of Hadoop’s core modules, along with HDFS, Yet Another Resource Negotiator (YARN), and Hadoop Common. MapReduce and YARN work with the NameNode and DataNodes in Hadoop clusters to create parallelized workflows that reliably process data. However, Hadoop prioritizes high availability and fault tolerance over performance, which imposes significant latency penalties on high throughput jobs.
How Starburst Helps
Trino bypasses MapReduce and YARN entirely to query Hadoop data directly. Starburst further improves Hadoop data processing by enhancing Trino with cost-based optimizations, dynamic filtering, cached views, and more. Creating Trino queries through Starburst accelerates processing jobs and reduces network traffic to deliver results faster.