What is stream processing?
Stream processing is a mechanism for collecting and transforming the output generated by continuous data sources like social media, e-commerce clickstreams, and Industrial Internet of Things (IIoT) sensors. Real-time processing introduces minimal latency and enables the near real-time analysis of data streams.
Stream processing: use cases and key features
Stream processing works with continuous data sources generating large amounts of sparse data, that is, a large stream of data with no beginning or end and few valuable data points.
Some use cases for stream processing include:
Anomaly detection: Transaction and activity logging systems generate a large amount of data, some of which could indicate anomalous behavior. Near real-time data analysis supports fraud detection in the financial industry and cybersecurity across all industries.
Predictive analysis: IIoT sensors report environmental information and the status of industrial processes. Stream processing collects this data for predictive models that anticipate equipment failure or process variances.
Customer experience: Analyzing e-commerce clickstreams in real-time lets automated systems tailor optimal customer experiences that maximize sales and customer satisfaction.
Market trading: Automated trading systems operate orders of magnitude faster than individual traders and rely on real-time weather, market, and other real-time data streams.
These use cases are significantly more time-sensitive than those for batch processing. They rely on close to real-time insights to drive effective actions, prioritizing low latency. In the case of market trading, a few milliseconds’ delay could cause significant trading losses. Automated marketing systems must respond on human timescales so they can accept latencies measured in seconds.
Technologies and tools: Streaming ingestion and processing
Open-source frameworks like Apache Kafka and Apache Flink let organizations build stream processing systems. For companies already using Apache Spark for batch processing, the framework’s Structured Streaming feature provides a unified API for processing data streams.
Stream processing with Starburst and Trino
Stream processing systems’ output often goes to a permanent repository for near real-time analysis or longer-term analytics. The challenge is ingesting these streams at scale without relying on complex code or risking data duplication.
Starburst’s streaming ingestion solution automatically ingests data streams from Kafka-compliant topics and writes the data to Iceberg tables in an open data lakehouse with exactly-once guarantees. This process converts the incoming streams into micro-batches for copying into the Iceberg table, which results in tables containing many very small files. Trino’s Iceberg connector lets data teams programmatically compact the Iceberg tables to contain fewer, larger files.
Key differences between batch processing vs. stream processing
These two data processing approaches serve different purposes and are not mutually exclusive. Most companies use both in their data architectures. Key differences between batch and stream processing include:
Bounded vs. unbounded data
Batches process complete, discrete datasets, which makes scheduling during periods of low resource utilization possible. Data streams have no beginning or end, so stream processing operates continuously.
High vs. low latency
Within the context of each job, latency is less critical than reliability. Whether it takes an hour or a weekend to process payroll doesn’t matter as long as the job is completed successfully on time. By contrast, stream processing use cases depend on receiving data as quickly as possible to close trades or catch security breaches.
Historical vs. real-time analytics
Batch processing is closely associated with historical analysis, as in financial reporting and machine learning. These applications don’t require the most recent data possible since they seek long-term business insights. Analyzing real-time data streams lets organizations react to events without delay, often in conjunction with predictive models.
Key differences between ingestion and processing
The two data processing approaches have several implications for how data teams ingest and process data for data repositories.
Batch processing implications
Batch ingestion periodically lands large chunks of data, which causes short-term spikes in network and storage utilization. Moving petabyte-scale batches from source to destination congests networks and requires carving out large staging areas in storage. The resource impacts of digesting such large amounts of data are why organizations typically schedule batches when network and system utilization rates are low.
Once ingested, a batch is processed all at once, even if complex jobs take hours or even days to complete. As a result, batch processing creates spikes in demand for compute resources. Without the scalability of a cloud services provider, organizations would overinvest to meet peak demand.
The high latency, periodic nature of batch processing depends on optimal scheduling to ensure jobs are completed on time without disrupting operational systems.
Stream processing implications
Unlike flexible, discontinuous batch processes, streaming processes are constant and must meet low-latency service levels. As a result, IT departments must invest in baseline capacity for network, compute, and storage.
Icehouse for data ingestion and data processing
An Icehouse architecture builds on Iceberg’s open table format, Trino’s massively parallel processing engine, and cloud object storage. Unlike data lakes or data warehouses, the Icehouse is not a central repository for all enterprise data. Trino’s connectors let organizations federate their data architecture, consolidating critical data in object storage while leaving infrequently accessed data at the source. ANSI-standard SQL queries can use one statement to access multiple sources simultaneously.
Founded by Trino’s developers, Starburst enhances the query engine’s core features with performance, management, and governance optimizations to streamline Icehouse administration.
A Starburst Icehouse replaces a data warehouse’s complex, difficult-to-manage ETL pipelines and rigid schema-on-write structure with ELT workflows and a schema-on-read paradigm. Whether ingesting data from streaming sources or in batches, Starburst lands raw data in Iceberg tables. Queries apply schema and transform data at runtime to keep data as useful as possible for future use cases.
Starburst streamlines the management of your Icehouse. Automatic capacity management features like auto-scaling, smart indexing, and enhanced fault-tolerant execution optimize Trino clusters. Data compaction, profiling, and other data lake optimization resources let you automate object storage management. Finally, Starburst’s universal discovery and governance resource, Gravity, automates cataloging and the enforcement of fine-grained role and attribute-based access controls.