
Computing history steadily progressed from expensive, proprietary technologies to more affordable, scalable open platforms. Enterprise reliance on mainframe vendors to process large amounts of data began to fade in 2006 with the release of two competing storage systems.
The Apache Hadoop Distributed File System (HDFS) lets companies process large datasets on commodity hardware running in their on-premises data centers.
Cloud-based object storage, pioneered by the Amazon Simple Storage Service (Amazon S3), lets companies leverage the scalability and capacity of third-party storage services.
This article will compare these two approaches to big data management, how enterprise analytics works with them, and how companies are modernizing their HDFS architectures by migrating to cloud-based open data lakehouses.
What is Hadoop Distributed File System (HDFS)
HDFS is a distributed file system and data store that lets companies manage terabytes of data on the commodity servers running in their data centers. One of the Hadoop framework’s four core modules, HDFS enables fault-tolerant, high-throughput data access for various use cases, including data warehouses.
What is Amazon Simple Storage Service (Amazon S3)
Amazon S3 is a cloud-based data storage service that saves data as objects in an efficient and scalable flat structure. One of the pillars of Amazon Web Services (AWS) along with Elastic Compute Cloud (EC2), S3’s object storage lets companies build entire data architectures in the cloud without the expense of an on-premises data center.
HDFS and S3 architecture
HDFS and S3 represent fundamentally different approaches to storing and managing data at scale. Understanding their origins and architectures will help explain why companies are re-evaluating their big data infrastructures to meet data’s ever-increasing volume, velocity, and variety.
HDFS architecture
Based on research conducted by Google and formalized by engineers at Yahoo!, Hadoop and HDFS addressed critical issues facing these pioneering web search engines. Websites had proliferated exponentially since the mid-1990s, forcing search companies to manage data at unprecedented scales and making conventional technology providers prohibitively expensive. As a result, companies like Yahoo! built their own data centers with more failure-prone commodity servers.
Although Google kept its big data processing systems to itself, Yahoo! handed Hadoop and HDFS to the Apache Software Foundation, letting any company implement more affordable and scalable data architectures. Several design principles guided Hadoop’s development, including:
- Prioritization of read throughput over latency.
- Fault tolerance through data replication and rapid fault detection.
- Data locality processes data where it lives to improve performance and avoid network congestion.
Hadoop modules
In addition to HDFS, the Hadoop framework consists of three other modules for storing, managing, and processing massive amounts of data:
- Hadoop Common is a set of shared utilities and libraries.
- Hadoop Yet Another Resource Negotiator (YARN) provides resource and task management.
- Hadoop MapReduce is a Java-based framework for parallelized data processing.
HDFS architecture
An HDFS cluster consists of a NameNode, which manages the file system’s namespace, and multiple DataNodes, which read and write the data. Rather than storing data as distinct files, HDFS splits it into blocks distributed across DataNodes on different servers and racks. Replicating these blocks ensures that a single server’s failure does not result in data loss.
Although HDFS distributes data blocks across multiple machines, the HDFS namespace presents users and applications with a hierarchical directory structure, much like a desktop operating system.