Apache Hadoop
Apache Hadoop is a collection of open-source software tools, that speeds up the processing of very large amounts of data by distributing it to multiple computers. Its core consists of:
Its core consists of:
A, a storage part: Hadoop Distributed File System (HDFS). This abstracts away the distributed nature of storage so that data can be moved around the cluster of computers more easily.
B, a processing part: Hadoop MapReduce. This is used to program transformations within the intended data processing pipeline.
B.1, The Map phase organises data into key:value pairs, and
B.2, a process called Shuffle uses that to redistribute the data to nodes in the cluster so that similar data is stored together.
B.3, The Reduce phase then does its transformation, perhaps just a mathematical operation like addition, and writes it back to HDFS.
The distributed nature of Hadoop means that processing:
- can be done in parallel rather than only sequentially, waiting for one step to finish before another begins.
- is far more fault tolerant and so the processing has higher availability.
Reading and writing to and from disk for each transformation means that Hadoop MapReduce is slower for most pipelines than Apache Spark which uses RAM rather than disk and also runs within the Hadoop ecosystem.