Spark 1TB Data Processing
Spark 1TB Data Processing
The HDFS blocks are distributed across the 2 EC2 instances, which
are configured as a Hadoop cluster. Each instance has a DataNode
that stores a portion of the data.
The Spark executors load the HDFS blocks into memory, which is
divided into two parts:
The Spark executors process the data in parallel using the following
steps:
Map phase: The data is processed in parallel using the map
function, which applies a transformation to each element of the
data.
Shuffle phase: The data is shuffled across the executors to
ensure that each executor has a portion of the data.
Reduce phase: The data is processed in parallel using the
reduce function, which aggregates the data.
MapReduce Processing
The mappers process the data in parallel using the following steps:
Map phase: The data is processed in parallel using the map function,
which applies a transformation to each element of the data.
Shuffle phase: The data is shuffled across the mappers to ensure that
each mapper has a portion of the data.
Step 10: Reducer Allocation
The reducer processes the data in parallel using the following steps:
The processed data is stored in HDFS, which is divided into two parts: