0% found this document useful (0 votes)
32 views

Spark 1TB Data Processing

Uploaded by

nqhuy.htbi127
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

Spark 1TB Data Processing

Uploaded by

nqhuy.htbi127
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 2

The 1TB of data is split into smaller chunks, called HDFS blocks

(64/128MB). This is done to ensure that each block can be processed


independently and in parallel.

The HDFS blocks are distributed across the 2 EC2 instances, which
are configured as a Hadoop cluster. Each instance has a DataNode
that stores a portion of the data.

A Spark job is submitted to the YARN cluster (used for resource


allocation and tasks scheduling ). The Spark job is configured to
process the 1TB of data in parallel using multiple executors.

YARN allocates 2 executors, one on each EC2 instance, to process


job.

The Spark executors load the HDFS blocks into memory, which is
divided into two parts:

Memory for caching: A portion of the memory is used to cache the


data, which is used to store the data that is frequently accessed.
Memory for processing: The remaining memory is used to process
the data, which includes the data that is being processed by the
Spark job.

The Spark executors process the data in parallel using the following
steps:
 Map phase: The data is processed in parallel using the map
function, which applies a transformation to each element of the
data.
 Shuffle phase: The data is shuffled across the executors to
ensure that each executor has a portion of the data.
 Reduce phase: The data is processed in parallel using the
reduce function, which aggregates the data.

MapReduce Processing

The MapReduce job is submitted to the Hadoop cluster, which is


responsible for processing the data in parallel using multiple mappers
and reducers.

Step 8: Mapper Allocation

TCB Internal Document


The MapReduce job allocates 2 mappers, one on each EC2 instance,
to process the data in parallel.

Step 9: Data Mapping

The mappers process the data in parallel using the following steps:

Map phase: The data is processed in parallel using the map function,
which applies a transformation to each element of the data.
Shuffle phase: The data is shuffled across the mappers to ensure that
each mapper has a portion of the data.
Step 10: Reducer Allocation

The MapReduce job allocates 1 reducer, which is responsible for


aggregating the data.

Step 11: Data Reducing

The reducer processes the data in parallel using the following steps:

Reduce phase: The data is processed in parallel using the reduce


function, which aggregates the data.
Output phase: The processed data is stored in HDFS.
Step 12: Data Output

The processed data is stored in HDFS, which is divided into two parts:

Memory for caching: A portion of the memory is used to cache the


processed data, which is used to store the data that is frequently
accessed.
Disk storage: The remaining data is stored on disk, which is used to
store the data that is not frequently accessed.

TCB Internal Document

You might also like