0% found this document useful (0 votes)

32 views

Spark 1TB Data Processing

Uploaded by

nqhuy.htbi127

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views

Spark 1TB Data Processing

Uploaded by

nqhuy.htbi127

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 2

The 1TB of data is split into smaller chunks, called HDFS blocks

(64/128MB). This is done to ensure that each block can be processed

independently and in parallel.

The HDFS blocks are distributed across the 2 EC2 instances, which
are configured as a Hadoop cluster. Each instance has a DataNode
that stores a portion of the data.

A Spark job is submitted to the YARN cluster (used for resource

allocation and tasks scheduling ). The Spark job is configured to
process the 1TB of data in parallel using multiple executors.

YARN allocates 2 executors, one on each EC2 instance, to process

job.

The Spark executors load the HDFS blocks into memory, which is
divided into two parts:

Memory for caching: A portion of the memory is used to cache the

data, which is used to store the data that is frequently accessed.
Memory for processing: The remaining memory is used to process
the data, which includes the data that is being processed by the
Spark job.

The Spark executors process the data in parallel using the following
steps:
 Map phase: The data is processed in parallel using the map
function, which applies a transformation to each element of the
data.
 Shuffle phase: The data is shuffled across the executors to
ensure that each executor has a portion of the data.
 Reduce phase: The data is processed in parallel using the
reduce function, which aggregates the data.

MapReduce Processing

The MapReduce job is submitted to the Hadoop cluster, which is

responsible for processing the data in parallel using multiple mappers
and reducers.

Step 8: Mapper Allocation

TCB Internal Document

The MapReduce job allocates 2 mappers, one on each EC2 instance,
to process the data in parallel.

Step 9: Data Mapping

The mappers process the data in parallel using the following steps:

Map phase: The data is processed in parallel using the map function,
which applies a transformation to each element of the data.
Shuffle phase: The data is shuffled across the mappers to ensure that
each mapper has a portion of the data.
Step 10: Reducer Allocation

The MapReduce job allocates 1 reducer, which is responsible for

aggregating the data.

Step 11: Data Reducing

The reducer processes the data in parallel using the following steps:

Reduce phase: The data is processed in parallel using the reduce

function, which aggregates the data.
Output phase: The processed data is stored in HDFS.
Step 12: Data Output

The processed data is stored in HDFS, which is divided into two parts:

Memory for caching: A portion of the memory is used to cache the

processed data, which is used to store the data that is frequently
accessed.
Disk storage: The remaining data is stored on disk, which is used to
store the data that is not frequently accessed.

TCB Internal Document

Cloud Computing Prof
No ratings yet
Cloud Computing Prof
11 pages
Hadoop Karunesh
No ratings yet
Hadoop Karunesh
14 pages
BDA UNIT-3 (1) - Merged
No ratings yet
BDA UNIT-3 (1) - Merged
98 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
27 pages
Big data unit 3 own
No ratings yet
Big data unit 3 own
20 pages
Map Reduce
No ratings yet
Map Reduce
36 pages
Unit-2 MapReduce2024
No ratings yet
Unit-2 MapReduce2024
41 pages
Mapreduce: Map Phase & Reduce Phase: Each Has Key-Value Pairs As Input and Output
No ratings yet
Mapreduce: Map Phase & Reduce Phase: Each Has Key-Value Pairs As Input and Output
2 pages
Lecture 3 MapReduce Spark
No ratings yet
Lecture 3 MapReduce Spark
62 pages
PBDS Unit4
No ratings yet
PBDS Unit4
32 pages
Big Data Unit5
No ratings yet
Big Data Unit5
57 pages
Notes Bug Data and of Apache
No ratings yet
Notes Bug Data and of Apache
4 pages
Unit 3 Handouts
No ratings yet
Unit 3 Handouts
11 pages
Unit 3-1
No ratings yet
Unit 3-1
65 pages
Unit IV Notes
No ratings yet
Unit IV Notes
25 pages
Bda Unit-3
No ratings yet
Bda Unit-3
20 pages
1 MapReduce introduction with example
No ratings yet
1 MapReduce introduction with example
52 pages
P.Prabu (23x61c) CCS334-BDA - Unit-3
No ratings yet
P.Prabu (23x61c) CCS334-BDA - Unit-3
23 pages
Big Data Analysis pdf 2
No ratings yet
Big Data Analysis pdf 2
18 pages
Hadoop (Mapreduce)
No ratings yet
Hadoop (Mapreduce)
43 pages
Big Data Processing, MapReduce
No ratings yet
Big Data Processing, MapReduce
13 pages
BDA Assignment 3
No ratings yet
BDA Assignment 3
24 pages
3. Introduction-to-Hadoop-Ecosystem
No ratings yet
3. Introduction-to-Hadoop-Ecosystem
26 pages
MapReduce Architecture
No ratings yet
MapReduce Architecture
5 pages
Lec 6
No ratings yet
Lec 6
16 pages
Data Science
No ratings yet
Data Science
7 pages
CLOUD COMPUTING UNIT 3
No ratings yet
CLOUD COMPUTING UNIT 3
10 pages
Bda - Unit 3
No ratings yet
Bda - Unit 3
29 pages
Unit-V Spark
No ratings yet
Unit-V Spark
69 pages
ECS765P_W4_Introduction to Spark
No ratings yet
ECS765P_W4_Introduction to Spark
39 pages
MapReduce Pattern Presentation
No ratings yet
MapReduce Pattern Presentation
7 pages
Unit 2 - From Hadoop Streaming PDF
No ratings yet
Unit 2 - From Hadoop Streaming PDF
20 pages
Unit-2 (MapReduce-I)
No ratings yet
Unit-2 (MapReduce-I)
28 pages
A Lightweight Continuous Jobs Mechanism For Mapreduce Frameworks
No ratings yet
A Lightweight Continuous Jobs Mechanism For Mapreduce Frameworks
22 pages
Unit 3 Bda
No ratings yet
Unit 3 Bda
59 pages
Spark_Class_1_PPT
No ratings yet
Spark_Class_1_PPT
33 pages
Big Data Analytics-4
No ratings yet
Big Data Analytics-4
26 pages
Chapter Five Hadoop Mapreduce & HDFS
No ratings yet
Chapter Five Hadoop Mapreduce & HDFS
44 pages
21CS1601 UNIT 5 UNDERSTANDING BIG DATA TECHNOLGIES
No ratings yet
21CS1601 UNIT 5 UNDERSTANDING BIG DATA TECHNOLGIES
20 pages
UNIT – III
No ratings yet
UNIT – III
38 pages
The CAP Theorem Overview
No ratings yet
The CAP Theorem Overview
16 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Unit - III Advanced Analytics Technology and Tools
No ratings yet
Unit - III Advanced Analytics Technology and Tools
44 pages
Shortnotes For Cloud
No ratings yet
Shortnotes For Cloud
22 pages
Big Data
No ratings yet
Big Data
67 pages
Writing Spark Application
No ratings yet
Writing Spark Application
37 pages
9 Hadoop PDF
No ratings yet
9 Hadoop PDF
59 pages
BDA-U4
No ratings yet
BDA-U4
25 pages
BDM 2
No ratings yet
BDM 2
5 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
What Is MapReduce in Hadoop - Architecture - Example
No ratings yet
What Is MapReduce in Hadoop - Architecture - Example
7 pages
2 Hadoop Ecosystem
No ratings yet
2 Hadoop Ecosystem
41 pages
CC-Unit 3
No ratings yet
CC-Unit 3
22 pages
Lecture 11
No ratings yet
Lecture 11
17 pages
BDA 2 (1)
No ratings yet
BDA 2 (1)
35 pages
Chapter 4 MapReduce and New Software Stack
No ratings yet
Chapter 4 MapReduce and New Software Stack
48 pages
Data W - Bigdata8
No ratings yet
Data W - Bigdata8
105 pages
Hadoop Class 2 PDF
No ratings yet
Hadoop Class 2 PDF
18 pages
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
Learn Cassandra in 24 Hours
From Everand
Learn Cassandra in 24 Hours
Alex Nordeen
No ratings yet
Practical File of Advanced Database Management System Lab: CSP2301 Bachelor of Engineering
No ratings yet
Practical File of Advanced Database Management System Lab: CSP2301 Bachelor of Engineering
29 pages
Informational Technology Project: The Institute of Charted Accountants OF India
No ratings yet
Informational Technology Project: The Institute of Charted Accountants OF India
60 pages
BDA Experiment 14 PDF
No ratings yet
BDA Experiment 14 PDF
77 pages
Resume
No ratings yet
Resume
4 pages
Analisis Strategi Pemasaran Perumahan Di Makassar (Studi Kasus PT Indah Bumi Bosowa)
No ratings yet
Analisis Strategi Pemasaran Perumahan Di Makassar (Studi Kasus PT Indah Bumi Bosowa)
14 pages
Tkde 2020 2997688
No ratings yet
Tkde 2020 2997688
14 pages
Log Miner
No ratings yet
Log Miner
22 pages
Ma Pics Book
No ratings yet
Ma Pics Book
42 pages
Oracle DBA Interview Questions Answered - Technical (1) JAI MAA DURGA
75% (4)
Oracle DBA Interview Questions Answered - Technical (1) JAI MAA DURGA
20 pages
IP Database Tables
No ratings yet
IP Database Tables
5 pages
Lec 2 Data Modeling and Database Design
No ratings yet
Lec 2 Data Modeling and Database Design
10 pages
Mid Term Exam Semester 2 Part 2
No ratings yet
Mid Term Exam Semester 2 Part 2
24 pages
BDC Recording SHDB Steps
No ratings yet
BDC Recording SHDB Steps
16 pages
Database Applications (15-415) : ORM - Part I Lecture 11, February 11, 2018
No ratings yet
Database Applications (15-415) : ORM - Part I Lecture 11, February 11, 2018
45 pages
Cohesity Best Practices NAS Data Protection Archived
No ratings yet
Cohesity Best Practices NAS Data Protection Archived
17 pages
Data Analysis by Using Python
No ratings yet
Data Analysis by Using Python
15 pages
Optigrise Technology Solutions LLC, New Jersey
No ratings yet
Optigrise Technology Solutions LLC, New Jersey
37 pages
Hostel Management System
No ratings yet
Hostel Management System
11 pages
Veeam Availability Suite 9 5 Editions Comparison
No ratings yet
Veeam Availability Suite 9 5 Editions Comparison
7 pages
John_Miller_CV
No ratings yet
John_Miller_CV
1 page
GLCM PDF
No ratings yet
GLCM PDF
7 pages
Exemplo Stored Procedure SP Who3 SQL
No ratings yet
Exemplo Stored Procedure SP Who3 SQL
5 pages
Custom Setting Vs Custom Metadata in Salesforce
No ratings yet
Custom Setting Vs Custom Metadata in Salesforce
6 pages
How To Create A Website
No ratings yet
How To Create A Website
12 pages
Oracle Golden Gate Deployment Architecture
No ratings yet
Oracle Golden Gate Deployment Architecture
4 pages
Data Warehousing
No ratings yet
Data Warehousing
31 pages
USIT304 Database Management Systems
No ratings yet
USIT304 Database Management Systems
222 pages
Clustering in Irs PDF
No ratings yet
Clustering in Irs PDF
8 pages
Mongodb Documentation - Google Search
No ratings yet
Mongodb Documentation - Google Search
2 pages
Informatica Questions 36443
No ratings yet
Informatica Questions 36443
52 pages

Spark 1TB Data Processing

Uploaded by

Spark 1TB Data Processing

Uploaded by

The 1TB of data is split into smaller chunks, called HDFS blocks

(64/128MB). This is done to ensure that each block can be processed

A Spark job is submitted to the YARN cluster (used for resource

YARN allocates 2 executors, one on each EC2 instance, to process

Memory for caching: A portion of the memory is used to cache the

The MapReduce job is submitted to the Hadoop cluster, which is

Step 8: Mapper Allocation

TCB Internal Document

Step 9: Data Mapping

The MapReduce job allocates 1 reducer, which is responsible for

Step 11: Data Reducing

Reduce phase: The data is processed in parallel using the reduce

Memory for caching: A portion of the memory is used to cache the

TCB Internal Document

You might also like