Introduction To Batch Processing

Uploaded by

ishwari.raskar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views23 pages

Introduction To Batch Processing

Uploaded by

ishwari.raskar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 23

Introduction to batch processing – MapReduce

the volume of data is often too big for a single server – node –
to process. Therefore, there was a need to develop code that
runs on multiple nodes. Writing distributed systems is an
endless array of problems, so people developed multiple
frameworks to make our lives easier. MapReduce is a
framework that allows the user to write code that is executed
on multiple nodes without having to worry about fault
tolerance, reliability, synchronization or availability.
Batch processing
 Batch processing is an automated job that does some
computation, usually done as a periodical job. It runs the
processing code on a set of inputs, called a batch.
Usually, the job will read the batch data from a database
and store the result in the same or different database.
 An example of a batch processing job could be reading
all the sale logs from an online shop for a single day and
aggregating it into statistics for that day (number of users
per country, the average spent amount, etc.). Doing this
as a daily job could give insights into customer trends.
MapReduce
 MapReduce is a programming model that was introduced
in a white paper by Google in 2004. Today, it is
implemented in various data processing and storing
systems (Hadoop, Spark, MongoDB, …) and it is a
foundational building block of most big data batch
processing systems.
 For MapReduce to be able to do computation on large
amounts of data, it has to be a distributed model that
executes its code on multiple nodes. This allows the
computation to handle larger amounts of data by adding
more machines – horizontal scaling. This is different
from vertical scaling, which implies increasing the
performance of a single machine.
Execution
 In order to decrease the duration of our distributed
computation, MapReduce tries to
reduce shuffling (moving) the data from one node to
another by distributing the computation so that it is done
on the same node where the data is stored. This way, the
data stays on the same node, but the code is moved via
the network. This is ideal because the code is much
smaller than the data.
 To run a MapReduce job, the user has to implement two
functions, map and reduce, and those implemented
functions are distributed to nodes that contain the data by
the MapReduce framework. Each node runs (executes)
the given functions on the data it has in order the
minimize network traffic (shuffling data).
The computation performance of MapReduce comes at the
cost of its expressivity. When writing a MapReduce job we
have to follow the strict interface (return and input data
structure) of the map and the reduce functions. The map phase
generates key-value data pairs from the input data (partitions),
which are then grouped by key and used in the reduce phase
by the reduce task. Everything except the interface of the
functions is programmable by the user.
Map
Hadoop, along with its many other features, had the first
open-source implementation of MapReduce. It also has its
own distributed file storage called HDFS. In Hadoop, the
typical input into a MapReduce job is a directory in HDFS. In
order to increase parallelization, each directory is made up of
smaller units called partitions and each partition can be
processed separately by a map task (the process that executes
the map function). This is hidden from the user, but it is
important to be aware of it because the number of partitions
can affect the speed of execution.

The map task (mapper) is called once for every input partition
and its job is to extract key-value pairs from the input
partition. The mapper can generate any number of key-value
pairs from a single input (including zero, see the figure
above). The user only needs to define the code inside the
mapper. Below, we see an example of a simple mapper that
takes the input partition and outputs each word as a key with
value 1.
# Map function, is applied on a partition
def mapper(key, value):
# Split the text into words and yield word,1 as a pair
for word in value.split():
normalized_word = world.lower()
yield normalized_word, 1

Reduce
The MapReduce framework collects all the key-value pairs
produced by the mappers, arranges them into groups with the
same key and applies the reduce function. All the grouped
values entering the reducers are sorted by the framework. The
reducer can produce output files which can serve as input into
another MapReduce job, thus enabling multiple MapReduce
jobs to chain into a more complex data processing pipeline.
# Reduce function, applied to a group of values with the same
key
def reducer(key, values):
# Sum all the values with the same key
result = sum(values)
return result
The mapper yielded key-value pairs with the word as the key
and the number 1 as the value. The reducer can be called on
all the values with the same key (word), to create a distributed
word counting pipeline. In the image below, we see that not
every sorted group has a reduce task. This happens because
the user needs to define the number of reducers, which is 3 in
our case. After a reducer is done with its task, it takes another
group if there is one that was not processed.

Practical example
In order for this post to not be only dry words and images, I
have added these examples to a lightweight MapReduce in
Python that you can run easily run on your local machine. If
you want to try this, download the code for the Python
MapReduce from GitHub. The example code is in the usual
place – DataWhatNow GitHub repo. The map and reduce
functions are same as the ones above (word counting). The
input is the first paragraph of Introduction to web scraping
with Python split into partitions (defined manually by me).
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import mincemeat
partitions = [
'Data is the core of predictive modeling, visualization, and
analytics.',
'Unfortunately, the needed data is not always readily available
to the user,',
'it is most often unstructured. The biggest source of data is the
Internet, and',
'with programming, we can extract and process the data found
on the Internet for',
'our use – this is called web scraping.',
'Web scraping allows us to extract data from websites and to
do what we please with it.',
'In this post, I will show you how to scrape a website with
only a few of lines of code in Python.',
'All the code used in this post can be found in my GitHub
notebook.'
]
# The data source can be any dictionary-like object
datasource = dict(enumerate(partitions))
def mapper(key, value):
for word in value.split():
normalized_word = word.lower()
yield normalized_word, 1
def reducer(key, values):
result = sum(values)
return result
s = mincemeat.Server()
s.datasource = datasource
s.mapfn = mapper
s.reducefn = reducer
results = s.run_server(password="datawhatnow")
print(results)
# Output
{'and': 4, 'predictive': 1, 'all': 1, 'code': 2, 'often': 1, 'show': 1,
'process': 1, 'allows': 1, 'is': 5, 'it': 1, 'not': 1, 'python.': 1, 'us': 1,
'modeling,': 1, 'in': 4, 'our': 1, 'user,': 1, 'extract': 2,
'unfortunately,': 1, 'readily': 1, 'available': 1, 'web': 2, 'use': 1,
'from': 1, 'i': 1, 'visualization,': 1, 'needed': 1, 'data': 5, 'please':
1, 'scrape': 1, 'website': 1, 'few': 1, 'only': 1, 'post,': 1,
'unstructured.': 1, 'biggest': 1, 'you': 1, 'it.': 1, 'do': 1, 'we': 2,
'used': 1, 'scraping.': 1, 'to': 4, 'post': 1, 'internet': 1, 'what': 1,
'how': 1, 'most': 1, 'analytics.': 1, 'programming,': 1, 'internet,':
1, 'core': 1, 'with': 3, 'source': 1, 'a': 2, 'on': 1, '\xe2\x80\x93': 1,
'github': 1, 'for': 1, 'always': 1, 'be': 1, 'scraping': 1, 'lines': 1,
'websites': 1, 'will': 1, 'this': 3, 'can': 2, 'notebook.': 1, 'of': 4,
'found': 2, 'the': 8, 'my': 1, 'called': 1}
In order to run the Python MapReduce server and the example
above, run the following inside your bash terminal:
# Run the command
python2 example.py
# In another window run
python2 mincemeat.py -p datawhatnow localhost
If you are still having problems with running the example
above, try following the official documentation on GitHub.
Congrats, you just created a MapReduce word counting
pipeline. Even if this does not sound impressive, the
flexibility of MapReduce allows the user to do more complex
data processing such as table joins, page rank, sorting and
anything you can code inside the limitations of the
framework.
Conclusion
MapReduce is a programming model that allows the user to
write batch processing jobs with a small amount of code. It is
flexible in the sense that you, the user, can write code to
modify the behavior, but making complex data processing
pipelines becomes cumbersome because every MapReduce
job has to be managed and scheduled on its own. The
intermediate output of map tasks is written to a file which
allows the framework to recover easily if a node has a failure.
This stability comes at a cost of performance, as the data
could have been forwarded to reduce tasks with a small buffer
instead, creating a stream.
Keep in mind that this was a practical example of getting
familiar with the MapReduce framework. Today, some
databases and data processing systems allow the user to do
computation over multiple machines without having to write
the map and reduce functions. These systems offer higher-
level libraries that allow the user to define the logic using
SQL, Python, Scala, etc. The system translates the code
written by the user into one or more MapReduce jobs, so the
user does not have to write the actual map and reduce
functions. This allows the users already familiar with those
languages to utilize the power of the MapReduce framework
with ease, using familiar tools.
Apache Spark
What is Spark?
 Apache Spark is a framework aimed at performing fast
distributed computing on Big Data by using in-
memory primitives.
 It allows user programs to load data into memory and
query it repeatedly, making it a well suited tool for online
and iterative processing (especially for ML algorithms)
 It was motivated by the limitations in the
MapReduce/Hadoop paradigm which forces to follow a
linear dataflow that make an intensive disk-usage.

 Spark Platform


Spark Model
Directed Acyclic Graphs
 MapReduce programming model only has two phases:
map and/or reduce.
 Complex applications and data flows can be
implemented by chaining these phases
 This chaining forms a ‘graph’ of operations — which is
known as a “directed acyclic graphs”, or DAGs
 DAGs contain series of actions connected to each other
in a workflow
 In the case of MapReduce, the DAG is a series of map
and reduce tasks used to implement the application —
and it is the developer’s job to define each task and chain
them together.
Spark vs Hadoop MR
Main differences between Hadoop MR and Spark:
 With Spark, the engine itself creates those complex
chains of steps from the application’s logic. This allows
developers to express complex algorithms and data
processing pipelines within the same job and allows the
framework to optimize the job as a whole, leading to
improved performance.
 Memory-based computations
Common features:
 Data locality
 Staged execution (stages separated by shuffle phases)
 Reliance on distributed file system for on-disk
persistence (HDFS)
Spark Core

RDD: Resilient Distributed Dataset

 Spark is based on distributed data structures called
Resilient Distributed Datasets (RDDs) which can be
thought of as immutable parallel data structures
 It can persist intermediate results into memory or disk for
re-usability purposes, and customize the partitioning to
optimize data placement.
 RDDs are also fault-tolerant by nature. RDD stores
information about its parents to optimize execution (via
pipelining of operations) and recompute partition in case
of failure
 RDD provides API for various transformations and
materializations of data
 There are two ways to create RDDs: parallelizing an
existing collection in your driver program, or referencing
a dataset in an external storage system, such as a shared
filesystem, HDFS, HBase, or any data source offering a
Hadoop InputFormat.
RDDs — Partitions
 RDDs are designed to contain huge amounts of data ,
that cannot fit onto one single machine → hence, the data
has to be partitioned across multiple machines/nodes
 Spark automatically partitions RDDs and distributes the
partitions across different nodes
 A partition in spark is an atomic chunk of data stored on
a node in the cluster
 Partitions are basic units of parallelism in Apache Spark
 RDDs in Apache Spark are collection of partitions
 Operations on RDDs automatically place tasks into
partitions, maintaining the locality of persisted data
RDD — Properties

RDD Creation
Here’s an example of RDDs created during a method call:
Which first loads HDFS blocks in memory and then applies
map() function to filter out keys creating two RDDs:
Spark — Job Architecture

Apache Spark follows a master/slave architecture with two

main daemons and a cluster manager –
Master Daemon — (Master/Driver Process)
Worker Daemon — (Slave Process)
Spark — Driver

 The Driver is the code that includes the “main” function

and defines the RDDs
 Parallel operations on the RDDs are sent to the DAG
scheduler, which will optimize the code and arrive at an
efficient DAG that represents the data processing steps in
the application.
Spark — ClusterManager
 The resulting DAG is sent to the ClusterManager. The
cluster manager has information about the workers,
assigned threads, and location of data blocks and is
responsible for assigning specific processing tasks to
workers.
 The cluster manager is also the service that handles DAG
play-back in the case of worker failure
Spark — Executor/Workers
Executors/Workers:
 run tasks scheduled by driver
 executes its specific task without knowledge of the entire
DAG
 store computation results in memory, on disk or off-heap
 interact with storage systems
 send its results back to the Driver application
Spark — Fault Tolerance
RDDs store their lineage — the set of transformations that
was used to create the current state, starting from the first
input format that was used to create the RDD.
If the data is lost, Spark will replay the lineage to rebuild the
lost RDDs so the job can continue.
Let’s see how this works:

The left image is a common image used to illustrate a DAG in

spark. The inner boxes are RDD partitions; the next layer is an
RDD and single chained operation.
Take a look at the right image. Now let’s say we lose the
partition denoted by the black box. Spark would replay the
“Good Replay” boxes and the “Lost Block” boxes to get the
data needed to execute the final step
Spark — Execution Workflow

1. Client/user defines the RDD transformations and actions

for the input data
2. DAGScheduler will form the most optimal Direct
Acyclic Graph which is then split into stages of tasks.
3. Stages combine tasks which don’t require
shuffling/repartitioning if the data
4. Tasks are then run on workers and results are returned to
the client
Lets take a look at how stages are determined by looking
at an example of a more complex job’s DAG
Spark — Dependency Types

Narrow (pipelineable)
 Each partition of the parent RDD is used by at most one
partition of the child RDD
 Allow for pipelined execution on one cluster node
 Failure recovery is more efficient as only lost parent
partitions need to be recomputed
Wide (shuffle)
 Multiple child partitions may depend on one parent
partition
 Require data from all parent partitions to be available
and to be shuffled across the nodes
 If some partition is lost from all the ancestors a complete
re-computation is needed
Spark — Stages and Tasks

Stages breakdown strategy

 Check backwards from final RDD
 Add each “narrow” dependency to the current stage
 Create new stage when there’s a shuffle dependency
Tasks
 ShuffleMapTask partitions its input for shuffle
 ResultTask sends its output to the driver
Spark — Stages Summary
Summary or the staging strategies:
 RDD operations with “narrow” dependencies, like map()
and filter(), are pipelined together into one set of tasks in
each stage
 Operations with “wide” /shuffle dependencies require
multiple stages (one to write a set of map output files,
and another to read those files after a barrier).
 In the end, every stage will have only shuffle
dependencies on other stages, and may compute multiple
operations inside it
Shared Variables
 Spark includes two types of variables that allow sharing
information between the execution nodes: broadcast
variables and accumulator variables.
 Broadcast variables are sent to all the remote execution
nodes, where they can be used for data processing.
 This is similar to the role that Configuration objects play
in MapReduce.
 Accumulators are also sent to the remote execution
nodes, but unlike broadcast variables, they can be
modified by
the executors, with the limitation that you only add to the
accumulator variables.
 Accumulators are somewhat similar to MapReduce
counters.
SparkContext
 SparkContext is an object that represents the connection
to a Spark cluster.
 It is used to create RDDs, broadcast data, and initialize
accumulators.
Transformations
 Transformations are functions that take one RDD and
return another
 RDDs are immutable, so transformations will never
modify their input, only return the modified RDD.
 Transformations in Spark are always lazy, so they don’t
compute their results. Instead, calling a transformation
function only creates a new RDD with this specific
transformation as part of its lineage.
 The complete set of transformations is only executed
when an action is called

AQW Item List
33% (3)
AQW Item List
1,531 pages
Grade 9 Social Studies Notes
86% (22)
Grade 9 Social Studies Notes
46 pages
Oral Cancer Essay
No ratings yet
Oral Cancer Essay
3 pages
Family Dynamics
No ratings yet
Family Dynamics
3 pages
Assignment 1
No ratings yet
Assignment 1
1 page
Types of NoSQL Databases
No ratings yet
Types of NoSQL Databases
3 pages
Unit - 4
No ratings yet
Unit - 4
6 pages
Invisisil Op2131sd Uv Cure Optical Bonding Silicone Tds
No ratings yet
Invisisil Op2131sd Uv Cure Optical Bonding Silicone Tds
5 pages
SOP Water Analysis Microbial BPL
No ratings yet
SOP Water Analysis Microbial BPL
17 pages
Class X Icse Syllabus
100% (1)
Class X Icse Syllabus
8 pages
Androstenedione
No ratings yet
Androstenedione
32 pages
The Data Lifecycle Process
No ratings yet
The Data Lifecycle Process
11 pages
Statement 633xxxx4789 23122023 130940
No ratings yet
Statement 633xxxx4789 23122023 130940
4 pages
B .Inggris
No ratings yet
B .Inggris
4 pages
Ajp12. Minu
No ratings yet
Ajp12. Minu
9 pages
Big Data Management Continued
No ratings yet
Big Data Management Continued
48 pages
Paper Map Reduce
No ratings yet
Paper Map Reduce
16 pages
All Ges101 Past Questions-1-1
No ratings yet
All Ges101 Past Questions-1-1
55 pages
PRATAP DOME Holding
No ratings yet
PRATAP DOME Holding
1 page
PAHS 055 Session 4 Disaster Management - 1
No ratings yet
PAHS 055 Session 4 Disaster Management - 1
27 pages
BDA Unit-3
No ratings yet
BDA Unit-3
63 pages
MapReduce Unit3
No ratings yet
MapReduce Unit3
27 pages
M4 06 MapReduce
No ratings yet
M4 06 MapReduce
28 pages
Unit 4 1
No ratings yet
Unit 4 1
12 pages
Lecture 10 Chapter 6 Part 1 Big Data Processing Concepts
No ratings yet
Lecture 10 Chapter 6 Part 1 Big Data Processing Concepts
26 pages
Grade 7 Tos
No ratings yet
Grade 7 Tos
2 pages
B. Hadoop Ecosystem - III (MapReduce)
No ratings yet
B. Hadoop Ecosystem - III (MapReduce)
55 pages
UNIT III Notes
No ratings yet
UNIT III Notes
24 pages
Infinitiv Ili - Ing
0% (1)
Infinitiv Ili - Ing
4 pages
ECS765P - W2 - The MapReduce Programming Model
No ratings yet
ECS765P - W2 - The MapReduce Programming Model
53 pages
Greek Lit Quiz
No ratings yet
Greek Lit Quiz
2 pages
Heat Exchanger Formulas
No ratings yet
Heat Exchanger Formulas
2 pages
Final GR 7 Tech Term 3 Task 5 Memo
No ratings yet
Final GR 7 Tech Term 3 Task 5 Memo
4 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
27 pages
m3010 Series PDF
100% (2)
m3010 Series PDF
133 pages
Big Data
No ratings yet
Big Data
120 pages
Mapreduce Model Principles
No ratings yet
Mapreduce Model Principles
65 pages
Lecture 3 - MapReduce
No ratings yet
Lecture 3 - MapReduce
9 pages
Nike - Final Report
No ratings yet
Nike - Final Report
13 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
9 pages
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Chaos Poincaré Seminar
100% (1)
Chaos Poincaré Seminar
281 pages
Ditp ch2
No ratings yet
Ditp ch2
2 pages
Map Reduce
No ratings yet
Map Reduce
3 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
Lecture 2.1
No ratings yet
Lecture 2.1
13 pages
Big Data Unit - 3
No ratings yet
Big Data Unit - 3
7 pages
D31EXPX 22 Vs CAT AECI431 00 LoRes 58018
No ratings yet
D31EXPX 22 Vs CAT AECI431 00 LoRes 58018
68 pages
Chapter 4
No ratings yet
Chapter 4
53 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
26 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
Unit IV Notes
No ratings yet
Unit IV Notes
25 pages
Map Reduce Notes and Learning
No ratings yet
Map Reduce Notes and Learning
48 pages
Unit4 Fos
No ratings yet
Unit4 Fos
7 pages
Map Reduce
No ratings yet
Map Reduce
35 pages
Map Reduce
No ratings yet
Map Reduce
3 pages
Lecture 10 MapReduce Hadoop
No ratings yet
Lecture 10 MapReduce Hadoop
37 pages
Ir MR 1
No ratings yet
Ir MR 1
34 pages
3 Fuel Consumption Example - MR
No ratings yet
3 Fuel Consumption Example - MR
7 pages
Da Unit 5 Data Analytics
No ratings yet
Da Unit 5 Data Analytics
43 pages
Map Reduce
No ratings yet
Map Reduce
74 pages
Sree Kaala Hastiswara Satakam in Telugu PDF
No ratings yet
Sree Kaala Hastiswara Satakam in Telugu PDF
21 pages
Lecture - 3
No ratings yet
Lecture - 3
25 pages
Module2 C MapReduceParadigm
No ratings yet
Module2 C MapReduceParadigm
74 pages
Map Reduce Tutorial-1
No ratings yet
Map Reduce Tutorial-1
7 pages
Unit - III
No ratings yet
Unit - III
37 pages
Date: English (Set - A) Time: 3 Hrs. Class: VII M. M: 70: General Instructions
No ratings yet
Date: English (Set - A) Time: 3 Hrs. Class: VII M. M: 70: General Instructions
6 pages
3.Map-Reduce Framework - 1
No ratings yet
3.Map-Reduce Framework - 1
47 pages
BDA Unit 3 1
No ratings yet
BDA Unit 3 1
37 pages
Map Reduce
No ratings yet
Map Reduce
42 pages
BDA Module 3 - Part 1 (Mapreduce and HBase) 2023
No ratings yet
BDA Module 3 - Part 1 (Mapreduce and HBase) 2023
15 pages
External Environment
No ratings yet
External Environment
54 pages
Data Science
No ratings yet
Data Science
7 pages
Chapter Five Hadoop Mapreduce & HDFS
No ratings yet
Chapter Five Hadoop Mapreduce & HDFS
44 pages
Career Manager Brochure
No ratings yet
Career Manager Brochure
6 pages
Mitosis
No ratings yet
Mitosis
15 pages
Cloud Computing Prof
No ratings yet
Cloud Computing Prof
11 pages
24/07/08 TP-Link W8920G 108M ADSL and ADSL2+ Set Up Guide
No ratings yet
24/07/08 TP-Link W8920G 108M ADSL and ADSL2+ Set Up Guide
7 pages
Distributed Systems: 18. Mapreduce
No ratings yet
Distributed Systems: 18. Mapreduce
39 pages
Unit-2 (MapReduce-I)
No ratings yet
Unit-2 (MapReduce-I)
28 pages
Ecs765p W2
No ratings yet
Ecs765p W2
55 pages
Secondary Market DR S Sreenivasa Murthy
No ratings yet
Secondary Market DR S Sreenivasa Murthy
33 pages
Distributed and Cloud Computing
No ratings yet
Distributed and Cloud Computing
58 pages
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
49 pages
Map Reduce
No ratings yet
Map Reduce
18 pages
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
55 pages
Map Reduce: Simplified Processing On Large Clusters
No ratings yet
Map Reduce: Simplified Processing On Large Clusters
29 pages
Introduction To Map Reduce
No ratings yet
Introduction To Map Reduce
50 pages
Term Paper Java
No ratings yet
Term Paper Java
14 pages
MapReduce Introduction
No ratings yet
MapReduce Introduction
34 pages
Parallel Programming, Mapreduce Model: Unit Ii
No ratings yet
Parallel Programming, Mapreduce Model: Unit Ii
47 pages
Mapreduce: Simpli - Ed Data Processing On Large Clusters
No ratings yet
Mapreduce: Simpli - Ed Data Processing On Large Clusters
4 pages
Map-Reduce For Parallel Computing: Amit Jain
No ratings yet
Map-Reduce For Parallel Computing: Amit Jain
72 pages
The Mapreduce Paradigm: Michael Kleber
No ratings yet
The Mapreduce Paradigm: Michael Kleber
13 pages

Introduction To Batch Processing

Uploaded by

Introduction To Batch Processing

Uploaded by

Introduction to batch processing – MapReduce

RDD: Resilient Distributed Dataset

Apache Spark follows a master/slave architecture with two

 The Driver is the code that includes the “main” function

The left image is a common image used to illustrate a DAG in

1. Client/user defines the RDD transformations and actions

Stages breakdown strategy

You might also like