0% found this document useful (0 votes)

21 views30 pages

BSC in Information Technology (Data Science) : Massive or Big Data Processing J.Alosius

The document discusses MapReduce and how it works. MapReduce is a programming model used for processing large datasets in a distributed computing environment. It consists of two main phases - Map and Reduce. In the Map phase, data is processed key-value pairs and outputs intermediate key-value pairs. In the Reduce phase, the intermediate outputs are aggregated and final outputs are generated. The document provides examples of MapReduce programs for word count and explains the different components like mapper, reducer, driver, etc. It also discusses how HDFS stores and distributes data across nodes in the cluster.

Uploaded by

Geethma Minoli

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views30 pages

BSC in Information Technology (Data Science) : Massive or Big Data Processing J.Alosius

Uploaded by

Geethma Minoli

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 30

BSc In Information Technology

(Data Science)
SLIIT – 2019 (Semester 2)

Massive or BIG Data Processing

J.Alosius
Introduction to Map Reduce
BIG Data Processing and Abstraction
MapReduce Overview
• A method for distributing computation across multiple nodes
• Each node processes the data that is stored at that node
• Consists of two main phases
• Map
• Reduce

MapReduce Features
• Automatic parallelization and distribution
• Fault-Tolerance
• Provides a clean abstraction for programmers to use
MapReduce Algorithm
• Iterate over a large n u m b e r of records MAP
• Extract something of interest from each
• Shuffle a n d sort intermediate results REDUCE
• Aggregate intermediate results
• Generate final o u t p u t
Key idea: provide a functional abstraction for these two operations

Programmers specify two functions:

m a p (k1, v1) → [(k2, v2)]
reduce (k2, [v2]) → [(k3, v3)]
 All values with the same key are sent to the same reducer
The execution framework handles everything else…
MapReduce Algorithm
Runtime
• Handles scheduling
• Assigns workers to map and
reduce tasks
• Handles “data distribution”
• Moves processes to data
• Handles synchronization
• Gathers, sorts, and shuffles
intermediate data
• Handles errors and faults
• Detects worker failures and
restarts
• Everything happens on top of a
distributed FS (HDFS)
MapReduce Algorithm
The Combiner
The Mapper • Called once for each unique key
• Reads data as key/value pairs • Gets a list of all values associated with a key as
• The key is often discarded input
• Outputs zero or more key/value pairs • The reducer outputs zero or more final
key/value pairs
Shuffle and Sort • Usually just one output per input key
• Output from the mapper is sorted by key • Example: local counting for Word Count:
• All values with the same key are guaranteed to go to the • def combiner(key, values):
same machine • output(key, sum(values)
Partition
The Reducer • In MapReduce, intermediate output values are
• Called once for each unique key not usually reduced together
• Gets a list of all values associated with a key as input • All values with the same key are presented to a
• The reducer outputs zero or more final key/value pairs single Reducer together
• Usually just one output per input key • More specifically, a different subset of
intermediate key space is assigned to each
Reducer
• These subsets are known as partitions
MapReduce Algorithm
• Programmers specify two functions:
• map (k1, v1) → [(k2, v2)]
• reduce (k2, [v2]) → [(k3, v3)]
• All values with the same key are reduced together
• The execution framework handles everything else…
• Not quite…usually, programmers also specify:
• partition (k2, number of partitions) → partition for
k2
• Often a simple hash of the key, e.g., hash(k2) mod n
• Divides up key space for parallel reduce operations
combine (k2, [v2]) → [(k2, v2’)]
• Mini-reducers that run in memory after the map
phase
• Used as an optimization to reduce network traffic
Word Count Example
Word Count Example
MapReduce Program
A MapReduce program consists of the following 3 parts:

• Driver (main- would trigger the map and reduce methods)

• Mapper
• Reducer

It is better to include the map, reduce and main methods in 3

different classes
public static class Map extends Mapper<LongWritable,Text,Text,IntWritable> {
1

2
public void map(LongWritable key, Text value, Context context) throws
3
IOException,InterruptedException {
4

5
String line = value.toString();
6
StringTokenizer tokenizer = new StringTokenizer(line);
7
while (tokenizer.hasMoreTokens()) {
8
value.set(tokenizer.nextToken());
9
context.write(value, new IntWritable(1));
10
}

Input:
The key is nothing but the offset of each line in the text file: LongWritable
The value is each individual line (as shown in the figure at the right): Text
Output:
The key is the tokenized words: Text
We have the hardcoded value in our case which is 1: IntWritable
Example – Dear 1, Bear 1, etc.
Reducer Code:
1public static class Reduce extends Reducer<Text,IntWritable,Text,IntWritable> {
2
3public void reduce(Text key, Iterable<IntWritable> values,Context context)
4throws IOException,InterruptedException {
5
6int sum=0;
7for(IntWritable x: values)
8{
9sum+=x.get();
10}
11context.write(key, new IntWritable(sum));
12}
13}

Both the input and the output of the Reducer is a key-value pair.
Input:
The key nothing but those unique words which have been generated after the sorting and shuffling phase: Text
The value is a list of integers corresponding to each key: IntWritable
Example – Bear, [1, 1], etc.
Output:
The key is all the unique words present in the input text file: Text
The value is the number of occurrences of each of the unique words: IntWritable
Example – Bear, 2; Car, 3, etc.
We have aggregated the values present in each of the list corresponding to each key and produced the final answer.
• In the driver class, we set the
1Configuration conf= new Configuration();
configuration of our MapReduce
2Job job = new Job(conf,"My Word Count Program");
3job.setJarByClass(WordCount.class); job to run in Hadoop.
4job.setMapperClass(Map.class); • specify the name of the job ,
5job.setReducerClass(Reduce.class); • the data type of input/output of
6job.setOutputKeyClass(Text.class); the mapper and reducer.
7 • specify the names of the
8job.setOutputValueClass(IntWritable.class); mapper and reducer classes.
9job.setInputFormatClass(TextInputFormat.class); • The path of the input and
10job.setOutputFormatClass(TextOutputFormat.class); output folder.
11Path outputPath = new Path(args[1]); • The method
12 setInputFormatClass () is used
13//Configuring the input/output path from the filesystem into the job for specifying that how a
14FileInputFormat.addInputPath(job, new Path(args[0]));
Mapper will read the input data
15FileOutputFormat.setOutputPath(job, new Path(args[1]));
or what will be the unit of work.
Here, we have chosen
TextInputFormat so that single
hadoop jar hadoop-mapreduce-example.jar WordCount /sample/input line is read by the mapper at a
/sample/output time from the input text file.
• The main () method is the entry
point for the driver. In this
method, we instantiate a new
Configuration object for the job.
HDFS Architecture
Distributed File System
• Don’t move data to workers… move workers to
the data!
• Store data on the local disks of nodes in the
cluster
• Start up the workers on the node that has
the data local
• Why DFS?
• Not enough RAM to hold all the data in
memory
• Disk access is slow, but disk throughput is
DFS - Features
• Single Namespace for entire cluster reasonable
• Data Coherency
• Write-once-read-many access model
• Client can only append to existing files
• Files are broken up into blocks
• Typically 128 MB block size
• Each block replicated on multiple DataNodes
• Intelligent Client
• Client can find location of blocks
• Client accesses data directly from DataNode
NameNode – Metadata NameNode – Responsibilities
• Meta-data in Memory • Managing the file system namespace:
• The entire metadata is in main memory • Holds file/directory structure, metadata, file-
• No demand paging of meta-data to-block mapping, access permissions, etc.
• Types of Metadata • Coordinating file operations:
• List of file • Directs clients to datanodes for reads and
• List of Blocks for each file writes
• List of DataNodes for each block • No data is moved through the namenode
• File attributes, e.g creation time, • Maintaining overall health:
replication factor • Periodic communication with the datanodes
• A Transaction Log • Block re-replication and rebalancing
• Records file creations, file deletions. etc • Garbage collection
Datanode Block Placement
• A Block Server
• Stores data in the local file system • Current Strategy
• Stores meta-data of a block • One replica on local node
• Serves data and meta-data to Clients • Second replica on a remote rack
• Block Report • Third replica on same remote rack
• Periodically sends a report of all existing • Additional replicas are randomly placed
blocks to the NameNode • Clients read from nearest replica
• Facilitates Pipelining of Data • Would like to make this policy pluggable
• Forwards data to other specified DataNodes
Data Correctness
• Use Checksums to validate data
• Use CRC32
• File Creation
• Client computes checksum per 512 byte
• DataNode stores the checksum
• File access
• Client retrieves the data and checksum from DataNode
• If Validation fails, Client tries other replicas

NameNode Failure
• A single point of failure
• Transaction Log stored in multiple directories
• A directory on the local file system
• A directory on a remote file system (NFS/CIFS)
3. Client writes block directly to one Data Node
• Data Nodes replicates block
1. Client consults Name Node • Cycle repeats for next block
2. Name Node replies with the location of the Data 4. Data node replies with acknowledgement
5. Client sends to the Name Node a request to
Node close the file
1. An application client wishing to read a file must first 3. The client then contacts the Data Nodes to
contact the Name Node to determine where the actual retrieve data.
data is stored. • Important features of the design:
2. In response to the client request the Name Node • Data is never moved through the Name Node
returns: • All data transfer occurs directly between clients
• The relevant block ids and the Data Nodes
• The locations where the blocks are held • Communications with the Name Node only
involves transfer of metadata
Introduction to YARN
MapReduce Vs YARN
MapReduce Vs YARN

Limits Scalability
• Maximum cluster size: 4,000 nodes
• Maximum concurrent tasks: 40,000
• Availability – Job Tracker is SPOF (Single Point of Failure)
• Problem with Resource Utilization
• Predefined number of map slots and reduce slots for each TaskTracker
• Underutilization when more map tasks or reducer tasks are running
• Runs only MapReduce applications
Advantages of YARN

• Yarn does efficient utilization of the resource

• Centralized resource management
• Multiple applications in Hadoop, all sharing a common resource
• No more fixed map-reduce slots
• Supports applications that do not follow MapReduce mode
• Apache Spark, Apache Giraph, Tez
• Most JobTracker functions moved to Application Master
• – one cluster can have many Application Masters
Components of YARN
Resource Manager (RM)
• Runs on Master Node
• Global resource scheduler
• Arbitrates system resources between
competing nodes

Node Manager (NM)

• Runs on slave nodes
• Communicates with RM

Container Application Master

• Created by the RM upon request • One per application
• Allocate a certain amount of resources (memory , CPU) • Framework/application specific
on a slave node • Runs in a container
• Applications run in one or more containers • Requests more containers to run application
tasks
Fault Tolerance
• Task (Container) – Handled just like MRv1
• MR APPMaster will re-attempt tasks that complete with exceptions or stop responding (4 times by default)
• Applications with too many failed tasks are considered failed

• Application Master
• If application fails or if AM stops sending heartbeats, RM will re-attempt the whole application (2 times by
default)
• MR AppMaster optional setting: Job Recovery
• If false, all tasks will re-run
• If true, MR APPMaster retrieves state of tasks when it restarts; only incomplete tasks will be re-run

• NodeManager
• If NM stops sending heartbeats to RM, it is removed from list of active nodes
• Tasks on the node will be treated as failed by MR AppMaster
• If the AppMaster node fails, it will be treated as a failed application

• Resource Manager

• No application or tasks can be launched if RM is unavailable

• Can be configured with High Availability
MapReduce Program

• Every mapper class must be extended
from MapReduceBase class and it must
implement Mapper interface.
• The main part of Mapper class is
a 'map()' method which accepts four
arguments.
• At every call to 'map()' method, a key-value pair
('key' and 'value' in this code) is passed.
• 'map()' method begins by splitting input text
which is received as an argument. It uses the
tokenizer to split these lines into words.
• After this, a pair is formed using a record at 7th
index of array 'SingleCountryData' and a
value '1'.

• Next Step to select the 7th index because the Country data is located at 7th index in array 'SingleCountryData'.
• Please note that the input data is in the below format (where Country is at 7th index, with 0 as a starting index)-
• Transaction_date,Product,Price,Payment_Type,Name,City,State,Country,Account_Created,Last_Login,Latitude,Long
itude
• An output of mapper is again a key-value pair which is outputted using 'collect()' method of 'OutputCollector'.
MapReduce Program
• An input to the reduce() method is a key with a list
of multiple values.
• For example, in our case, it will be-
<United Arab Emirates, 1>, <United Arab
Emirates, 1>, <United Arab Emirates, 1>,<United
Arab Emirates, 1>, <United Arab Emirates, 1>,
<United Arab Emirates, 1>.
• This is given to reducer as <United Arab Emirates,
{1,1,1,1,1,1}>
• So, to accept arguments of this form, first two data
types are used,
viz., Text and Iterator<IntWritable>. Text is a data
type of key and Iterator<IntWritable> is a data type
for list of values for that key.
• The next argument is of
type OutputCollector<Text,IntWritable> which
collects the output of reducer phase.
• reduce() method begins by copying key value and initializing frequency count to 0.
• Then, 'while' loop, is used to iterate through the list of values associated with the key and calculate the final
frequency by summing up all the values.
• push the result to the output collector in the form of key and obtained frequency count.
MapReduce Program

Medisin The Causes Solutions To Disease Malnutrition and The Medical Sins That Are Killing The World 1st Scott Whitaker PDF Download
No ratings yet
Medisin The Causes Solutions To Disease Malnutrition and The Medical Sins That Are Killing The World 1st Scott Whitaker PDF Download
82 pages
Lecture 21 Analysis of Rainfall Data
No ratings yet
Lecture 21 Analysis of Rainfall Data
10 pages
Search vs. Hashing
No ratings yet
Search vs. Hashing
55 pages
Women Empowerment
100% (1)
Women Empowerment
7 pages
Lvsuysl Blikr DH Iysv) Píjsa RFKK Ifùk K¡ Fof'Kf"V: HKKJRH Ekud
100% (4)
Lvsuysl Blikr DH Iysv) Píjsa RFKK Ifùk K¡ Fof'Kf"V: HKKJRH Ekud
17 pages
Persuasive Speech On Homework Should Be Banned
100% (1)
Persuasive Speech On Homework Should Be Banned
6 pages
Discrete and Stationary Wavelet Decomposition For Image Resolution Enhancement
100% (2)
Discrete and Stationary Wavelet Decomposition For Image Resolution Enhancement
61 pages
Concept Paper 3 Eng 10 New
No ratings yet
Concept Paper 3 Eng 10 New
6 pages
How To Package and Deploy SAP Business One Extensions For Lightweight Deployment
No ratings yet
How To Package and Deploy SAP Business One Extensions For Lightweight Deployment
26 pages
RS3 Modeling
100% (1)
RS3 Modeling
14 pages
Term One Edited
No ratings yet
Term One Edited
70 pages
MCCBs Simpact Series
No ratings yet
MCCBs Simpact Series
24 pages
Q2 Project Instructions
No ratings yet
Q2 Project Instructions
12 pages
( ) 2024 7.life in Space - ( ) 2 (25 ) (Q)
No ratings yet
( ) 2024 7.life in Space - ( ) 2 (25 ) (Q)
8 pages
My Notes Financial Market
No ratings yet
My Notes Financial Market
8 pages
506.6T-17 Visual Shotcrete Core Quality Evaluation Technote
No ratings yet
506.6T-17 Visual Shotcrete Core Quality Evaluation Technote
4 pages
RGUHS - B.SC Nursing - 2012 - 1 - Mar - 1754 Anatomy and Physiology (Rs 3)
No ratings yet
RGUHS - B.SC Nursing - 2012 - 1 - Mar - 1754 Anatomy and Physiology (Rs 3)
1 page
Tutorial Application of GIS For Watershed
No ratings yet
Tutorial Application of GIS For Watershed
28 pages
Wiljam Flight Training: 050-01-01 Composition, Extent, Vertical Division
No ratings yet
Wiljam Flight Training: 050-01-01 Composition, Extent, Vertical Division
18 pages
IT Based Decision Making in Health Care
No ratings yet
IT Based Decision Making in Health Care
5 pages
Assignment - 12 Solution
No ratings yet
Assignment - 12 Solution
5 pages
Lecture 1 Definitions & Terminologies in Experimental Design
No ratings yet
Lecture 1 Definitions & Terminologies in Experimental Design
11 pages
New Song
No ratings yet
New Song
8 pages
C929 Template
No ratings yet
C929 Template
5 pages
Material Safety Data Sheet: Ephedrine Hydrochloride
No ratings yet
Material Safety Data Sheet: Ephedrine Hydrochloride
6 pages
P425/1 Pure Mathematics Paper 1 July/August 2016 3 Hours Uganda Advanced Certificate of Education Mock Examinations Pure Mathematics P425/1 3 Hours
No ratings yet
P425/1 Pure Mathematics Paper 1 July/August 2016 3 Hours Uganda Advanced Certificate of Education Mock Examinations Pure Mathematics P425/1 3 Hours
4 pages
Microwave Project
No ratings yet
Microwave Project
11 pages
General Feedback For Module 7
No ratings yet
General Feedback For Module 7
1 page
Motor
No ratings yet
Motor
2 pages
Lovair - L-990 (991 992) Sensor Tap - Parts List
No ratings yet
Lovair - L-990 (991 992) Sensor Tap - Parts List
1 page
Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
4/5 (648)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
4.5/5 (298)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
4.5/5 (1856)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
4.5/5 (4103)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
4.5/5 (1139)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
4.5/5 (943)
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
3.5/5 (836)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
4/5 (2885)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M.L. Stedman
4.5/5 (815)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
4.5/5 (629)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
3.5/5 (233)
Little Women
From Everand
Little Women
Louisa May Alcott
4.5/5 (2369)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
3.5/5 (2289)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brené Brown
4/5 (1175)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
4.5/5 (244)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
4/5 (1267)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
4/5 (100)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
4/5 (903)
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
3.5/5 (144)
John Adams
From Everand
John Adams
David McCullough
4.5/5 (2546)
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
4/5 (45)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
3.5/5 (919)

BSC in Information Technology (Data Science) : Massive or Big Data Processing J.Alosius

Uploaded by

BSC in Information Technology (Data Science) : Massive or Big Data Processing J.Alosius

Uploaded by

BSc In Information Technology

Massive or BIG Data Processing

Programmers specify two functions:

• Driver (main- would trigger the map and reduce methods)

It is better to include the map, reduce and main methods in 3

• Yarn does efficient utilization of the resource

Node Manager (NM)

Container Application Master

• No application or tasks can be launched if RM is unavailable

You might also like