Map Reduce 1

MapReduce is a solution developed by Google to address the limitations of traditional centralized enterprise systems in processing large volumes of data. It works by dividing tasks into smaller parts processed by multiple computers, utilizing two main tasks: Map, which converts data into key-value pairs, and Reduce, which aggregates these pairs into a final dataset. Additionally, there are various ways to integrate R with Hadoop to enhance data analysis capabilities, including RHadoop, ORCH, RHIPE, and Hadoop Streaming.

Uploaded by

himanshugmarekar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views50 pages

Map Reduce 1

Uploaded by

himanshugmarekar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 50

Why MapReduce?

• Traditional Enterprise Systems normally have a centralized

server to store and process data.
• The following illustration depicts a schematic view of a
traditional enterprise system.
• The traditional model is certainly not suitable to process
huge volumes of scalable data and cannot be accommodated
by standard database servers.
• Moreover, the centralized system creates too much of a
bottleneck while processing multiple files simultaneously.
Google solved this bottleneck issue using an algorithm
called MapReduce.
MapReduce divides a task into small parts and assigns them to
many computers. Later, the results are collected at one place
and integrated to form the result dataset.
How MapReduce Works?

 The MapReduce algorithm contains two

important tasks, namely Map and Reduce.
 The Map task takes a set of data and converts it
into another set of data, where individual
elements are broken down into tuples (key-value
pairs).
 The Reduce task takes the output from the Map
as an input and combines those data tuples
(key-value pairs) into a smaller set of tuples.
 The reduce task is always performed after the
map job.
 Let us now take a close look at each of the
phases and try to understand their significance.
• Input Phase − Here we have a Record Reader that translates each
record in an input file and sends the parsed data to the mapper in the
form of key-value pairs.
• Map − Map is a user-defined function, which takes a series of key-
value pairs and processes each one of them to generate zero or more
key-value pairs.
• Intermediate Keys − They key-value pairs generated by the mapper
are known as intermediate keys.
• Combiner − A combiner is a type of local Reducer that groups similar
data from the map phase into identifiable sets. It takes the
intermediate keys from the mapper as input and applies a user-defined
code to aggregate the values in a small scope of one mapper. It is not
a part of the main MapReduce algorithm; it is optional.
• Shuffle and Sort − The Reducer task starts with the Shuffle and Sort
step. It downloads the grouped key-value pairs onto the local machine,
where the Reducer is running. The individual key-value pairs are sorted
by key into a larger data list. The data list groups the equivalent keys
together so that their values can be iterated easily in the Reducer task.
 Reducer − The Reducer takes the grouped key-value paired data
as input and runs a Reducer function on each one of them. Here,
the data can be aggregated, filtered, and combined in a number of
ways, and it requires a wide range of processing. Once the
execution is over, it gives zero or more key-value pairs to the final
step.
 Output Phase − In the output phase, we have an output
formatter that translates the final key-value pairs from the
Reducer function and writes them onto a file using a record writer.
 Tokenize − Tokenizes the tweets into maps of tokens and writes
them as key-value pairs.
 Filter − Filters unwanted words from the maps of tokens and
writes the filtered maps as key-value pairs.
 Count − Generates a token counter per word.
 Aggregate Counters − Prepares an aggregate of similar counter
values into small manageable units.
MapReduce-Example
 Twitter receives around 500 million tweets per
day, which is nearly 3000 tweets per second. The
following illustration shows how Tweeter manages
its tweets with the help of MapReduce.
Job
IMPALA schedulin
Monitor
g
& Search
Manage

Data
Injestion Nosql(google
Big data)
MapReduce - Algorithm

 The MapReduce algorithm contains two

important tasks, namely Map and Reduce.
 The map task is done by means of Mapper
Class
 The reduce task is done by means of Reducer
Class.
 Mapper class takes the input, tokenizes it,
maps and sorts it. The output of Mapper class
is used as input by Reducer class, which in turn
searches matching pairs and reduces them.
Several ways using which One can Integrate
both R and Hadoop:
 The first layer: It is the hardware layer — it consists of
a cluster of computers systems,
 The second layer: It is the middleware layer of
Hadoop. This layer also takes care of the distributions of
the files flawlessly through using HDFS and the features
of the MapReduce job.
 The third layer: It is the interface layer that provides
the interface for analysis of data. At this level, we can
use an effective tool like Pig which provides a high-level
platform to us for creating MapReduce programs using a
language which we called Pig-Latin. We can also use
Hive which is a data warehouse infrastructure
developed by Apache and built on top of Hadoop. Hive
provides a number of facilities to us for running complex
queries and helps to analyze the data using an SQL-like
language called HiveQL and it also extends support for
implementing MapReduce tasks.
 Besides using Hive and Pig, We can also use
Rhipe or Rhadoop libraries that build an
interface to provide integration between
Hadoop and R and enables users to access
data from the Hadoop file system and enable
to write his own script to implement the Map
and Reduce jobs, or we can also use the
Hadoop- streaming that is a technology
which is used to integrate the Hadoop.
Using R and Hadoop
There are four different ways of using
Hadoop and R together:

. RHadoop
1

 RHadoop is a collection of three R packages:

rmr, rhdfs and rhbase. rmr package provides
Hadoop MapReduce functionality in R, rhdfs
provides HDFS file management in R and
rhbase provides HBase database
management from within R. Each of these
primary packages can be used to analyze
and manage Hadoop framework data better.
 The rmr package –rmr package provides Hadoop
MapReduce functionality in R. So, the R programmer only has
to do just divide the logic and idea of their application into the
map and reduce phases associates and just submit it with the
rmr methods. After that, The rmr package makes a call to the
Hadoop streaming and the MapReduce API through multiple
job parameters as input directory, output directory, reducer,
mapper, and so on, to perform the R MapReduce job over
Hadoop cluster(most of the components are similar as Hadoop
streaming).
 The rhbase package –Allows R developer to connect Hadoop
HBASE to R using Thrift Server. It also offers functionality like
(read, write, and modify tables stored in HBase from R).
 The rhdfs package –It provides HDFS file management in R,
because data itself stores in Hadoop file system. Functions of
this package are as given as follows. File Manipulations -
( hdfs.delete, hdfs.rm, hdfs.del, hdfs.chown, hdfs.put, hdfs.get
etc), File Read/Write -(hdfs.flush, hdfs.read, hdfs.seek,
hdfs.tell, hdfs.line.reader etc), Directory -hdfs.dircreate,
hdfs.mkdir, Initialization: hdfs.init, hdfs.defaults.
2. ORCH
 ORCH stands for Oracle R Connector for Hadoop.
 It is a collection of R packages that provide the relevant
interfaces to work with Hive tables, the Apache Hadoop
compute infrastructure, the local R environment, and Oracle
database tables.
 ORCH also provides predictive analytic techniques that can be
applied to data in HDFS files.
 Orch is a collection of R packages that provide the following
features.
 Various attractive Interfaces to work with the data maintained
in Hive tables, able to use the Apache Hadoop based
computing infrastructure, and also provides the local R
environment and Oracle database tables.
 Use a predictive analytic technique, written in R or Java as
Hadoop MapReduce jobs, that can be applied to data stored in
HDFS files
3. RHIPE

 RHIPE is a R package which provides an API to use Hadoop.

RHIPE stands for R and Hadoop Integrated
Programming Environment, and is essentially RHadoop
with a different API.
 RHIPE: Rhipe is used in R to do an intricate analysis of the
large collection of data sets via Hadoop is an integrated
programming environment tool that is brought by the Divide
and Recombine (D & R) to analyze the huge amount of data.
 Can read, save the complete data that is created using
RHIPE MapReduce. RHIPE is deployed with many features
that help us to effectively interact with HDFS. An individual
can also use various languages like Perl, Java, or Python to
read data sets in RHIPE.
 Rhipe is a Java package, so it acts like a Java bridge between
Hadoop and R. During serialization Rhipe converts the input
data into java types, so that it can be interpreted by Hadoop.
 Hadoop Streaming
 Hadoop streaming is a Hadoop utility for running
the Hadoop MapReduce job with executable
scripts such as Mapper and Reducer.
 The script is available as part of the R package on
CRAN. And its aim is to make R more accessible
to the Hadoop streaming-based applications.
 This is just congruent to the pipe operation in
Linux. With this, the text input file is printed on
stream (stdin), which is provided as an input to
Mapper, and the output (stdout) of Mapper is
provided as an input to the Reducer; finally,
Reducer writes the output to the HDFS directory.
 The main benefit of the Hadoop streaming is to allow the execution of the Java, as
well as non-Java based programmed MapReduce jobs over Hadoop clusters.

 The Hadoop streaming supports various languages like Perl, Python, PHP, R, and
C++, and other programming languages efficiently. Various components of the
Hadoop streaming MapReduce job.

Revised CET Draft TS (Rev 03) For Proposed HPWMS System at PBS As Per Comments Observations 26 12 2023
No ratings yet
Revised CET Draft TS (Rev 03) For Proposed HPWMS System at PBS As Per Comments Observations 26 12 2023
137 pages
Top 58 MySql Interview Questions (2023) - Javatpoint
No ratings yet
Top 58 MySql Interview Questions (2023) - Javatpoint
37 pages
Unit IV Hadoop
No ratings yet
Unit IV Hadoop
90 pages
Unit III EBDP 2022
No ratings yet
Unit III EBDP 2022
77 pages
Big Data Management Continued
No ratings yet
Big Data Management Continued
48 pages
Python PYQ
No ratings yet
Python PYQ
10 pages
MapReduce Unit3
No ratings yet
MapReduce Unit3
27 pages
Date Narration Chq./Ref - No. Value DT Withdrawal Amt. Deposit Amt. Closing Balance
No ratings yet
Date Narration Chq./Ref - No. Value DT Withdrawal Amt. Deposit Amt. Closing Balance
67 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
R Programming Insights Textbook
From Everand
R Programming Insights Textbook
Manish Soni
No ratings yet
Module2 C MapReduceParadigm
No ratings yet
Module2 C MapReduceParadigm
74 pages
Paper Map Reduce
No ratings yet
Paper Map Reduce
16 pages
Learning Hadoop 2
From Everand
Learning Hadoop 2
Garry Turkington
4/5 (1)
Map Reduce
No ratings yet
Map Reduce
35 pages
Module 2 Network Threats
No ratings yet
Module 2 Network Threats
40 pages
09b - MapReduce
No ratings yet
09b - MapReduce
44 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
9 pages
Ir MR 1
No ratings yet
Ir MR 1
34 pages
Lecture 4: Mapreduce and Hadoop: Indranil Gupta (Indy)
No ratings yet
Lecture 4: Mapreduce and Hadoop: Indranil Gupta (Indy)
37 pages
Hadoop Map Reduce Concepts - Teaching - 1
No ratings yet
Hadoop Map Reduce Concepts - Teaching - 1
53 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
Bda Model
No ratings yet
Bda Model
32 pages
Ch02a Mapreduce
No ratings yet
Ch02a Mapreduce
53 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
18CS72-Big Data and Analytics 3rd Internal QP 7th Semester - Scheme of Evaluation
No ratings yet
18CS72-Big Data and Analytics 3rd Internal QP 7th Semester - Scheme of Evaluation
14 pages
Chapter Five Hadoop Mapreduce & HDFS
No ratings yet
Chapter Five Hadoop Mapreduce & HDFS
44 pages
Introduction To Batch Processing
No ratings yet
Introduction To Batch Processing
23 pages
Aiwa Av-D58-U SM PDF
No ratings yet
Aiwa Av-D58-U SM PDF
34 pages
Biggdata
No ratings yet
Biggdata
24 pages
Big Data Notes (All Lectures)
No ratings yet
Big Data Notes (All Lectures)
44 pages
Map Reduce
No ratings yet
Map Reduce
44 pages
Nfpa 70B
100% (4)
Nfpa 70B
32 pages
Chapter 4 - Understanding Map Reduce Fundamentals
No ratings yet
Chapter 4 - Understanding Map Reduce Fundamentals
45 pages
CSEC Information Technology June 2022 P2 (Answered)
No ratings yet
CSEC Information Technology June 2022 P2 (Answered)
20 pages
Unit 5
No ratings yet
Unit 5
21 pages
Aderonke Project Proposal
No ratings yet
Aderonke Project Proposal
40 pages
DM - Topic Five
No ratings yet
DM - Topic Five
30 pages
Own Answer 2
No ratings yet
Own Answer 2
22 pages
M5
No ratings yet
M5
18 pages
Data Sheet enUS 1890876683
No ratings yet
Data Sheet enUS 1890876683
47 pages
Lab Manual BDA
No ratings yet
Lab Manual BDA
36 pages
BD - Unit - III - MapReduce
100% (1)
BD - Unit - III - MapReduce
31 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Lecture 10 MapReduce Hadoop
No ratings yet
Lecture 10 MapReduce Hadoop
37 pages
2 MapReduce Continue
No ratings yet
2 MapReduce Continue
12 pages
Unit-2 (MapReduce-I)
No ratings yet
Unit-2 (MapReduce-I)
28 pages
CC Unit-7
No ratings yet
CC Unit-7
16 pages
Hadoop Spark
No ratings yet
Hadoop Spark
34 pages
Basler RDP-110
No ratings yet
Basler RDP-110
26 pages
Microsoft Certkey AZ-900 v2020-08-18 by Zala 111q PDF
No ratings yet
Microsoft Certkey AZ-900 v2020-08-18 by Zala 111q PDF
87 pages
Distributed and Cloud Computing
No ratings yet
Distributed and Cloud Computing
58 pages
Unit 4 CS 3RD Yr
No ratings yet
Unit 4 CS 3RD Yr
13 pages
Quick Installation Guide: HD Ultra-Wide View Wi-Fi Camera Dcs-960L
No ratings yet
Quick Installation Guide: HD Ultra-Wide View Wi-Fi Camera Dcs-960L
48 pages
3a - MapReduce Data Flow Scheduling Combiner Partitioner PDF
No ratings yet
3a - MapReduce Data Flow Scheduling Combiner Partitioner PDF
22 pages
Buy Stigum's Money Market 4E Ebook at Discount Price
100% (1)
Buy Stigum's Money Market 4E Ebook at Discount Price
12 pages
Soft Key Solutions - HASP4 HASP HL Hardlock Dongle Emulator For Aladdin Hardware Key
100% (1)
Soft Key Solutions - HASP4 HASP HL Hardlock Dongle Emulator For Aladdin Hardware Key
4 pages
Cloud - UNIT V
No ratings yet
Cloud - UNIT V
18 pages
Big Data 1 PDF
No ratings yet
Big Data 1 PDF
17 pages
Chapter4 - MapReduce
No ratings yet
Chapter4 - MapReduce
29 pages
Big Data
No ratings yet
Big Data
43 pages
Project 2
No ratings yet
Project 2
8 pages
ES - Lecture2 - Aug 2
No ratings yet
ES - Lecture2 - Aug 2
37 pages
Academic
No ratings yet
Academic
8 pages
Map Reduce
No ratings yet
Map Reduce
7 pages
BDA Unit 3
No ratings yet
BDA Unit 3
7 pages
1.4 Map Reduce
No ratings yet
1.4 Map Reduce
30 pages
Bda Ia1 Scheme
No ratings yet
Bda Ia1 Scheme
7 pages
A Survey of Probability Concepts
No ratings yet
A Survey of Probability Concepts
42 pages
Hadoop
No ratings yet
Hadoop
34 pages
Map Reduce Tutorial-1
No ratings yet
Map Reduce Tutorial-1
7 pages
R Integration With Hadoop
No ratings yet
R Integration With Hadoop
12 pages
Map-Reduce For Parallel Computing: Amit Jain
No ratings yet
Map-Reduce For Parallel Computing: Amit Jain
72 pages
BDAunit III
No ratings yet
BDAunit III
4 pages
Advanced Excel - Waterfall Chart
No ratings yet
Advanced Excel - Waterfall Chart
8 pages
LED Lightboxes Specs
No ratings yet
LED Lightboxes Specs
14 pages
JAVA For Beginners: Using The Vehicle Class
No ratings yet
JAVA For Beginners: Using The Vehicle Class
12 pages
The Map Reduce Programming
No ratings yet
The Map Reduce Programming
15 pages
TK3 Manual
No ratings yet
TK3 Manual
24 pages
Map Reduce
No ratings yet
Map Reduce
18 pages
Term Paper Java
No ratings yet
Term Paper Java
14 pages
Netcat - Cheat Sheet
No ratings yet
Netcat - Cheat Sheet
3 pages
Big R Data
No ratings yet
Big R Data
17 pages
w2014 12 PDF
No ratings yet
w2014 12 PDF
18 pages
NCS Expert Tutorial - How To Code Features in Your Car.
100% (1)
NCS Expert Tutorial - How To Code Features in Your Car.
10 pages
Ic Datasheet CH en
No ratings yet
Ic Datasheet CH en
2 pages
Digital Lifestyle of Connected Nigerians
No ratings yet
Digital Lifestyle of Connected Nigerians
14 pages
CAT-9519 MGC-CONFIG-KIT4 Fire Panel Configuration Kit
No ratings yet
CAT-9519 MGC-CONFIG-KIT4 Fire Panel Configuration Kit
1 page
Arjun Jaggi: Mapple July 2012 - Jan 2013
No ratings yet
Arjun Jaggi: Mapple July 2012 - Jan 2013
3 pages
(Ebooks PDF) Download Triple Focus A New Approach To Education The Full Chapters
100% (3)
(Ebooks PDF) Download Triple Focus A New Approach To Education The Full Chapters
21 pages

Map Reduce 1

Uploaded by

Map Reduce 1

Uploaded by

Why MapReduce?

• Traditional Enterprise Systems normally have a centralized

 The MapReduce algorithm contains two

 The MapReduce algorithm contains two

 RHadoop is a collection of three R packages:

 RHIPE is a R package which provides an API to use Hadoop.

You might also like