0% found this document useful (0 votes)

9 views28 pages

Lecture 4

The document provides an overview of MapReduce and its data processing capabilities within Hadoop, focusing on various InputFormats such as TextInputFormat, KeyValueInputFormat, NLineInputFormat, and SequenceFileInputFormat. It discusses the structure and function of mappers and reducers in a word count example, as well as the importance of serialization in Hadoop. Additionally, it highlights the strengths and limitations of Hadoop in handling large datasets and different data types.

Uploaded by

khaledabdelazim143

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views28 pages

Lecture 4

Uploaded by

khaledabdelazim143

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

Big Data Technologies

(IS 365)
Lecture 4
MapReduce
Dr. Wael Abbas
2024 - 2025

All slides in this file from the following book :”Tom white (2015) .Hadoop: The Definitive
Guide, 4th Edition . " O'Reilly Media, Inc."
Reading data in Mapreduce
 Hadoop can process many different types of data formats, from flat text
files to databases.

 There are three main Java classes provided in Hadoop to read data in
MapReduce:

1. InputSplitter

2. RecordReader

3. InputFormat
MapReduce : InputFormat
InputFormat Description Key Value Fil type
TextInputFormat Default format; The byte The line Text
reads lines of offset of the contents
text files line

KeyValueInputFormat Parses lines Everything The remainder Text

into (K, V) pairs up to the of
first tab the line
character
NLineInputFormat mappers The byte The line contents Text
receives a fixed offset of the
number of lines line
of input
SequenceFileInputFo A Hadoop- user- user-defined Binary
rmat specific high- defined
performance
binary format
MapReduce : InputFormat
Text input format
 TextInputFormat is the default InputFormat.
 Each record is a line of input.
 The key, a LongWritable, is the byte offset within the file of the beginning
of the line.
 The value is the contents of the line, excluding any line terminators (e.g.,
newline or carriage return), and is packaged as a Text object. So, a file
containing the following text:
On the top of the Crumpetty Tree
The Quangle Wangle sat,
But his face you could not see,
On account of his Beaver Hat.
MapReduce : InputFormat
Text input format
The text is divided into one split of four records. The records are interpreted as
the following key value pairs:
(0, On the top of the Crumpetty Tree)
(33, The Quangle Wangle sat,)
(57, But his face you could not see,)
(89, On account of his Beaver Hat.)

• byte offset is the number of character that exists count from the beginning
of the line.
MapReduce : InputFormat
Text input format
THE RELATIONSHIP BETWEEN INPUT SPLITS AND HDFS BLOCKS
A single file is broken into lines, and the line boundaries do not correspond
with the HDFS block boundaries. Splits honor logical record boundaries (in this
case, lines), so we see that the first split contains line 5, even though it spans the
first and second block. The second split starts at line 6.

Note : This image from Hadoop definition guide

MapReduce : InputFormat

Key-value input format

 TextInputFormat’s keys, being simply the offsets within the file, are not
normally very useful. It is common for each line in a file to be a key-value
pair, separated by a delimiter such as a tab character.
 For example, this is the kind of output produced by TextOutputFormat,
Hadoop’s default OutputFormat.
 To interpret such files correctly, KeyValueTextInputFormat is appropriate.
MapReduce : InputFormat

Key-value input format

 TextInputFormat’s keys, being You can specify the separator via the
mapreduce.input.keyvaluelinerecordreader.key.value.separator property.
 It is a tab character by default. Consider the following input file, where →
represents a (horizontal) tab character:
line1→On the top of the Crumpetty Tree
line2→The Quangle Wangle sat,
line3→But his face you could not see,
line4→On account of his Beaver Hat..
MapReduce : InputFormat

Key-value input format

Like in the TextInputFormat case, the input is in a single split comprising four
records, although this time the keys are the Text sequences before the tab in
each line:
(line1, On the top of the Crumpetty Tree)
(line2, The Quangle Wangle sat,)
(line3, But his face you could not see,)
(line4, On account of his Beaver Hat.)
MapReduce : InputFormat

NLineInputFormat input format

 With TextInputFormat and KeyValueTextInputFormat, each mapper receives
a variable number of lines of input.
 The number depends on the size of the split and the length of the lines.
 If you want your mappers to receive a fixed number of lines of input, then
NLineInputFormat is the InputFormat to use.
 Like with TextInputFormat, the keys are the byte offsets within the file and
the values are the lines themselves.
MapReduce : InputFormat

NLineInputFormat input format

 Each mapper N refers to the number of lines of input that each mapper
receives. With N set to 1 (the default), each mapper receives exactly one line
of input.
 The mapreduce.input.lineinputformat.linespermap property controls the
value of N. By way of example, consider these four lines again:
On the top of the Crumpetty Tree
The Quangle Wangle sat,
But his face you could not see,
On account of his Beaver Hat.
 If, for example, N is 2, then each split contains two lines. One mapper will
receive the first two key-value pairs:
(0, On the top of the Crumpetty Tree)
(33, The Quangle Wangle sat,)
MapReduce : InputFormat

NLineInputFormat input format

 And another mapper will receive the second two key-value pairs:
(57, But his face you could not see,)
(89, On account of his Beaver Hat.)
 The keys and values are the same as those that TextInputFormat produces.
 The difference is in the way the splits are constructed.
MapReduce : InputFormat
Binary : input format
 Hadoop MapReduce is not restricted to processing textual data. It has
support for binary formats.
 Hadoop’s sequence file format stores sequences of binary key-value pairs.
 Sequence files are well suited as a format for MapReduce data because they
are splittable (they have sync points so that readers can synchronize with
record boundaries from an arbitrary point in the file, such as the start of a
split), they support compression as a part of the format, and they can store
arbitrary types using a variety of serialization frameworks.
MapReduce : InputFormat
Binary : input format (SequenceFileInputFormat)
 To use data from sequence files as the input to MapReduce, you can use
SequenceFileInputFormat.
 The keys and values are determined by the sequence file, and you need to
make sure that your map input types correspond.
what problems does the SequenceFile try to
solve ?
For HDFS
 SequenceFile is one of the solutions to small file problem in Hadoop.
 Small file is significantly smaller than the HDFS block size(128MB).
 Each file, directory, block in HDFS is represented as object and occupies
150 bytes.
 10 million files, would use about 3 gigabytes of memory of NameNode.
 A billion files is not feasible.
what problems does the SequenceFile try to
solve ?
For MapReduce :
 Map tasks usually process a block of input at a time (using the default
FileInputFormat).
 The more the number of files is, the more number of Map task need and the
job time can be much more slow.
Small file scenario :
 The files are pieces of a larger logical file.
how can SequenceFile help to solve the
problems?
 The concept of SequenceFile is to put each small file to a larger single file.
 For example, suppose there are 10,000 100KB files, then we can write a
program to put them into a single SequenceFile like below, where you can
use filename to be the key and content to be the value.
how can SequenceFile help to solve the
problems?
1. A smaller number of memory needed on NameNode. Continue with the
10,000 100KB files example,
o Before using SequenceFile, 10,000 objects occupy about 4.5MB of
RAM in NameNode.
o After using SequenceFile, 1GB SequenceFile with 8 HDFS blocks,
these objects occupy about 3.6KB of RAM in NameNode.
2. SequenceFile is splittable, so is suitable for MapReduce.
3. SequenceFile is compression supported.
MapReduce : RecordReader
MapReduce word count : Mapper
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class wordcountmapper extends Mapper<Object, Text, Text, IntWritable>{
@Override
protected void map(Object key, Text value, Context context) throws IOException,
InterruptedException {
//To change body of generated methods, choose Tools | Templates.
String mytext =value.toString();
String allwords []=mytext.split(" ");
for(String x:allwords){
context.write(new Text(x), new IntWritable(1));
} } }
MapReduce word count : Mapper

• The Mapper class is a generic type, with four formal type parameters
that specify the input key, input value, output key, and output value
types of the map function.
• In word count example , input key is object , input value is a line of
text (Text), output key is a word (Text) , and output value (Intwritable).
MapReduce word count : Reducer
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class wordcountreducer extends Reducer<Text, IntWritable, Text, IntWritable> {

@Override
protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException,
InterruptedException {
//To change body of generated methods, choose Tools | Templates.
int sum = 0;
for (IntWritable iw : values) {
sum += iw.get();
}
context.write(key, new IntWritable(sum));
}

}
MapReduce word count : Reducer
• The reducer class is a generic type, with four formal type parameters that
specify the input key, input value, output key, and output value types of the
reduce function.
• The input types of the reduce function must match the output types of the
map function.
• In word count example , input key is text , input value is intwritable ,
output key is Text , and output value (Intwritable).
MapReduce word count : Driver
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat ;
public class wordcountdriver {
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
Configuration c = new Configuration();
Job j = Job.getInstance(c, "mywordcount");
j.setMapperClass(wordcountmapper.class);
j.setReducerClass(wordcountreducer.class);
//j.setCombinerClass(wordcountreducer.class);
j.setJarByClass(wordcountdriver.class);
j.setOutputKeyClass(Text.class);
j.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(j, new Path("hdfs://localhost:8020/user/cloudera/input/data.dat"));
FileOutputFormat.setOutputPath(j, new Path("hdfs://localhost:8020/user/cloudera/2019c"));
System.exit(j.waitForCompletion(true) ? 0 : 1);
}}
MapReduce word count : Driver

• The setOutputKeyClass() and setOutputValueClass() methods control

the output types for the reduce function, and must match what the Reduce
class produces .
• setMapOutputKeyClass() and setMapOutputValueClass() methods
are used when data type of map output is different from data type of
reduce output .
Serialization and deserialization in Hadoop
• Serialization is the process of turning structured objects into a byte stream
for transmission over a network or for writing to persistent storage.
• Deserialization is the reverse process of turning a byte stream back into a
series of structured objects.
• Serialization is used in two quite distinct areas of distributed data processing:
for interprocess communication and for persistent storage.
• In Hadoop, interprocess communication between nodes in the system is
implemented using remote procedure calls (RPCs). The RPC protocol uses
serialization to render the message into a binary stream to be sent to the
remote node, which then deserializes the binary stream into the original
message.
Serialization and deserialization in Hadoop
Why does Hadoop use classes such as intwritable and Text instead of
int and string ?
because java Serializable is too big or too heavy for Hadoop, Writable can
serializable the Hadoop Object in a very light way.
Why & where Hadoop is used / not used?
 What Hadoop is good for:
1. Massive amounts of data through parallelism
2. A variety of data (structured, unstructured, semi-structured)
3. Inexpensive commodity hardware
 Hadoop is not good for:
1. Not to process transactions (random access)
2. Not good when work cannot be parallelized
3. Not good for low latency data access
4. Not good for processing lots of small files
5. Not good for intensive calculations with little data

Module 08 Fixture I
100% (1)
Module 08 Fixture I
34 pages
BMW 5-Serie - Individual (2006-01)
No ratings yet
BMW 5-Serie - Individual (2006-01)
22 pages
Sagara Technology Profile
No ratings yet
Sagara Technology Profile
39 pages
Map Reduce Programming
No ratings yet
Map Reduce Programming
64 pages
Hadoop Mapred
100% (1)
Hadoop Mapred
11 pages
Cloud Unit 5
No ratings yet
Cloud Unit 5
52 pages
Cloudera Academic Partnership 3 PDF
0% (1)
Cloudera Academic Partnership 3 PDF
103 pages
Map Reduce Programming
No ratings yet
Map Reduce Programming
74 pages
Quick HadOop Ref Card Always
No ratings yet
Quick HadOop Ref Card Always
2 pages
Hadoop MapReduce Flow Chart
No ratings yet
Hadoop MapReduce Flow Chart
28 pages
Map Reduce Programming
No ratings yet
Map Reduce Programming
67 pages
Hadoop Wordcount Program
No ratings yet
Hadoop Wordcount Program
20 pages
S MapReduce Types Formats
100% (2)
S MapReduce Types Formats
22 pages
Palak
No ratings yet
Palak
10 pages
Hadoop Unit III DR David
No ratings yet
Hadoop Unit III DR David
12 pages
Class 8
No ratings yet
Class 8
5 pages
Job Scheduling in MR
No ratings yet
Job Scheduling in MR
6 pages
Developing A Mapreduce Application: by Dr. K. Venkateswara Rao Professor Department of Cse
No ratings yet
Developing A Mapreduce Application: by Dr. K. Venkateswara Rao Professor Department of Cse
83 pages
Mapreduce: Simplified Data Processing On Large Clusters by Jeffrey Dean and Sanjay Ghemawa Presented by Jon Logan
No ratings yet
Mapreduce: Simplified Data Processing On Large Clusters by Jeffrey Dean and Sanjay Ghemawa Presented by Jon Logan
30 pages
Unit 3 Bda
No ratings yet
Unit 3 Bda
59 pages
Map Reduce
No ratings yet
Map Reduce
30 pages
S MapReduce Types Formats Features
No ratings yet
S MapReduce Types Formats Features
15 pages
BDC Output 3
No ratings yet
BDC Output 3
4 pages
Hadoop Week 4
No ratings yet
Hadoop Week 4
13 pages
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
49 pages
HKBK College of Engineering Department of Ise: Big Data Analytics (18Cs72) Seminar On The Topic Key-Value Pairs
100% (1)
HKBK College of Engineering Department of Ise: Big Data Analytics (18Cs72) Seminar On The Topic Key-Value Pairs
15 pages
Lecture 4: Mapreduce and Hadoop: Indranil Gupta (Indy)
No ratings yet
Lecture 4: Mapreduce and Hadoop: Indranil Gupta (Indy)
37 pages
09b - MapReduce
No ratings yet
09b - MapReduce
44 pages
Setting Up Eclipse:: Codelab 1 Introduction To The Hadoop Environment (Version 0.17.0)
No ratings yet
Setting Up Eclipse:: Codelab 1 Introduction To The Hadoop Environment (Version 0.17.0)
9 pages
Mapreduce Types and Formats
No ratings yet
Mapreduce Types and Formats
65 pages
MAP Reduce - 1
No ratings yet
MAP Reduce - 1
34 pages
Hadoop Developingapps PDF
No ratings yet
Hadoop Developingapps PDF
17 pages
Unit 4
No ratings yet
Unit 4
11 pages
Mapreduce, Hadoop and Amazon Aws: Yasser Ganjisaffar
No ratings yet
Mapreduce, Hadoop and Amazon Aws: Yasser Ganjisaffar
33 pages
Practise Quiz Ccd-333 Exam (01-2014) - Cloudera Quiz Learning
No ratings yet
Practise Quiz Ccd-333 Exam (01-2014) - Cloudera Quiz Learning
44 pages
Map Reduce Types and Formats
No ratings yet
Map Reduce Types and Formats
32 pages
Big Data 4 Vivek
No ratings yet
Big Data 4 Vivek
3 pages
Bda Unit-3
No ratings yet
Bda Unit-3
44 pages
Quiz - Online Test 1 (10%)
No ratings yet
Quiz - Online Test 1 (10%)
7 pages
CS702 Big Data Programs
No ratings yet
CS702 Big Data Programs
58 pages
Lecture 04
No ratings yet
Lecture 04
25 pages
Bda Unit 1
No ratings yet
Bda Unit 1
13 pages
Data Analytics
No ratings yet
Data Analytics
26 pages
Chapter 9 - Processing Big Data With Mapreduce
No ratings yet
Chapter 9 - Processing Big Data With Mapreduce
157 pages
Mapreduce Programming Model and Design Patterns: Andrea Lottarini January 17, 2012
No ratings yet
Mapreduce Programming Model and Design Patterns: Andrea Lottarini January 17, 2012
23 pages
3 MapReduce Program Ex Code
No ratings yet
3 MapReduce Program Ex Code
14 pages
Lecture 03
No ratings yet
Lecture 03
26 pages
S MapReduce Types Formats Features 03
No ratings yet
S MapReduce Types Formats Features 03
16 pages
Map Red
No ratings yet
Map Red
6 pages
Bda Unit III r20csm
No ratings yet
Bda Unit III r20csm
54 pages
Lecture - 3
No ratings yet
Lecture - 3
25 pages
Map Reduce
No ratings yet
Map Reduce
18 pages
Advanced Mapreduce
No ratings yet
Advanced Mapreduce
37 pages
Bda Ia1 Scheme
No ratings yet
Bda Ia1 Scheme
7 pages
Hadoop Training in Hyderabad
No ratings yet
Hadoop Training in Hyderabad
49 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
9 pages
CS246 TA Session: Hadoop Tutorial: Peyman Kazemian 1/11/2011
No ratings yet
CS246 TA Session: Hadoop Tutorial: Peyman Kazemian 1/11/2011
13 pages
Prerequisites: Single Node Setup Cluster Setup
No ratings yet
Prerequisites: Single Node Setup Cluster Setup
5 pages
Unit 3 MapReduce Part 2
No ratings yet
Unit 3 MapReduce Part 2
12 pages
Big Data Practical 2
No ratings yet
Big Data Practical 2
11 pages
Hadoop 2
No ratings yet
Hadoop 2
31 pages
Understanding Inputs and Outputs of Mapreduce
No ratings yet
Understanding Inputs and Outputs of Mapreduce
13 pages
Mastering Data Structures and Algorithms in C and C++
From Everand
Mastering Data Structures and Algorithms in C and C++
Sachin Naha
No ratings yet
2013.02.2 - Building Code-Vol 2
100% (1)
2013.02.2 - Building Code-Vol 2
792 pages
RAIC Field Review Manual
No ratings yet
RAIC Field Review Manual
29 pages
SQM-Unit1 and Unit 2
No ratings yet
SQM-Unit1 and Unit 2
103 pages
Dishwasher Instructions
No ratings yet
Dishwasher Instructions
4 pages
Thers: Please Give Previous Certificate No
No ratings yet
Thers: Please Give Previous Certificate No
2 pages
Doing-The-Job-British-English-Student
No ratings yet
Doing-The-Job-British-English-Student
8 pages
DH-IPC-HDW2831T-ZS-S2: 8MP Lite IR Vari-Focal Eyeball Nework Camera
No ratings yet
DH-IPC-HDW2831T-ZS-S2: 8MP Lite IR Vari-Focal Eyeball Nework Camera
3 pages
Sri Lanka Matrimonial Advertisements
No ratings yet
Sri Lanka Matrimonial Advertisements
17 pages
Article
No ratings yet
Article
10 pages
Wheel Loaders
100% (2)
Wheel Loaders
32 pages
Banchbo AN INITIATIVE FOR THE WELFARE OF SR. CITIZENS
No ratings yet
Banchbo AN INITIATIVE FOR THE WELFARE OF SR. CITIZENS
61 pages
Mobile1 PDF
No ratings yet
Mobile1 PDF
2 pages
Tonoyan Et Al-2010-Entrepreneurship Theory and Practice
No ratings yet
Tonoyan Et Al-2010-Entrepreneurship Theory and Practice
40 pages
A2002D10328392 Benson Gilbert Odo Week 7 - Assessment Point 2
100% (1)
A2002D10328392 Benson Gilbert Odo Week 7 - Assessment Point 2
17 pages
QA Assignment 02
No ratings yet
QA Assignment 02
2 pages
Internship at D'Decor
No ratings yet
Internship at D'Decor
38 pages
Chaper Five: Curve Fitting
No ratings yet
Chaper Five: Curve Fitting
44 pages
Flyer Filter Sleeves
No ratings yet
Flyer Filter Sleeves
1 page
Stivuitor Electric 4 Tone - Diagrama 4 Tone
No ratings yet
Stivuitor Electric 4 Tone - Diagrama 4 Tone
2 pages
CCTV View Sales by Time - 30-10-2024
No ratings yet
CCTV View Sales by Time - 30-10-2024
9 pages
PHYTOREMEDIATION
50% (2)
PHYTOREMEDIATION
26 pages
Stock Analysis Strategy For US's Stock Market Based On Risk, Profitability, and Market Value Insights
No ratings yet
Stock Analysis Strategy For US's Stock Market Based On Risk, Profitability, and Market Value Insights
5 pages
TPZB150 Method of Statement - Satamas (2024)
No ratings yet
TPZB150 Method of Statement - Satamas (2024)
9 pages
Building and Site Security Policy
No ratings yet
Building and Site Security Policy
1 page
How To Get Started As An Online English Teacher
No ratings yet
How To Get Started As An Online English Teacher
2 pages
MEP Myanmar
No ratings yet
MEP Myanmar
27 pages
GDS - Voluntary Discharge Scheme - 14.12.2018
No ratings yet
GDS - Voluntary Discharge Scheme - 14.12.2018
6 pages

Lecture 4

Uploaded by

Lecture 4

Uploaded by

Big Data Technologies

KeyValueInputFormat Parses lines Everything The remainder Text

Note : This image from Hadoop definition guide

Key-value input format

Key-value input format

Key-value input format

NLineInputFormat input format

NLineInputFormat input format

NLineInputFormat input format

• The setOutputKeyClass() and setOutputValueClass() methods control

You might also like