0% found this document useful (0 votes)
9 views28 pages

Lecture 4

The document provides an overview of MapReduce and its data processing capabilities within Hadoop, focusing on various InputFormats such as TextInputFormat, KeyValueInputFormat, NLineInputFormat, and SequenceFileInputFormat. It discusses the structure and function of mappers and reducers in a word count example, as well as the importance of serialization in Hadoop. Additionally, it highlights the strengths and limitations of Hadoop in handling large datasets and different data types.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views28 pages

Lecture 4

The document provides an overview of MapReduce and its data processing capabilities within Hadoop, focusing on various InputFormats such as TextInputFormat, KeyValueInputFormat, NLineInputFormat, and SequenceFileInputFormat. It discusses the structure and function of mappers and reducers in a word count example, as well as the importance of serialization in Hadoop. Additionally, it highlights the strengths and limitations of Hadoop in handling large datasets and different data types.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Big Data Technologies

(IS 365)
Lecture 4
MapReduce
Dr. Wael Abbas
2024 - 2025

All slides in this file from the following book :”Tom white (2015) .Hadoop: The Definitive
Guide, 4th Edition . " O'Reilly Media, Inc."
Reading data in Mapreduce
 Hadoop can process many different types of data formats, from flat text
files to databases.

 There are three main Java classes provided in Hadoop to read data in
MapReduce:

1. InputSplitter

2. RecordReader

3. InputFormat
MapReduce : InputFormat
InputFormat Description Key Value Fil type
TextInputFormat Default format; The byte The line Text
reads lines of offset of the contents
text files line

KeyValueInputFormat Parses lines Everything The remainder Text


into (K, V) pairs up to the of
first tab the line
character
NLineInputFormat mappers The byte The line contents Text
receives a fixed offset of the
number of lines line
of input
SequenceFileInputFo A Hadoop- user- user-defined Binary
rmat specific high- defined
performance
binary format
MapReduce : InputFormat
Text input format
 TextInputFormat is the default InputFormat.
 Each record is a line of input.
 The key, a LongWritable, is the byte offset within the file of the beginning
of the line.
 The value is the contents of the line, excluding any line terminators (e.g.,
newline or carriage return), and is packaged as a Text object. So, a file
containing the following text:
On the top of the Crumpetty Tree
The Quangle Wangle sat,
But his face you could not see,
On account of his Beaver Hat.
MapReduce : InputFormat
Text input format
The text is divided into one split of four records. The records are interpreted as
the following key value pairs:
(0, On the top of the Crumpetty Tree)
(33, The Quangle Wangle sat,)
(57, But his face you could not see,)
(89, On account of his Beaver Hat.)

• byte offset is the number of character that exists count from the beginning
of the line.
MapReduce : InputFormat
Text input format
THE RELATIONSHIP BETWEEN INPUT SPLITS AND HDFS BLOCKS
A single file is broken into lines, and the line boundaries do not correspond
with the HDFS block boundaries. Splits honor logical record boundaries (in this
case, lines), so we see that the first split contains line 5, even though it spans the
first and second block. The second split starts at line 6.

Note : This image from Hadoop definition guide


MapReduce : InputFormat

Key-value input format


 TextInputFormat’s keys, being simply the offsets within the file, are not
normally very useful. It is common for each line in a file to be a key-value
pair, separated by a delimiter such as a tab character.
 For example, this is the kind of output produced by TextOutputFormat,
Hadoop’s default OutputFormat.
 To interpret such files correctly, KeyValueTextInputFormat is appropriate.
MapReduce : InputFormat

Key-value input format


 TextInputFormat’s keys, being You can specify the separator via the
mapreduce.input.keyvaluelinerecordreader.key.value.separator property.
 It is a tab character by default. Consider the following input file, where →
represents a (horizontal) tab character:
line1→On the top of the Crumpetty Tree
line2→The Quangle Wangle sat,
line3→But his face you could not see,
line4→On account of his Beaver Hat..
MapReduce : InputFormat

Key-value input format


Like in the TextInputFormat case, the input is in a single split comprising four
records, although this time the keys are the Text sequences before the tab in
each line:
(line1, On the top of the Crumpetty Tree)
(line2, The Quangle Wangle sat,)
(line3, But his face you could not see,)
(line4, On account of his Beaver Hat.)
MapReduce : InputFormat

NLineInputFormat input format


 With TextInputFormat and KeyValueTextInputFormat, each mapper receives
a variable number of lines of input.
 The number depends on the size of the split and the length of the lines.
 If you want your mappers to receive a fixed number of lines of input, then
NLineInputFormat is the InputFormat to use.
 Like with TextInputFormat, the keys are the byte offsets within the file and
the values are the lines themselves.
MapReduce : InputFormat

NLineInputFormat input format


 Each mapper N refers to the number of lines of input that each mapper
receives. With N set to 1 (the default), each mapper receives exactly one line
of input.
 The mapreduce.input.lineinputformat.linespermap property controls the
value of N. By way of example, consider these four lines again:
On the top of the Crumpetty Tree
The Quangle Wangle sat,
But his face you could not see,
On account of his Beaver Hat.
 If, for example, N is 2, then each split contains two lines. One mapper will
receive the first two key-value pairs:
(0, On the top of the Crumpetty Tree)
(33, The Quangle Wangle sat,)
MapReduce : InputFormat

NLineInputFormat input format


 And another mapper will receive the second two key-value pairs:
(57, But his face you could not see,)
(89, On account of his Beaver Hat.)
 The keys and values are the same as those that TextInputFormat produces.
 The difference is in the way the splits are constructed.
MapReduce : InputFormat
Binary : input format
 Hadoop MapReduce is not restricted to processing textual data. It has
support for binary formats.
 Hadoop’s sequence file format stores sequences of binary key-value pairs.
 Sequence files are well suited as a format for MapReduce data because they
are splittable (they have sync points so that readers can synchronize with
record boundaries from an arbitrary point in the file, such as the start of a
split), they support compression as a part of the format, and they can store
arbitrary types using a variety of serialization frameworks.
MapReduce : InputFormat
Binary : input format (SequenceFileInputFormat)
 To use data from sequence files as the input to MapReduce, you can use
SequenceFileInputFormat.
 The keys and values are determined by the sequence file, and you need to
make sure that your map input types correspond.
what problems does the SequenceFile try to
solve ?
For HDFS
 SequenceFile is one of the solutions to small file problem in Hadoop.
 Small file is significantly smaller than the HDFS block size(128MB).
 Each file, directory, block in HDFS is represented as object and occupies
150 bytes.
 10 million files, would use about 3 gigabytes of memory of NameNode.
 A billion files is not feasible.
what problems does the SequenceFile try to
solve ?
For MapReduce :
 Map tasks usually process a block of input at a time (using the default
FileInputFormat).
 The more the number of files is, the more number of Map task need and the
job time can be much more slow.
Small file scenario :
 The files are pieces of a larger logical file.
how can SequenceFile help to solve the
problems?
 The concept of SequenceFile is to put each small file to a larger single file.
 For example, suppose there are 10,000 100KB files, then we can write a
program to put them into a single SequenceFile like below, where you can
use filename to be the key and content to be the value.
how can SequenceFile help to solve the
problems?
1. A smaller number of memory needed on NameNode. Continue with the
10,000 100KB files example,
o Before using SequenceFile, 10,000 objects occupy about 4.5MB of
RAM in NameNode.
o After using SequenceFile, 1GB SequenceFile with 8 HDFS blocks,
these objects occupy about 3.6KB of RAM in NameNode.
2. SequenceFile is splittable, so is suitable for MapReduce.
3. SequenceFile is compression supported.
MapReduce : RecordReader
MapReduce word count : Mapper
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class wordcountmapper extends Mapper<Object, Text, Text, IntWritable>{
@Override
protected void map(Object key, Text value, Context context) throws IOException,
InterruptedException {
//To change body of generated methods, choose Tools | Templates.
String mytext =value.toString();
String allwords []=mytext.split(" ");
for(String x:allwords){
context.write(new Text(x), new IntWritable(1));
} } }
MapReduce word count : Mapper

• The Mapper class is a generic type, with four formal type parameters
that specify the input key, input value, output key, and output value
types of the map function.
• In word count example , input key is object , input value is a line of
text (Text), output key is a word (Text) , and output value (Intwritable).
MapReduce word count : Reducer
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class wordcountreducer extends Reducer<Text, IntWritable, Text, IntWritable> {

@Override
protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException,
InterruptedException {
//To change body of generated methods, choose Tools | Templates.
int sum = 0;
for (IntWritable iw : values) {
sum += iw.get();
}
context.write(key, new IntWritable(sum));
}

}
MapReduce word count : Reducer
• The reducer class is a generic type, with four formal type parameters that
specify the input key, input value, output key, and output value types of the
reduce function.
• The input types of the reduce function must match the output types of the
map function.
• In word count example , input key is text , input value is intwritable ,
output key is Text , and output value (Intwritable).
MapReduce word count : Driver
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat ;
public class wordcountdriver {
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
Configuration c = new Configuration();
Job j = Job.getInstance(c, "mywordcount");
j.setMapperClass(wordcountmapper.class);
j.setReducerClass(wordcountreducer.class);
//j.setCombinerClass(wordcountreducer.class);
j.setJarByClass(wordcountdriver.class);
j.setOutputKeyClass(Text.class);
j.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(j, new Path("hdfs://localhost:8020/user/cloudera/input/data.dat"));
FileOutputFormat.setOutputPath(j, new Path("hdfs://localhost:8020/user/cloudera/2019c"));
System.exit(j.waitForCompletion(true) ? 0 : 1);
}}
MapReduce word count : Driver

• The setOutputKeyClass() and setOutputValueClass() methods control


the output types for the reduce function, and must match what the Reduce
class produces .
• setMapOutputKeyClass() and setMapOutputValueClass() methods
are used when data type of map output is different from data type of
reduce output .
Serialization and deserialization in Hadoop
• Serialization is the process of turning structured objects into a byte stream
for transmission over a network or for writing to persistent storage.
• Deserialization is the reverse process of turning a byte stream back into a
series of structured objects.
• Serialization is used in two quite distinct areas of distributed data processing:
for interprocess communication and for persistent storage.
• In Hadoop, interprocess communication between nodes in the system is
implemented using remote procedure calls (RPCs). The RPC protocol uses
serialization to render the message into a binary stream to be sent to the
remote node, which then deserializes the binary stream into the original
message.
Serialization and deserialization in Hadoop
Why does Hadoop use classes such as intwritable and Text instead of
int and string ?
because java Serializable is too big or too heavy for Hadoop, Writable can
serializable the Hadoop Object in a very light way.
Why & where Hadoop is used / not used?
 What Hadoop is good for:
1. Massive amounts of data through parallelism
2. A variety of data (structured, unstructured, semi-structured)
3. Inexpensive commodity hardware
 Hadoop is not good for:
1. Not to process transactions (random access)
2. Not good when work cannot be parallelized
3. Not good for low latency data access
4. Not good for processing lots of small files
5. Not good for intensive calculations with little data

You might also like