Lecture 4
Lecture 4
(IS 365)
Lecture 4
MapReduce
Dr. Wael Abbas
2024 - 2025
All slides in this file from the following book :”Tom white (2015) .Hadoop: The Definitive
Guide, 4th Edition . " O'Reilly Media, Inc."
Reading data in Mapreduce
Hadoop can process many different types of data formats, from flat text
files to databases.
There are three main Java classes provided in Hadoop to read data in
MapReduce:
1. InputSplitter
2. RecordReader
3. InputFormat
MapReduce : InputFormat
InputFormat Description Key Value Fil type
TextInputFormat Default format; The byte The line Text
reads lines of offset of the contents
text files line
• byte offset is the number of character that exists count from the beginning
of the line.
MapReduce : InputFormat
Text input format
THE RELATIONSHIP BETWEEN INPUT SPLITS AND HDFS BLOCKS
A single file is broken into lines, and the line boundaries do not correspond
with the HDFS block boundaries. Splits honor logical record boundaries (in this
case, lines), so we see that the first split contains line 5, even though it spans the
first and second block. The second split starts at line 6.
• The Mapper class is a generic type, with four formal type parameters
that specify the input key, input value, output key, and output value
types of the map function.
• In word count example , input key is object , input value is a line of
text (Text), output key is a word (Text) , and output value (Intwritable).
MapReduce word count : Reducer
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class wordcountreducer extends Reducer<Text, IntWritable, Text, IntWritable> {
@Override
protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException,
InterruptedException {
//To change body of generated methods, choose Tools | Templates.
int sum = 0;
for (IntWritable iw : values) {
sum += iw.get();
}
context.write(key, new IntWritable(sum));
}
}
MapReduce word count : Reducer
• The reducer class is a generic type, with four formal type parameters that
specify the input key, input value, output key, and output value types of the
reduce function.
• The input types of the reduce function must match the output types of the
map function.
• In word count example , input key is text , input value is intwritable ,
output key is Text , and output value (Intwritable).
MapReduce word count : Driver
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat ;
public class wordcountdriver {
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
Configuration c = new Configuration();
Job j = Job.getInstance(c, "mywordcount");
j.setMapperClass(wordcountmapper.class);
j.setReducerClass(wordcountreducer.class);
//j.setCombinerClass(wordcountreducer.class);
j.setJarByClass(wordcountdriver.class);
j.setOutputKeyClass(Text.class);
j.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(j, new Path("hdfs://localhost:8020/user/cloudera/input/data.dat"));
FileOutputFormat.setOutputPath(j, new Path("hdfs://localhost:8020/user/cloudera/2019c"));
System.exit(j.waitForCompletion(true) ? 0 : 1);
}}
MapReduce word count : Driver