Map Reduce-LO2
Map Reduce-LO2
08/05/2023 1
MapReduce Overview
• A method for distributing computation across multiple nodes
• Each node processes the data that is stored at that node
• Consists of two main phases
• Map
• Reduce
Map Reduce
• Hadoop Ecosystem component ‘MapReduce’ works by breaking the
processing into two phases:
• Map phase
• Reduce phase
• Each phase has key-value pairs as input and output. In addition,
programmer also specifies two functions: map function and reduce
function
08/05/2023 3
Map Reduce
• Map function takes a set of data and converts it into another set of
data, where individual elements are broken down into tuples
(key/value pairs).
Reduce function takes the output from the Map as an input and
combines those data tuples based on the key and accordingly
modifies the value of the key.
08/05/2023 4
MR – Important Notes
• Hadoop divides the input to a MapReduce job into fixed-size pieces called input splits, or just splits.
Hadoop creates one map task for each split, which runs the user-defined map function for each
record in the split.
• For most jobs, a good split size tends to be the size of an HDFS block, which is 128 MB by default
• For better performance, it should be clear why the optimal split size is the same as the block size:
it is the largest size of input that can be guaranteed to be stored on a single node. If the split
spanned two blocks, it would be unlikely that any HDFS node stored both blocks, so some of the
split would have to be transferred across the network to the node running the map task, which is
clearly less efficient than running the whole map task using local data.
• Map tasks write their output to the local disk, not to HDFS.
• If the node running the map task fails before the map output has been consumed by the reduce
task, then Hadoop will automatically rerun the map task on another node to re-create the map
output.
• The input to a single reduce task is normally the output from all mappers. The output of the reduce
is normally stored in HDFS for reliability.
• Reducers can be 0,1 or multiple
08/05/2023 5
Map Reduce – Detail Flow Process
08/05/2023 6
MapReduce Features
• Automatic parallelization and distribution
• Fault-Tolerance
• Provides a clean abstraction for programmers to use
The Mapper
• All values with the same key are guaranteed to go to the same
machine
The Reducer
• Called once for each unique key
• Gets a list of all values associated with a key as input
• The reducer outputs zero or more final key/value pairs
• Usually just one output per input key
Map Reduce
08/05/2023 11
MR- Data Flow
08/05/2023 12
MR- Data Flow
08/05/2023 13
RDBMS Vs Map Reduce
• MapReduce is a good fit for problems that need to analyze the whole dataset in a batch fashion.
An RDBMS is good for point queries or updates, where the dataset has been indexed to deliver
low-latency retrieval and update times of a relatively small amount of data.
• MapReduce suits applications where the data is written once and read many times, whereas a
relational database is good for datasets that are continually updated.
08/05/2023 14
RDBMS Vs Map Reduce
08/05/2023 15
Map Reduce Paradigm
• Map and Reduce are based on functional programming
9/17/2018 18
MR Example For Weather Dataset
• Weather sensors collect data every hour at many locations across the globe
and gather a large volume of log data, which is a good candidate for analysis
with MapReduce because we want to process all the data, and the data is
semi-structured and record-oriented.
• What’s the highest recorded global temperature for each year in the dataset?
• Data Format:
• The data is stored using a line-oriented ASCII format, in which each line is a record. The
format supports a rich set of meteorological elements, many of which are optional or
with variable data lengths. For simplicity, we focus on the basic elements, such as
temperature, which are always present and are of fixed width.
Map Reduce -Example
08/05/2023 20
Map Reduce
• Data Files example:
• Datafiles are organized by date and weather station. There is a
directory for each year from 1901 to 2001, each containing a gzipped
file for each weather station with its readings for that year
08/05/2023 21
Map Reduce –Example for Weather Data set
• Unix may also gives you result but the process becomes very slow
with large data set.
• Lets see how MR works.
• MapReduce works by breaking the processing into two phases: the
map phase and the reduce phase.
• Each phase has key-value pairs as input and output, the types of
which may be chosen by the programmer.
• The programmer also specifies two functions: the map function and
the reduce function.
08/05/2023 22
Map Reduce –Example for Weather Data set
• The input to our map phase is the raw data. We choose a text input format
that gives us each line in the dataset as a text value.
• The key is the offset of the beginning of the line from the beginning of the
file, but as we have no need for this, we ignore it.
• Our map function is simple. We pull out the year and the air temperature,
because these are the only fields we are interested in. In this case, the map
function is just a data preparation phase, setting up the data in such a way
that the reduce function can do its work on it: finding the maximum
temperature for each year.
• The map function is also a good place to drop bad records: here we filter out
temperatures that are missing, suspect, or erroneous.
08/05/2023 23
Map Reduce –Example for Weather Data set
• To visualize the way the map works, consider the following sample
lines of input data
• These lines are presented to the map function as the key-value pairs:
08/05/2023 24
Map Reduce –Example for Weather Data set
• The keys are the line offsets within the file, which we ignore in our
map function. The map function merely extracts the year and the air
temperature, and emits them as its output (the temperature values
have been interpreted as integers):
• (1950, 0) (1950, 22) (1950, −11) (1949, 111) (1949, 78)
• The output from the map function is processed by the MapReduce
framework before being sent to the reduce function. This processing
sorts and groups the key-value pairs by key.
• So, continuing the example, our reduce function sees the following
input: (1949, [111, 78]) (1950, [0, 22, −11])
08/05/2023 25
Map Reduce –Example for Weather Data set
• Each year appears with a list of all its air temperature readings.
• All the reduce function has to do now is iterate through the list and
pick up the maximum reading: (1949, 111) (1950, 22).
• This is the final output: the maximum global temperature recorded
in each year.
08/05/2023 26
Map Reduce- Example for Weather Data set
08/05/2023 27
MR - Example for Weather Data set
• Lets code the above explanation.
• We need three things:
• a map function
• a reduce function
• and some code to run the job.
• The map function is represented by the Mapper class, which declares
an abstract map() method.
08/05/2023 28
Java MR - Example for Weather Data set
• Mapper for the maximum temperature example
08/05/2023 29
Java MR - Example for Weather Data set
• The Mapper class is a generic type, with four formal type parameters that specify the input key,
input value, output key, and output value types of the map function. For the present example, the
input key is a long integer offset, the input value is a line of text, the output key is a year, and the
output value is an air temperature (an integer). O/p from mapper: (1950, 0) (1950, 22) (1950,
−11) (1949, 111) (1949, 78)
• Here we use LongWritable, which corresponds to a Java Long, Text (like Java String), and
IntWritable (like Java Integer).
• The map() method is passed a key and a value. We convert the Text value containing the line of
input into a Java String, then use its substring() method to extract the columns we are interested
in.
• The map() method also provides an instance of Context to write the output to. In this case, we
write the year as a Text object (since we are just using it as a key), and the temperature is
wrapped in an IntWritable. We write an output record only if the temperature is present.
08/05/2023 30
Java MR - Example for Weather Data set
• Reducer for the maximum temperature example
08/05/2023 31
Java MR - Example for Weather Data set
• Again, four formal type parameters are used to specify the input and output types, this time for
the reduce function.
• The input types of the reduce function must match the output types of the map function: Text
and IntWritable.
• And in this case, the output types of the reduce function are Text and IntWritable, for a year and its maximum
temperature, which we find by iterating through the temperatures and comparing each with a record of the highest
found so far.
08/05/2023 32
Java MR - Example for Weather Data set
• Application to find the maximum temperature in the weather dataset
08/05/2023 33
Java MR - Example for Weather Data set
• A Job object forms the specification of the job and gives you control over how the job is run.
When we run this job on a Hadoop cluster, we will package the code into a JAR file (which Hadoop
will distribute around the cluster).
• Rather than explicitly specifying the name of the JAR file, we can pass a class in the Job’s
setJarByClass() method, which Hadoop will use to locate the relevant JAR file by looking for the
JAR file containing this class.
• Having constructed a Job object, we specify the input and output paths.
• An input path is specified by calling the static addInputPath() method on FileInputFormat, and it
can be a single file, a directory (in which case, the input forms all the files in that directory), or a
file pattern. As the name suggests, addInputPath() can be called more than once to use input
from multiple paths.
08/05/2023 34
Java MR - Example for Weather Data set
• The output path (of which there is only one) is specified by the static setOutputPath() method on
FileOutputFormat. It specifies a directory where the output files from the reduce function are
written.
• The directory shouldn’t exist before running the job because Hadoop will complain and not run the
job. This precaution is to prevent data loss (it can be very annoying to accidentally overwrite the
output of a long job with that of another).
• Next, we specify the map and reduce types to use via the setMapperClass() and setReducerClass()
methods.
• The setOutputKeyClass() and setOutputValueClass() methods control the output types for the reduce
function, and must match what the Reduce class produces.
• The map output types default to the same types, so they do not need to be set if the mapper
produces the same types as the reducer (as it does in our case). However, if they are different, the
map output types must be set using the setMapOutputKeyClass() and setMapOutputValueClass()
methods.
08/05/2023 35
Java MR - Example for Weather Data set
• The input types are controlled via the input format, which we have not explicitly set because we
are using the default TextInputFormat.
• After setting the classes that define the map and reduce functions, we are ready to run the job.
The waitForCompletion() method on Job submits the job and waits for it to finish.
• The single argument to the method is a flag indicating whether verbose output is generated.
When true, the job writes information about its progress to the console.
• The return value of the waitForCompletion() method is a Boolean indicating success (true) or
failure (false), which we translate into the program’s exit code of 0 or 1.
08/05/2023 36
Java MR - Example for Weather Data set
• Now, Test it on few sample records. Log will look like:
08/05/2023 37
Java MR - Example for Weather Data set
• Log will look like:
08/05/2023 38
Java MR - Example for Weather Data set
• When the hadoop command is invoked with a classname as the first argument, it launches a Java
virtual machine (JVM) to run the class.
• The hadoop command adds the Hadoop libraries (and their dependencies) to the classpath and
picks up the Hadoop configuration, too. To add the application classes to the classpath, we’ve
defined an environment variable called HADOOP_CLASSPATH, which the hadoop script picks up.
• The output from running the job provides some useful information.
• For example, we can see that the job was given an ID of job_local26392882_0001, and it ran one map task and one
reduce task (with the following IDs: attempt_local26392882_0001_m_000000_0 and
attempt_local26392882_0001_r_000000_0). Knowing the job and task IDs can be very useful when debugging
MapReduce jobs.
• Counters from Log: we can follow the number of records that went through the system: five map
input records produced five map output records (since the mapper emitted one output record for
each valid input record), then five reduce input records in two groups (one for each unique key)
produced two reduce output records.
• The output was written to the output directory, which contains one output file per reducer. The
job had a single reducer, so we find a single file, named part-r-00000:
08/05/2023 39
Map Reduce
• That’s all. Well Done.
08/05/2023 40
Another Example For Map Reduce - Word
Count
• Mapper
• Input: value: lines of text of input
• Output: key: word, value: 1
• Reducer
• Input: key: word, value: set of counts
• Output: key: word, value: sum
• Launching program
• Defines this job
• Submits job to cluster
Map Reduce
08/05/2023 42
Word Count Mapper
public static class Map extends MapReduceBase implements
Mapper<LongWritable,Text,Text,IntWritable> {
private static final IntWritable one = new IntWritable(1);
private Text word = new Text();
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducer(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
Conf.setOutputFormat(TextOutputFormat.class);
JobClient.runJob(conf);
Input and Output Formats
• A Map/Reduce may specify how it’s input is to be read by
specifying an InputFormat to be used
• A Map/Reduce may specify how it’s output is to be written
by specifying an OutputFormat to be used
• These default to TextInputFormat and TextOutputFormat,
which process line-based text data
• Another common choice is SequenceFileInputFormat and
SequenceFileOutputFormat for binary data
• These are file-based, but they are not required to be
How many Maps and Reduces
• Maps
• Usually as many as the number of HDFS blocks being
processed, this is the default
• Else the number of maps can be specified as a hint
• The number of maps can also be controlled by specifying
the minimum split size
• The actual sizes of the map inputs are computed by:
• max(min(block_size,data/#maps),
min_split_size
How many Maps and Reduces
• Reduces
• Unless the amount of data being processed is small
• 0.95*num_nodes*mapred.tasktracker.tasks.maximum
08/05/2023 50
Some handy tools
• Partitioners
• Combiners
• Compression
• Counters
• Speculation
• Zero Reduces
• Distributed File Cache
• Tool
Partitioners
• Partitioners are application code that define how
keys are assigned to reduces
• Default partitioning spreads keys evenly, but
randomly
• Uses key.hashCode() % num_reduces
Partitioners
08/05/2023 53
Combiners
• When maps produce many repeated keys
• It is often useful to do a local aggregation following the
map
• Done by specifying a Combiner
• Goal is to decrease size of the transient data
• Combiners have the same interface as Reduces, and
often are the same class
• Combiners must not side effects, because they run an
intermdiate number of times
• In WordCount,
conf.setCombinerClass(Reduce.class);
Compression
08/05/2023 56
Counters
• Often Map/Reduce applications have countable events
• For example, framework counts records in to and out of
Mapper and Reducer
• To define user counters:
static enum Counter {EVENT1, EVENT2};
reporter.incrCounter(Counter.EVENT1, 1);
• Define nice names in a MyClass_Counter.properties file
CounterGroupName=MyCounters
EVENT1.name=Event 1
EVENT2.name=Event 2
Speculative execution
• The framework can run multiple instances of slow tasks
• Output from instance that finishes first is used
• Controlled by the configuration variable
mapred.speculative.execution
• Can dramatically bring in long tails on jobs
Zero Reduces
• Frequently, we only need to run a filter on the input
data
• No sorting or shuffling required by the job
• Set the number of reduces to 0
• Output from maps will go directly to OutputFormat and
disk
Distributed File Cache
• Sometimes need read-only copies of data on the local
computer
• Downloading 1GB of data for each Mapper is expensive
• Define list of files you need to download in JobConf
• Files are downloaded once per computer
• Add to launching program:
DistributedCache.addCacheFile(new
URI(“hdfs://nn:8020/foo”), conf);
• Add to task:
Path[] files =
DistributedCache.getLocalCacheFiles(conf);
Tool
• Handle “standard” Hadoop command line options
• -conf file - load a configuration file named file
• -D prop=value - define a single configuration property prop
• Class looks like:
public class MyApp extends Configured
implements Tool {
public static void main(String[] args) throws
Exception {
System.exit(ToolRunner.run(new
Configuration(), new
MyApp(), args));
}
public int run(String[] args) throws Exception
{
…. getConf() ….
}
}
File Formats & Compression
Logical Table
08/05/2023 62