Analyzing_Data_with_Hadoop
Analyzing_Data_with_Hadoop
A Weather Dataset
• The data we will use is from the National Climatic Data Center
(NCDC, http://www .ncdc.noaa.gov/).
• Weather sensors collect data every hour at many locations
across the globe and gather a large volume of log data, which
is a good candidate for analysis with MapReduce because we
want to process all the data, and the data is semi-structured
and record-oriented.
• The data is stored using a line-oriented ASCII format, in which
each line is a record.
• The format supports a rich set of meteorological elements,
many of which are optional or with variable data lengths.
Analyzing the Data with Hadoop
Map and Reduce
Map Reduce works by breaking the
processing into two phases: the map phase
and the reduce phase.
• The jobtracker coordinates all the jobs run on the system by scheduling
tasks to run on tasktrackers. Job Tracker determines the location of the data
by communicating with the Name Node. Job Tracker also helps in finding
the Task Tracker.
• Tasktrackers run tasks and send progress reports to the jobtracker, which
keeps a record of the overall progress of each job. The Task Tracker helps in
mapping, shuffling and reducing the data operations.
• Task Tracker continuously updates the status of the Job Tracker. It also
informs about the number of slots available in the cluster. In case the Task
Tracker is unresponsive, then Job Tracker assigns the work to some other
nodes.
Daemons of Hadoop
Distributed computing system – MapReduce Framework
Job Tracker:
• Centrally Monitors the submitted Job and
controls all processes running on the
nodes of the cluster.
Task Tracker:
• constantly communicates with job Tracker
Daemons Architecture
Hadoop Server Roles
• Hadoop divides the input to a Map Reduce job into fixed-size
pieces called input splits, or just splits.
• Hadoop creates one map task for each split, which runs the
user defined map function for each record in the split.
• Having many splits means the time taken to process each
split is small compared to the time to process the whole
input.
• So if we are processing the splits in parallel, the processing is
better load-balanced if the splits are small, since a faster
machine will be able to process proportionally more splits
over the course of the job than a slower machine.
• if splits are too small, then the overhead of managing the splits
and of map task creation begins to dominate the total job
execution time.
• For most jobs, a good split size tends to be the size of an HDFS
block, 128 MB by default, although this can be changed for the
cluster (for all newly created files), or specified when each file
is created.
• Hadoop does its best to run the map task on a node where the
input data resides in HDFS. This is called the data locality
optimization since it doesn’t use valuable cluster bandwidth.
• Sometimes, however, all three nodes hosting the HDFS block
replicas for a map task’s input split are running other map tasks
so the job scheduler will look for a free map slot on a node in
the same rack as one of the blocks.
• Map tasks write their output to the local disk, not to HDFS.
• Because Map output is intermediate output: it’s processed by
reduce tasks to produce the final output, and once the job is
complete, the map output can be thrown away.
• So, storing it in HDFS with replication would be overkill.
• If the node running the map task fails before the map output
has been consumed by the reduce task, then Hadoop will
automatically rerun the map task on another node to re-create
the map output.
• Reduce tasks don’t have the advantage of data locality; the
input to a single reduce task is normally the output from all
mappers.
• In the present example, we have a single reduce task that is fed
by all of the map tasks. Therefore, the sorted map outputs have
to be transferred across the network to the node where the
reduce task is running, where they are merged and then passed
to the user-defined reduce function.
• The output of the reduce is normally stored in HDFS for
reliability.
The whole data flow with a single reduce task is as follows:
Map Reduce data flow diagram
• The number of reduce tasks is not governed by the size of the
input, but instead is specified independently.
• When there are multiple reducers, the map tasks partition their
output, each creating one partition for each reduce task. There
can be many keys (and their associated values) in each partition,
but the records for any given key are all in a single partition.
• The partitioning can be controlled by a user-defined partitioning
function, but normally the default partitioner—which buckets
keys using a hash function.
Map Reduce data flow with multiple reduce tasks
• It’s also possible to have zero reduce tasks.
• This can be appropriate when we don’t need the shuffle because the
processing can be carried out entirely in parallel
Combiner Functions
• Many MapReduce jobs are limited by the bandwidth available on the
cluster, so it pays to minimize the data transferred between map and
reduce tasks.
• Hadoop allows the user to specify a combiner function to be run on the
map output, and the combiner function’s output forms the input to
the reduce function. Because the combiner function is an optimization
• Hadoop does not provide a guarantee of how many times it will call it
for a particular map output record.
• The contract for the combiner function constrains the type of function
that may be used.
• Suppose that for the maximum temperature example, readings for the year 1950 were
processed by two maps (because they were in different splits). Imagine the first map
produced the output:
(1950, 0)
(1950, 20)
(1950, 10)
and the second produced:
(1950, 25)
(1950, 15)
The reduce function would be called with a list of all the values:
(1950, [0, 20, 10, 25, 15])
with output:
(1950, 25)
• since 25 is the maximum value in the list. We could use a combiner function
that, just like the reduce function, finds the maximum temperature for each
map output.
• The reduce function would then be called with:
(1950, [20, 25])
and would produce the same output as before. More succinctly, we may
express the function calls on the temperature values in this case as follows:
• max(0, 20, 10, 25, 15) = max(max(0, 20, 10), max(25, 15)) = max(20, 25) = 25
• Not all functions possess this property.1 For example, if we were
calculating mean temperatures,
• we couldn’t use the mean as our combiner function, because:
• mean(0, 20, 10, 25, 15) = 14
• but:
• mean(mean(0, 20, 10), mean(25, 15)) = mean(10, 20) = 15
• The combiner function doesn’t replace the reduce function.
Specifying a combiner function
• Same implementation as the reduce function in
MaxTemperatureReducer.
• The only change we need to make is to set the combiner class on the Job.
public class MaxTemperatureWithCombiner {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: MaxTemperatureWithCombiner <input path> " +
"<output path>");
System.exit(-1);
}
Job job = new Job();
job.setJarByClass(MaxTemperatureWithCombiner.class);
job.setJobName("Max temperature");
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(MaxTemperatureMapper.class);
job.setCombinerClass(MaxTemperatureReducer.class);
job.setReducerClass(MaxTemperatureReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Hadoop Streaming
• Hadoop provides an API to MapReduce that allows us to write our
map and reduce functions in languages other than Java.
• Hadoop Streaming uses Unix standard streams as the interface
between Hadoop and our program, so we can use any language that
can read standard input and write to standard output to write our
MapReduce program
Ruby
# we are looping over the words array and printing the word
# with the count of 1 to the STDOUT
for word in words:
# write the results to STDOUT (standard output);
# what we output here will be the input for the
# Reduce step, i.e. the input for reducer.py
print '%s\t%s' % (word, 1)
Word count using python : Reducercode
#!/usr/bin/env python # this IF-switch only works because Hadoop sorts
import sys map output
current_word = None # by key (here: word) before it is passed to the
current_count = 0 reducer
word = None if current_word == word:
# read the entire line from STDIN current_count += count
for line in sys.stdin: else:
# remove leading and trailing whitespace if current_word:
line = line.strip() # write result to STDOUT
# splitting the data on the basis of tab we have print '%s\t%s' % (current_word,
provided in mapper.py current_count)
word, count = line.split('\t', 1)
current_count = count
# convert count (currently a string) to int
current_word = word
try:
# do not forget to output the last word if needed!
count = int(count)
except ValueError:
if current_word == word:
# count was not a number, so silently print '%s\t%s' % (current_word, current_count)
# ignore/discard this line
continue