0% found this document useful (0 votes)
3 views

Analyzing_Data_with_Hadoop

Uploaded by

Suseela Devi
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Analyzing_Data_with_Hadoop

Uploaded by

Suseela Devi
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 54

Analyzing the Data with Hadoop

A Weather Dataset

• The data we will use is from the National Climatic Data Center
(NCDC, http://www .ncdc.noaa.gov/).
• Weather sensors collect data every hour at many locations
across the globe and gather a large volume of log data, which
is a good candidate for analysis with MapReduce because we
want to process all the data, and the data is semi-structured
and record-oriented.
• The data is stored using a line-oriented ASCII format, in which
each line is a record.
• The format supports a rich set of meteorological elements,
many of which are optional or with variable data lengths.
Analyzing the Data with Hadoop
Map and Reduce
 Map Reduce works by breaking the
processing into two phases: the map phase
and the reduce phase.

 Each phase has key-value pairs as input


and output, the types of which may be
chosen by the programmer.

 The programmer also specifies two


functions: the map function and the reduce
function
• The input to our map phase is the raw NCDC data. We
choose a TextInputFormat that gives us each line in the
dataset as a text value.

• The map function is just a data preparation phase, setting


up the data in such a way that the reducer function can do
its work on it: finding the maximum temperature for each
year.

• The map function is also a good place to drop bad records:


here we filter out temperatures that are missing, suspect,
or erroneous
 The keys are the line offsets within the file, which we
ignore in our map function.
 The map function merely extracts the year and the air
temperature (indicated in bold text), and emits them
as its output (the temperature values have been
interpreted as integers):
(1950, 0)
(1950, 22)
(1950, −11)
(1949, 111)
(1949, 78)
• The output from the map function is processed by the
Map Reduce framework before being sent to the
reduce function. This processing sorts and groups the
key-value pairs by key.
(1949, [111, 78])
(1950, [0, 22, −11])
• Each year appears with a list of all its air temperature
readings. All the reduce function has to do now is
iterate through the list and pick up the maximum
reading:
(1949, 111)
(1950, 22)
This is the final output
Map Reduce logical data flow
• The input for our program is weather data files for each year This
weather data is collected by National Climatic Data Center – NCDC
from weather sensors at all over the world.
• You can find weather data for each year from
ftp://ftp.ncdc.noaa.gov/pub/data/noaa/.All files are zipped by year
and the weather station.
• For each year, there are multiple files for different weather stations .
Meaning
• 0029029070999991905010106004+64333+023450FM12+00
0599999V0202301N008219999999N0000001N9-01391+
99999102641ADDGF102991999999999999999999.
• (029070) is the USAF weather station identifier
• The next one (19050101) represents the observation date.
• The third highlighted one (-0139) represents the air
temperature in Celsius times ten.
• So the reading of -0139 equates to -13.9 degrees Celsius.
The next highlighted and italic item indicates a reading
quality code
Map Phase
• The input for Map phase is set of weather data files as shown
in snap shot. The types of input key value pairs are Long
Writable and Text and the types of output key value pairs are
Text and IntWritable.
• Each Map task extracts the temperature data from the given
year file. The output of the map phase is set of key value pairs.
Set of keys are the years. Values are the temperature of each
year.
• The Mapper class is a generic type, with four formal type
parameters that specify the input key, input value, output key,
and output value types of the map function.
Java Map Reduce
Mapper for maximum temperature
example
Reduce Phase
• Reduce phase takes all the values associated with a
particular key. That is all the temperature values
belong to a particular year is fed to a same reducer.
Then each reducer finds the highest recorded
temperature for each year.

• The types of output key value pairs in Map phase is


same for the types of input key value pairs in reduce
phase (Text and IntWritable). The types of output key
value pairs in reduce phase is too Text and IntWritable
Reducer for maximum temperature example
Application to find the maximum temperature in the weather dataset
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class MaxTemperature {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: MaxTemperature <input path> <output path>");
System.exit(-1);
}
Job job = new Job();
job.setJarByClass(MaxTemperature.class);
job.setJobName("Max temperature");
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(MaxTemperatureMapper.class);
job.setReducerClass(MaxTemperatureReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
• % export HADOOP_CLASSPATH=hadoop-examples.jar
• % hadoop MaxTemperature input/ncdc/sample.txt output
• Job object forms the specification of the job and gives you control over how
the job is run. When we run this job on a Hadoop cluster, we will package
the code into a JAR file (which Hadoop will distribute around the cluster).
• Rather than explicitly specifying the name of the JAR file, we can pass a class
in the Job’s setJarByClass() method, which Hadoop will use to locate the
relevant JAR file by looking for the JAR file containing this class.
• The return value of the waitForCompletion() method is a Boolean indicating
success (true) or failure (false), which we translate into the program’s exit
code of 0 or 1.
Scaling Out
• We’ve seen how MapReduce works for small inputs;
• For simplicity, the examples so far have used files on the local
filesystem. However, to scale out, we need to store the data in a
distributed filesystem.
• This allows Hadoop to move the MapReduce computation to each
machine hosting a part of the data, using Hadoop’s resource
management system, called YARN.
Data Flow
• A MapReduce job is a unit of work that the client wants to be
performed: it consists of the input data, the MapReduce program,
and configuration information.
• Hadoop runs the job by dividing it into tasks, of which there are two
types: map tasks and reduce tasks.
• The tasks are scheduled using YARN and run on nodes in the cluster. If
a task fails, it will be automatically rescheduled to run on a different
node.
• Hadoop divides the input to a MapReduce job into fixed-size pieces
called input splits,or just splits. Hadoop creates one map task for each
split, which runs the user-defined map function for each record in the
split.
• Hadoop does its best to run the map task on a node where the input
data resides in HDFS, because it doesn’t use valuable cluster
bandwidth. This is called the data locality optimization.
• There are two types of nodes that control the job execution process: a
jobtracker and a number of tasktrackers.

• The jobtracker coordinates all the jobs run on the system by scheduling
tasks to run on tasktrackers. Job Tracker determines the location of the data
by communicating with the Name Node. Job Tracker also helps in finding
the Task Tracker.

• Tasktrackers run tasks and send progress reports to the jobtracker, which
keeps a record of the overall progress of each job. The Task Tracker helps in
mapping, shuffling and reducing the data operations.
• Task Tracker continuously updates the status of the Job Tracker. It also
informs about the number of slots available in the cluster. In case the Task
Tracker is unresponsive, then Job Tracker assigns the work to some other
nodes.
Daemons of Hadoop
Distributed computing system – MapReduce Framework
Job Tracker:
• Centrally Monitors the submitted Job and
controls all processes running on the
nodes of the cluster.

Task Tracker:
• constantly communicates with job Tracker
Daemons Architecture
Hadoop Server Roles
• Hadoop divides the input to a Map Reduce job into fixed-size
pieces called input splits, or just splits.
• Hadoop creates one map task for each split, which runs the
user defined map function for each record in the split.
• Having many splits means the time taken to process each
split is small compared to the time to process the whole
input.
• So if we are processing the splits in parallel, the processing is
better load-balanced if the splits are small, since a faster
machine will be able to process proportionally more splits
over the course of the job than a slower machine.
• if splits are too small, then the overhead of managing the splits
and of map task creation begins to dominate the total job
execution time.
• For most jobs, a good split size tends to be the size of an HDFS
block, 128 MB by default, although this can be changed for the
cluster (for all newly created files), or specified when each file
is created.
• Hadoop does its best to run the map task on a node where the
input data resides in HDFS. This is called the data locality
optimization since it doesn’t use valuable cluster bandwidth.
• Sometimes, however, all three nodes hosting the HDFS block
replicas for a map task’s input split are running other map tasks
so the job scheduler will look for a free map slot on a node in
the same rack as one of the blocks.
• Map tasks write their output to the local disk, not to HDFS.
• Because Map output is intermediate output: it’s processed by
reduce tasks to produce the final output, and once the job is
complete, the map output can be thrown away.
• So, storing it in HDFS with replication would be overkill.
• If the node running the map task fails before the map output
has been consumed by the reduce task, then Hadoop will
automatically rerun the map task on another node to re-create
the map output.
• Reduce tasks don’t have the advantage of data locality; the
input to a single reduce task is normally the output from all
mappers.
• In the present example, we have a single reduce task that is fed
by all of the map tasks. Therefore, the sorted map outputs have
to be transferred across the network to the node where the
reduce task is running, where they are merged and then passed
to the user-defined reduce function.
• The output of the reduce is normally stored in HDFS for
reliability.
The whole data flow with a single reduce task is as follows:
Map Reduce data flow diagram
• The number of reduce tasks is not governed by the size of the
input, but instead is specified independently.
• When there are multiple reducers, the map tasks partition their
output, each creating one partition for each reduce task. There
can be many keys (and their associated values) in each partition,
but the records for any given key are all in a single partition.
• The partitioning can be controlled by a user-defined partitioning
function, but normally the default partitioner—which buckets
keys using a hash function.
Map Reduce data flow with multiple reduce tasks
• It’s also possible to have zero reduce tasks.
• This can be appropriate when we don’t need the shuffle because the
processing can be carried out entirely in parallel
Combiner Functions
• Many MapReduce jobs are limited by the bandwidth available on the
cluster, so it pays to minimize the data transferred between map and
reduce tasks.
• Hadoop allows the user to specify a combiner function to be run on the
map output, and the combiner function’s output forms the input to
the reduce function. Because the combiner function is an optimization
• Hadoop does not provide a guarantee of how many times it will call it
for a particular map output record.
• The contract for the combiner function constrains the type of function
that may be used.
• Suppose that for the maximum temperature example, readings for the year 1950 were
processed by two maps (because they were in different splits). Imagine the first map
produced the output:
(1950, 0)
(1950, 20)
(1950, 10)
and the second produced:
(1950, 25)
(1950, 15)
The reduce function would be called with a list of all the values:
(1950, [0, 20, 10, 25, 15])
with output:
(1950, 25)
• since 25 is the maximum value in the list. We could use a combiner function
that, just like the reduce function, finds the maximum temperature for each
map output.
• The reduce function would then be called with:
(1950, [20, 25])
and would produce the same output as before. More succinctly, we may
express the function calls on the temperature values in this case as follows:
• max(0, 20, 10, 25, 15) = max(max(0, 20, 10), max(25, 15)) = max(20, 25) = 25
• Not all functions possess this property.1 For example, if we were
calculating mean temperatures,
• we couldn’t use the mean as our combiner function, because:
• mean(0, 20, 10, 25, 15) = 14
• but:
• mean(mean(0, 20, 10), mean(25, 15)) = mean(10, 20) = 15
• The combiner function doesn’t replace the reduce function.
Specifying a combiner function
• Same implementation as the reduce function in
MaxTemperatureReducer.
• The only change we need to make is to set the combiner class on the Job.
public class MaxTemperatureWithCombiner {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: MaxTemperatureWithCombiner <input path> " +
"<output path>");
System.exit(-1);
}
Job job = new Job();
job.setJarByClass(MaxTemperatureWithCombiner.class);
job.setJobName("Max temperature");
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(MaxTemperatureMapper.class);
job.setCombinerClass(MaxTemperatureReducer.class);
job.setReducerClass(MaxTemperatureReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Hadoop Streaming
• Hadoop provides an API to MapReduce that allows us to write our
map and reduce functions in languages other than Java.
• Hadoop Streaming uses Unix standard streams as the interface
between Hadoop and our program, so we can use any language that
can read standard input and write to standard output to write our
MapReduce program
Ruby

• The map function can be expressed in Ruby as


Map function for maximum temperature in Ruby
#!/usr/bin/env ruby
STDIN.each_line do |line|
val = line
year, temp, q = val[15,4], val[87,5], val[92,1]
puts "#{year}\t#{temp}" if (temp != "+9999" && q =~ /[01459]/)
end
% cat input/ncdc/sample.txt |
ch02-mr-intro/src/main/ruby/max_temperature_map.rb
1950 +0000
1950 +0022
1950 -0011
1949 +0111
1949 +0078
Reduce function for maximum temperature in Ruby
#!/usr/bin/env ruby
last_key, max_val = nil, -1000000
STDIN.each_line do |line|
key, val = line.split("\t")
if last_key && last_key != key
puts "#{last_key}\t#{max_val}"
last_key, max_val = key, val.to_i
else
last_key, max_val = key, [max_val, val.to_i].max
end
end
puts "#{last_key}\t#{max_val}" if last_key
% cat input/ncdc/sample.txt | \
ch02-mr-intro/src/main/ruby/max_temperature_map.rb | \
sort | ch02-mr-intro/src/main/ruby/max_temperature_reduce.rb
1949 111
1950 22
• Specify the Streaming JAR file along with the jar option. Options to
the Streaming program specify the input and output paths and the
map and reduce scripts.
% hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-
streaming-*.jar \
-input input/ncdc/sample.txt \
-output output \
-mapper ch02-mr-intro/src/main/ruby/max_temperature_map.rb \
-reducer ch02-mr-intro/src/main/ruby/max_temperature_reduce.rb
Python
Map function for maximum temperature in Python
#!/usr/bin/env python
import re
import sys
for line in sys.stdin:
val = line.strip()
(year, temp, q) = (val[15:19], val[87:92], val[92:93])
if (temp != "+9999" and re.match("[01459]", q)):
print "%s\t%s" % (year, temp)
Reduce function for maximum temperature in Python
#!/usr/bin/env python
import sys
(last_key, max_val) = (None, -sys.maxint)
for line in sys.stdin:
(key, val) = line.strip().split("\t")
if last_key and last_key != key:
print "%s\t%s" % (last_key, max_val)
(last_key, max_val) = (key, int(val))
else:
(last_key, max_val) = (key, max(max_val, int(val)))
if last_key:
print "%s\t%s" % (last_key, max_val)
% cat input/ncdc/sample.txt | \
ch02-mr-intro/src/main/python/max_temperature_map.py | \
sort | ch02-mr-intro/src/main/python/max_temperature_reduce.py
1949 111
1950 22
Word count using python : Mapper code
#!/usr/bin/env python
# import sys because we need to read and write data to STDIN and STDOUT
import sys
# reading entire line from STDIN (standard input)
for line in sys.stdin:
# to remove leading and trailing whitespace
line = line.strip()
# split the line into words
words = line.split()

# we are looping over the words array and printing the word
# with the count of 1 to the STDOUT
for word in words:
# write the results to STDOUT (standard output);
# what we output here will be the input for the
# Reduce step, i.e. the input for reducer.py
print '%s\t%s' % (word, 1)
Word count using python : Reducercode
#!/usr/bin/env python # this IF-switch only works because Hadoop sorts
import sys map output
current_word = None # by key (here: word) before it is passed to the
current_count = 0 reducer
word = None if current_word == word:
# read the entire line from STDIN current_count += count
for line in sys.stdin: else:
# remove leading and trailing whitespace if current_word:
line = line.strip() # write result to STDOUT
# splitting the data on the basis of tab we have print '%s\t%s' % (current_word,
provided in mapper.py current_count)
word, count = line.split('\t', 1)
current_count = count
# convert count (currently a string) to int
current_word = word
try:
# do not forget to output the last word if needed!
count = int(count)
except ValueError:
if current_word == word:
# count was not a number, so silently print '%s\t%s' % (current_word, current_count)
# ignore/discard this line
continue

You might also like