Hadoop PDF
Hadoop PDF
Hadoop PDF
! Dedicated to scalable, distributed, data-intensive computing ! Handles thousands of nodes and petabytes of data ! Supports applications under a free license ! 3 Hadoop subprojects:
!! Hadoop Common: common utilities package !! HFDS: Hadoop Distributed File System with high throughput access to application data !! MapReduce: A software framework for distributed processing of large data sets on computer clusters
Hadoop MapReduce
! MapReduce is a programming model and software framework first developed by Google (Googles MapReduce paper submitted in 2004) ! Intended to facilitate and simplify the processing of vast amounts of data in parallel on large clusters of commodity hardware in a reliable, fault-tolerant manner
!! Petabytes of data !! Thousands of nodes
!! Reliability and fault tolerance ensured by replicating data across multiple hosts
! MapReduce layer has job and task tracker nodes ! HFDS layer has name and data nodes
MapReduce framework
! Per cluster node:
!! Single JobTracker per master
!! Responsible for scheduling the jobs component tasks on the slaves !! Monitors slave progress !! Re-executing failed tasks
!! Reduce step
!! Master node takes the answers to the sub problems and combines them in a predefined way to get the output/answer to original problem
! The framework sorts the outputs of the maps ! A MapReduce Task is sent the output of the framework to reduce and combine ! Both the input and output of the job are stored in a filesystem ! Framework handles scheduling
!! Monitors and re-executes failed tasks
map
<k2, v2>!
combine*
<k2, v2>!
reduce
Output
<k3, v3>!
From http://code.google.com/edu/parallel/mapreduce-tutorial.html
To explain in detail, well use a code example: WordCount Count occurrences of each word across different files
Two input files: file1: hello world hello moon file2: goodbye world goodnight moon Three operations: map combine reduce
COMBINE! First map: < moon, 1 > ! ! < world, 1 > ! < hello, 2 > ! ! ! ! ! REDUCE! < goodbye, 1 >! < goodnight, 1 >! < moon, 2 > ! ! < world, 2 > ! < hello, 2 > !!
! ! ! !
!! !
!!
job.setInputFormatClass(TextInputFormat.class); ! job.setOutputFormatClass(TextOutputFormat.class);! FileInputFormat.setInputPaths(job, new Path(args[0])); ! FileOutputFormat.setOutputPaths(job, new Path(args[1]));! boolean success = job.waitForCompletion(true); ! return success ? 0 : 1;! }!
Job details
! Job sets the overall MapReduce job configuration ! Job is specified client-side ! Primary interface for a user to describe a MapReduce job to the Hadoop framework for execution ! Used to specify
!! !! !! !! !! !! !! Mapper Combiner (if any) Partitioner (to partition key space) Reducer InputFormat OutputFormat Many user options; high customizability
Mapper
! Mapper maps input key/value pairs to a set of intermediate key/value pairs ! Implementing classes extend Mapper and override map()
!! Main Mapper engine: Mapper.run()
!! setup() !! map() for each input record !! cleanup()
! Mapper implementations are specified in the Job ! Mapper instantiated in the Job ! Output data is emitted from Mapper via the Context object ! Hadoop MapReduce framework spawns one map task for each logical representation of a unit of input work for a map task
!! E.g. a filename and a byte range within that file
! Context object: allows the Mapper to interact with the rest of the Hadoop system ! Includes configuration data for the job as well as interfaces which allow it to emit output ! Applications can use the Context
!! !! !! !! to report progress to set application-level status messages update Counters indicate they are alive
Combiner class
! Specifies how to combine the maps for local aggregation ! In this example, it is the same as the Reduce class ! Output after running combiner: First map: < moon, 1 > ! < world, 1 > ! < hello, 2 > ! ! ! ! ! Second map: !< goodbye, 1 >! !< world, 1 >! !< goodnight, 1 >! !< moon, 1 >!
! ! ! !
! Framework groups all intermediate values associated with a given output key ! Passed to the Reducer class to get final output ! User-specified Comparator can be used to control grouping ! Combiner class can be user specified to perform local aggregation of the intermediate outputs ! Intermediate, sorted outputs always stored in a simple format
!! Applications can control if (and how) intermediate outputs are to be compressed (and the CompressionCode) in the Job
The framework puts together all the pairs with the same key and feeds them to the reduce function, that then sums the values to give occurrence counts.
!! !! !! !!
Reducer (III)
! Reduces a set of intermediate values which share a key to a (usually smaller) set of values ! Sorts and partitions Mapper outputs ! Number of reduces for the job set by user via Job.setNumReduceTasks(int) ! Reduce engine
!! receives a Context containing jobs configuration as well as interfacing methods that return data back to the framework !! Reducer.run()
!! setup() !! reduce() per key associated with reduce task !! cleanup()
Reducer (IV)
! Reducer.reduce()
!! Called once per key !! Passed in an Iterable which returns all values associated with that key !! Emits output with Context.write() !! Output is not sorted. !! 3 primary phases
!! Shuffle: the framework fetches relevant partitions of the output of all mappers via HTTP !! Sort: framework groups Reducer inputs by keys !! Reduce: reduce called on each <key, (value list) >
Scheduling
! By default, Hadoop uses FIFO to schedule jobs. Alternate scheduler options: capacity and fair ! Capacity scheduler
!! Developed by Yahoo !! Jobs are submitted to queues !! Jobs can be prioritized !! Queues are allocated a fraction of the total resource capacity !! Free resources are allocated to queues beyond their total capacity !! No preemption once a job is running
! Fair scheduler
!! Developed by Facebook !! Provides fast response times for small jobs !! Jobs are grouped into Pools !! Each pool assigned a guaranteed minimum share !! Excess capacity split between jobs !! By default, jobs that are uncategorized go into a default pool. Pools have to specify the minimum number of map slots, reduce slots, and a limit on the number of running jobs
! Job client then submits the job (jar/executables etc) and the configuration to the JobTracker
! Framework goes into skipping mode after a certain number of map failures ! Number of records skipped depends on how frequently the processed record counter is incremented by the application
! In 2007 IBM and Google announced an initiative to use Hadoop to support university courses in distributed computer programming ! In 2008 this collaboration and the Academic Cloud Computing Initiative were funded by the NSF and produced the Cluster Exploratory Program (CLuE)
! Fault tolerant, reliable, and supports thousands of nodes and petabytes of data ! If you can rewrite algorithms into Maps and Reduces, and your problem can be broken up into small pieces solvable in parallel, then Hadoops MapReduce is the way to go for a distributed problem solving approach to large datasets ! Tried and tested in production ! Many implementation options