Big Data Management Continued
Big Data Management Continued
MapReduce Ecosystem
• Pig (Think Shell Scripting)
• Hive (Think SQL using MapReduce)
Spark
Distributed Big Data Processing Framework
Computing Engine
Does not have anything to do with data storage
Spark Core – 3 Building Blocks
RDD
Data Abstraction (Container for storing data)
All Spark programs understand RDD
SparkContext
Entry point to the Spark program
A Spark program is made up of RDDs and functions that operate on those RDDs
Think of Excel for BigData
HADOOP
One of the most popular Big Data frameworks.
Open source project written in Java
Based on Google File System (HDFS) and Google Map Reduce Framework (Computation)
TaskTracker – Slaves
Execute Map and Reduce Tasks
TYPICAL HADOOP CLUSTER
MAP REDUCE – BASIC IDEA
MAPREDUCE - BASICS
• Basically, LIST PROCESSING
• Input data list -> Output data list
• Does twice
•Mapper
•Reducer
•Idea is not new
MAP REDUCE PARADIGM
MAP REDUCE BASICS
•Inspired by functional language primitives
•map f list : applies a given function f to a each element of list and returns a new list
• map square [1 2 3 4 5] = [1 4 9 16 25]
•reduce g list : combines elements of list using function g to generate a new value
• reduce sum [1 2 3 4 5] = [15]
•Map and reduce do not modify input data. They always create new data
•A Hadoop Map Reduce program consists of a mapper and a reducer
MAPREDUCE
• Simple programming model to process large datasets in parallel
• Divide task into subtasks
• Handle sub-tasks in parallel
• Aggregate results to form final output
• Programmer perspective: Two functions
• Map
• Reduce
•Auxiliary phases such as Sorting, partitioning can also occur
• Input and output expressed in form of key-value pairs
• Also has a component called Driver
MAP/REDUCE DATA FLOW
input
map
split
intermediate
input output
“splits” shuffle
input reduce
(HDFS
split
map
& sort
output
blocks)
intermediate
output
input
map
split
MAPPING LIST
• First phase of a map reduce program
• Transforms each element to an output data element
• For example, toUpper() to convert all strings to upper case
Mapper
Records (lines, database rows etc) are input as key/value
pairs
Mapper outputs one or more intermediate key/value pairs
for each input
map(K1 key, V1 value, OutputCollector<K2, V2> output,
Reporter reporter)
MAPPING LIST
REDUCING LIST
• Aggregate values together
• Iterate over input values
• Give a single final output value by combining input values
Reducer
After the map phase, all the intermediate values for a
given output key are combined together into a list
reducer combines those intermediate values into one or
more final key/value pairs
reduce(K2 key, Iterator<V2> values,
OutputCollector<K3, V3> output, Reporter reporter)
Input and output key/value types can be different
REDUCING LIST
PUTTING IT TOGETHER
• MapReduce takes these concepts in context of large datasets
• MapReduce has two components
•One implements Mapper
•Other implements Reducer
• Key-value pairs
•Values are associated with keys.
•AAA-123 65mph, 12:00pm
•ZZZ-789 50mph, 12:02pm
•AAA-123 40mph, 12:05pm
•CCC-456 25mph, 12:15pm
MAPREDUCE
• Mapper and Reducer functions receive key, value pairs
• Output is also a key, value pair
• Less strict than other languages
• Keys divide reduce space
• Reducer transforms values with same key together
FUNDAMENTAL DATA TYPES
MapReduce has two fundamental data types:
key-value pairs
lists
Input Output
Map (k1,v1) list( (k2,v2) )
Reduce (k2, list(v2)) list( (k3,v3) )
MAPPER EXAMPLE
REDUCE
EXAMPLE REDUCER
EXAMPLE
Mapper
Input: <key:, offset, value: line of a document>
Output: for each word w in input line output<key: w, value:1>
Input: (2133, The quick brown fox jumps over the lazy dog.)
Output: (the, 1) , (quick, 1), (brown, 1) … (fox,1), (the, 1)
Reducer
Input: <key: word, value: list<integer>>
Output: sum all values from input for the given key input list of values
and output <Key:word value:count>
Input: (the, [1, 1, 1, 1,1]), (fox, [1, 1, 1]) …
Output: (the, 5)
(fox, 3)
EXAMPLE: COUNTING WORDS IN TEXT
input intermediate
split output
text result:
input to reducer word count
Job tracker
splits input and assigns to various map tasks
Schedules and monitors map tasks (heartbeat)
On completion, schedules reduce tasks
Task tracker
Execute map tasks – call mapper for every input record
Execute reduce tasks – call reducer for every intermediate key, list of values pair
Handle partitioning of map outputs
Handle sorting and grouping of reducer input
ADVANTAGES
Locality
Job tracker divides tasks based on location of data: it tries to schedule map tasks on same
machine that has the physical data
Parallelism
Map tasks run in parallel working different input data splits
Reduce tasks run in parallel working on different intermediate keys
Reduce tasks wait until all map tasks are finished
Fault tolerance
Job tracker maintains a heartbeat with task trackers
Failures are handled by re-execution
If a task tracker node fails then all tasks scheduled on it (completed or incomplete) are re-
executed on another node
PROGRAMMING IN MAPREDUCE
HOW TO PROGRAM IN MAPREDUCE FRAMEWORK
There are many libraries available using which MapReduce programming can be
done (relatively) easily
HADOOP STREAMING
•Has nothing to do with streaming applications (Do not confuse with Spark Streaming)
•Wrapper provided by Hadoop
MRJOB
•External library for programming MapReduce programs
•More easy to use than Hadoop Streaming
MAPREDUCE PROGRAMMING – HADOOP
STREAMING
• Utility comes packaged with Hadoop
• Enables multiples languages such as Python, Ruby to used
as mapper, reducer, or both
• Creates a MapReduce job, submits the job to the cluster,
and monitors its progress until it is complete
• Mapper and Reducer are both executables
• Let us understand this by an example
In order to run Map Reduce program on Hadoop, type following commands: