0% found this document useful (0 votes)
4 views48 pages

Big Data Management Continued

The document provides an overview of a Big Data Management course focusing on MapReduce, its ecosystem, and programming concepts. It covers key components like Hadoop, Pig, Hive, and NoSQL databases, particularly MongoDB, while outlining course objectives, grading, and recommended resources. Additionally, it explains the MapReduce architecture, data flow, and programming methodologies using Hadoop Streaming and MRJob.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views48 pages

Big Data Management Continued

The document provides an overview of a Big Data Management course focusing on MapReduce, its ecosystem, and programming concepts. It covers key components like Hadoop, Pig, Hive, and NoSQL databases, particularly MongoDB, while outlining course objectives, grading, and recommended resources. Additionally, it explains the MapReduce architecture, data flow, and programming methodologies using Hadoop Streaming and MRJob.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

Peeyush Taori

BIG DATA MANAGEMENT CONTINUED TA: Yogesh Khandelwal


AGENDA
Introduction to MapReduce
­ Alternative framework to Apache Spark
­ MapReduce Concepts
­ Programming in MapReduce

MapReduce Ecosystem
• Pig (Think Shell Scripting)
• Hive (Think SQL using MapReduce)

Focus on NoSQL Databases


­ MongoDB

Can be used alongwith BigData, but not a requirement


COURSE OBJECTIVES & GRADING

• Getting a good grade J


• Appreciate and understand MapReduce and its usage in BigData
• Understand Pig and Hive, and ability to program using Pig and Hive
• Understand NoSQL Databases and make informed choice about when to use what
•1 assignment – 40% weightage
•Final Exam (60%)
RECOMMENDED TEXTBOOKS & SOFTWARE
Hadoop: The Definitive Guide (4th edition). T. White. O’Reilly Media, 2015
MongoDB: The Definitive Guide. Powerful and Scalable Data Storage (2nd edition). Kristina
Chodorow. O’Reilly Media, 2015

• We would be using Cloudera QuickStart


• Python as language of implementation (agnostic if you want to you use any other language)
• Materials on LMS
• Lecture slides
• Recommended readings
BRIEF RECAP
Big Data
­ 3V – Volume, Velocity, Variety

To solve any BigData problem, we need to put three pieces


­ Storage – HDFS
­ Monitoring – YARN, MESOS
­ Processing – Spark/MapReduce

Spark
­ Distributed Big Data Processing Framework
­ Computing Engine
­ Does not have anything to do with data storage
Spark Core – 3 Building Blocks
RDD
­ Data Abstraction (Container for storing data)
­ All Spark programs understand RDD

Actions and Transformation


­ Functions that work on RDD

SparkContext
­ Entry point to the Spark program

A Spark program is made up of RDDs and functions that operate on those RDDs
Think of Excel for BigData
HADOOP
One of the most popular Big Data frameworks.
Open source project written in Java
Based on Google File System (HDFS) and Google Map Reduce Framework (Computation)

Has three pieces


HDFS – Hadoop Distributed File System
•File System for storing large files across connected computers
•Single namespace for entire cluster
•File broken into 64 MB or 128 MB blocks
•Namenode – Main node for storing metadata information about files
• Datanode – Individual slave nodes that store actual chunks of data
• Secondary Namenode – Backup for namenode
We will now talk about the second piece –
MapReduce

Hadoop also has a diverse ecosystem based on


MapReduce (we will discuss some of them in next
sessions):
•Pig
•Hive
•Hbase
•Impala
•Mahout
HADOOP ECOSYSTEM
INTRODUCTION TO MAPREDUCE
WHAT IS MAPREDUCE?
MapReduce was described in a research paper by Google in 2004. It is a
programming model for parallel data analysis.
The big idea is that the data (input) can be split into independent parts, that these
can be analyzed or processed separately, and then the results can be combined
somehow to produce the final output.
It must be possible to process each piece of data separately from the others. That’s
a challenge for some types of analysis! Thinking about how to solve a problem in
parallel is a real intellectual challenge.
MAPREDUCE – KEY BENEFITS
• Simplicity
• Scalability – Parallel programming model
• Speed
• Recovery – Scheduling and fault tolerance in-built
• Status reporting and monitoring
• Minimal data motion – Moves computation to where data is
APPLICATIONS

Map Reduce greatly simplifies writing large scale distributed applications


Used for building search index at Google, Amazon
Widely used for analyzing user logs, data warehousing and analytics
Also used for large scale machine learning and data mining applications
TYPICAL USE CASES
MAP REDUCE ARCHITECTURE

Each node is part of a HDFS cluster


Input data is stored in HDFS across nodes
Just like Namenode and datanode in HDFS, we have JobTracker and TaskTracker in
MapReduce
End users (programmer) submits job (e.g. WordCount – mapper, reducer, input) to JobTracker
JobTracker – Master Node
­ Splits input data
­ Schedules and monitors via map and reduce tasks

TaskTracker – Slaves
­ Execute Map and Reduce Tasks
TYPICAL HADOOP CLUSTER
MAP REDUCE – BASIC IDEA
MAPREDUCE - BASICS
• Basically, LIST PROCESSING
• Input data list -> Output data list
• Does twice
•Mapper
•Reducer
•Idea is not new
MAP REDUCE PARADIGM
MAP REDUCE BASICS
•Inspired by functional language primitives
•map f list : applies a given function f to a each element of list and returns a new list
• map square [1 2 3 4 5] = [1 4 9 16 25]
•reduce g list : combines elements of list using function g to generate a new value
• reduce sum [1 2 3 4 5] = [15]
•Map and reduce do not modify input data. They always create new data
•A Hadoop Map Reduce program consists of a mapper and a reducer
MAPREDUCE
• Simple programming model to process large datasets in parallel
• Divide task into subtasks
• Handle sub-tasks in parallel
• Aggregate results to form final output
• Programmer perspective: Two functions
• Map
• Reduce
•Auxiliary phases such as Sorting, partitioning can also occur
• Input and output expressed in form of key-value pairs
• Also has a component called Driver
MAP/REDUCE DATA FLOW

input
map
split
intermediate
input output
“splits” shuffle
input reduce
(HDFS
split
map
& sort
output
blocks)
intermediate
output

input
map
split
MAPPING LIST
• First phase of a map reduce program
• Transforms each element to an output data element
• For example, toUpper() to convert all strings to upper case
Mapper
­Records (lines, database rows etc) are input as key/value
pairs
­Mapper outputs one or more intermediate key/value pairs
for each input
­map(K1 key, V1 value, OutputCollector<K2, V2> output,
Reporter reporter)
MAPPING LIST
REDUCING LIST
• Aggregate values together
• Iterate over input values
• Give a single final output value by combining input values
Reducer
­After the map phase, all the intermediate values for a
given output key are combined together into a list
­reducer combines those intermediate values into one or
more final key/value pairs
­reduce(K2 key, Iterator<V2> values,
OutputCollector<K3, V3> output, Reporter reporter)
­Input and output key/value types can be different
REDUCING LIST
PUTTING IT TOGETHER
• MapReduce takes these concepts in context of large datasets
• MapReduce has two components
•One implements Mapper
•Other implements Reducer
• Key-value pairs
•Values are associated with keys.
•AAA-123 65mph, 12:00pm
•ZZZ-789 50mph, 12:02pm
•AAA-123 40mph, 12:05pm
•CCC-456 25mph, 12:15pm
MAPREDUCE
• Mapper and Reducer functions receive key, value pairs
• Output is also a key, value pair
• Less strict than other languages
• Keys divide reduce space
• Reducer transforms values with same key together
FUNDAMENTAL DATA TYPES
MapReduce has two fundamental data types:
­ key-value pairs
­ lists
Input Output
Map (k1,v1) list( (k2,v2) )
Reduce (k2, list(v2)) list( (k3,v3) )
MAPPER EXAMPLE
REDUCE
EXAMPLE REDUCER
EXAMPLE
Mapper
­ Input: <key:, offset, value: line of a document>
­ Output: for each word w in input line output<key: w, value:1>
Input: (2133, The quick brown fox jumps over the lazy dog.)
Output: (the, 1) , (quick, 1), (brown, 1) … (fox,1), (the, 1)
Reducer
­ Input: <key: word, value: list<integer>>
­ Output: sum all values from input for the given key input list of values
and output <Key:word value:count>
Input: (the, [1, 1, 1, 1,1]), (fox, [1, 1, 1]) …
Output: (the, 5)
(fox, 3)
EXAMPLE: COUNTING WORDS IN TEXT
input intermediate
split output
text result:
input to reducer word count

the (0,the) (the,1) (the, [1,1]) (the, 2)


quick (1,quick) (quick,1) (quick, [1]) (quick, 1)
brown (2,brown) (brown,1) (brown, [1]) (brown, 1)
fox (3,fox) (fox,1) (fox, [1])
map shuffle (fox, 1)
jumped (4,jumped) (jumped,1) (jumped, [1]) reduce
(jumped, 1)
over (5,over) (over,1) (over, [1]) (over, 1)
the (6,the) (the,1) (lazy, [1]) (lazy, 1)
lazy (7,lazy) (lazy,1) (dog, [1]) (dog, 1)
dog (8,dog) (dog,1)
(k2,list(v2)) (k3,v3)
(k1,v1) list(k2,v2)
HOW IT ACTUALLY FLOWS
Map Phase
­ Map tasks run in parallel – output intermediate key value pairs
Shuffle and sort phase
­ Map task output is partitioned by hashing the output key
­ Number of partitions is equal to number of reducers
­ Partitioning ensures all key/value pairs sharing same key belong to same partition
­ The map output partition is sorted by key to group all values for the same key
Reduce Phase
­ Each partition is assigned to one reducer.
­ Reducers also run in parallel.
­ No two reducers process the same intermediate key
­ Reducer gets all values for a given key at the same time
HOW IT ACTUALLY FLOWS
MAPREDUCE DATA FLOW
MAP REDUCE ARCHITECTURE
Input Job (mapper, reducer, input)
Jobtracker Assign tasks

tasktracker tasktracker tasktracker


Data
transfer

Job tracker
­ splits input and assigns to various map tasks
­ Schedules and monitors map tasks (heartbeat)
­ On completion, schedules reduce tasks
Task tracker
­ Execute map tasks – call mapper for every input record
­ Execute reduce tasks – call reducer for every intermediate key, list of values pair
­ Handle partitioning of map outputs
­ Handle sorting and grouping of reducer input
ADVANTAGES
Locality
­ Job tracker divides tasks based on location of data: it tries to schedule map tasks on same
machine that has the physical data

Parallelism
­ Map tasks run in parallel working different input data splits
­ Reduce tasks run in parallel working on different intermediate keys
­ Reduce tasks wait until all map tasks are finished

Fault tolerance
­ Job tracker maintains a heartbeat with task trackers
­ Failures are handled by re-execution
­ If a task tracker node fails then all tasks scheduled on it (completed or incomplete) are re-
executed on another node
PROGRAMMING IN MAPREDUCE
HOW TO PROGRAM IN MAPREDUCE FRAMEWORK
There are many libraries available using which MapReduce programming can be
done (relatively) easily

We will look at two such implementations

HADOOP STREAMING
•Has nothing to do with streaming applications (Do not confuse with Spark Streaming)
•Wrapper provided by Hadoop

MRJOB
•External library for programming MapReduce programs
•More easy to use than Hadoop Streaming
MAPREDUCE PROGRAMMING – HADOOP
STREAMING
• Utility comes packaged with Hadoop
• Enables multiples languages such as Python, Ruby to used
as mapper, reducer, or both
• Creates a MapReduce job, submits the job to the cluster,
and monitors its progress until it is complete
• Mapper and Reducer are both executables
• Let us understand this by an example
In order to run Map Reduce program on Hadoop, type following commands:

/usr/bin/hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -


files mapper.py,reducer.py -mapper mapper.py -reducer reducer.py -input
/frost.txt -output /output
Option Description

-files A command-separated list of files to be copied to the MapReduce cluster

-mapper The command to be run as the mapper

-reducer The command to be run as the reducer

-input The DFS input path for the Map step

-output The DFS output directory for the Reduce step


MRJOB
MapReduce library created by Yelp
Wraps Hadoop streaming around MapReduce applications
Can write applications in a Pythonic manner
Can be written in pure Python
Very actively developed framework
Let us again understand with an example
To install mrjob, need to do this:
sudo yum -y install python-pip
sudo pip install mrjob

Let us run a word count example:

python word_count.py –r hadoop hdfs:///user/peeyush/file1.txt

You might also like