Mapreduce and Hadoop Distributed File System: K. Madurai and B. Ramamurthy
Mapreduce and Hadoop Distributed File System: K. Madurai and B. Ramamurthy
Hadoop Distributed
File SystemContact:
Dr. Bina Ramamurthy
K. MADURAI
CSE AND B. RAMAMURTHY
Department
University at Buffalo (SUNY)
[email protected]
http://www.cse.buffalo.edu/faculty/bina
Partially Supported by
NSF DUE Grant: 0737243
10/6/2019
1
The Context: Big-data
Man on the moon with 32KB (1969); my laptop had 2GB RAM (2009)
Google collects 270PB data in a month (2007), 20000PB a day (2008)
2010 census data is expected to be a huge gold mine of information
Data mining huge amounts of data collected in a wide range of domains from
astronomy to healthcare has become essential for planning and performance.
We are in a knowledge economy.
◦ Data is an important asset to any organization
◦ Discovery of knowledge; Enabling discovery; annotation of data
We are looking at newer
◦ programming models, and
◦ Supporting algorithms and data structures.
NSF refers to it as “data-intensive computing” and industry calls it “big-data” and
“cloud computing”
10/6/2019
2
Purpose of this talk
To provide a simple introduction to:
“The big-data computing” : An important
advancement that has a potential to impact
significantly the CS and undergraduate curriculum.
A programming model called MapReduce for
processing “big-data”
A supporting file system called Hadoop Distributed
File System (HDFS)
To encourage educators to explore ways to infuse
relevant concepts of this emerging area into their
curriculum.
10/6/2019
3
The Outline
Introduction to MapReduce
From CS Foundation to MapReduce
MapReduce programming model
Hadoop Distributed File System
Relevance to Undergraduate Curriculum
Demo (Internet access needed)
Our experience with the framework
Summary
References
10/6/2019
4
MapReduce
10/6/2019 5
What is MapReduce?
MapReduce is a programming model Google has used
successfully is processing its “big-data” sets (~ 20000 peta
bytes per day)
Users specify the computation in terms of a map and a
reduce function,
Underlying runtime system automatically parallelizes
the computation across large-scale clusters of machines,
and
Underlying system also handles machine failures,
efficient communications, and performance issues.
-- Reference: Dean, J. and Ghemawat, S. 2008. MapReduce: simplified
data processing on large clusters. Communication of ACM 51, 1 (Jan.
2008), 107-113.
10/6/2019
6
From CS Foundations to
MapReduce
Consider a large data collection:
{web, weed, green, sun, moon, land, part, web, green,…}
Problem: Count the occurrences of the different words in the collection.
10/6/2019
7
Word Counter and Result
Table
{web, weed, green, sun, moon, land, part, web 2
web, green,…}
weed 1
green 2
Data Main
sun 1
collection
moon 1
land 1
WordCounter part 1
parse( )
count( )
DataCollection ResultTable
10/6/2019
8
Multiple Instances of Word
Counter web 2
weed 1
green 2
Data
Main sun 1
collection
moon 1
Thread
land 1
1..*
WordCounter part 1
parse( )
count( )
10/6/2019
9
Improve Word Counter for
Performance Main
N No need for lock
oweb 2
weed 1
Data green 2
collection
sun 1
moon 1
Thread
land 1
1..*
1..* part 1
Parser Counter
WordList
Separate counters
DataCollection ResultTable
KEY web weed green sun moon land part web green …….
VALUE
10/6/2019
10
Peta-scale Data Main
web 2
weed 1
green 2
Data sun 1
collection moon 1
Thread
land 1
1..*
1..* part 1
Parser Counter
KEY web weed green sun moon land part web green …….
VALUE
10/6/2019
11
Addressing the Scale Issue
Single machine cannot serve all the data: you need a distributed special
(file) system
Large number of commodity hardware disks: say, 1000 disks 1TB each
Issue: With Mean time between failures (MTBF) or
failure rate of 1/1000, then at least 1 of the above 1000
disks would be down at a given time.
Thus failure is norm and not an exception.
File system has to be fault-tolerant: replication,
checksum
Data transfer bandwidth is critical (location of data)
Critical aspects: fault tolerance + replication + load balancing, monitoring
Exploit parallelism afforded by splitting parsing and counting
Provision and locate computing at data locations
10/6/2019
12
Peta-scale Data Main
web 2
weed 1
green 2
Data sun 1
collection moon 1
Thread
land 1
1..*
1..* part 1
Parser Counter
KEY web weed green sun moon land part web green …….
VALUE
10/6/2019
13
Peta
Data Scale Data is Commonly Distributed
collection
Main
web 2
Data
collection weed 1
green 2
Data sun 1
collection
moon 1
Thread
land 1
1..*
Data part 1
1..*
collection Parser Counter
WordList
Data DataCollection ResultTable
VALUE
10/6/2019
14
Data
collection
green 2
Data sun 1
collection
moon 1
Thread
land 1
1..*
Data part 1
1..*
collection Parser Counter
WordList
Data DataCollection ResultTable
collection
KEY web weed green sun moon land part web green …….
VALUE
10/6/2019
15
Data
collection
WordList
Data DataCollection ResultTable
collection
10/6/2019
16
For our example,
Divide and Conquer: Provision Computing at Data Location Main
collection Parser
1..*
Counter
Data Thread
Our parse is a mapping operation:
collection Parser
1..*
1..*
Counter
MAP: input <key, value> pairs
DataCollection WordList ResultTable
Main
Our count is a reduce operation:
REDUCE: <key, value> pairs reduced
Data Thread
1..*
collection Parser
1..*
Counter
Counter
10/6/2019
17
Mapper and Reducer
MapReduceTask
Mapper Reducer
YourReducer Counter
YourMapper Parser
weed
1
green 1
MAP: Input data <key, value> pair
web 1
sun 1 weed 1
moon 1 green 1
land 1 sun1 1
web
part 1 moon 1
Map web
weed
1
1
land
1web 1 1
Data green
green
1 part
1weed 1 1
Collection: split1 Split the data to web … 1
sun
1 web
moon 1green 1 1
Supply multiple weedKEY 1 VALUE
land green
1sun 1 1
processors green 1
part … 1moon 1 1
sun 1 KEY1land VALUE
1
web
moon 1
green 1part 1
web 1 … 1
…
green 1 KEY VALUE
… 1
KEY VALUE
Data
Collection: split n
10/6/2019
19
Reduce Operation
MAP: Input data <key, value> pair
REDUCE: <key, value> pair <result>
Reduce
Map
Data
Collection: split1 Split the data to
Supply multiple
processors
Reduce
Data Map
Collection: split 2
……
Data
…
Reduce
Collection: split n Map
10/6/2019
20
Large scale data splits
Map <key, 1> Reducers (say, Count)
Parse-hash
Count
P-0000
, count1
Parse-hash
Count
P-0001
, count2
Parse-hash
Count
P-0002
Parse-hash ,count3
10/6/2019
21
MapReduce Example in my operating systems class
combine part0
map reduce
Cat split
reduce part1
split map combine
Bat
map part2
split combine reduce
Dog
split map
Other
Words
(size:
TByte)
10/6/2019
22
MapReduce
Programming
Model
10/6/2019 23
MapReduce programming model
10/6/2019
24
MapReduce Characteristics
Very large scale data: peta, exa bytes
Write once and read many data: allows for parallelism without mutexes
Map and Reduce are the main operations: simple code
There are other supporting operations such as combine and partition (out of the scope
of this talk).
All the map should be completed before reduce operation starts.
Map and reduce operations are typically performed by the same physical processor.
Number of map tasks and reduce tasks are configurable.
Operations are provisioned near the data.
Commodity hardware and storage.
Runtime takes care of splitting and moving data for operations.
Special distributed file system. Example: Hadoop Distributed File System and Hadoop
Runtime.
10/6/2019
25
Classes of problems
“mapreducable”
Benchmark for comparing: Jim Gray’s challenge on data-intensive
computing. Ex: “Sort”
Google uses it (we think) for wordcount, adwords, pagerank, indexing
data.
Simple algorithms such as grep, text-indexing, reverse indexing
Bayesian classification: data mining domain
Facebook uses it for various operations: demographics
Financial services use it for analytics
Astronomy: Gaussian analysis for locating extra-terrestrial objects.
Expected to play a critical role in semantic web and web3.0
10/6/2019
26
Scope of MapReduce
Data size: small
Pipelined Instruction level
10/6/2019
27
Hadoop
10/6/2019 28
What is Hadoop?
At Google MapReduce operation are run on a special file system called
Google File System (GFS) that is highly optimized for this purpose.
GFS is not open source.
Doug Cutting and Yahoo! reverse engineered the GFS and called it
Hadoop Distributed File System (HDFS).
The software framework that supports HDFS, MapReduce and other
related entities is called the project Hadoop or simply Hadoop.
This is open source and distributed by Apache.
10/6/2019
29
Basic Features: HDFS
Highly fault-tolerant
High throughput
Suitable for applications with large data sets
Streaming access to file system data
Can be built out of commodity hardware
10/6/2019
30
Hadoop Distributed File
System HDFS Server Master node
HDFS Client
Application
Local file
system
Block size: 2K
Name Nodes
Block size: 128M
More details: We discuss this in great detail in my Operating Replicated
Systems course
10/6/2019
31
Hadoop Distributed File
System HDFS Server Master node
blockmap
Local file
system
Block size: 2K
Name Nodes
Block size: 128M
More details: We discuss this in great detail in my Operating Replicated
Systems course
10/6/2019
32
Relevance and Impact on
Undergraduate courses
Data structures and algorithms: a new look at traditional algorithms
such as sort: Quicksort may not be your choice! It is not easily
parallelizable. Merge sort is better.
You can identify mappers and reducers among your algorithms.
Mappers and reducers are simply place holders for algorithms
relevant for your applications.
Large scale data and analytics are indeed concepts to reckon with
similar to how we addressed “programming in the large” by OO
concepts.
While a full course on MR/HDFS may not be warranted, the concepts
perhaps can be woven into most courses in our CS curriculum.
10/6/2019
33
Demo
VMware simulated Hadoop and MapReduce demo
Remote access to NEXOS system at my Buffalo office
5-node HDFS running HDFS on Ubuntu 8.04
1 –name node and 4 data-nodes
Each is an old commodity PC with 512 MB RAM, 120GB – 160GB
external memory
Zeus (namenode), datanodes: hermes, dionysus, aphrodite, athena
10/6/2019
34
Summary
We introduced MapReduce programming model for processing large
scale data
We discussed the supporting Hadoop Distributed File System
The concepts were illustrated using a simple example
We reviewed some important parts of the source code for the example.
Relationship to Cloud Computing
10/6/2019
35
References
1. Apache Hadoop Tutorial: http://hadoop.apache.org
http://hadoop.apache.org/core/docs/current/mapred_t
utorial.html
2. Dean, J. and Ghemawat, S. 2008. MapReduce: simplified
data processing on large clusters. Communication of
ACM 51, 1 (Jan. 2008), 107-113.
3. Cloudera Videos by Aaron Kimball:
http://www.cloudera.com/hadoop-training-basic
4.
http://www.cse.buffalo.edu/faculty/bina/mapreduce.ht
ml
10/6/2019
36