BDA Lec5
BDA Lec5
Spark
MapReduce Large-scale Data
Hadoop File System Mining/ML
Streaming
Big Data Analytics (In short)
What will we learn in this lecture?
01. Distributed Computing
02. MapReduce
○ They are not as powerful as traditional parallel computers and are often built
out of less specialized nodes.
Distributed File System
◾ Master node
C0 C1 D0 C1 C2 C5 C0 C5
C5 C2 C5 C3 D0 D1 … D0 C2
◾ HDFS store data in each block which size is 64MB or 128MB.(Default are 128MB)
Hadoop Distributed File System
◾ We’re going to consider the case of creating a new file, writing data to it, then closing the
file.
02. MapReduce
MapReduce: Overview
● Easy as 1, 2!
○ Step 1: Map Step 2: Reduce
● Easy as 1, 2, 3!
○ Step 1: Map Step 2: Sort / Group by Step 3: Reduce
Programming Model: MapReduce
◾ MapReduce is a style of programming
(programming model) designed for:
1. Easy parallel programming
2. Invisible management of hardware and software failures
3. Easy management of very-large-scale data
MapReduce: Overview
◾ 3 steps of MapReduce
◾ 1. Map (written by programmer):
▪ Apply a user-written Map function to each input element
▪ Mapper applies the Map function to a single element
▪ Many mappers grouped in a Map task (the unit of parallelism)
▪ The output of the Map function is a set of 0, 1, or more key-value pairs.
Outline stays the same, Map and Reduce change to fit the problem
MapReduce: In-Parallel
MAP:
Read input and
produces a set of
key-value pairs
Group by key:
Collect all pairs with
same key
(Hash merge, Shuffle,
Sort, Partition)
Reduce:
Collect all values
belonging to the
key and output
MapReduce: In-Parallel
Phases of Map-
Reduced are
distributed with
many tasks
doing the work Partitioning function
in parallel determines which record
goes to which reducer