0% found this document useful (0 votes)
12 views71 pages

Chapter 4

lecture not about big data

Uploaded by

yehenew
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views71 pages

Chapter 4

lecture not about big data

Uploaded by

yehenew
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 71

Chapter 4: Introduction to the

Distributed Platforms &


MapReduce
Outline
 Motivation
 Why distribution?
 Distributed computing for data mining
 Scalability
 Popular distributed platforms
 Hadoop
 Spark

2
Why Distribution?
Motivation
 Why distribution?
 Big data: volume, velocity, variety
 We need more storage space
 Space efficiency
 We need more computing power
 Time efficiency
 We need more I/O throughput
 Inevitable because of the computer architecture
Divide and Conquer

“Work”
Partition

w1 w2 w3

worker worker worker

r1 r2 r3

“Result” Combine
Parallelization Challenges
 How do we assign work units to workers?
 What if we have more work units than
workers?
 What if workers need to share partial results?
 How do we aggregate partial results?
 How do we know all the workers have
finished?
 What if workers die?

What’s the common theme of all these problems?


Common Theme?
 Parallelization problems arise from:
 Communication between workers (e.g., to
exchange state)
 Access to shared resources (e.g., data)
 Thus, we need a synchronization mechanism
Managing Multiple Workers
 Difficult because
 We don’t know the order in which workers run
 We don’t know when workers interrupt each other
 We don’t know when workers need to communicate partial results
 We don’t know the order in which workers access shared data
 Thus, we need:
 Semaphores (lock, unlock)
 Conditional variables (wait, notify, broadcast)
 Barriers
 Still, lots of problems:
 Deadlock, livelock, race conditions...
 Dining philosophers, sleeping barbers, cigarette smokers...
 Moral of the story: be careful!
Current Tools
 Programming models Shared Memory Message Passing

 Shared memory (pthreads)

Memory
 Message passing (MPI)
 Design Patterns P1 P2 P3 P4 P5 P1 P2 P3 P4 P5

 Master-slaves
 Producer-consumer flows
 Shared work queues
producer consumer
master

work queue

slaves
producer consumer
Where the rubber meets the road
 Concurrency is difficult to reason about
 Concurrency is even more difficult to reason about
 At the scale of datacenters and across datacenters
 In the presence of failures
 In terms of multiple interacting services
 Not to mention debugging…
 The reality:
 Lots of one-off solutions, custom code
 Write your own dedicated library, then program with it
 Burden on the programmer to explicitly manage
everything
“Big Ideas”
 Scale “out”, not “up”
 Limits of SMP and large shared-memory machines
 SMP is Symmetric multiprocessing (SMP) is a
computing architecture in which two or more
processors are attached to a single memory.
 Move processing to the data
 Cluster has limited bandwidth
 Process data sequentially, avoid random access
 Seeks are expensive, disk throughput is reasonable
 Seamless scalability
 From the mythical man-month to the tradable
machine-hour
Scaling “up” vs. “out”
 No single machine is large enough
 Smaller cluster of large SMP machines vs. larger
cluster of commodity machines (e.g., 16 128-core
machines vs. 128 16-core machines)
 Nodes need to talk to each other!
 Intra-node latencies: ~100 ns
 Inter-node latencies: ~100 s
Distributed Computing for
Data Mining
Large-scale Computing
 Large-scale computing for data mining
problems on commodity hardware
 Challenges:
 How do you distribute computation?
 How can we make it easy to write distributed
programs?
 Machines fail:
 One server may stay up 3 years (1,000 days)
 If you have 1,000 servers, expect to loose 1/day
 With 1M machines 1,000 machines fail every day!

14
An Idea and a Solution
 Issue:
Copying data over a network takes time
 Idea:
 Bring computation to data
 Store files multiple times for reliability
 Spark/Hadoop address these problems
 Storage Infrastructure – File system
 Google: GFS. Hadoop: HDFS
 Programming model
 MapReduce
 Spark
15
Storage Infrastructure
 Problem:
 If nodes fail, how to store data persistently?
 Answer:
 Distributed File System
 Provides global file namespace
 Typical usage pattern:
 Huge files (100s of GB to TB)
 Data is rarely updated in place
 Reads and appends are common

16
Distributed File System
 Chunk servers
 File is split into contiguous chunks
 Typically each chunk is 16-64MB
 Each chunk replicated (usually 2x or 3x)
 Try to keep replicas in different racks
 Master node
 a.k.a. Name Node in Hadoop’s HDFS
 Stores metadata about where files are stored
 Might be replicated
 Client library for file access
 Talks to master to find chunk servers
 Connects directly to chunk servers to access data
17
Distributed File System
 Reliable distributed file system
 Data kept in “chunks” spread across machines
 Each chunk replicated on different machines
 Seamless recovery from disk or machine failure

C0 C1 D0 C1 C2 C5 C0 C5

C5 C2 C5 C3 D0 D1 … D0 C2
Chunk server 1 Chunk server 2 Chunk server 3 Chunk server N

Bring computation directly to the data!


Chunk servers also serve as compute servers
18
Google File System (GFS) Architecture

19
HDFS Architecture – Master/Slave
Typical Big Data Problem
 Iterate over a large number of records
Map
Extract something of interest from each
 Shuffle and sort intermediate results
 Aggregate intermediate results d u ce
Re
 Generate final output

Key idea: provide a functional abstraction for


these two operations
Roots in Functional Programming

Map f f f f f

Fold g g g g g
Programming Model
 MapReduce is a style of programming
designed for:
1. Easy parallel programming
2. Invisible management of hardware and software
failures
3. Easy management of very-large-scale data

 It has several implementations, including


Hadoop, Spark (used in this class), Flink, and
the original Google implementation just called
“MapReduce”
23
MapReduce Implementations
 Google has a proprietary implementation in C++
 Bindings in Java, Python
 Hadoop is an open-source implementation in Java
 Development led by Yahoo, now an Apache project
 Used in production at Yahoo, Facebook, Twitter,
LinkedIn, Netflix, …
 The de facto big data processing platform
 Large and expanding software ecosystem
 Lots of custom research implementations
 For GPUs, cell processors, etc.
User
Program

(1) submit

Master

(2) schedule map (2) schedule reduce

worker
split 0
(6) write output
split 1 (5) remote read worker
(3) read file 0
split 2 (4) local write
worker
split 3
split 4 output
worker
file 1

worker

Input Map Intermediate files Reduce Output


files phase (on local disk) phase files
MapReduce: Overview
3 steps of MapReduce
 Map:
 Apply a user-written Map function to each input element
 Mapper applies the Map function to a single element
 Many mappers grouped in a Map task (the unit of parallelism)
 The output of the Map function is a set of 0, 1, or more key-value
pairs.
 Group by key: Sort and shuffle
 System sorts all the key-value pairs by key, and
outputs key-(list of values) pairs
 Reduce:
 User-written Reduce function is applied to each
key-(list of values)
Outline stays the same, Map and Reduce change to fit the problem
26
MapReduce: The Map Step

Input Intermediate
key-value pairs key-value pairs

k v
map
k v
k v
map
k v
k v

… …

k v k v

27
MapReduce: The Reduce Step

Output
Intermediate Key-value groups key-value pairs
key-value pairs
reduce
k v k v v v k v
reduce
Group
k v k v v k v
by key

k v
… …

k v k v k v

28
Map-Reduce: A diagram

MAP:
Read input and
produces a set of
key-value pairs

Group by key:
Collect all pairs with
same key
(Hash merge, Shuffle,
Sort, Partition)

Reduce:
Collect all values
belonging to the
key and output

29
Map-Reduce: In Parallel

All phases are distributed with many tasks doing the work
30
MapReduce Pattern

key-value
pairs

Input Output

Mappers Reducers

31
Example: Word Counting
Example MapReduce task:
 We have a huge text document
 Count the number of times each
distinct word appears in the file
 Many applications of this:
 Analyze web server logs to find popular URLs
 Statistical machine translation:
 Need to count number of times every 5-word sequence
occurs in a large corpus of documents

32
MapReduce: Word Counting
Provided by the Provided by the
programmer programmer
MAP: Group by key: Reduce:
Read input and Collect all values
Collect all pairs
produces a set of belonging to the
with same key
key-value pairs key and output

data
reads
The crew of the space
shuttle Endeavor recently
(The, 1) (crew, 1)

read the
returned to Earth as (crew, 1) (crew, 1)
ambassadors, harbingers (crew, 2)
of a new era of space (of, 1) (space, 1)
(space, 1)

sequential
exploration. Scientists at
(the, 1) (the, 1)
NASA are saying that the (the, 3)

Sequentially
recent assembly of the (space, 1) (the, 1)
Dextre bot is the first step (shuttle, 1)
in a long-term space-based (shuttle, 1) (the, 1)
(recently, 1)
man/mache partnership. (Endeavor, 1) (shuttle, 1)
'"The work we're doing now …

Only
-- the robotics we're doing (recently, 1) (recently, 1)
-- is what we're going to
need …………………….. …. …
Big document (key, value) (key, value) (key, value)
33
Word Count Using MapReduce

map(key, value):
// key: document name; value: text of the document
for each word w in value:
emit(w, 1)

reduce(key, values):
// key: a word; value: an iterator over counts
result = 0
for each count v in values:
result += v
emit(key, result)

34
MapReduce: Environment
MapReduce environment takes care of:
 Partitioning the input data
 Scheduling the program’s execution across a
set of machines
 Performing the group by key step
 In practice this is the bottleneck
 Handling machine failures
 Managing required inter-machine
communication

35
Dealing with Failures
 Map worker failure
 Map tasks completed or in-progress at
worker are reset to idle and rescheduled
 Reduce workers are notified when map task is
rescheduled on another worker
 Reduce worker failure
 Only in-progress tasks are reset to idle and the
reduce task is restarted

36
Putting everything together…

namenode job submission node

namenode daemon jobtracker

tasktracker tasktracker tasktracker

datanode daemon datanode daemon datanode daemon

Linux file system Linux file system Linux file system

… … …
slave node slave node slave node

(Not Quite… We’ll come back to YARN later)


Major Differences between Hadoop 1 and 2

 HDFS Federation
 Resource manager YARN
Hadoop 2: HDFS Federation
 Two major components in HDFS
 Namespaces
 Blocks storage service
 Hadoop 1
 A single namenode manages the entire namespace
 HDFS federation
 Multiple namenode servers manage namespaces
 Horizontal scaling, performance improvement,
multiple namespaces
Hadoop 2: YARN
 Brings significant performance improvements
for some applications
 Supports additional processing models
 Implements a more flexible execution engine
YARN: from Hadoop 2
What is YARN?
 A resource manager separated from
MapReduce in Hadoop 1
 The operating system of Hadoop
 Managing and monitoring workloads
 Maintaining a multi-tenant environment
 Implementing security controls
 Managing high-availability features of Hadoop
 No longer limited to the I/O intensive, high-
latency MapReduce model
Spark
Problems with MapReduce
 Two major limitations of MapReduce:
 Difficultly of programming directly in MR
 Many problems aren’t easily described as map-reduce
 Performance bottlenecks, or batch not fitting the
use cases
 Persistence to disk typically slower than in-memory work

 In short, MR doesn’t compose well for large


applications
 Many times one needs to chain multiple map-
reduce steps
44
Data-Flow Systems
 MapReduce uses two “ranks” of tasks:
One for Map, the second for Reduce
 Data flows from the first rank to the second

 Data-Flow Systems generalize this in two ways:


1. Allow any number of tasks/ranks
2. Allow functions other than Map and Reduce
 As long as data flow is in one direction only, we can
have the blocking property and allow recovery of
tasks rather than whole jobs

45
Spark: Most Popular Data-Flow System

 Expressive computing system, not limited to


the map-reduce model
 Additions to MapReduce model:
 Fast data sharing
 Avoids saving intermediate results to disk
 Caches data for repetitive queries (e.g. for machine learning)
 General execution graphs (DAGs)
 Richer functions than just map and reduce
 Compatible with Hadoop

46
Spark Architecture
Components
 Spark applications
 Independent sets of processes on a cluster,
coordinated by the SparkContext object in main
(Driver) program
 Cluster managers
 Allocate resources across applications
 Executors
 Processes that run computations and store data
 Spark sends your application code (JAR or python
files) to executors, and SparkContext sends tasks
to executors to run
Note about the architecture
 Each application has its own executor processes
 Benefit: isolating applications from each other
 Cons: data cannot be shared across different applications
without writing to external storage
 Spark is agnostic to the underlying cluster manager
 Driver program must listen for and accept incoming
connections from its executors throughout its lifetime
 Driver must be network addressable from the workers
 Driver program should be run close to the worker
nodes, since it schedules tasks on the cluster
Cluster Manager Types
 Standalone: a simple cluster manager included in Spark
 Faster job startup, but it doesn’t support communication with an HDFS
secured with Kerberos authentication protocol
 Apache Mesos: a general cluster manager that can also runs
Hadoop MapReduce
 A scalable and fault-tolerant “distributed systems kernel” written in C+
+, which also supports C++ and Python applications
 It is actually a “scheduler of scheduler frameworks” because of its two-
level scheduling architecture
 Hadoop YARN: resource manager in Hadoop 2
 YARN lets you run different types of Java applications, not just Spark
 It also provides methods for isolating and prioritizing applications
among users and organizations
 The only cluster type that supports Kerberos-secured HDFS
 You don’t have to install Spark on all nodes in the cluster
Spark Standalone Cluster
 Two deploy modes for Spark Standalone
cluster
 Client mode: driver is launched in the same
process as the client that submits the application
 Cluster mode: driver is launched from one of the
Worker processes in the cluster, and the client
exits as soon as it fulfills its submission of
application
Client mode vs. Cluster mode

52
Spark on YARN cluster mode
Spark on YARN client mode
Spark: RDD
Key concept Resilient Distributed Dataset (RDD)
 Partitioned collection of records
 Generalizes (key-value) pairs
 Spread across the cluster, Read-only
 Caching dataset in memory
 Different storage levels available
 Fallback to disk possible

 RDDs can be created from Hadoop, or by


transforming other RDDs (you can stack RDDs)
 RDDs are best suited for applications that apply the
55
Spark RDD Operations
 Transformations build RDDs through
deterministic operations on other RDDs:
 Transformations include map, filter, join, union,
intersection, distinct
 Lazy evaluation: Nothing computed until an action
requires it

 Actions to return value or export data


 Actions include count, collect, reduce, save
 Actions can be applied to RDDs; actions force
calculations and return values
56
Task Scheduler: General DAGs
A: B:

F:
Stage 1 groupBy

C: D: E:

join = RDD

Stage 2 map filter Stage 3 = cached partition

 Supports general task graphs


 Pipelines functions where possible
 Cache-aware data reuse & locality
 Partitioning-aware to avoid shuffles
57
Data Analytics Software Stack

58
Problems Suited for
MapReduce
Example: Host size
 Suppose we have a large web corpus
 Look at the metadata file
 Lines of the form: (URL, size, date, …)
 For each host, find the total number of bytes
 That is, the sum of the page sizes for all URLs from
that particular host

 Other examples:
 Link analysis and graph processing
 Machine Learning algorithms

60
Example: Language Model
 Statistical machine translation:
 Need to count number of times every 5-word
sequence occurs in a large corpus of documents

 Very easy with MapReduce:


 Map:
 Extract (5-word sequence, count) from document
 Reduce:
 Combine the counts

61
Example: Join By Map-Reduce
 Compute the natural join R(A,B) ⋈ S(B,C)
 R and S are each stored in files
 Tuples are pairs (a,b) or (b,c)

A B B C A C
a1 b1 b2 c1 a3 c1
a2
a3
b1
b2
⋈ b2 c2 = a3 c2
b3 c3 a4 c3
a4 b3
S
R

62
Map-Reduce Join
 Use a hash function h from B-values to 1...k
 A Map process turns:
 Each input tuple R(a,b) into key-value pair (b,(a,R))
 Each input tuple S(b,c) into (b,(c,S))

 Map processes send each key-value pair with


key b to Reduce process h(b)
 Hadoop does this automatically; just tell it what k is.
 Each Reduce process matches all the pairs (b,
(a,R)) with all (b,(c,S)) and outputs (a,b,c).
63
Problems NOT suitable for MapReduce

 MapReduce is great for:


 Problems that require sequential data access
 Large batch jobs (not interactive, real-time)

 MapReduce is inefficient for problems where


random (or irregular) access to data required:
 Graphs
 Interdependent data
 Machine learning
 Comparisons of many pairs of items

64
Cost Measures for Algorithms
 In MapReduce we quantify the cost of an
algorithm using
1. Communication cost = total I/O of all
processes
2. Elapsed communication cost = max of I/O
along any path
3. (Elapsed) computation cost analogous, but
count only running time of processes
Note that here the big-O notation is not the most useful
(adding more machines is always an option)

65
Example: Cost Measures
 For a map-reduce algorithm:
 Communication cost = input file size + 2  (sum of
the sizes of all files passed from Map processes to
Reduce processes) + the sum of the output sizes of
the Reduce processes.
 Elapsed communication cost is the sum of the
largest input + output for any map process, plus
the same for any reduce process

66
What Cost Measures Mean
 Either the I/O (communication) or processing
(computation) cost dominates
 Ignore one or the other

 Total cost tells what you pay in rent from


your friendly neighborhood cloud

 Elapsed cost is wall-clock time using


parallelism

67
Cost of Map-Reduce Join
 Total communication cost
= O(|R|+|S|+|R ⋈ S|)
 Elapsed communication cost = O(s)
 We’re going to pick k and the number of Map
processes so that the I/O limit s is respected
 We put a limit s on the amount of input or output that
any one process can have. s could be:
 What fits in main memory
 What fits on local disk
 With proper indexes, computation cost is linear
in the input + output size
 So computation cost is like comm. cost
68
Thanks for Your Attention!

69
References
 Jure Leskovec, Anand Rajaraman, Jeffrey
David Ullman, Mining of Massive Datasets,
2nd Edition, Cambridge University Press,
November 2014. (Ch.1-2)
 Slides from Stanford CS246 “Mining Massive
Datasets” (by Jure Leskovec)
 Jimmy Lin and Chris Dyer, “Data-Intensive Text
Processing with MapReduce”. (Ch.1-2)
 Slides from Jimmy Lin’s “Big Data Infrastructure”
course, Univ. Maryland (and Univ. Waterloo)

70
References
 Papers:
 GFS
 MapReduce

71

You might also like