0% found this document useful (0 votes)

12 views71 pages

Chapter 4

lecture not about big data

Uploaded by

yehenew

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views71 pages

Chapter 4

lecture not about big data

Uploaded by

yehenew

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 71

Chapter 4: Introduction to the

Distributed Platforms &

MapReduce
Outline
 Motivation
 Why distribution?
 Distributed computing for data mining
 Scalability
 Popular distributed platforms
 Hadoop
 Spark

2
Why Distribution?
Motivation
 Why distribution?
 Big data: volume, velocity, variety
 We need more storage space
 Space efficiency
 We need more computing power
 Time efficiency
 We need more I/O throughput
 Inevitable because of the computer architecture
Divide and Conquer

“Work”
Partition

w1 w2 w3

worker worker worker

r1 r2 r3

“Result” Combine
Parallelization Challenges
 How do we assign work units to workers?
 What if we have more work units than
workers?
 What if workers need to share partial results?
 How do we aggregate partial results?
 How do we know all the workers have
finished?
 What if workers die?

What’s the common theme of all these problems?

Common Theme?
 Parallelization problems arise from:
 Communication between workers (e.g., to
exchange state)
 Access to shared resources (e.g., data)
 Thus, we need a synchronization mechanism
Managing Multiple Workers
 Difficult because
 We don’t know the order in which workers run
 We don’t know when workers interrupt each other
 We don’t know when workers need to communicate partial results
 We don’t know the order in which workers access shared data
 Thus, we need:
 Semaphores (lock, unlock)
 Conditional variables (wait, notify, broadcast)
 Barriers
 Still, lots of problems:
 Deadlock, livelock, race conditions...
 Dining philosophers, sleeping barbers, cigarette smokers...
 Moral of the story: be careful!
Current Tools
 Programming models Shared Memory Message Passing

 Shared memory (pthreads)

Memory
 Message passing (MPI)
 Design Patterns P1 P2 P3 P4 P5 P1 P2 P3 P4 P5

 Master-slaves
 Producer-consumer flows
 Shared work queues
producer consumer
master

work queue

slaves
producer consumer
Where the rubber meets the road
 Concurrency is difficult to reason about
 Concurrency is even more difficult to reason about
 At the scale of datacenters and across datacenters
 In the presence of failures
 In terms of multiple interacting services
 Not to mention debugging…
 The reality:
 Lots of one-off solutions, custom code
 Write your own dedicated library, then program with it
 Burden on the programmer to explicitly manage
everything
“Big Ideas”
 Scale “out”, not “up”
 Limits of SMP and large shared-memory machines
 SMP is Symmetric multiprocessing (SMP) is a
computing architecture in which two or more
processors are attached to a single memory.
 Move processing to the data
 Cluster has limited bandwidth
 Process data sequentially, avoid random access
 Seeks are expensive, disk throughput is reasonable
 Seamless scalability
 From the mythical man-month to the tradable
machine-hour
Scaling “up” vs. “out”
 No single machine is large enough
 Smaller cluster of large SMP machines vs. larger
cluster of commodity machines (e.g., 16 128-core
machines vs. 128 16-core machines)
 Nodes need to talk to each other!
 Intra-node latencies: ~100 ns
 Inter-node latencies: ~100 s
Distributed Computing for
Data Mining
Large-scale Computing
 Large-scale computing for data mining
problems on commodity hardware
 Challenges:
 How do you distribute computation?
 How can we make it easy to write distributed
programs?
 Machines fail:
 One server may stay up 3 years (1,000 days)
 If you have 1,000 servers, expect to loose 1/day
 With 1M machines 1,000 machines fail every day!

14
An Idea and a Solution
 Issue:
Copying data over a network takes time
 Idea:
 Bring computation to data
 Store files multiple times for reliability
 Spark/Hadoop address these problems
 Storage Infrastructure – File system
 Google: GFS. Hadoop: HDFS
 Programming model
 MapReduce
 Spark
15
Storage Infrastructure
 Problem:
 If nodes fail, how to store data persistently?
 Answer:
 Distributed File System
 Provides global file namespace
 Typical usage pattern:
 Huge files (100s of GB to TB)
 Data is rarely updated in place
 Reads and appends are common

16
Distributed File System
 Chunk servers
 File is split into contiguous chunks
 Typically each chunk is 16-64MB
 Each chunk replicated (usually 2x or 3x)
 Try to keep replicas in different racks
 Master node
 a.k.a. Name Node in Hadoop’s HDFS
 Stores metadata about where files are stored
 Might be replicated
 Client library for file access
 Talks to master to find chunk servers
 Connects directly to chunk servers to access data
17
Distributed File System
 Reliable distributed file system
 Data kept in “chunks” spread across machines
 Each chunk replicated on different machines
 Seamless recovery from disk or machine failure

C0 C1 D0 C1 C2 C5 C0 C5

C5 C2 C5 C3 D0 D1 … D0 C2
Chunk server 1 Chunk server 2 Chunk server 3 Chunk server N

Bring computation directly to the data!

Chunk servers also serve as compute servers
18
Google File System (GFS) Architecture

19
HDFS Architecture – Master/Slave
Typical Big Data Problem
 Iterate over a large number of records
Map
Extract something of interest from each
 Shuffle and sort intermediate results
 Aggregate intermediate results d u ce
Re
 Generate final output

Key idea: provide a functional abstraction for

these two operations
Roots in Functional Programming

Map f f f f f

Fold g g g g g
Programming Model
 MapReduce is a style of programming
designed for:
1. Easy parallel programming
2. Invisible management of hardware and software
failures
3. Easy management of very-large-scale data

 It has several implementations, including

Hadoop, Spark (used in this class), Flink, and
the original Google implementation just called
“MapReduce”
23
MapReduce Implementations
 Google has a proprietary implementation in C++
 Bindings in Java, Python
 Hadoop is an open-source implementation in Java
 Development led by Yahoo, now an Apache project
 Used in production at Yahoo, Facebook, Twitter,
LinkedIn, Netflix, …
 The de facto big data processing platform
 Large and expanding software ecosystem
 Lots of custom research implementations
 For GPUs, cell processors, etc.
User
Program

(1) submit

Master

(2) schedule map (2) schedule reduce

worker
split 0
(6) write output
split 1 (5) remote read worker
(3) read file 0
split 2 (4) local write
worker
split 3
split 4 output
worker
file 1

worker

Input Map Intermediate files Reduce Output

files phase (on local disk) phase files
MapReduce: Overview
3 steps of MapReduce
 Map:
 Apply a user-written Map function to each input element
 Mapper applies the Map function to a single element
 Many mappers grouped in a Map task (the unit of parallelism)
 The output of the Map function is a set of 0, 1, or more key-value
pairs.
 Group by key: Sort and shuffle
 System sorts all the key-value pairs by key, and
outputs key-(list of values) pairs
 Reduce:
 User-written Reduce function is applied to each
key-(list of values)
Outline stays the same, Map and Reduce change to fit the problem
26
MapReduce: The Map Step

Input Intermediate
key-value pairs key-value pairs

k v
map
k v
k v
map
k v
k v

… …

k v k v

27
MapReduce: The Reduce Step

Output
Intermediate Key-value groups key-value pairs
key-value pairs
reduce
k v k v v v k v
reduce
Group
k v k v v k v
by key

k v
… …
…

k v k v k v

28
Map-Reduce: A diagram

MAP:
Read input and
produces a set of
key-value pairs

Group by key:
Collect all pairs with
same key
(Hash merge, Shuffle,
Sort, Partition)

Reduce:
Collect all values
belonging to the
key and output

29
Map-Reduce: In Parallel

All phases are distributed with many tasks doing the work
30
MapReduce Pattern

key-value
pairs

Input Output

Mappers Reducers

31
Example: Word Counting
Example MapReduce task:
 We have a huge text document
 Count the number of times each
distinct word appears in the file
 Many applications of this:
 Analyze web server logs to find popular URLs
 Statistical machine translation:
 Need to count number of times every 5-word sequence
occurs in a large corpus of documents

32
MapReduce: Word Counting
Provided by the Provided by the
programmer programmer
MAP: Group by key: Reduce:
Read input and Collect all values
Collect all pairs
produces a set of belonging to the
with same key
key-value pairs key and output

data
reads
The crew of the space
shuttle Endeavor recently
(The, 1) (crew, 1)

read the
returned to Earth as (crew, 1) (crew, 1)
ambassadors, harbingers (crew, 2)
of a new era of space (of, 1) (space, 1)
(space, 1)

sequential
exploration. Scientists at
(the, 1) (the, 1)
NASA are saying that the (the, 3)

Sequentially
recent assembly of the (space, 1) (the, 1)
Dextre bot is the first step (shuttle, 1)
in a long-term space-based (shuttle, 1) (the, 1)
(recently, 1)
man/mache partnership. (Endeavor, 1) (shuttle, 1)
'"The work we're doing now …

Only
-- the robotics we're doing (recently, 1) (recently, 1)
-- is what we're going to
need …………………….. …. …
Big document (key, value) (key, value) (key, value)
33
Word Count Using MapReduce

map(key, value):
// key: document name; value: text of the document
for each word w in value:
emit(w, 1)

reduce(key, values):
// key: a word; value: an iterator over counts
result = 0
for each count v in values:
result += v
emit(key, result)

34
MapReduce: Environment
MapReduce environment takes care of:
 Partitioning the input data
 Scheduling the program’s execution across a
set of machines
 Performing the group by key step
 In practice this is the bottleneck
 Handling machine failures
 Managing required inter-machine
communication

35
Dealing with Failures
 Map worker failure
 Map tasks completed or in-progress at
worker are reset to idle and rescheduled
 Reduce workers are notified when map task is
rescheduled on another worker
 Reduce worker failure
 Only in-progress tasks are reset to idle and the
reduce task is restarted

36
Putting everything together…

namenode job submission node

namenode daemon jobtracker

tasktracker tasktracker tasktracker

datanode daemon datanode daemon datanode daemon

Linux file system Linux file system Linux file system

… … …
slave node slave node slave node

(Not Quite… We’ll come back to YARN later)

Major Differences between Hadoop 1 and 2

 HDFS Federation
 Resource manager YARN
Hadoop 2: HDFS Federation
 Two major components in HDFS
 Namespaces
 Blocks storage service
 Hadoop 1
 A single namenode manages the entire namespace
 HDFS federation
 Multiple namenode servers manage namespaces
 Horizontal scaling, performance improvement,
multiple namespaces
Hadoop 2: YARN
 Brings significant performance improvements
for some applications
 Supports additional processing models
 Implements a more flexible execution engine
YARN: from Hadoop 2
What is YARN?
 A resource manager separated from
MapReduce in Hadoop 1
 The operating system of Hadoop
 Managing and monitoring workloads
 Maintaining a multi-tenant environment
 Implementing security controls
 Managing high-availability features of Hadoop
 No longer limited to the I/O intensive, high-
latency MapReduce model
Spark
Problems with MapReduce
 Two major limitations of MapReduce:
 Difficultly of programming directly in MR
 Many problems aren’t easily described as map-reduce
 Performance bottlenecks, or batch not fitting the
use cases
 Persistence to disk typically slower than in-memory work

 In short, MR doesn’t compose well for large

applications
 Many times one needs to chain multiple map-
reduce steps
44
Data-Flow Systems
 MapReduce uses two “ranks” of tasks:
One for Map, the second for Reduce
 Data flows from the first rank to the second

 Data-Flow Systems generalize this in two ways:

1. Allow any number of tasks/ranks
2. Allow functions other than Map and Reduce
 As long as data flow is in one direction only, we can
have the blocking property and allow recovery of
tasks rather than whole jobs

45
Spark: Most Popular Data-Flow System

 Expressive computing system, not limited to

the map-reduce model
 Additions to MapReduce model:
 Fast data sharing
 Avoids saving intermediate results to disk
 Caches data for repetitive queries (e.g. for machine learning)
 General execution graphs (DAGs)
 Richer functions than just map and reduce
 Compatible with Hadoop

46
Spark Architecture
Components
 Spark applications
 Independent sets of processes on a cluster,
coordinated by the SparkContext object in main
(Driver) program
 Cluster managers
 Allocate resources across applications
 Executors
 Processes that run computations and store data
 Spark sends your application code (JAR or python
files) to executors, and SparkContext sends tasks
to executors to run
Note about the architecture
 Each application has its own executor processes
 Benefit: isolating applications from each other
 Cons: data cannot be shared across different applications
without writing to external storage
 Spark is agnostic to the underlying cluster manager
 Driver program must listen for and accept incoming
connections from its executors throughout its lifetime
 Driver must be network addressable from the workers
 Driver program should be run close to the worker
nodes, since it schedules tasks on the cluster
Cluster Manager Types
 Standalone: a simple cluster manager included in Spark
 Faster job startup, but it doesn’t support communication with an HDFS
secured with Kerberos authentication protocol
 Apache Mesos: a general cluster manager that can also runs
Hadoop MapReduce
 A scalable and fault-tolerant “distributed systems kernel” written in C+
+, which also supports C++ and Python applications
 It is actually a “scheduler of scheduler frameworks” because of its two-
level scheduling architecture
 Hadoop YARN: resource manager in Hadoop 2
 YARN lets you run different types of Java applications, not just Spark
 It also provides methods for isolating and prioritizing applications
among users and organizations
 The only cluster type that supports Kerberos-secured HDFS
 You don’t have to install Spark on all nodes in the cluster
Spark Standalone Cluster
 Two deploy modes for Spark Standalone
cluster
 Client mode: driver is launched in the same
process as the client that submits the application
 Cluster mode: driver is launched from one of the
Worker processes in the cluster, and the client
exits as soon as it fulfills its submission of
application
Client mode vs. Cluster mode

52
Spark on YARN cluster mode
Spark on YARN client mode
Spark: RDD
Key concept Resilient Distributed Dataset (RDD)
 Partitioned collection of records
 Generalizes (key-value) pairs
 Spread across the cluster, Read-only
 Caching dataset in memory
 Different storage levels available
 Fallback to disk possible

 RDDs can be created from Hadoop, or by

transforming other RDDs (you can stack RDDs)
 RDDs are best suited for applications that apply the
55
Spark RDD Operations
 Transformations build RDDs through
deterministic operations on other RDDs:
 Transformations include map, filter, join, union,
intersection, distinct
 Lazy evaluation: Nothing computed until an action
requires it

 Actions to return value or export data

 Actions include count, collect, reduce, save
 Actions can be applied to RDDs; actions force
calculations and return values
56
Task Scheduler: General DAGs
A: B:

F:
Stage 1 groupBy

C: D: E:

join = RDD

Stage 2 map filter Stage 3 = cached partition

 Supports general task graphs

 Pipelines functions where possible
 Cache-aware data reuse & locality
 Partitioning-aware to avoid shuffles
57
Data Analytics Software Stack

58
Problems Suited for
MapReduce
Example: Host size
 Suppose we have a large web corpus
 Look at the metadata file
 Lines of the form: (URL, size, date, …)
 For each host, find the total number of bytes
 That is, the sum of the page sizes for all URLs from
that particular host

 Other examples:
 Link analysis and graph processing
 Machine Learning algorithms

60
Example: Language Model
 Statistical machine translation:
 Need to count number of times every 5-word
sequence occurs in a large corpus of documents

 Very easy with MapReduce:

 Map:
 Extract (5-word sequence, count) from document
 Reduce:
 Combine the counts

61
Example: Join By Map-Reduce
 Compute the natural join R(A,B) ⋈ S(B,C)
 R and S are each stored in files
 Tuples are pairs (a,b) or (b,c)

A B B C A C
a1 b1 b2 c1 a3 c1
a2
a3
b1
b2
⋈ b2 c2 = a3 c2
b3 c3 a4 c3
a4 b3
S
R

62
Map-Reduce Join
 Use a hash function h from B-values to 1...k
 A Map process turns:
 Each input tuple R(a,b) into key-value pair (b,(a,R))
 Each input tuple S(b,c) into (b,(c,S))

 Map processes send each key-value pair with

key b to Reduce process h(b)
 Hadoop does this automatically; just tell it what k is.
 Each Reduce process matches all the pairs (b,
(a,R)) with all (b,(c,S)) and outputs (a,b,c).
63
Problems NOT suitable for MapReduce

 MapReduce is great for:

 Problems that require sequential data access
 Large batch jobs (not interactive, real-time)

 MapReduce is inefficient for problems where

random (or irregular) access to data required:
 Graphs
 Interdependent data
 Machine learning
 Comparisons of many pairs of items

64
Cost Measures for Algorithms
 In MapReduce we quantify the cost of an
algorithm using
1. Communication cost = total I/O of all
processes
2. Elapsed communication cost = max of I/O
along any path
3. (Elapsed) computation cost analogous, but
count only running time of processes
Note that here the big-O notation is not the most useful
(adding more machines is always an option)

65
Example: Cost Measures
 For a map-reduce algorithm:
 Communication cost = input file size + 2  (sum of
the sizes of all files passed from Map processes to
Reduce processes) + the sum of the output sizes of
the Reduce processes.
 Elapsed communication cost is the sum of the
largest input + output for any map process, plus
the same for any reduce process

66
What Cost Measures Mean
 Either the I/O (communication) or processing
(computation) cost dominates
 Ignore one or the other

 Total cost tells what you pay in rent from

your friendly neighborhood cloud

 Elapsed cost is wall-clock time using

parallelism

67
Cost of Map-Reduce Join
 Total communication cost
= O(|R|+|S|+|R ⋈ S|)
 Elapsed communication cost = O(s)
 We’re going to pick k and the number of Map
processes so that the I/O limit s is respected
 We put a limit s on the amount of input or output that
any one process can have. s could be:
 What fits in main memory
 What fits on local disk
 With proper indexes, computation cost is linear
in the input + output size
 So computation cost is like comm. cost
68
Thanks for Your Attention!

69
References
 Jure Leskovec, Anand Rajaraman, Jeffrey
David Ullman, Mining of Massive Datasets,
2nd Edition, Cambridge University Press,
November 2014. (Ch.1-2)
 Slides from Stanford CS246 “Mining Massive
Datasets” (by Jure Leskovec)
 Jimmy Lin and Chris Dyer, “Data-Intensive Text
Processing with MapReduce”. (Ch.1-2)
 Slides from Jimmy Lin’s “Big Data Infrastructure”
course, Univ. Maryland (and Univ. Waterloo)

70
References
 Papers:
 GFS
 MapReduce

CC Question Bank All Units
No ratings yet
CC Question Bank All Units
28 pages
Spark Interview
No ratings yet
Spark Interview
17 pages
Big Data Unit 4
No ratings yet
Big Data Unit 4
96 pages
Notes
No ratings yet
Notes
11 pages
Hand Out Cloud Computing
No ratings yet
Hand Out Cloud Computing
9 pages
Hive Tutorial PDF
0% (1)
Hive Tutorial PDF
14 pages
Parallel & Distributed Computing
100% (1)
Parallel & Distributed Computing
52 pages
Yum Yum D Giga
No ratings yet
Yum Yum D Giga
368 pages
C3 IndexConstruction
No ratings yet
C3 IndexConstruction
46 pages
20ai402 Data Analytics Unit-2
No ratings yet
20ai402 Data Analytics Unit-2
72 pages
BCSC 0016 - Emerging Tech (Updatedv3) - 1
No ratings yet
BCSC 0016 - Emerging Tech (Updatedv3) - 1
66 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
60 pages
Cp5293 Big Data Analytics Question Bank
0% (1)
Cp5293 Big Data Analytics Question Bank
13 pages
L1: Introduction, Mapreduce, Spark: Csl7710: Machine Learning With Big Data Dip Sankar Banerjee Cse, Iit Jodhpur
No ratings yet
L1: Introduction, Mapreduce, Spark: Csl7710: Machine Learning With Big Data Dip Sankar Banerjee Cse, Iit Jodhpur
51 pages
Chapter 3 Big Data Analytics and Big Data Analytics Techniques PDF
No ratings yet
Chapter 3 Big Data Analytics and Big Data Analytics Techniques PDF
22 pages
Unit-2 Map Reduce Notes
No ratings yet
Unit-2 Map Reduce Notes
28 pages
Master Thesis
No ratings yet
Master Thesis
68 pages
SPARK Question Answers
No ratings yet
SPARK Question Answers
19 pages
Introduction To Map Reduce
No ratings yet
Introduction To Map Reduce
50 pages
Business Intelligence & Big Data Analytics-CSE3124Y
No ratings yet
Business Intelligence & Big Data Analytics-CSE3124Y
26 pages
19eai433 - Big Data Analytics
No ratings yet
19eai433 - Big Data Analytics
2 pages
3 Hadoop
No ratings yet
3 Hadoop
111 pages
Real Time Hadoop Interview Questions From Various Interviews
No ratings yet
Real Time Hadoop Interview Questions From Various Interviews
6 pages
Spark Introduction
No ratings yet
Spark Introduction
90 pages
Business Intelligence and Analytics: Systems For Decision Support, 10e (Sharda) Chapter 13 Big Data and Analytics
No ratings yet
Business Intelligence and Analytics: Systems For Decision Support, 10e (Sharda) Chapter 13 Big Data and Analytics
13 pages
MapReduce Its Applications For Course
No ratings yet
MapReduce Its Applications For Course
36 pages
Dist Sys Unit 5 Notes
No ratings yet
Dist Sys Unit 5 Notes
16 pages
Detailed Big Data and Hadoop Notes
No ratings yet
Detailed Big Data and Hadoop Notes
3 pages
Unit 4
No ratings yet
Unit 4
10 pages
P.prabu (28x61c) CCS334 BDA - Unit 4
No ratings yet
P.prabu (28x61c) CCS334 BDA - Unit 4
28 pages
Spring 2020 DS-GA 1004 Syllabus
No ratings yet
Spring 2020 DS-GA 1004 Syllabus
5 pages
Satya Sandeep - Data Engineer Resume
No ratings yet
Satya Sandeep - Data Engineer Resume
8 pages
Environmental Conditions' Big Data Management and Cloud Computing Analytics For Sustainable Agriculture
No ratings yet
Environmental Conditions' Big Data Management and Cloud Computing Analytics For Sustainable Agriculture
9 pages
Beginning Database Design
No ratings yet
Beginning Database Design
2 pages
ABCA 2 Model Building
No ratings yet
ABCA 2 Model Building
9 pages
HaLoop - Efficient Iterative Data Processing On Large Clusters
No ratings yet
HaLoop - Efficient Iterative Data Processing On Large Clusters
12 pages
Hadoop Ecosystem: An Introduction: Sneha Mehta, Viral Mehta
No ratings yet
Hadoop Ecosystem: An Introduction: Sneha Mehta, Viral Mehta
6 pages
03 Intro HadoopAndMapReduce BigData
No ratings yet
03 Intro HadoopAndMapReduce BigData
91 pages
06 Application Architecture
No ratings yet
06 Application Architecture
22 pages
MapReduce - 1
No ratings yet
MapReduce - 1
39 pages
Map-Reduce and The New Software Stack: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
No ratings yet
Map-Reduce and The New Software Stack: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
49 pages
Take A Close Look At: Ma Ed
No ratings yet
Take A Close Look At: Ma Ed
42 pages
Big Data Analytics Module 3: Mapreduce Paradigm: Faculty Name: Ms. Varsha Sanap Dr. Vivek Singh
No ratings yet
Big Data Analytics Module 3: Mapreduce Paradigm: Faculty Name: Ms. Varsha Sanap Dr. Vivek Singh
36 pages
Cloud Compute
No ratings yet
Cloud Compute
46 pages
Map-Reduce and The New Software Stack: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
No ratings yet
Map-Reduce and The New Software Stack: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
48 pages
CS246: Mining Massive Datasets Jure Leskovec,: Stanford University
No ratings yet
CS246: Mining Massive Datasets Jure Leskovec,: Stanford University
53 pages
8 Conclusions References: To Appear in OSDI 2004
No ratings yet
8 Conclusions References: To Appear in OSDI 2004
1 page
Ch02a Mapreduce
No ratings yet
Ch02a Mapreduce
53 pages
DA
No ratings yet
DA
51 pages
Bda CHP2
No ratings yet
Bda CHP2
105 pages
Ditp ch2
No ratings yet
Ditp ch2
2 pages
02 Hadoop
No ratings yet
02 Hadoop
117 pages
Mapreduce and Hadoop Distributed File System
No ratings yet
Mapreduce and Hadoop Distributed File System
45 pages
Unit-4 CC
No ratings yet
Unit-4 CC
72 pages
Lecture 10 Map Reduce
No ratings yet
Lecture 10 Map Reduce
42 pages
Week 02
No ratings yet
Week 02
115 pages
Map Reduce
No ratings yet
Map Reduce
44 pages
Mapreduce: Simplified Data Processing On Large Clusters
No ratings yet
Mapreduce: Simplified Data Processing On Large Clusters
38 pages
Mapreduce and Hadoop Distributed File System: K. Madurai and B. Ramamurthy
No ratings yet
Mapreduce and Hadoop Distributed File System: K. Madurai and B. Ramamurthy
36 pages
Map Reduce Notes and Learning
No ratings yet
Map Reduce Notes and Learning
48 pages
Lecture 1 - Map Reduce
No ratings yet
Lecture 1 - Map Reduce
31 pages
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
55 pages
Unit 5
No ratings yet
Unit 5
32 pages
TM2 ch02 Mapreduce
No ratings yet
TM2 ch02 Mapreduce
51 pages
9 Hadoop PDF
No ratings yet
9 Hadoop PDF
59 pages
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
No ratings yet
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
65 pages
Bda 2
No ratings yet
Bda 2
35 pages
MapReduce and The New Software Stack
No ratings yet
MapReduce and The New Software Stack
33 pages
Lecture 2.1
No ratings yet
Lecture 2.1
13 pages
Problem-Solving Using Mapreduce/Hadoop
No ratings yet
Problem-Solving Using Mapreduce/Hadoop
22 pages
Big Data Streams Analytics: Challenges, Analysis, and Applications
No ratings yet
Big Data Streams Analytics: Challenges, Analysis, and Applications
55 pages
Big Data Computing
No ratings yet
Big Data Computing
36 pages
09b - MapReduce
No ratings yet
09b - MapReduce
44 pages
Hadoop Map Reduce Concept
No ratings yet
Hadoop Map Reduce Concept
23 pages
Ir MR 1
No ratings yet
Ir MR 1
34 pages
Fillatre Big Data
No ratings yet
Fillatre Big Data
98 pages
2 Hadoop Ecosystem
No ratings yet
2 Hadoop Ecosystem
41 pages
Lecture4 IntroMapReduce PDF
No ratings yet
Lecture4 IntroMapReduce PDF
75 pages
Large-Scale Data Management: Cs525: Special Topics in Dbs
No ratings yet
Large-Scale Data Management: Cs525: Special Topics in Dbs
22 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
Hadoop Spark
No ratings yet
Hadoop Spark
34 pages
Chapter Five Hadoop Mapreduce & HDFS
No ratings yet
Chapter Five Hadoop Mapreduce & HDFS
44 pages
Unit 4 Map Reduce
No ratings yet
Unit 4 Map Reduce
10 pages
AAAI2011 Tutorial Slides
No ratings yet
AAAI2011 Tutorial Slides
213 pages
Map Reduce
No ratings yet
Map Reduce
69 pages
MapReduce Introduction
No ratings yet
MapReduce Introduction
34 pages
Map Reduce
No ratings yet
Map Reduce
28 pages
Parallel Programming, Mapreduce Model: Unit Ii
No ratings yet
Parallel Programming, Mapreduce Model: Unit Ii
47 pages
Learn Multithreading with Modern C++
From Everand
Learn Multithreading with Modern C++
James Raynard
No ratings yet
DRBD-Cookbook: How to create your own cluster solution, without SAN or NAS!
From Everand
DRBD-Cookbook: How to create your own cluster solution, without SAN or NAS!
Joerg Christian Seubert
No ratings yet

Chapter 4

Uploaded by

Chapter 4

Uploaded by

Chapter 4: Introduction to the

Distributed Platforms &

worker worker worker

What’s the common theme of all these problems?

 Shared memory (pthreads)

Bring computation directly to the data!

Key idea: provide a functional abstraction for

 It has several implementations, including

(2) schedule map (2) schedule reduce

Input Map Intermediate files Reduce Output

namenode job submission node

namenode daemon jobtracker

tasktracker tasktracker tasktracker

datanode daemon datanode daemon datanode daemon

Linux file system Linux file system Linux file system

(Not Quite… We’ll come back to YARN later)

 In short, MR doesn’t compose well for large

 Data-Flow Systems generalize this in two ways:

 Expressive computing system, not limited to

 RDDs can be created from Hadoop, or by

 Actions to return value or export data

Stage 2 map filter Stage 3 = cached partition

 Supports general task graphs

 Very easy with MapReduce:

 Map processes send each key-value pair with

 MapReduce is great for:

 MapReduce is inefficient for problems where

 Total cost tells what you pay in rent from

 Elapsed cost is wall-clock time using

You might also like