Week-8 - Lecture Notes
Week-8 - Lecture Notes
EL
PT
N
Dr. Rajiv Misra
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
[email protected]
EL
Servers’, also discuss its Stale Synchronous Parallel
Model.
PT
N
EL
Scalable Machine Learning Algorithms
PT Abstractions
N
Scalable Systems
EL
PT
N
EL
Algorithms
PT
N
EL
Naïve Bayes, Graph Algorithms, SGD, Sampling
Rocchio
PT
Graphical Models [NIPS’09, NIPS’13]
N
EL
Naïve Bayes, Graph Algorithms, SGD, Sampling
Rocchio
PT
Graphical Models [NIPS’09, NIPS’13]
N
Abstractions
Hadoop & GraphLab, Bosen, DMTK,
Spark Tensorflow ParameterServer.org
Vu Pham
ML Systems Landscape
Shared Memory
Dataflow Systems Graph Systems Systems
Model
EL
Naïve Bayes, Graph Algorithms, SGD, Sampling
Rocchio
PT
Graphical Models [NIPS’09, NIPS’13]
EL
ML system are stored in a
distributed hash table that is
PT
accessible thru the network
[NIPS’09, NIPS’13]
EL
Pull: query parts of the model
Push: update parts of the model
PT
Machine learning update equation
N
(Stochastic) gradient descent
Collapsed Gibbs sampling for topic modeling
Aggregate push updates via addition (+)
EL
PT
N
EL
PT
N
EL
PT Implemented
with Parameter
N
Server
EL
Worker Server
Machines Machines
PT
N
➢ Model parameters are stored on PS machines and accessed via key-
value interface (distributed shared memory)
➢ Extensions: multiple keys (for a matrix); multiple “channels” (for
multiple sparse vectors, multiple clients for same servers, …)
[Smola et al 2010, Ho et al 2013, Li et al 2014]
EL
Worker Server
Machines Machines
PT
N
➢ Extensions: push/pull interface to send/receive most recent copy
of (subset of) parameters, blocking is optional
➢ Extension: can block until push/pulls with clock < (t – τ ) complete
[Smola et al 2010, Ho et al 2013, Li et al 2014]
w1 w2 w3 w4 w5 w6 w7 w8 w9
EL
PT
N
w519 w64 w83 w72
Data Data Data Data
w1 w2 w3 w4 w5 w6 w7 w8 w9
EL
PT
1.Different parts of the model on different servers.
2.Workers retrieve the part needed as needed
N
Data Data Data Data
EL
1. get(key) → value
2.
PT
add(key, delta)
Model
N
Model Model
Big Data Computing Vu Pham Parameter Servers
Iteration in Map-Reduce (IPM)
Initial Learned
Model Map Reduce Model
w(0)
w(1)
EL
Training
Data
PT w(2)
N
w(3)
Big Data Computing Vu Pham Parameter Servers
Cost of Iteration in Map-Reduce
Initial Learned
Model Map Reduce Model
w(0)
w(1)
EL
Training
Data
Read 2
PT Repeatedly
w(2)
load same data
N
w(3)
Big Data Computing Vu Pham Parameter Servers
Cost of Iteration in Map-Reduce
Initial Learned
Model Map Reduce Model
w(0)
w(1)
EL
Training
Redundantly save
Data
w(3)
Big Data Computing Vu Pham Parameter Servers
Parameter Servers
EL
Stale Synchronous Parallel Model
PT
N
EL
Worker Server
Machines Machines
PT
N
➢ Model parameters are stored on PS machines and accessed via key-
value interface (distributed shared memory)
EL
PT Model
N
Data Worker Parameters
EL
Programming Key-Value Store
Map & Reduce
Abstraction PT (Distributed Shared
Memory)
N
Execution Bulk Synchronous
Semantics Parallel (BSP) ?
The Problem: Networks Are Slow!
get(key)
add(key, delta)
EL
Worker Server
Machines Machines
PT
N
➢ Network is slow compared to local memory access
➢ We want to explore options for handling this….
[Smola et al 2010, Ho et al 2013, Li et al 2014]
Data
EL
Data Data
PT Server
N
Data Data
Data
Sparse Changes
to Model Data
EL
Data Data
PT Server
N
Data Data
Data
Data
EL
Data Data
PT Server
N
Data Data
Data
EL
Machine 2
PT Iteration Iteration Waste
N
Machine 3 Iteration Waste Iteration
Barrier Barrier
EL
Machine 1
PT
Iteration Iteration Iteration Iteration
N
Machine 2 Iteration Iteration Iteration
[Smola et al 2010]
Big Data Computing Vu Pham Parameter Servers
Asynchronous Execution
Problem:
Async lacks theoretical guarantee as distributed
environment can have arbitrary delays from
EL
network & stragglers
But…. PT
N
EL
PT
N
Map-Reduce VS. Parameter Server
EL
Programming Key-Value Store
Map & Reduce
Abstraction PT (Distributed Shared
Memory)
N
Execution Bulk Synchronous Bounded
Semantics Parallel (BSP) ?
Asynchronous
Parameter Server
Stale synchronous parallel (SSP):
• Global clock time t
• Parameters workers “get” can be out of date
• but can’t be older than t-τ
EL
• τ controls “staleness”
PT
• aka stale synchronous parallel (SSP)
N
Bounded
Asynchronous
EL
updates guaranteed
worker 2 visible to worker 2
PT
updates not yet sent to
worker 2
worker 4
Clock
N
0 1 2 3 4 5 6 7
➢ Interpolate between BSP and Async and subsumes both
➢ Allow workers to usually run at own pace
➢ Fastest/slowest threads not allowed to drift >s clocks apart
➢ Efficiently implemented: Cache parameters [Ho et al 2013]
EL
PT
N
Strong consistency Relaxed consistency
EL
PT
N
[Ho et al 2013]
Big Data Computing Vu Pham Parameter Servers
Conclusion
EL
PT
N
EL
PT
N
Dr. Rajiv Misra
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
[email protected]
Big Data Computing Vu Pham Page Rank
Preface
Content of this Lecture:
In this lecture, we will discuss PageRank Algorithm in Big
Data using different framework with different ways and
scale.
EL
PT
N
EL
Brain scale graph
100 billion vertices, 100 trillion edges
Human connectome.
Gerhard et al., Frontiers in
Neuroinformatics 5(3), 2011
Big Data Computing Vu Pham Page Rank
What is PageRank?
Why is Page Importance Rating important?
New challenges for information retrieval on the World Wide Web.
• Huge number of web pages: 150 million by1998
1000 billion by 2008
EL
• Diversity of web pages: different topics, different quality, etc.
What is PageRank?
PT
• A method for rating the importance of web pages objectively and
mechanically using the link structure of the web.
N
• History:
– PageRank was developed by Larry Page (hence the name Page-Rank) and
Sergey Brin.
– It is first as part of a research project about a new kind of search engine.
That project started in 1995 and led to a functional prototype in 1998.
Big Data Computing Vu Pham Page Rank
PageRank of E denotes by PR (E)
EL
Links from a high-rank page high rank
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
We have shown how to calculate iteratively PageRank
for every vertex in the given graph.
PT
N
EL
PT
N
EL
We only need to shuffle the new rank contributions,
not the graph structure
PT
Further, we have to control the iteration outside of
N
MapReduce
EL
Batch algorithms on large graphs
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
(transportation networks), or corporations (financial
transactions).
PT
In this lecture, we have discussed PageRank algorithms for
extracting information from graph data using MapReduce
N
and Pregel.
EL
PT
N
Dr. Rajiv Misra
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
[email protected]
Big Data Computing Vu Pham GraphX
Preface
Content of this Lecture:
EL
parallel and data parallel computation for Big Data
Analytics and also discuss as a case study of Graph
PT
Analytics with GraphX.
N
EL
path length, triangle counts-high level measures of a graph.
It can count triangles in the graph, and apply the PageRank algorithm to
it.
PT
It can also join graphs together and transform graphs quickly
It supports the Pregel API( google) for traversing a graph.
N
• Introduces VertexRDD and EdgeRDD, and the Edge data type .
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
Title PR
Raw Text
Wikipedia Table
EL
Title Body
Term-Doc Topic Model
<</ />> Graph (LDA)
</> Word Topics
XML
PT Word Topic
N
Discussion Community User Community
Table Detection Community Topic
Editor Graph
EL
Rank of
Weighted sum of
user i
PT neighbors’ ranks
N
Update ranks in parallel
Iterate until convergence
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
Model / Alg.
State
PT
N
Computation depends only on the
neighbors
EL
PT
N
Gather information from
neighboring vertices
Big Data Computing Vu Pham GraphX
Graph-Parallel Pattern
Gonzalez et al. [OSDI’12]
EL
PT
N
Apply an update the
vertex property
Big Data Computing Vu Pham GraphX
Graph-Parallel Pattern
Gonzalez et al. [OSDI’12]
EL
PT
N
Scatter information to
neighboring vertices
Big Data Computing Vu Pham GraphX
Many Graph-Parallel Algorithms
Collaborative Filtering Community Detection
Alternating Least Squares Triangle-Counting
Stochastic Gradient Descent K-core Decomposition
Tensor Factorization K-Truss
EL
Structured Prediction Graph Analytics
Loopy Belief Propagation PageRank
Programs PT
Max-Product Linear Personalized PageRank
Shortest Path
N
Gibbs Sampling Graph Coloring
Semi-supervised ML Classification
Graph SSL Neural Networks
CoEM
EL
oogle
PT
Expose specialized APIs to simplify graph
programming.
N
EL
PT
N
EL
PT
N
Data movement is managed by the system and not the
user.
Big Data Computing Vu Pham GraphX
Iterative Bulk Synchronous Execution
EL
PT
N
EL
oogle
PT
Expose specialized APIs to simplify graph
programming.
N
Exploit graph structure to achieve orders-of-magnitude
performance gains over more general
data-parallel systems.
EL
PT
N
Spark is 4x faster than Hadoop
GraphLab is 16x faster than Spark
Big Data Computing Vu Pham GraphX
PageRank on the Live-Journal Graph
Mahout/Hadoop 1340
EL
Naïve Spark 354
GraphLab 22
EL
PT
N
Title PR
Raw Text
Wikipedia Table
EL
Title Body
Term-Doc Topic Model
<</ />> Graph (LDA)
</> Word Topics
XML
PT Word Topic
N
Discussion Community User Community
Table Detection Community Topic
Editor Graph
EL
Table Dependency Graph
Row
PT
N
Row
Result
Row
Row
EL
for each view is
difficult to use and
PTinefficient
N
EL
PT
Leads to brittle and often
N
complex interfaces
EL
<</ />>
</>
XML
PT
N
HDFS HDFS HDFS HDFS
EL
Graphs
PT
N
Enabling users to easily and efficiently express the
entire graph analytics pipeline
Big Data Computing Vu Pham GraphX
Tables and Graphs are composable views of the
same physical data
EL
Table View
PT GraphX Unified
Representation
Graph View
N
Each view has its own operators that
exploit the semantics of the view
to achieve efficient execution
Big Data Computing Vu Pham GraphX
Graphs → Relational Algebra
1. Encode graphs as distributed tables
2. Express graph computation in relational algebra
3. Recast graph systems optimizations as:
EL
1. Distributed join optimization
2. Incremental materialized maintenance
PT
Integrate Graph and Achieve performance
N
Table data parity with
processing systems. specialized systems.
EL
R F Jegonzal (PstDoc, Berk.)
Franklin (Prof., Berk)
PT Istoica
EL
PT
N
EL
groupBy fold first
sort
union
PT reduceByKey
groupByKey
partitionBy
mapWith
N
join cogroup pipe
leftOuterJoin cross save
rightOuterJoin zip ...
Big Data Computing Vu Pham GraphX
Graph Operators
class Graph [ V, E ] {
def Graph(vertices: Table[ (Id, V) ],
edges: Table[ (Id, Id, E) ])
// Table Views -----------------
EL
def vertices: Table[ (Id, V) ]
def edges: Table[ (Id, Id, E) ]
def triplets: Table [ ((Id, V), (Id, V), E) ]
// Transformations ------------------------------
PT
def reverse: Graph[V, E]
def subgraph(pV: (Id, V) => Boolean,
pE: Edge[V,E] => Boolean): Graph[V,E]
def mapV(m: (Id, V) => T ): Graph[T,E]
N
def mapE(m: Edge[V,E] => T ): Graph[V,T]
// Joins ----------------------------------------
def joinV(tbl: Table [(Id, T)]): Graph[(V, T), E ]
def joinE(tbl: Table [(Id, Id, T)]): Graph[V, (E, T)]
// Computation ----------------------------------
def mrTriplets(mapF: (Edge[V,E]) => List[(Id, T)],
reduceF: (T, T) => T): Graph[T, E]
} Big Data Computing Vu Pham GraphX
Triplets Join Vertices and Edges
EL
A A B
B A C
C
D
PT A
B
C
C B C
N
C D C D
EL
mapF( A B ) A1
PT
A
mapF( A C ) A2
N
D E
reduceF( A1 , A2 ) A
EL
.mrTriplets(
e=> (e.dst.id, e.src.age),//Map 30
(a,b)=> max(a, b) //Reduce
PT
) A
.vertices
N
D 19 E 75
F
16
EL
PT
N
Part. 1
A B
A A
EL
1 2
B C A C
B B 1 B C
A
A
2D Vertex Cut Heuristic
A
PT
D
D
D
C C 1
C D
N
A E
D D 1 2
A F
E E
F E
2
E D
Part. 2 F F 2 E F
Big Data Computing Vu Pham
Caching for Iterative mrTriplets
Vertex Table Edge Table
(RDD) (RDD)
Mirror A B
Cache
AA
EL
A A C
B
BB B C
C
C
C PT D
Mirror
C D
N
Cache
A E
D
D
A A F
EE D
E E D
FF F E F
Vu Pham
Incremental Updates for Iterative mrTriplets
Vertex Table Edge Table
(RDD) (RDD)
Mirror A B
Cache
Change A
EL
A A C
B
B B C
C
C PT D
Mirror
C D
N
Cache
A E
D
A A F
Scan
Change E D
E E D
F F E F
Vu Pham
Aggregation for Iterative mrTriplets
Vertex Table Edge Table
(RDD) (RDD)
Mirror A B
Cache
Change A
EL
A A C
Local
B
Change B Aggregate B C
C
Change C PT D
Mirror
C D
N
Cache
A E
Change D
A A F
Local
Scan
Change E D
Aggregate
E E D
Change F F E F
Vu Pham
Reduction in Communication Due to Cached Updates
EL
10000
Network Comm. (MB)
1000
100
10
PT
N
Most vertices are within 8 hops
1
of all vertices in their comp.
0.1
0 2 4 6 8 10 12 14 16
Iteration
EL
30
Scan
25
Runtime (Seconds)
Indexed
20
15
10
PT Scan All Edges
N
Index of “Active” Edges
5
0
0 2 4 6 8 10 12 14 16
Iteration
EL
PageRank on Twitter Three Way Join
15000 Join Elimination
PT
Communication (MB)
10000
N
5000
EL
To efficiently construct sub-graphs
PT
Substantial Index and Data Reuse:
Reuse routing tables across graphs and sub-graphs
N
Reuse edge adjacency information and indices
Mahout/Hadoop 1340
EL
Naïve Spark 354
Giraph 207
GraphX
GraphLab
0
22 PT
68
EL
Giraph 749
GraphX 451
Graph…
0
PT 200
203
<</ />>
</>
EL
XML
HDFS HDFS
Spark
Giraph + Spark
PT 605
1492
N
GraphX 342
GraphLab + Spark 375
EL
PageRank Connected Shortest SVD ALS K-core
(5) Comp. (10) Path (10) (40) (40) (51) Triangle
LDA
Count
(120)
PT
Pregel (28) + GraphLab (50)
(45)
N
GraphX (3575)
Spark
EL
PT
N
EL
PT
N
Enabling users to easily and efficiently express the entire
graph analytics pipeline
EL
By embedding graphs in relational algebra:
PT
Integration with tables and preprocessing
Leverage advances in relational systems
N
Graph opt. recast to relational systems
EL
Single system that efficiently spans the pipeline
minimize data movement and duplication
PT
eliminates need to learn and manage multiple systems
N
Graphs through the lens of database systems
Graph-Parallel Pattern → Triplet joins in relational alg.
Graph Systems → Distributed join optimizations
EL
Model and analyze relationships over time
PT
Serving Graph Structured Data
N
Allow external systems to interact with GraphX
Unify distributed graph databases with relational
database technology
EL
200
have one neighbor.
Number of Vertices
8
10
PT adjacent to
count
EL
1000000
Num-Vertices
100000
10000
1000
PT
N
100
10
1
0 10 20 30 40 50 60 70
Number of Updates
Big Data Computing Vu Pham GraphX
Graphs are Essential to Data Mining and
Machine Learning
EL
Find communities
PT
Understand people’s shared interests
N
Model complex data dependencies
EL
PT
N
EL
Movies f(1)
x
f(j)
≈ Users
Movies PT f(2)
r24
r25 f(5)
N
Iterate:
Liberal ? ? Conservative
?
?
EL
? ?
?
? Post
?
PT
Post
?
Post Post
? Post Post
? Post
N
Post
Post Post ?
?
Post
? ?
? ?
? Post Post ?
? ? ? ?
? ? ?
Belief Propagation
?
EL
3
1
4
PT
Measures “cohesiveness” of local community
N
EL
</>
</>
</>
XML
Raw
PT
ETL Slice Compute Analyze
N
Data
Initial Subgraph PageRank Top
Graph Users
Repeat GraphX
Big Data Computing Vu Pham
References
Xin, R., Crankshaw, D., Dave, A., Gonzalez, J., Franklin, M.J.,
& Stoica, I. (2014). GraphX: Unifying Data-Parallel and
Graph-Parallel Analytics. CoRR, abs/1402.2394.
EL
PT
N
http://spark.apache.org/graphx
Big Data Computing Vu Pham GraphX
Conclusion
The growing scale and importance of graph data has driven the
development of specialized graph computation engines
capable of inferring complex recursive properties of graph-
EL
structured data.
PT
In this lecture we have discussed GraphX, a distributed graph
processing framework that unifies graph-parallel and data-
parallel computation in a single system and is capable of
N
succinctly expressing and efficiently executing the entire graph
analytics pipeline.
EL
PT
N
EL
PT
N
EL
Transportation's (DOT) Bureau of Transportation
Statistics site which tracks the on-time performance of
PT
domestic flights operated by large air carriers. The
Summary information on the number of on-time,
delayed, canceled, and diverted flights is published in
N
DOT's monthly Air Travel Consumer Report and in this
dataset of January 2014 flight delays and cancellations.
EL
PT
N
EL
III.
PT
N
EL
Display the airports with the highest incoming flights
Display the airports with the highest outgoing flights
PT
List the most important airports according to PageRank
List the routes with the lowest flight costs
N
Display the airport codes with the lowest flight costs
List the routes with airport codes which have the lowest
flight costs
EL
dOfW Day of Week
fNum
origin_id
PT Flight Number Reporting Airline
Origin Airport Id
N
origin Origin Airport
EL
crsarrtime CRS Arrival Time (local time: hhmm)
arrtime
arrdelaymins
PT Actual Arrival Time (local time: hhmm)
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
graph.edges.filter {
case (Edge(org_id, dest_id, distance)) => distance > 1000
}.take(5).foreach(println)
PT
N
EL
if (a._2 > b._2) a else b
}
PT
// Compute the max degrees
val maxInDegree: (VertexId, Int) = graph.inDegrees.reduce(max)
N
EL
_._2).map(x => (airportMap(x._1),
x._2)).take(10).foreach(println)
PT
N
EL
"There were " + triplet.attr.toString + " flights from " + triplet.srcAttr
+ " to " + triplet.dstAttr + ".").take(20).foreach(println)
PT
N
EL
(id, distCost, newDistCost) => math.min(distCost, newDistCost),triplet => {
//send message
if(triplet.srcAttr + triplet.attr < triplet.dstAttr)
{
Iterator((triplet.dstId, triplet.srcAttr + triplet.attr))
}
else
{
Iterator.empty
PT
N
}
},
//Merge Messages
(a,b) => math.min(a,b)
)
//print routes with lowest flight cost
print("routes with lowest flight cost")
println(sssp.edges.take(10).mkString("\n"))
EL
val temp = rank.join(airports)
temp.take(10).foreach(println)
PT
Output the most influential airports from most influential to latest
val temp2 = temp.sortBy(_._2._1,false)
N
EL
structured data.
PT
In this lecture we have discussed GraphX, a distributed graph
processing framework that unifies graph-parallel and data-
parallel computation in a single system and is capable of
N
succinctly expressing and efficiently executing the entire graph
analytics pipeline.