0% found this document useful (0 votes)
20 views

Week-8 - Lecture Notes

The document discusses parameter servers, which are a machine learning framework that distributes a model over multiple machines. Parameter servers offer operations to pull and push parts of the model. They are commonly used with stochastic gradient descent and other algorithms.

Uploaded by

tejastaware7451
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Week-8 - Lecture Notes

The document discusses parameter servers, which are a machine learning framework that distributes a model over multiple machines. Parameter servers offer operations to pull and push parts of the model. They are commonly used with stochastic gradient descent and other algorithms.

Uploaded by

tejastaware7451
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 169

Parameter Servers

EL
PT
N
Dr. Rajiv Misra
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
[email protected]

Big Data Computing Vu Pham Parameter Servers


Preface
Content of this Lecture:

In this lecture, we will discuss the ‘Parameters

EL
Servers’, also discuss its Stale Synchronous Parallel
Model.

PT
N

Big Data Computing Vu Pham Parameter Servers


ML Systems

EL
Scalable Machine Learning Algorithms
PT Abstractions
N
Scalable Systems

Big Data Computing Vu Pham Parameter Servers


ML Systems Landscape
Shared Memory
Dataflow Systems Graph Systems Systems
Model

EL
PT
N

Hadoop, GraphLab, Bosen, DMTK,


Spark Tensorflow ParameterServer.org
Vu Pham
ML Systems Landscape
Shared Memory
Dataflow Systems Graph Systems Systems
Model

EL
Algorithms
PT
N

Hadoop, GraphLab, Bosen, DMTK,


Spark Tensorflow ParameterServer.org
Vu Pham
ML Systems Landscape
Shared Memory
Dataflow Systems Graph Systems Systems
Model

EL
Naïve Bayes, Graph Algorithms, SGD, Sampling
Rocchio
PT
Graphical Models [NIPS’09, NIPS’13]
N

Hadoop, GraphLab, Bosen, DMTK,


Spark Tensorflow ParameterServer.org
Vu Pham
ML Systems Landscape
Shared Memory
Dataflow Systems Graph Systems Systems
Model

EL
Naïve Bayes, Graph Algorithms, SGD, Sampling
Rocchio
PT
Graphical Models [NIPS’09, NIPS’13]
N
Abstractions
Hadoop & GraphLab, Bosen, DMTK,
Spark Tensorflow ParameterServer.org
Vu Pham
ML Systems Landscape
Shared Memory
Dataflow Systems Graph Systems Systems
Model

EL
Naïve Bayes, Graph Algorithms, SGD, Sampling
Rocchio
PT
Graphical Models [NIPS’09, NIPS’13]

PIG, Vertex-Programs Parameter Server


N
GuineaPig, … [VLDB’10]

Hadoop & GraphLab, Bosen, DMTK,


Spark Tensorflow ParameterServer.org
Vu Pham
ML Systems Landscape
Shared Memory
Dataflow Systems Graph Systems Systems
Model
Simple case: Parameters of the

EL
ML system are stored in a
distributed hash table that is

PT
accessible thru the network
[NIPS’09, NIPS’13]

Param Servers used in Google, Yahoo, ….


Parameter Server
N
Academic work by Smola, Xing, …
[VLDB’10]

Big Data Computing Vu Pham Parameter Servers


Parameter Server
A machine learning framework
Distributes a model over multiple machines
Offers two operations:

EL
Pull: query parts of the model
Push: update parts of the model

PT
Machine learning update equation
N
(Stochastic) gradient descent
Collapsed Gibbs sampling for topic modeling
Aggregate push updates via addition (+)

Big Data Computing Vu Pham Parameter Servers


Spark with Parameter Servers

EL
PT
N

Big Data Computing Vu Pham Parameter Servers


Parameter Server (PS)
Training state stored in PS shards, asynchronous
updates

EL
PT
N

Big Data Computing Vu Pham Parameter Servers


Parameter Servers Are Flexible

EL
PT Implemented
with Parameter
N
Server

Big Data Computing Vu Pham Parameter Servers


Parameter Server (PS)

EL
Worker Server
Machines Machines

PT
N
➢ Model parameters are stored on PS machines and accessed via key-
value interface (distributed shared memory)
➢ Extensions: multiple keys (for a matrix); multiple “channels” (for
multiple sparse vectors, multiple clients for same servers, …)
[Smola et al 2010, Ho et al 2013, Li et al 2014]

Big Data Computing Vu Pham Parameter Servers


Parameter Server (PS)

EL
Worker Server
Machines Machines

PT
N
➢ Extensions: push/pull interface to send/receive most recent copy
of (subset of) parameters, blocking is optional
➢ Extension: can block until push/pulls with clock < (t – τ ) complete
[Smola et al 2010, Ho et al 2013, Li et al 2014]

Big Data Computing Vu Pham Parameter Servers


Data parallel learning with PS

Parameter Server Parameter Server Parameter Server

w1 w2 w3 w4 w5 w6 w7 w8 w9

EL
PT
N
w519 w64 w83 w72
Data Data Data Data

Split Data Across Machines


Big Data Computing Vu Pham Parameter Servers
Data parallel learning with PS

Parameter Server Parameter Server Parameter Server

w1 w2 w3 w4 w5 w6 w7 w8 w9

EL
PT
1.Different parts of the model on different servers.
2.Workers retrieve the part needed as needed
N
Data Data Data Data

Split Data Across Machines


Big Data Computing Vu Pham Parameter Servers
Abstraction used for
Data parallel learning with PS
Key-Value API for workers:

EL
1. get(key) → value

2.
PT
add(key, delta)
Model
N

Model Model
Big Data Computing Vu Pham Parameter Servers
Iteration in Map-Reduce (IPM)
Initial Learned
Model Map Reduce Model

w(0)
w(1)

EL
Training
Data

PT w(2)
N

w(3)
Big Data Computing Vu Pham Parameter Servers
Cost of Iteration in Map-Reduce
Initial Learned
Model Map Reduce Model

w(0)
w(1)

EL
Training
Data

Read 2
PT Repeatedly
w(2)
load same data
N

w(3)
Big Data Computing Vu Pham Parameter Servers
Cost of Iteration in Map-Reduce
Initial Learned
Model Map Reduce Model

w(0)
w(1)

EL
Training
Redundantly save
Data

output between PT w(2)


N
stages

w(3)
Big Data Computing Vu Pham Parameter Servers
Parameter Servers

EL
Stale Synchronous Parallel Model

PT
N

Big Data Computing Vu Pham Parameter Servers


Parameter Server (PS)

EL
Worker Server
Machines Machines

PT
N
➢ Model parameters are stored on PS machines and accessed via key-
value interface (distributed shared memory)

[Smola et al 2010, Ho et al 2013, Li et al 2014]

Big Data Computing Vu Pham Parameter Servers


Iterative ML Algorithms

EL
PT Model
N
Data Worker Parameters

➢ Topic Model, matrix factorization, SVM, Deep Neural


Network…

Big Data Computing Vu Pham Parameter Servers


Map-Reduce VS. Parameter Server
Data Independent Independent
Model Records Data

EL
Programming Key-Value Store
Map & Reduce
Abstraction PT (Distributed Shared
Memory)
N
Execution Bulk Synchronous
Semantics Parallel (BSP) ?
The Problem: Networks Are Slow!

get(key)
add(key, delta)

EL
Worker Server
Machines Machines

PT
N
➢ Network is slow compared to local memory access
➢ We want to explore options for handling this….
[Smola et al 2010, Ho et al 2013, Li et al 2014]

Big Data Computing Vu Pham Parameter Servers


Solution 1: Cache Synchronization

Data

EL
Data Data

PT Server
N
Data Data

Data

Big Data Computing Vu Pham Parameter Servers


Parameter Cache Synchronization

Sparse Changes
to Model Data

EL
Data Data

PT Server
N
Data Data

Data

Big Data Computing Vu Pham Parameter Servers


Parameter Cache Synchronization
(aka IPM)

Data

EL
Data Data

PT Server
N
Data Data

Data

Big Data Computing Vu Pham Parameter Servers


Solution 2: Asynchronous
Execution
Compute Communicate Compute

Machine 1 Iteration Waste Iteration

EL
Machine 2
PT Iteration Iteration Waste
N
Machine 3 Iteration Waste Iteration

Barrier Barrier

Enable more frequent coordination on parameter values


Big Data Computing Vu Pham Parameter Servers
Asynchronous Execution

Parameter Server (Logical)


w1 w2 w3 w4 w5 w6 w7 w8 w9

EL
Machine 1
PT
Iteration Iteration Iteration Iteration
N
Machine 2 Iteration Iteration Iteration

Machine 3 Iteration Iteration Iteration Iteration Iteration

[Smola et al 2010]
Big Data Computing Vu Pham Parameter Servers
Asynchronous Execution
Problem:
Async lacks theoretical guarantee as distributed
environment can have arbitrary delays from

EL
network & stragglers

But…. PT
N

Big Data Computing Vu Pham Parameter Servers


RECAP

f is loss function, x is parameters

1. Take a gradient step: x’ = xt – ηt gt


2. If you’ve restricted the parameters to a subspace X (e.g., must be positive, …)
find the closest thing in X to x’: xt+1 = argminX dist( x – x’ )
3. But…. you might be using a “stale” g (from τ steps ago)

EL
PT
N
Map-Reduce VS. Parameter Server

Data Independent Independent


Model Records Data

EL
Programming Key-Value Store
Map & Reduce
Abstraction PT (Distributed Shared
Memory)
N
Execution Bulk Synchronous Bounded
Semantics Parallel (BSP) ?
Asynchronous
Parameter Server
Stale synchronous parallel (SSP):
• Global clock time t
• Parameters workers “get” can be out of date
• but can’t be older than t-τ

EL
• τ controls “staleness”
PT
• aka stale synchronous parallel (SSP)
N
Bounded
Asynchronous

Big Data Computing Vu Pham Parameter Servers


Stale Synchronous Parallel (SSP)

Staleness bound s = 3 worker 2 blocked until


worker 1 reaches clock 2
worker 1

EL
updates guaranteed
worker 2 visible to worker 2

incomplete updates sent


to worker 2; not yet
worker 3 visible

PT
updates not yet sent to
worker 2
worker 4

Clock
N
0 1 2 3 4 5 6 7
➢ Interpolate between BSP and Async and subsumes both
➢ Allow workers to usually run at own pace
➢ Fastest/slowest threads not allowed to drift >s clocks apart
➢ Efficiently implemented: Cache parameters [Ho et al 2013]

Big Data Computing Vu Pham Parameter Servers


Consistency Matters

EL
PT
N
Strong consistency Relaxed consistency

➢ Suitable delay (SSP) gives big speed-up


[Ho et al 2013]
Big Data Computing Vu Pham Parameter Servers
Stale Synchronous Parallel (SSP)

EL
PT
N

[Ho et al 2013]
Big Data Computing Vu Pham Parameter Servers
Conclusion

In this lecture, we have discussed the parameters


servers.

EL
PT
N

Big Data Computing Vu Pham Parameter Servers


PageRank Algorithm in Big Data

EL
PT
N
Dr. Rajiv Misra
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
[email protected]
Big Data Computing Vu Pham Page Rank
Preface
Content of this Lecture:
In this lecture, we will discuss PageRank Algorithm in Big
Data using different framework with different ways and
scale.

EL
PT
N

Big Data Computing Vu Pham Page Rank


Big Graphs
Social scale graph
1 billion vertices, 100 billion edges Social graph from Facebook
https://medium.com/@johnrobb/facebook-
Web scale graph the-complete-social-graph-b58157ee6594

50 billion vertices, 1 trillion edges

EL
Brain scale graph
100 billion vertices, 100 trillion edges

PT Web graph from the SNAP database


https://snap.stanford.edu/data/
N

Human connectome.
Gerhard et al., Frontiers in
Neuroinformatics 5(3), 2011
Big Data Computing Vu Pham Page Rank
What is PageRank?
Why is Page Importance Rating important?
New challenges for information retrieval on the World Wide Web.
• Huge number of web pages: 150 million by1998
1000 billion by 2008

EL
• Diversity of web pages: different topics, different quality, etc.

What is PageRank?
PT
• A method for rating the importance of web pages objectively and
mechanically using the link structure of the web.
N
• History:
– PageRank was developed by Larry Page (hence the name Page-Rank) and
Sergey Brin.
– It is first as part of a research project about a new kind of search engine.
That project started in 1995 and led to a functional prototype in 1998.
Big Data Computing Vu Pham Page Rank
PageRank of E denotes by PR (E)

Give pages ranks (scores) based on links to them


Links from many pages high rank

EL
Links from a high-rank page high rank

PT
N

Big Data Computing Vu Pham Page Rank


PageRank of E denotes by PR (E)

EL
PT
N

Big Data Computing Vu Pham Page Rank


Example

EL
PT
N

Big Data Computing Vu Pham Page Rank


Example

EL
PT
N

Big Data Computing Vu Pham Page Rank


Example

EL
PT
N

Big Data Computing Vu Pham Page Rank


Example

EL
PT
N

Big Data Computing Vu Pham Page Rank


Example

EL
PT
N

Big Data Computing Vu Pham Page Rank


Example

EL
PT
N

Big Data Computing Vu Pham Page Rank


Page Rank

EL
PT
N

Big Data Computing Vu Pham Page Rank


Stationary Distribution

EL
PT
N

Big Data Computing Vu Pham Page Rank


Page Rank

EL
PT
N

Big Data Computing Vu Pham Page Rank


Page Rank

EL
PT
N

Big Data Computing Vu Pham Page Rank


Page Rank

EL
PT
N

Big Data Computing Vu Pham Page Rank


Summary
PageRank (PR)- a first algorithm by Google Search to
rank websites in their search engine results.

EL
We have shown how to calculate iteratively PageRank
for every vertex in the given graph.

PT
N

Big Data Computing Vu Pham Page Rank


MapReduce for PageRank

EL
PT
N

Big Data Computing Vu Pham Page Rank


Problems
The entire state of the graph is shuffled on every
iteration

EL
We only need to shuffle the new rank contributions,
not the graph structure
PT
Further, we have to control the iteration outside of
N
MapReduce

Big Data Computing Vu Pham Page Rank


Pregel
Originally from Google
Open source implementations
Apache Giraph, Standard GPS, Jpregel, Hama

EL
Batch algorithms on large graphs

PT
N

Big Data Computing Vu Pham Page Rank


PageRank in Pregel

EL
PT
N

Big Data Computing Vu Pham Page Rank


Example

EL
PT
N

Big Data Computing Vu Pham Page Rank


Example

EL
PT
N

Big Data Computing Vu Pham Page Rank


Example

EL
PT
N

Big Data Computing Vu Pham Page Rank


Example

EL
PT
N

Big Data Computing Vu Pham Page Rank


Example

EL
PT
N

Big Data Computing Vu Pham Page Rank


Example

EL
PT
N

Big Data Computing Vu Pham Page Rank


Example

EL
PT
N

Big Data Computing Vu Pham Page Rank


Example

EL
PT
N

Big Data Computing Vu Pham Page Rank


Conclusion
Graph-structured data are increasingly common in data
science contexts due to their ubiquity in modeling the
communication between entities: people (social networks),
computers (Internet communication), cities and countries

EL
(transportation networks), or corporations (financial
transactions).

PT
In this lecture, we have discussed PageRank algorithms for
extracting information from graph data using MapReduce
N
and Pregel.

Big Data Computing Vu Pham Page Rank


Spark GraphX & Graph Analytics

EL
PT
N
Dr. Rajiv Misra
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
[email protected]
Big Data Computing Vu Pham GraphX
Preface
Content of this Lecture:

In this lecture, we will discuss GraphX: a distributed


graph computation framework that unifies graph-

EL
parallel and data parallel computation for Big Data
Analytics and also discuss as a case study of Graph

PT
Analytics with GraphX.
N

Big Data Computing Vu Pham GraphX


Introduction
A piece of technology build on top of spark core.
Graphs like our social network of graphs in the computer science/network
sense.
However, Graphs are only useful for specific things.
It can measure things like “connectedness”, degree distribution, average

EL
path length, triangle counts-high level measures of a graph.
It can count triangles in the graph, and apply the PageRank algorithm to
it.
PT
It can also join graphs together and transform graphs quickly
It supports the Pregel API( google) for traversing a graph.
N
• Introduces VertexRDD and EdgeRDD, and the Edge data type .

Big Data Computing Vu Pham


Graphs in Machine Learning Landscape

EL
PT
N

Big Data Computing Vu Pham GraphX


Machine Learning Landscape

EL
PT
N

Big Data Computing Vu Pham GraphX


Graphs

EL
PT
N

Big Data Computing Vu Pham GraphX


Graph-structured data is everywhere

EL
PT
N

Big Data Computing Vu Pham GraphX


Graph-structured data is everywhere

EL
PT
N

Big Data Computing Vu Pham GraphX


Graph-structured data is everywhere

EL
PT
N

Big Data Computing Vu Pham GraphX


Graph-structured data is everywhere

EL
PT
N

Big Data Computing Vu Pham GraphX


Graph-structured data is everywhere

EL
PT
N

Big Data Computing Vu Pham GraphX


Graph-structured data is everywhere

EL
PT
N

Big Data Computing Vu Pham GraphX


Graph-structured data is everywhere

EL
PT
N

Big Data Computing Vu Pham GraphX


Graph Algorithms

EL
PT
N

Big Data Computing Vu Pham GraphX


Graphs are Central to Analytics
Hyperlinks PageRank Top 20 Pages

Title PR
Raw Text
Wikipedia Table

EL
Title Body
Term-Doc Topic Model
<</ />> Graph (LDA)
</> Word Topics
XML

PT Word Topic
N
Discussion Community User Community
Table Detection Community Topic
Editor Graph

User Disc. User Com. Topic Com.


PageRank: Identifying Leaders

EL
Rank of
Weighted sum of
user i
PT neighbors’ ranks
N
Update ranks in parallel
Iterate until convergence

Big Data Computing Vu Pham GraphX


PageRank

EL
PT
N

Big Data Computing Vu Pham GraphX


Single-Source Shortest Path

EL
PT
N

Big Data Computing Vu Pham GraphX


Finding Communities

EL
PT
N

Big Data Computing Vu Pham GraphX


Counting Triangles

EL
PT
N

Big Data Computing Vu Pham GraphX


The Graph-Parallel Pattern

EL
Model / Alg.
State

PT
N
Computation depends only on the
neighbors

Big Data Computing Vu Pham GraphX


Graph-Parallel Pattern
Gonzalez et al. [OSDI’12]

EL
PT
N
Gather information from
neighboring vertices
Big Data Computing Vu Pham GraphX
Graph-Parallel Pattern
Gonzalez et al. [OSDI’12]

EL
PT
N
Apply an update the
vertex property
Big Data Computing Vu Pham GraphX
Graph-Parallel Pattern
Gonzalez et al. [OSDI’12]

EL
PT
N
Scatter information to
neighboring vertices
Big Data Computing Vu Pham GraphX
Many Graph-Parallel Algorithms
Collaborative Filtering Community Detection
Alternating Least Squares Triangle-Counting
Stochastic Gradient Descent K-core Decomposition
Tensor Factorization K-Truss

EL
Structured Prediction Graph Analytics
Loopy Belief Propagation PageRank

Programs PT
Max-Product Linear Personalized PageRank
Shortest Path
N
Gibbs Sampling Graph Coloring
Semi-supervised ML Classification
Graph SSL Neural Networks
CoEM

Big Data Computing Vu Pham GraphX


Graph-Parallel Systems

EL
oogle

PT
Expose specialized APIs to simplify graph
programming.
N

Big Data Computing Vu Pham GraphX


The Pregel (Push) Abstraction
“Think like a Vertex.” - Pregel [SIGMOD’10]
Vertex-Programs interact by sending messages.

EL
PT
N

Big Data Computing Vu Pham GraphX


The GraphLab (Pull) Abstraction
Vertex Programs directly access adjacent vertices and
edges

EL
PT
N
Data movement is managed by the system and not the
user.
Big Data Computing Vu Pham GraphX
Iterative Bulk Synchronous Execution

EL
PT
N

Big Data Computing Vu Pham GraphX


Graph-Parallel Systems

EL
oogle

PT
Expose specialized APIs to simplify graph
programming.
N
Exploit graph structure to achieve orders-of-magnitude
performance gains over more general
data-parallel systems.

Big Data Computing Vu Pham GraphX


PageRank on the Live-Journal Graph

EL
PT
N
Spark is 4x faster than Hadoop
GraphLab is 16x faster than Spark
Big Data Computing Vu Pham GraphX
PageRank on the Live-Journal Graph

Mahout/Hadoop 1340

EL
Naïve Spark 354

GraphLab 22

0 PT 200 400 600 800 1000 1200

Runtime (in seconds, PageRank for 10 iterations)


1400 1600
N
GraphLab is 60x faster than Hadoop
GraphLab is 16x faster than Spark
Big Data Computing Vu Pham GraphX
Triangle Counting on Twitter

EL
PT
N

Big Data Computing Vu Pham GraphX


Graphs are Central to Analytics
Hyperlinks PageRank Top 20 Pages

Title PR
Raw Text
Wikipedia Table

EL
Title Body
Term-Doc Topic Model
<</ />> Graph (LDA)
</> Word Topics
XML

PT Word Topic
N
Discussion Community User Community
Table Detection Community Topic
Editor Graph

User Disc. User Com. Topic Com.


Separate Systems to Support Each View

Table View Graph View

EL
Table Dependency Graph
Row
PT
N
Row
Result
Row

Row

Big Data Computing Vu Pham GraphX


Having separate systems

EL
for each view is
difficult to use and
PTinefficient
N

Big Data Computing Vu Pham GraphX


Difficult to Program and Use
Users must Learn, Deploy, and Manage
multiple systems

EL
PT
Leads to brittle and often
N
complex interfaces

Big Data Computing Vu Pham GraphX


Inefficient
Extensive data movement and duplication across
the network and file system

EL
<</ />>
</>
XML

PT
N
HDFS HDFS HDFS HDFS

Limited reuse internal data-structures


across stages
Big Data Computing Vu Pham GraphX
Solution: The GraphX Unified Approach

New API New System


Blurs the distinction Combines Data-Parallel
between Tables and Graph-Parallel Systems

EL
Graphs

PT
N
Enabling users to easily and efficiently express the
entire graph analytics pipeline
Big Data Computing Vu Pham GraphX
Tables and Graphs are composable views of the
same physical data

EL
Table View
PT GraphX Unified
Representation
Graph View
N
Each view has its own operators that
exploit the semantics of the view
to achieve efficient execution
Big Data Computing Vu Pham GraphX
Graphs → Relational Algebra
1. Encode graphs as distributed tables
2. Express graph computation in relational algebra
3. Recast graph systems optimizations as:

EL
1. Distributed join optimization
2. Incremental materialized maintenance

PT
Integrate Graph and Achieve performance
N
Table data parity with
processing systems. specialized systems.

Big Data Computing Vu Pham GraphX


View a Property Graph as a Table

Vertex Property Table


Property Graph Id Property (V)
Rxin (Stu., Berk.)

EL
R F Jegonzal (PstDoc, Berk.)
Franklin (Prof., Berk)

PT Istoica

Edge Property Table


(Prof., Berk)
N
SrcId DstId Property (E)
J I
rxin jegonzal Friend
franklin rxin Advisor
istoica franklin Coworker
franklin jegonzal PI
Big Data Computing Vu Pham GraphX
Property Graphs as Distributed Tables

EL
PT
N

Big Data Computing Vu Pham GraphX


Table Operators
Table (RDD) operators are inherited from Spark:
map reduce sample
filter count take

EL
groupBy fold first
sort
union
PT reduceByKey
groupByKey
partitionBy
mapWith
N
join cogroup pipe
leftOuterJoin cross save
rightOuterJoin zip ...
Big Data Computing Vu Pham GraphX
Graph Operators
class Graph [ V, E ] {
def Graph(vertices: Table[ (Id, V) ],
edges: Table[ (Id, Id, E) ])
// Table Views -----------------

EL
def vertices: Table[ (Id, V) ]
def edges: Table[ (Id, Id, E) ]
def triplets: Table [ ((Id, V), (Id, V), E) ]
// Transformations ------------------------------

PT
def reverse: Graph[V, E]
def subgraph(pV: (Id, V) => Boolean,
pE: Edge[V,E] => Boolean): Graph[V,E]
def mapV(m: (Id, V) => T ): Graph[T,E]
N
def mapE(m: Edge[V,E] => T ): Graph[V,T]
// Joins ----------------------------------------
def joinV(tbl: Table [(Id, T)]): Graph[(V, T), E ]
def joinE(tbl: Table [(Id, Id, T)]): Graph[V, (E, T)]
// Computation ----------------------------------
def mrTriplets(mapF: (Edge[V,E]) => List[(Id, T)],
reduceF: (T, T) => T): Graph[T, E]
} Big Data Computing Vu Pham GraphX
Triplets Join Vertices and Edges

The triplets operator joins vertices and edges:


Vertices Triplets Edges

EL
A A B

B A C

C
D
PT A

B
C

C B C
N
C D C D

The mrTriplets operator sums adjacent triplets.


SELECT t.dstId, reduceUDF( mapUDF(t) ) AS sum
FROM triplets AS t GROUPBY t.dstId

Big Data Computing Vu Pham GraphX


Map Reduce Triplets
Map-Reduce for each vertex
B C

EL
mapF( A B ) A1

PT
A

mapF( A C ) A2
N
D E

reduceF( A1 , A2 ) A

Big Data Computing Vu Pham GraphX


Example: Oldest Follower
What is the age of the oldest follower for each
23 42
user?
B C
val oldestFollowerAge = graph

EL
.mrTriplets(
e=> (e.dst.id, e.src.age),//Map 30
(a,b)=> max(a, b) //Reduce

PT
) A

.vertices
N
D 19 E 75

F
16

Big Data Computing Vu Pham GraphX


GraphX System Design

EL
PT
N

Big Data Computing Vu Pham GraphX


Distributed Graphs as Tables (RDDs)
Vertex Table Routing Edge Table
Property Graph (RDD) Table (RDD) (RDD)

Part. 1
A B
A A

EL
1 2
B C A C

B B 1 B C

A
A
2D Vertex Cut Heuristic
A
PT
D
D
D
C C 1
C D
N
A E
D D 1 2

A F
E E
F E
2
E D

Part. 2 F F 2 E F
Big Data Computing Vu Pham
Caching for Iterative mrTriplets
Vertex Table Edge Table
(RDD) (RDD)

Mirror A B
Cache
AA

EL
A A C
B
BB B C
C

C
C PT D

Mirror
C D
N
Cache
A E
D
D
A A F
EE D
E E D

FF F E F
Vu Pham
Incremental Updates for Iterative mrTriplets
Vertex Table Edge Table
(RDD) (RDD)

Mirror A B
Cache
Change A

EL
A A C
B
B B C
C

C PT D

Mirror
C D
N
Cache
A E
D
A A F

Scan
Change E D
E E D

F F E F
Vu Pham
Aggregation for Iterative mrTriplets
Vertex Table Edge Table
(RDD) (RDD)

Mirror A B
Cache
Change A

EL
A A C
Local
B
Change B Aggregate B C
C

Change C PT D

Mirror
C D
N
Cache
A E
Change D
A A F
Local

Scan
Change E D
Aggregate
E E D

Change F F E F
Vu Pham
Reduction in Communication Due to Cached Updates

Connected Components on Twitter Graph

EL
10000
Network Comm. (MB)

1000

100

10
PT
N
Most vertices are within 8 hops
1
of all vertices in their comp.
0.1
0 2 4 6 8 10 12 14 16
Iteration

Big Data Computing Vu Pham GraphX


Benefit of Indexing Active Edges

Connected Components on Twitter Graph

EL
30
Scan
25
Runtime (Seconds)

Indexed
20
15
10
PT Scan All Edges
N
Index of “Active” Edges
5
0
0 2 4 6 8 10 12 14 16
Iteration

Big Data Computing Vu Pham GraphX


Join Elimination
Identify and bypass joins for unused triplets fields
Example: PageRank only accesses source attribute

EL
PageRank on Twitter Three Way Join
15000 Join Elimination

PT
Communication (MB)

10000
N
5000

Factor of 2 reduction in communication


0
0 5 10 15 20
Iteration

Big Data Computing Vu Pham GraphX


Additional Query Optimizations
Indexing and Bitmaps:
To accelerate joins across graphs

EL
To efficiently construct sub-graphs

PT
Substantial Index and Data Reuse:
Reuse routing tables across graphs and sub-graphs
N
Reuse edge adjacency information and indices

Big Data Computing Vu Pham GraphX


Performance Comparisons

Live-Journal: 69 Million Edges

Mahout/Hadoop 1340

EL
Naïve Spark 354
Giraph 207
GraphX
GraphLab

0
22 PT
68

200 400 600 800 1000 1200 1400 1600


N
Runtime (in seconds, PageRank for 10 iterations)

GraphX is roughly 3x slower than GraphLab


Big Data Computing Vu Pham GraphX
GraphX scales to larger graphs

Twitter Graph: 1.5 Billion Edges

EL
Giraph 749

GraphX 451

Graph…

0
PT 200
203

400 600 800


N
Runtime (in seconds, PageRank for 10 iterations)

GraphX is roughly 2x slower than GraphLab


» Scala + Java overhead: Lambdas, GC time, …
» No shared memory parallelism: 2x increase in comm.
Big Data Computing Vu Pham GraphX
A Small Pipeline in GraphX
Raw Wikipedia Hyperlinks PageRank Top 20 Pages

<</ />>
</>

EL
XML
HDFS HDFS

Spark Preprocess Compute Spark Post.

Spark
Giraph + Spark
PT 605
1492
N
GraphX 342
GraphLab + Spark 375

0 200 400 600 800 1000 1200 1400 1600


Total Runtime (in Seconds)
Timed end-to-end GraphX is faster than GraphLab
Big Data Computing Vu Pham GraphX
The GraphX Stack (Lines of Code)

EL
PageRank Connected Shortest SVD ALS K-core
(5) Comp. (10) Path (10) (40) (40) (51) Triangle
LDA
Count
(120)

PT
Pregel (28) + GraphLab (50)
(45)
N
GraphX (3575)

Spark

Big Data Computing Vu Pham GraphX


Status
Alpha release as part of Spark 0.9

EL
PT
N

Big Data Computing Vu Pham GraphX


GraphX: Unified Analytics

EL
PT
N
Enabling users to easily and efficiently express the entire
graph analytics pipeline

Big Data Computing Vu Pham GraphX


A Case for Algebra in Graphs
A standard algebra is essential for graph systems:
e.g.: SQL → proliferation of relational system

EL
By embedding graphs in relational algebra:

PT
Integration with tables and preprocessing
Leverage advances in relational systems
N
Graph opt. recast to relational systems

Big Data Computing Vu Pham GraphX


Observations
Domain specific views: Tables and Graphs
tables and graphs are first-class composable objects
specialized operators which exploit view semantics

EL
Single system that efficiently spans the pipeline
minimize data movement and duplication
PT
eliminates need to learn and manage multiple systems
N
Graphs through the lens of database systems
Graph-Parallel Pattern → Triplet joins in relational alg.
Graph Systems → Distributed join optimizations

Big Data Computing Vu Pham GraphX


Active Research
Static Data → Dynamic Data
Apply GraphX unified approach to time evolving
data

EL
Model and analyze relationships over time

PT
Serving Graph Structured Data
N
Allow external systems to interact with GraphX
Unify distributed graph databases with relational
database technology

Big Data Computing Vu Pham GraphX


Graph Property 1 Real-World Graphs

Power-Law Degree Distribution Edges >> Vertices


AltaVista WebGraph1.4B Vertices, 6.6B Edges
10
10 Facebook
More than 108 vertices

EL
200
have one neighbor.
Number of Vertices

8
10

Ratio of Edges to Vertices


Top 1% of vertices are 150
10
6

PT adjacent to
count

50% of the edges! 100


4
10
N
50
2
10
0
10
0
0 2 4 6 8
2008 2009 2010 2011 2012
10 10 10 10 10
Year
Degree
degree

Big Data Computing Vu Pham GraphX


Graph Property 2 Active Vertices

PageRank on Web Graph


100000000
10000000
51% updated only once!

EL
1000000
Num-Vertices

100000
10000
1000
PT
N
100
10
1
0 10 20 30 40 50 60 70
Number of Updates
Big Data Computing Vu Pham GraphX
Graphs are Essential to Data Mining and
Machine Learning

Identify influential people and information

EL
Find communities

PT
Understand people’s shared interests
N
Model complex data dependencies

Big Data Computing Vu Pham GraphX


Recommending Products
Users Ratings Items

EL
PT
N

Big Data Computing Vu Pham GraphX


Recommending Products

Low-Rank Matrix Factorization:


f(3)
r13

EL
Movies f(1)
x

Movie Factors (M)


Netflix
Users

f(j)

≈ Users

User Factors (U)


r14
f(4)
f(i)

Movies PT f(2)
r24

r25 f(5)
N
Iterate:

Big Data Computing Vu Pham GraphX


Predicting User Behavior
? ?

Liberal ? ? Conservative
?
?

EL
? ?

?
? Post
?

PT
Post
?
Post Post

? Post Post
? Post
N
Post
Post Post ?
?
Post
? ?
? ?
? Post Post ?

Conditional Random Field Post Post


Post

? ? ? ?
? ? ?
Belief Propagation
?

Big Data Computing Vu Pham GraphX


Finding Communities

Count triangles passing through each vertex:

EL
3
1
4

PT
Measures “cohesiveness” of local community
N

Fewer Triangles More Triangles


Weaker Community Stronger Community
Big Data Computing Vu Pham GraphX
Example Graph Analytics Pipeline
Preprocessing Compute Post Proc.

EL
</>
</>
</>

XML

Raw
PT
ETL Slice Compute Analyze
N
Data
Initial Subgraph PageRank Top
Graph Users

Repeat GraphX
Big Data Computing Vu Pham
References
Xin, R., Crankshaw, D., Dave, A., Gonzalez, J., Franklin, M.J.,
& Stoica, I. (2014). GraphX: Unifying Data-Parallel and
Graph-Parallel Analytics. CoRR, abs/1402.2394.

EL
PT
N

http://spark.apache.org/graphx
Big Data Computing Vu Pham GraphX
Conclusion
The growing scale and importance of graph data has driven the
development of specialized graph computation engines
capable of inferring complex recursive properties of graph-

EL
structured data.

PT
In this lecture we have discussed GraphX, a distributed graph
processing framework that unifies graph-parallel and data-
parallel computation in a single system and is capable of
N
succinctly expressing and efficiently executing the entire graph
analytics pipeline.

Big Data Computing Vu Pham GraphX


Case Study: Flight Data Analysis
using Spark GraphX

EL
PT
N

Big Data Computing Vu Pham GraphX


Problem Statement
To analyze Real-Time Flight data using Spark GraphX,
provide near real-time computation results and
visualize the results.

EL
PT
N

Big Data Computing Vu Pham GraphX


Flight Data Analysis using Spark GraphX
Dataset Description:

The data has been collected from U.S. Department of

EL
Transportation's (DOT) Bureau of Transportation
Statistics site which tracks the on-time performance of

PT
domestic flights operated by large air carriers. The
Summary information on the number of on-time,
delayed, canceled, and diverted flights is published in
N
DOT's monthly Air Travel Consumer Report and in this
dataset of January 2014 flight delays and cancellations.

Big Data Computing Vu Pham GraphX


Big Data Pipeline for Flight Data Analysis using
Spark GraphX

EL
PT
N

Big Data Computing Vu Pham GraphX


UseCases
I. Monitoring Air traffic at airports
II. Monitoring of flight delays
Analysis overall routes and airports

EL
III.

IV. Analysis of routes and airports per airline

PT
N

Big Data Computing Vu Pham GraphX


Objectives
Compute the total number of airports
Compute the total number of flight routes
Compute and sort the longest flight routes

EL
Display the airports with the highest incoming flights
Display the airports with the highest outgoing flights

PT
List the most important airports according to PageRank
List the routes with the lowest flight costs
N
Display the airport codes with the lowest flight costs
List the routes with airport codes which have the lowest
flight costs

Big Data Computing Vu Pham GraphX


Features
17 Attributes
Attribute Name Attribute Description
dOfM Day of Month

EL
dOfW Day of Week

carrier Unique Airline Carrier Code

tailNum Tail Number

fNum

origin_id
PT Flight Number Reporting Airline

Origin Airport Id
N
origin Origin Airport

dest_id Destination Airport Id

dest Destination Airport

crsdepttime CRS Departure Time (local time: hhmm)


deptime Actual Departure Time

Big Data Computing Vu Pham GraphX


Features
Attribute Name Attribute Description
depdelaymins Difference in minutes between scheduled
and actual departure time. Early
departures set to 0.

EL
crsarrtime CRS Arrival Time (local time: hhmm)
arrtime

arrdelaymins
PT Actual Arrival Time (local time: hhmm)

Difference in minutes between scheduled


and actual arrival time. Early arrivals set to
N
0.

crselapsedtime CRS Elapsed Time of Flight, in Minutes


dist Distance between airports (miles)

Big Data Computing Vu Pham GraphX


Sample Dataset

EL
PT
N

Big Data Computing Vu Pham GraphX


Spark Implementation

EL
PT
N

Big Data Computing Vu Pham GraphX


Spark Implementation
(“/home/iitp/spark-2.2.0-bin-hadoop-2.6/flights/airportdataset.csv")

EL
PT
N

Big Data Computing Vu Pham GraphX


Graph Operations

EL
PT
N

Big Data Computing Vu Pham GraphX


Graph Operations

EL
PT
N

Big Data Computing Vu Pham GraphX


Graph Operations

EL
PT
N

Big Data Computing Vu Pham GraphX


Graph Operations
what airport has the most in degrees or unique flights into it?

EL
PT
N

Big Data Computing Vu Pham GraphX


Graph Operations
What are our most important airports ?

EL
PT
N

Big Data Computing Vu Pham GraphX


Graph Operations
Output the routes where the distance between airports exceeds
1000 miles

EL
graph.edges.filter {
case (Edge(org_id, dest_id, distance)) => distance > 1000
}.take(5).foreach(println)
PT
N

Big Data Computing Vu Pham GraphX


Graph Operations
Output the airport with maximum incoming flight
// Define a reduce operation to compute the highest degree vertex
def max: (VertexId, Int), b: (VertexId, Int)): (VertexId, Int) = {

EL
if (a._2 > b._2) a else b
}

PT
// Compute the max degrees
val maxInDegree: (VertexId, Int) = graph.inDegrees.reduce(max)
N

Big Data Computing Vu Pham GraphX


Graph Operations
Output the airports with maximum incoming flights

val maxIncoming=graph.inDegrees.collect.sortWith (_._2 >

EL
_._2).map(x => (airportMap(x._1),
x._2)).take(10).foreach(println)

PT
N

Big Data Computing Vu Pham GraphX


Graph Operations
Output the longest routes

graph.triplets.sortBy(_.attr, ascending = false).map(triplet =>

EL
"There were " + triplet.attr.toString + " flights from " + triplet.srcAttr
+ " to " + triplet.dstAttr + ".").take(20).foreach(println)

PT
N

Big Data Computing Vu Pham GraphX


Graph Operations
Output the cheapest airfare routes

val gg = graph.mapEdges(e => 50.toDouble + e.attr.toDouble / 20)


//Call pregel on graph
val sssp = initialGraph.pregel(Double.PositiveInfinity)(
//vertex program

EL
(id, distCost, newDistCost) => math.min(distCost, newDistCost),triplet => {
//send message
if(triplet.srcAttr + triplet.attr < triplet.dstAttr)
{
Iterator((triplet.dstId, triplet.srcAttr + triplet.attr))
}
else
{
Iterator.empty
PT
N
}
},
//Merge Messages
(a,b) => math.min(a,b)
)
//print routes with lowest flight cost
print("routes with lowest flight cost")
println(sssp.edges.take(10).mkString("\n"))

Big Data Computing Vu Pham GraphX


Graph Operations (PageRank)
Output the most influential airports using PageRank

val rank = graph.pageRank(0.001).vertices

EL
val temp = rank.join(airports)
temp.take(10).foreach(println)

PT
Output the most influential airports from most influential to latest
val temp2 = temp.sortBy(_._2._1,false)
N

Big Data Computing Vu Pham GraphX


Conclusion
The growing scale and importance of graph data has driven the
development of specialized graph computation engines
capable of inferring complex recursive properties of graph-

EL
structured data.

PT
In this lecture we have discussed GraphX, a distributed graph
processing framework that unifies graph-parallel and data-
parallel computation in a single system and is capable of
N
succinctly expressing and efficiently executing the entire graph
analytics pipeline.

Big Data Computing Vu Pham GraphX

You might also like