Graphx: Graph Analytics in Spark
Graphx: Graph Analytics in Spark
Ankur Dave
Graduate Student, UC Berkeley AMPLab
Model &
Dependencies
Small & Dense Sparse Large & Dense
Architecture
Model &
Dependencies
Small & Dense Sparse Large & Dense
GraphX
Architecture
Spark Dataflow
Framework Parameter Server
Graphs
Social Networks
Web Graphs
User-Item Graphs
Graph Algorithms
PageRank
Triangle Counting
Collaborative Filtering
Products
x
Users
Ratings f(j)
Users
f(i)
Products
Collaborative Filtering
f(3)
r13
f(1)
Product Factors
r14
User Factors
f(4)
r24
f(2)
r25 f(5)
X 2
T
f [i] = arg min rij w f [j] + ||w||22
w2Rd
j2Nbrs(i)
The Graph-Parallel Pattern
The Graph-Parallel Pattern
The Graph-Parallel Pattern
Many Graph-Parallel Algorithms
Collaborative Filtering Community Detection
Alternating Least Squares Triangle-Counting
Stochastic Gradient Descent K-core Decomposition
Tensor Factorization K-Truss
Semi-supervised ML Classification
Graph SSL Neural Networks
CoEM
Modern Analytics
Link Table Hyperlinks PageRank Top 20 Pages
Title Link Title PR
Raw
Wikipedia
Com. PR..
<</ />>
</> Top Communities
XML
Raw
Wikipedia
Com. PR..
<</ />>
</> Top Communities
XML
Raw
Wikipedia
Com. PR..
<</ />>
</> Top Communities
XML
Vertex Property:
User Profile
Current PageRank Value
Edge Property:
Weights
Relationships
Timestamps
Creating a Graph (Scala)
type
VertexId
=
Long
Graph
val
vertices:
RDD[(VertexId,
String)]
=
sc.parallelize(List(
(1L,
Alice),
1 Alice
(2L,
Bob),
(3L,
Charlie)))
coworker
class
Edge[ED](
val
srcId:
VertexId,
val
dstId:
VertexId,
2 Bob
val
attr:
ED)
val
edges:
RDD[Edge[String]]
=
friend
sc.parallelize(List(
Edge(1L,
2L,
coworker),
Edge(2L,
3L,
friend)))
3 Charlie
val
graph
=
Graph(vertices,
edges)
Graph Operations (Scala)
class
Graph[VD,
ED]
{
//
Table
Views
-----------------------
def
vertices:
RDD[(VertexId,
VD)]
def
edges:
RDD[Edge[ED]]
def
triplets:
RDD[EdgeTriplet[VD,
ED]]
//
Transformations
-------------------------------------------
def
mapVertices[VD2](f:
(VertexId,
VD)
=>
VD2):
Graph[VD2,
ED]
def
mapEdges[ED2](f:
Edge[ED]
=>
ED2):
Graph[VD2,
ED]
def
reverse:
Graph[VD,
ED]
def
subgraph(epred:
EdgeTriplet[VD,
ED]
=>
Boolean,
vpred:
(VertexId,
VD)
=>
Boolean):
Graph[VD,
ED]
//
Joins
----------------------------------------
def
outerJoinVertices[U,
VD2]
(tbl:
RDD[(VertexId,
U)])
(f:
(VertexId,
VD,
Option[U])
=>
VD2):
Graph[VD2,
ED]
//
Computation
----------------------------------
def
mapReduceTriplets[A](
sendMsg:
EdgeTriplet[VD,
ED]
=>
Iterator[(VertexId,
A)],
mergeMsg:
(A,
A)
=>
A):
RDD[(VertexId,
A)]
Built-in Algorithms (Scala)
//
Continued
from
previous
slide
def
pageRank(tol:
Double):
Graph[Double,
Double]
def
triangleCount():
Graph[Int,
ED]
def
connectedComponents():
Graph[VertexId,
ED]
//
...and
more:
org.apache.spark.graphx.lib
}
Graph
1 Alice RDD
coworker srcAttr dstAttr attr
triplets
Alice coworker Bob
2 Bob
Bob friend Charlie
friend
3 Charlie
The subgraph transformation
class
Graph[VD,
ED]
{
def
subgraph(epred:
EdgeTriplet[VD,
ED]
=>
Boolean,
vpred:
(VertexId,
VD)
=>
Boolean):
Graph[VD,
ED]
}
graph.subgraph(epred
=
(edge)
=>
edge.attr
!=
relative)
Graph Graph
relative subgraph
friend friend
Graph Graph
Alice Bob
Alice 2
coworker
mapReduceTriplets
Bob 2
relative
friend
Charlie 3
Charlie relative David
David 1
How GraphX Works
Encoding Property Graphs as RDDs
Vertex Routing Edge Table
Property Graph Table Table (RDD)
(RDD) (RDD)
Part. 1 A B
A A 1
2
B C A C
Machine 1
B B 1
B C
C D
ACut
VertexA
D
D C C 1
A D
A E
D D 1
2
Machine 2
A F
E E 2
F E E D
Part. 2 F F 2
E F
Graph System Optimizations
Specialized Vertex-Cuts Remote
Data-Structures Partitioning Caching / Mirroring
Twitter Graph (42M Vertices,1.5B Edges) UK-Graph (106M Vertices, 3.7B Edges)
3500 9000
3000 8000
Runtime (Seconds)
7000
2500
6000
2000 5000
7x 18x
1500 4000
3000
1000
2000
500 1000
0 0
2. More algorithms
a) LDA (topic modeling): PR #2388
b) Correlation clustering
c) Your algorithm here?
3. Speculative
a) Streaming/time-varying graphs
b) Graph databaselike queries
Thanks!
http://spark.apache.org/graphx