0% found this document useful (0 votes)
79 views34 pages

Graphx: Graph Analytics in Spark

GraphX is a graph-parallel processing system built on Apache Spark. It provides APIs for representing graphs as property graphs with vertices and edges, and for performing graph-parallel computations and algorithms like PageRank, triangle counting, and connected components. The GraphX API allows users to create graphs from RDDs, transform graphs using operations like mapVertices and subgraph, and run graph algorithms and custom computations using the triplets view.

Uploaded by

webdaxter
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
79 views34 pages

Graphx: Graph Analytics in Spark

GraphX is a graph-parallel processing system built on Apache Spark. It provides APIs for representing graphs as property graphs with vertices and edges, and for performing graph-parallel computations and algorithms like PageRank, triangle counting, and connected components. The GraphX API allows users to create graphs from RDDs, transform graphs using operations like mapVertices and subgraph, and run graph algorithms and custom computations using the triplets view.

Uploaded by

webdaxter
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

GraphX

Graph Analytics in Spark

Ankur Dave
Graduate Student, UC Berkeley AMPLab

Joint work with Joseph Gonzalez, Reynold Xin, Daniel


Crankshaw, Michael Franklin, and Ion Stoica UC BERKELEY
Machine Learning Landscape

Model &
Dependencies
Small & Dense Sparse Large & Dense

Architecture

MapReduce Graph-Parallel Parameter Server


Machine Learning Landscape

Model &
Dependencies
Small & Dense Sparse Large & Dense

GraphX
Architecture
Spark Dataflow
Framework Parameter Server
Graphs
Social Networks
Web Graphs
User-Item Graphs






Graph Algorithms
PageRank
Triangle Counting
Collaborative Filtering

Products
x
Users

Ratings f(j)

Users
f(i)

Products
Collaborative Filtering
f(3)
r13
f(1)

Product Factors
r14
User Factors
f(4)
r24
f(2)
r25 f(5)

X 2
T
f [i] = arg min rij w f [j] + ||w||22
w2Rd
j2Nbrs(i)
The Graph-Parallel Pattern
The Graph-Parallel Pattern
The Graph-Parallel Pattern
Many Graph-Parallel Algorithms
Collaborative Filtering Community Detection
Alternating Least Squares Triangle-Counting
Stochastic Gradient Descent K-core Decomposition
Tensor Factorization K-Truss

Structured Prediction Graph Analytics


Loopy Belief Propagation PageRank
Max-Product Linear Personalized PageRank
Programs Shortest Path
Gibbs Sampling Graph Coloring

Semi-supervised ML Classification
Graph SSL Neural Networks
CoEM
Modern Analytics
Link Table Hyperlinks PageRank Top 20 Pages
Title Link Title PR

Raw
Wikipedia
Com. PR..
<</ />>
</> Top Communities
XML

Editor Community User


Table Editor Graph Detection Community
Editor Title User Com.
Tables
Link Table Hyperlinks PageRank Top 20 Pages
Title Link Title PR

Raw
Wikipedia
Com. PR..
<</ />>
</> Top Communities
XML

Editor Community User


Table Editor Graph Detection Community
Editor Title User Com.
Graphs
Link Table Hyperlinks PageRank Top 20 Pages
Title Link Title PR

Raw
Wikipedia
Com. PR..
<</ />>
</> Top Communities
XML

Editor Community User


Table Editor Graph Detection Community
Editor Title User Com.
The GraphX API
Property Graphs

Vertex Property:
User Profile
Current PageRank Value

Edge Property:
Weights
Relationships
Timestamps
Creating a Graph (Scala)
type VertexId = Long

Graph
val vertices: RDD[(VertexId, String)] =
sc.parallelize(List(
(1L, Alice),
1 Alice
(2L, Bob),
(3L, Charlie))) coworker

class Edge[ED](
val srcId: VertexId,
val dstId: VertexId, 2 Bob
val attr: ED)

val edges: RDD[Edge[String]] = friend
sc.parallelize(List(
Edge(1L, 2L, coworker),
Edge(2L, 3L, friend)))

3 Charlie
val graph = Graph(vertices, edges)
Graph Operations (Scala)
class Graph[VD, ED] {
// Table Views -----------------------
def vertices: RDD[(VertexId, VD)]
def edges: RDD[Edge[ED]]
def triplets: RDD[EdgeTriplet[VD, ED]]
// Transformations -------------------------------------------

def mapVertices[VD2](f: (VertexId, VD) => VD2): Graph[VD2, ED]
def mapEdges[ED2](f: Edge[ED] => ED2): Graph[VD2, ED]
def reverse: Graph[VD, ED]
def subgraph(epred: EdgeTriplet[VD, ED] => Boolean,
vpred: (VertexId, VD) => Boolean): Graph[VD, ED]
// Joins ----------------------------------------
def outerJoinVertices[U, VD2]
(tbl: RDD[(VertexId, U)])
(f: (VertexId, VD, Option[U]) => VD2): Graph[VD2, ED]
// Computation ----------------------------------
def mapReduceTriplets[A](
sendMsg: EdgeTriplet[VD, ED] => Iterator[(VertexId, A)],
mergeMsg: (A, A) => A): RDD[(VertexId, A)]

Built-in Algorithms (Scala)
// Continued from previous slide
def pageRank(tol: Double): Graph[Double, Double]
def triangleCount(): Graph[Int, ED]
def connectedComponents(): Graph[VertexId, ED]
// ...and more: org.apache.spark.graphx.lib
}

PageRank Triangle Count Connected


Components
The triplets view
class Graph[VD, ED] {
def triplets: RDD[EdgeTriplet[VD, ED]]
}

class EdgeTriplet[VD, ED](
val srcId: VertexId, val dstId: VertexId, val attr: ED,
val srcAttr: VD, val dstAttr: VD)

Graph

1 Alice RDD
coworker srcAttr dstAttr attr
triplets
Alice coworker Bob
2 Bob
Bob friend Charlie
friend

3 Charlie
The subgraph transformation
class Graph[VD, ED] {
def subgraph(epred: EdgeTriplet[VD, ED] => Boolean,
vpred: (VertexId, VD) => Boolean): Graph[VD, ED]

}

graph.subgraph(epred = (edge) => edge.attr != relative)

Graph Graph

Alice coworker Bob Alice coworker Bob

relative subgraph
friend friend

Charlie relative David Charlie David


The subgraph transformation
class Graph[VD, ED] {
def subgraph(epred: EdgeTriplet[VD, ED] => Boolean,
vpred: (VertexId, VD) => Boolean): Graph[VD, ED]

}

graph.subgraph(vpred = (id, name) => name != Bob)

Graph Graph

Alice coworker Bob Alice

relative subgraph relative


friend

Charlie relative David Charlie relative David


Computation with mapReduceTriplets
class Graph[VD, ED] {
def mapReduceTriplets[A](
upgrade to aggregateMessages
sendMsg: EdgeTriplet[VD, ED] => Iterator[(VertexId, A)],
in Spark 1.2.0
mergeMsg: (A, A) => A): RDD[(VertexId, A)]
}

graph.mapReduceTriplets(
edge => Iterator(
(edge.srcId, 1),
(edge.dstId, 1)),
_ + _) RDD

Graph vertex id degree

Alice Bob
Alice 2
coworker
mapReduceTriplets Bob 2
relative
friend
Charlie 3
Charlie relative David
David 1
How GraphX Works
Encoding Property Graphs as RDDs
Vertex Routing Edge Table
Property Graph Table Table (RDD)
(RDD) (RDD)
Part. 1 A B
A A 1 2
B C A C

Machine 1
B B 1 B C

C D
ACut
VertexA
D
D C C 1
A D
A E
D D 1 2
Machine 2

A F
E E 2
F E E D

Part. 2 F F 2 E F
Graph System Optimizations
Specialized Vertex-Cuts Remote
Data-Structures Partitioning Caching / Mirroring

Message Combiners Active Set Tracking


PageRank Benchmark
EC2 Cluster of 16 x m2.4xLarge (8 cores) + 1GigE

Twitter Graph (42M Vertices,1.5B Edges) UK-Graph (106M Vertices, 3.7B Edges)
3500 9000
3000 8000
Runtime (Seconds)

7000
2500
6000
2000 5000
7x 18x
1500 4000
3000
1000
2000
500 1000
0 0

GraphX performs comparably to 


state-of-the-art graph processing systems.
Future of GraphX
1. Language support
a) Java API: PR #3234
b) Python API: collaborating with Intel, SPARK-3789

2. More algorithms
a) LDA (topic modeling): PR #2388
b) Correlation clustering
c) Your algorithm here?

3. Speculative
a) Streaming/time-varying graphs
b) Graph databaselike queries
Thanks!
http://spark.apache.org/graphx

[email protected]

[email protected]
[email protected]
[email protected]

You might also like