Spark Graphx
Spark Graphx
GraphX introduction
• Data is generally stored and processed as a collection of
records or rows. It is represented as a two-dimensional
table with data divided into rows and columns.
• However, collections or tables are not the only way to
represent data. Sometimes, a graph provides a better
representation of data than a collection.
• For example, the Internet is a large graph of
interconnected computers, routers, and switches. The
World Wide Web is a large graph. Web pages connected by
hypertext links form a graph. Social networks on sites such
as Facebook, LinkedIn, and Twitter are graphs.
Transportation hubs such as airports, train terminals, and
bus stops can also be represented by a graph.
2
GraphX introduction
• Graph provides an easy-to-understand and intuitive
model for working with data.
• In addition, specialized graph algorithms are available
for processing graph-oriented data.
• These algorithms provide efficient tools for different
analytics tasks
• Spark GraphX provides efficient library for processing
of large-scale graph-oriented data.
3
Spark GraphX
• For graphs and graph-parallel computation, we have GraphX API
in Spark.
• It leverages an advantage of growing collection of graph
algorithms. Also includes Graph builders to simplify graph
analytics tasks.
• Basically, it extends the Spark RDD with a Resilient Distributed
Property Graph.
• The property graph is a directed multigraph. It has multiple edges
in parallel. Here, every vertex and edge have user-defined
properties associated with it. Moreover, parallel edges allow
multiple relationships between the same vertices.
4
Spark GraphX Features
• Flexibility: We can work with both graphs and computations with
Spark GraphX. It includes exploratory analysis, ETL (Extract,
Transform & Load), as well as iterative graph.
• It is possible to view the same data as both graphs, collections,
transform and join graphs with RDDs.
• Also using the Pregel (Parallel, Graph, and Google) API it is
possible to write custom iterative graph algorithms.
• Speed : It is fastest on comparing with the other graph systems.
Even while retaining Spark’s flexibility, fault tolerance and ease of
use.
5
Spark GraphX Features
• Growing Algorithm Library: Spark GraphX offers a
growing library of graph algorithms. It offers Popular
algorithms such as PageRank, connected components,
strongly connected components, and triangle count.
6
Property Graph
• A property graph is a directed multigraph in which data is associated
with the vertices and the edges. Each vertex of property graph has
properties (attributes). Similarly, each edge is associated with a label
and properties
• A directed multigraph with user-defined objects attached to each
vertex and edge is a property graph.
• It is a graph with potentially multiple parallel edges. They are also
sharing the same source and destination vertex.
• It supports multiple relationships between the same vertices and
parallel edges also.
• Each vertex is keyed with 64-bit long identifier (VertexId)
• As same as RDDs, property graphs are also immutable, distributed,
and fault-tolerant.
7
Property Graph Example
Example of a property graph is a graph representing a social
network on Twitter.
8
Example of Property Graph
• Suppose we want to construct a property graph consisting of the
various collaborators on the GraphX project.
• The vertex property might contain the username and occupation.
• We could annotate edges with a string describing the relationships
between collaborators:
Property Graph Construction
Assume the SparkContext has already been constructed
//
10
GraphX Library
11
GraphX API
• first we need to import Spark and GraphX into our
project.
import org.apache.spark._
import org.apache.spark.graphx._
12
GraphX API
• The GraphX API provides data types for representing graph-
oriented data and operators for graph analytics.
• Just as RDDs have basic operations like map, filter,
and reduceByKey, property graphs also have a collection of basic
operators that take user defined functions and produce new graphs
with transformed properties and structure.
• It also provides an implementation of Google's Pregel API.
• These operators simplify graph analytics tasks.
• Since GraphX is integrated with Spark, a GraphX user has access to
both GraphX and Spark APIs, including the RDD and DataFrame
APIs
• We can compute the in-degree of each vertex (defined
in GraphOps) by the following
13
Data Types
• The key data types provided by GraphX for working with property
graphs include VertexRDD, Edge, EdgeRDD, EdgeTriplet, and
Graph.
14
Data Types
• Edge: The Edge class abstracts a directed edge in a property graph.
An instance of the Edge class contains source vertex id, destination
vertex id, and edge attributes.
15
Data Types
• EdgeContext: It combines EdgeTriplet with methods to send
messages to source and destination vertices of an edge.
16
Property Graph Creation
• The cities and distance between cities are given. the cities are the
vertices and the distances between them are the edges. We have
to create a property graph.
17
Graph Creation
• To get started, launch the Spark-shell:
$/path/to/spark/bin/spark-shell
• Once you are inside the Spark-shell, import the GraphX library
import org.apache.spark.graphx._
• create an array of vertices with attributes city name and
population
val verArray = Array((1L, (“Philadelphia”, 1580863)),
(2L, (“Baltimore”, 620961)),(3L, (“Harrisburg”, 49528)),
(4L, (“Wilmington”, 70851)),(5L, (“New York”, 8175133)),
(6L, (“Scranton”, 76089)))
18
Graph Creation
• To create edges array, type in the spark shell:
val edgeArray = Array(Edge(2L, 3L, 113),Edge(2L, 4L,
106),Edge(3L, 4L, 128),Edge(3L, 5L, 248),Edge(3L, 6L,
162),Edge(4L, 1L, 39),Edge(1L, 6L, 168),Edge(1L, 5L,
130),Edge(5L, 6L, 159))
Next, create RDDs from the vertices and edges arrays by using the
sc.parallelize()command
19
Graph Creation
• Finally, build a property graph
val graph = Graph(verRDD, edgeRDD)
Filter operation
• find the cities with population more than 50000
graph.vertices.filter {case (id, (city, population)) => population >
50000}.collect.foreach {case (id, (city, population)) =>println(s”The
population of $city is $population”)}
20
triplets RDD
• There is one triplet for each edge which contains
information about both the vertices and the edge
information.
• We can find the distances between the connected cities
by using graph.triplets.collect
21
Filtration by edges
• we want to find the cities, the distance between which is
less than 150 kilometers. If we type in the spark shell:
22
)
Aggregation
• We can find total population of the neighboring cities
by using aggregateMessages operator.
• As GraphX deals only with directed graphs. But to take into account
edges in both directions, we should add the reverse directions to the
graph.
• Take a union of reversed edges and original ones.
val undirectedEdgeRDD = graph.reverse.edges.union(graph.edges)
val graph = Graph(verRDD, undirectedEdgeRDD)
Perform aggrgation
val neighbors = graph.aggregateMessages[Int](ectx =>
ectx.sendToSrc(ectx.dstAttr._2), _ + _)
23
GraphX Operators
• Basic Operators
• numEdges
• numVertices
• inDegrees
• outDegrees
• degrees
• Property Operators
•mapVertices
•mapEdges
•mapTriplets
24
GraphX Operators
• Structural Operators
•reverse
•subgraph
•mask
•groupEdges
•Join Operator
•joinVertices
•outerJoinVertices
25
Graph Creation
Create RDD of id and user pairs:
26
Graph Creation
• Next, let us create an RDD of connections (edges) between users.
An edge can have any number of attributes; however, to keep
things simple, we assign a single attribute of type Int to each
edge.
val follows = List(Edge(1L, 2L, 1), Edge(2L, 3L, 1), Edge(3L, 1L,
1), Edge(3L, 4L, 1), Edge(3L, 5L, 1), Edge(4L, 5L, 1), Edge(6L,
5L, 1), Edge(7L, 6L, 1), Edge(6L, 8L, 1), Edge(7L, 8L, 1),
Edge(7L, 9L, 1), Edge(9L, 8L, 1), Edge(8L, 10L, 1), Edge(10L,
9L, 1), Edge(1L, 11L, 1))
val followsRDD = sc.parallelize(follows)
followsRDD.collect
27
Graph Creation
• Note that there is an edge connecting vertex with id 1 to
vertex with id 11 (Edge(1L, 11L, 1)). However, the vertex
with id 11 does not have any property.
• GraphX allows you to handle such cases by creating a
default set of properties. It will assign the default
properties to the vertices that have not been explicitly
assigned any properties:
• val defaultUser = User("NA", 0)
28
Graph Creation
• Now we have all the components required to a
construct a property graph:
val socialGraph = Graph(usersRDD, followsRDD,
defaultUser)
29
Find Graph Information
• Next, we briefly describe how to find useful
information about a property graph.
• You can find the number of edges in a property graph,
as shown next.
val numEdges = socialGraph.numEdges
• You can find the number of vertices in a property
graph, as shown next.
val numVertices = socialGraph.numVertices
30
Graph Information
• Now, we show how to find the number of edges
terminating at a vertex :
val inDegrees = socialGraph.inDegrees
inDegrees.collect
• Now, we show how to find the number of edges
originating from :
val outDegrees = socialGraph.outDegrees
outDegrees.collect
31
Graph Information
• Next, we find the degree of a vertex
val degrees = socialGraph.degrees
degrees.collect
• We can obtain a collections view of the vertices, edges, and triplets in a property
graph.
• Vertices:
val vertices = socialGraph.vertices
vertices.collect
• Edges:
val edges = socialGraph.edges
edges.collect
• Triplets:
val triplets = socialGraph.triplets
triplets.take(3)
32
Structural Operators
class Graph[VD, ED]
{ def reverse: Graph[VD, ED]
33
Reverse Operators
• reverse operator returns a new graph with all the edge directions
reversed.
34
Subgraph
35
Join Operators
class Graph[VD, ED]
{
def joinVertices[U](table: RDD[(VertexId, U)])
(map: (VertexId, VD, U) => VD): Graph[VD, ED]
def outerJoinVertices[U, VD2](table: RDD[(VertexId, U)])
(map: (VertexId, VD, Option[U]) => VD2)
: Graph[VD2, ED]
}
36
Aggregate Messages
• Connected Components algorithm-labels each connected
component of the graph with the ID. GraphX contains
an implementation of the algorithm for the
ConnectedComponentsObject.
37
PageRank Algorithm
• PageRank measures the importance of each vertex in a graph,
assuming an edge from u to v represents an endorsement of v’s
importance by u. For example, if a Twitter user is followed by many
others, the user will be ranked highly.
38
import org.apache.spark.graphx.GraphLoader // Load the edges as
a graph
val graph = GraphLoader.edgeListFile(sc,
"data/graphx/followers.txt") /
/ Run PageRank
val ranks = graph.pageRank(0.0001).vertices
// Join the ranks with the usernames
val users = sc.textFile("data/graphx/users.txt").map { line => val
fields = line.split(",") (fields(0).toLong, fields(1)) }
val ranksByUsername = users.join(ranks).map { case (id, (username,
rank)) => (username, rank) }
// Print the result
println(ranksByUsername.collect().mkString("\n"))
39
Connected component
• import org.apache.spark.graphx.GraphLoader
// Load the graph as in the PageRank example
val graph = GraphLoader.edgeListFile(sc, "data/graphx/followers.txt")
// Find the connected components
val cc = graph.connectedComponents().vertices
// Join the connected components with the usernames
val users = sc.textFile("data/graphx/users.txt").map { line => val fields =
line.split(",") (fields(0).toLong, fields(1)) }
val ccByUsername = users.join(cc).map
{ case (id, (username, cc)) => (username, cc) }
// Print the result println(ccByUsername.collect().mkString("\n"))
40
Triangle Counting
• GraphX implements a triangle counting algorithm in the
TriangleCount object that determines the number of triangles
passing through each vertex, providing a measure of clustering.
41
Triangle Counting
import org.apache.spark.graphx.{GraphLoader, PartitionStrategy}
// Load the edges in canonical order and partition the graph for triangle count
val graph = GraphLoader.edgeListFile(sc, "data/graphx/followers.txt", true)
.partitionBy(PartitionStrategy.RandomVertexCut)
https://spark.apache.org/graphx/