0% found this document useful (0 votes)
108 views43 pages

Spark Graphx

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
108 views43 pages

Spark Graphx

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

1

GraphX introduction
• Data is generally stored and processed as a collection of
records or rows. It is represented as a two-dimensional
table with data divided into rows and columns.
• However, collections or tables are not the only way to
represent data. Sometimes, a graph provides a better
representation of data than a collection.
• For example, the Internet is a large graph of
interconnected computers, routers, and switches. The
World Wide Web is a large graph. Web pages connected by
hypertext links form a graph. Social networks on sites such
as Facebook, LinkedIn, and Twitter are graphs.
Transportation hubs such as airports, train terminals, and
bus stops can also be represented by a graph.

2
GraphX introduction
• Graph provides an easy-to-understand and intuitive
model for working with data.
• In addition, specialized graph algorithms are available
for processing graph-oriented data.
• These algorithms provide efficient tools for different
analytics tasks
• Spark GraphX provides efficient library for processing
of large-scale graph-oriented data.

3
Spark GraphX
• For graphs and graph-parallel computation, we have GraphX API
in Spark.
• It leverages an advantage of growing collection of graph
algorithms. Also includes Graph builders to simplify graph
analytics tasks.
• Basically, it extends the Spark RDD with a Resilient Distributed
Property Graph.
• The property graph is a directed multigraph. It has multiple edges
in parallel. Here, every vertex and edge have user-defined
properties associated with it. Moreover, parallel edges allow
multiple relationships between the same vertices.

4
Spark GraphX Features
• Flexibility: We can work with both graphs and computations with
Spark GraphX. It includes exploratory analysis, ETL (Extract,
Transform & Load), as well as iterative graph.
• It is possible to view the same data as both graphs, collections,
transform and join graphs with RDDs.
• Also using the Pregel (Parallel, Graph, and Google) API it is
possible to write custom iterative graph algorithms.
• Speed : It is fastest on comparing with the other graph systems.
Even while retaining Spark’s flexibility, fault tolerance and ease of
use.

5
Spark GraphX Features
• Growing Algorithm Library: Spark GraphX offers a
growing library of graph algorithms. It offers Popular
algorithms such as PageRank, connected components,
strongly connected components, and triangle count.

6
Property Graph
• A property graph is a directed multigraph in which data is associated
with the vertices and the edges. Each vertex of property graph has
properties (attributes). Similarly, each edge is associated with a label
and properties
• A directed multigraph with user-defined objects attached to each
vertex and edge is a property graph.
• It is a graph with potentially multiple parallel edges. They are also
sharing the same source and destination vertex.
• It supports multiple relationships between the same vertices and
parallel edges also.
• Each vertex is keyed with 64-bit long identifier (VertexId)
• As same as RDDs, property graphs are also immutable, distributed,
and fault-tolerant.
7
Property Graph Example
Example of a property graph is a graph representing a social
network on Twitter.

8
Example of Property Graph
• Suppose we want to construct a property graph consisting of the
various collaborators on the GraphX project.
• The vertex property might contain the username and occupation.
• We could annotate edges with a string describing the relationships
between collaborators:
Property Graph Construction
Assume the SparkContext has already been constructed
//

val sc: SparkContext


// Create an RDD for the vertices
val users: RDD[(VertexId, (String, String))] = sc.parallelize(Seq((3L, ("rxin",
"student")), (7L, ("jgonzal", "postdoc")), (5L, ("franklin", "prof")), (2L, ("istoica",
"prof"))))

//Create an RDD for edges


val relationships: RDD[Edge[String]] = sc.parallelize(Seq(Edge(3L, 7L, "collab"),
Edge(5L, 3L, "advisor"), Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi")))
// Define a default user in case there are relationship with missing user

val defaultUser = ("John Doe", "Missing")


// Build the initial Graph
val graph = Graph(users, relationships, defaultUser)

10
GraphX Library

11
GraphX API
• first we need to import Spark and GraphX into our
project.

import org.apache.spark._
import org.apache.spark.graphx._

// To make some of the examples work we will also need RDD


import org.apache.spark.rdd.RDD

12
GraphX API
• The GraphX API provides data types for representing graph-
oriented data and operators for graph analytics.
• Just as RDDs have basic operations like map, filter,
and reduceByKey, property graphs also have a collection of basic
operators that take user defined functions and produce new graphs
with transformed properties and structure.
• It also provides an implementation of Google's Pregel API.
• These operators simplify graph analytics tasks.
• Since GraphX is integrated with Spark, a GraphX user has access to
both GraphX and Spark APIs, including the RDD and DataFrame
APIs
• We can compute the in-degree of each vertex (defined
in GraphOps) by the following
13
Data Types
• The key data types provided by GraphX for working with property
graphs include VertexRDD, Edge, EdgeRDD, EdgeTriplet, and
Graph.

• VertexRDD: VertexRDD represents a distributed collection of


vertices in a property graph. VertexRDD stores only one entry for
each vertex.

• Each vertex is represented by a key-value pair, where the key is a


unique id and value is the data associated with a vertex. The data
type of the key is VertexId, which is essentially a 64-bit Long. The
value can be of any type.

14
Data Types
• Edge: The Edge class abstracts a directed edge in a property graph.
An instance of the Edge class contains source vertex id, destination
vertex id, and edge attributes.

• EdgeRDD: EdgeRDD represents a distributed collection of the


edges in a property graph.

• EdgeTriplet: EdgeTriplet represents a combination of an edge and


the two vertices that it connects. It stores the attributes of an edge
and the two vertices that it connects. It also contains the unique
identifiers for the source and destination vertices of an edge.

15
Data Types
• EdgeContext: It combines EdgeTriplet with methods to send
messages to source and destination vertices of an edge.

• Graph: It represents the property graphs; an instance of the


Graph class represents a property graph. Similar to RDD, it is
immutable, distributed, and fault-tolerant.

• GraphX partitions and distributes a graph across a cluster using


vertex partitioning heuristics.

16
Property Graph Creation
• The cities and distance between cities are given. the cities are the
vertices and the distances between them are the edges. We have
to create a property graph.

17
Graph Creation
• To get started, launch the Spark-shell:
$/path/to/spark/bin/spark-shell
• Once you are inside the Spark-shell, import the GraphX library
import org.apache.spark.graphx._
• create an array of vertices with attributes city name and
population
val verArray = Array((1L, (“Philadelphia”, 1580863)),
(2L, (“Baltimore”, 620961)),(3L, (“Harrisburg”, 49528)),
(4L, (“Wilmington”, 70851)),(5L, (“New York”, 8175133)),
(6L, (“Scranton”, 76089)))

18
Graph Creation
• To create edges array, type in the spark shell:
val edgeArray = Array(Edge(2L, 3L, 113),Edge(2L, 4L,
106),Edge(3L, 4L, 128),Edge(3L, 5L, 248),Edge(3L, 6L,
162),Edge(4L, 1L, 39),Edge(1L, 6L, 168),Edge(1L, 5L,
130),Edge(5L, 6L, 159))

Next, create RDDs from the vertices and edges arrays by using the
sc.parallelize()command

val verRDD = sc.parallelize(verArray)


val edgeRDD = sc.parallelize(edgeArray)

19
Graph Creation
• Finally, build a property graph
val graph = Graph(verRDD, edgeRDD)

Filter operation
• find the cities with population more than 50000
graph.vertices.filter {case (id, (city, population)) => population >
50000}.collect.foreach {case (id, (city, population)) =>println(s”The
population of $city is $population”)}

20
triplets RDD
• There is one triplet for each edge which contains
information about both the vertices and the edge
information.
• We can find the distances between the connected cities
by using graph.triplets.collect

for (triplet <- graph.triplets.collect) {println(s”””The distance


between ${triplet.srcAttr._1} and${triplet.dstAttr._1} is
${triplet.attr} kilometers”””)}

21
Filtration by edges
• we want to find the cities, the distance between which is
less than 150 kilometers. If we type in the spark shell:

graph.edges.filter {case Edge(city1, city2, distance) => distance <


150}.collect.foreach {case Edge(city1, city2, distance)
=>println(s”The distance between $city1 and $city2 is $distance”)}

22
)

Aggregation
• We can find total population of the neighboring cities
by using aggregateMessages operator.
• As GraphX deals only with directed graphs. But to take into account
edges in both directions, we should add the reverse directions to the
graph.
• Take a union of reversed edges and original ones.
val undirectedEdgeRDD = graph.reverse.edges.union(graph.edges)
val graph = Graph(verRDD, undirectedEdgeRDD)
Perform aggrgation
val neighbors = graph.aggregateMessages[Int](ectx =>
ectx.sendToSrc(ectx.dstAttr._2), _ + _)

23
GraphX Operators
• Basic Operators
• numEdges
• numVertices
• inDegrees
• outDegrees
• degrees

• Property Operators
•mapVertices
•mapEdges
•mapTriplets

24
GraphX Operators
• Structural Operators
•reverse
•subgraph
•mask
•groupEdges

•Join Operator
•joinVertices
•outerJoinVertices

25
Graph Creation
Create RDD of id and user pairs:

val users = List((1L, User("Alex", 26)), (2L, User("Bill", 42)), (3L,


User("Carol", 18)), (4L, User("Dave", 16)), (5L, User("Eve", 45)),
(6L, User("Farell", 30)), (7L, User("Garry", 32)), (8L, User("Harry",
36)), (9L, User("Ivan", 28)), (10L, User("Jill", 48)))
val usersRDD = sc.parallelize(users)
usersRDD.collect

26
Graph Creation
• Next, let us create an RDD of connections (edges) between users.
An edge can have any number of attributes; however, to keep
things simple, we assign a single attribute of type Int to each
edge.
val follows = List(Edge(1L, 2L, 1), Edge(2L, 3L, 1), Edge(3L, 1L,
1), Edge(3L, 4L, 1), Edge(3L, 5L, 1), Edge(4L, 5L, 1), Edge(6L,
5L, 1), Edge(7L, 6L, 1), Edge(6L, 8L, 1), Edge(7L, 8L, 1),
Edge(7L, 9L, 1), Edge(9L, 8L, 1), Edge(8L, 10L, 1), Edge(10L,
9L, 1), Edge(1L, 11L, 1))
val followsRDD = sc.parallelize(follows)
followsRDD.collect

27
Graph Creation
• Note that there is an edge connecting vertex with id 1 to
vertex with id 11 (Edge(1L, 11L, 1)). However, the vertex
with id 11 does not have any property.
• GraphX allows you to handle such cases by creating a
default set of properties. It will assign the default
properties to the vertices that have not been explicitly
assigned any properties:
• val defaultUser = User("NA", 0)

28
Graph Creation
• Now we have all the components required to a
construct a property graph:
val socialGraph = Graph(usersRDD, followsRDD,
defaultUser)

29
Find Graph Information
• Next, we briefly describe how to find useful
information about a property graph.
• You can find the number of edges in a property graph,
as shown next.
val numEdges = socialGraph.numEdges
• You can find the number of vertices in a property
graph, as shown next.
val numVertices = socialGraph.numVertices

30
Graph Information
• Now, we show how to find the number of edges
terminating at a vertex :
val inDegrees = socialGraph.inDegrees
inDegrees.collect
• Now, we show how to find the number of edges
originating from :
val outDegrees = socialGraph.outDegrees
outDegrees.collect

31
Graph Information
• Next, we find the degree of a vertex
val degrees = socialGraph.degrees
degrees.collect
• We can obtain a collections view of the vertices, edges, and triplets in a property
graph.
• Vertices:
val vertices = socialGraph.vertices
vertices.collect
• Edges:
val edges = socialGraph.edges
edges.collect
• Triplets:
val triplets = socialGraph.triplets
triplets.take(3)
32
Structural Operators
class Graph[VD, ED]
{ def reverse: Graph[VD, ED]

def subgraph(epred: EdgeTriplet[VD,ED] => Boolean,


vpred: (VertexId, VD) => Boolean): Graph[VD, ED]

def mask[VD2, ED2](other: Graph[VD2, ED2]): Graph[VD, ED]

def groupEdges(merge: (ED, ED) => ED): Graph[VD,ED]


}

33
Reverse Operators
• reverse operator returns a new graph with all the edge directions
reversed.

• This can be useful when, trying to compute the inverse PageRank.


Because the reverse operation does not modify vertex or edge
properties or change the number of edges

• It can be implemented efficiently without data movement or


duplication.

34
Subgraph

• The subgraph operator takes vertex and edge predicates and


returns the graph containing only the vertices that satisfy the
vertex predicate (evaluate to true) and edges that satisfy the edge
predicate and connect
• vertices that satisfy the vertex predicate.
• The subgraph operator can be used in number of situations to
restrict the graph to the vertices and edges of interest or eliminate
broken links.

35
Join Operators
class Graph[VD, ED]
{
def joinVertices[U](table: RDD[(VertexId, U)])
(map: (VertexId, VD, U) => VD): Graph[VD, ED]
def outerJoinVertices[U, VD2](table: RDD[(VertexId, U)])
(map: (VertexId, VD, Option[U]) => VD2)
: Graph[VD2, ED]
}

36
Aggregate Messages
• Connected Components algorithm-labels each connected
component of the graph with the ID. GraphX contains
an implementation of the algorithm for the
ConnectedComponentsObject.

• Aggregation operator: used in computing the shortest


path to a source, smallest reachable vertex id, connected
components and PageRank.
• Pregel operator: executes in a series of super steps in
which vertices receive the sum of their inbound

37
PageRank Algorithm
• PageRank measures the importance of each vertex in a graph,
assuming an edge from u to v represents an endorsement of v’s
importance by u. For example, if a Twitter user is followed by many
others, the user will be ranked highly.

• GraphX also includes an example social network dataset that we can


run PageRank on. A set of users is given in data/graphx/users.txt,
and a set of relationships between users is given
in data/graphx/followers.txt. We compute the PageRank of each user
as follows:

38
import org.apache.spark.graphx.GraphLoader // Load the edges as
a graph
val graph = GraphLoader.edgeListFile(sc,
"data/graphx/followers.txt") /
/ Run PageRank
val ranks = graph.pageRank(0.0001).vertices
// Join the ranks with the usernames
val users = sc.textFile("data/graphx/users.txt").map { line => val
fields = line.split(",") (fields(0).toLong, fields(1)) }
val ranksByUsername = users.join(ranks).map { case (id, (username,
rank)) => (username, rank) }
// Print the result
println(ranksByUsername.collect().mkString("\n"))
39
Connected component
• import org.apache.spark.graphx.GraphLoader
// Load the graph as in the PageRank example
val graph = GraphLoader.edgeListFile(sc, "data/graphx/followers.txt")
// Find the connected components
val cc = graph.connectedComponents().vertices
// Join the connected components with the usernames
val users = sc.textFile("data/graphx/users.txt").map { line => val fields =
line.split(",") (fields(0).toLong, fields(1)) }
val ccByUsername = users.join(cc).map
{ case (id, (username, cc)) => (username, cc) }
// Print the result println(ccByUsername.collect().mkString("\n"))

40
Triangle Counting
• GraphX implements a triangle counting algorithm in the
TriangleCount object that determines the number of triangles
passing through each vertex, providing a measure of clustering.

• TriangleCount requires the edges to be in canonical orientation


(srcId < dstId) and the graph to be partitioned using
Graph.partitionBy.

41
Triangle Counting
import org.apache.spark.graphx.{GraphLoader, PartitionStrategy}
// Load the edges in canonical order and partition the graph for triangle count
val graph = GraphLoader.edgeListFile(sc, "data/graphx/followers.txt", true)
.partitionBy(PartitionStrategy.RandomVertexCut)

// Find the triangle count for each vertex


val triCounts = graph.triangleCount().vertices

// Join the triangle counts with the usernames


val users = sc.textFile("data/graphx/users.txt").map { line => val fields =
line.split(",") (fields(0).toLong, fields(1)) }
val triCountByUsername = users.join(triCounts).map { case (id, (username, tc)) =>
(username, tc) }

// Print the result


println(triCountByUsername.collect().mkString("\n"))
42
Resources

https://spark.apache.org/graphx/

You might also like