Spark-GraphX and Neo4j

Download as pdf or txt
Download as pdf or txt
You are on page 1of 32

Spark GraphX

GraphX is Apache Spark’s API for graphs and graph-parallel computation. GraphX unifies
ETL (Extract, Transform & Load) process, exploratory analysis and iterative graph
computation within a single system. The usage of graphs can be seen in Facebook’s friends,
LinkedIn’s connections, internet’s routers, relationships between galaxies and stars in
astrophysics and Google’s Maps. Even though the concept of graph computation seems to
be very simple, the applications of graphs is literally limitless with use cases in disaster
detection, banking, stock market, banking and geographical systems just to name a few.

In computer science, a graph is an abstract data type that is meant to implement the
undirected graph and directed graph concepts from mathematics, specifically the field of
graph theory. A graph data structure may also associate to each edge some edge value, such
as a symbolic label or a numeric attribute (cost, capacity, length, etc.).

GraphX is the Spark API for graphs and graph-parallel computation. It includes a growing
collection of graph algorithms and builders to simplify graph analytics tasks.

GraphX extends the Spark RDD with a Resilient Distributed Property Graph. The property
graph is a directed multigraph which can have multiple edges in parallel. Every edge and
vertex have user defined properties associated with it. The parallel edges allow multiple
relationships between the same vertices.
Understanding GraphX with Examples
We will now understand the concepts of Spark GraphX using an example. Let us consider a
simple graph as shown in the image below.

Figure: Spark GraphX Tutorial – Graph Example

Looking at the graph, we can extract information about the people (vertices) and the
relations between them (edges). The graph here represents the Twitter users and whom
they follow on Twitter. For e.g. Bob follows Davide and Alice on Twitter.

This means that outgoing edges = A follows B

The information on edges states that for how many years A has followed B.

NOW, CONSTRUCTING A GRAPH USING SCALA(SPARK USES SCALA)

1 //Importing the necessary classes


2 import org.apache.spark._
3 import org.apache.spark.rdd.RDD
4 import org.apache.spark.util.IntParam
5 import org.apache.spark.graphx._
6 import org.apache.spark.graphx.util.GraphGenerators
Displaying Vertices: Further, we will now display all the names and ages of the users
(vertices).
val vertexRDD: RDD[(Long, (String, Int))] =
sc.parallelize(vertexArray)

-> Vertex has a name and an age so data type is


(String, Int). It also has a vertex number such a 1,2...
etc. So, Long.

val edgeRDD: RDD[Edge[Int]] =


sc.parallelize(edgeArray)

-> Edge has a number = Int


val graph: Graph[(String, Int), Int] = Graph(vertexRDD,
edgeRDD)

->Constructing a graph using vertices and edges.

graph.vertices.filter { case (id, (name, age)) => age >


30 }
.collect.foreach { case (id, (name, age)) =>
println(s"$name is $age")}
-> Here, we are filtering people whose age is higher
than 30.

Output is:
David is 42
Fran is 50
Ed is 55
Charlie is 65

Displaying Edges: Let us look at which person likes whom on Twitter.

1 for (triplet <- graph.triplets.collect)


2 {
3 println(s"${triplet.srcAttr._1} likes
${triplet.dstAttr._1}")
4
}
The output for the above code is as below:

Bob likes Alice

Bob likes David


Charlie likes Bob
Charlie likes Fran
David likes Alice
Ed likes BobE
d likes Charlie
Ed likes Fran

Here, likes means that this vertex(person) has an outgoing edge towards the other
vertex(person).

Number of followers: Every user in our graph has a different number of followers. Let us
look at all the followers for every user.

// Defining a class to more clearly model the user


property
case class User(name: String, age: Int, inDeg: Int,
outDeg: Int)

// Creating a user Graph


val initialUserGraph: Graph[User, Int] =
graph.mapVertices{ case (id, (name, age)) =>
User(name, age, 0, 0) }
->Initially, the user does not have any outgoing or
incoming vertices.

// Filling in the degree information


val userGraph =
initialUserGraph.outerJoinVertices(initialUserGraph.in
Degrees) {
case (id, u, inDegOpt) => User(u.name, u.age,
inDegOpt.getOrElse(0), u.outDeg)
}.outerJoinVertices(initialUserGraph.outDegrees) {
case (id, u, outDegOpt) => User(u.name, u.age,
u.inDeg, outDegOpt.getOrElse(0))
}

->First, we have joined for inDegree and then for


outdegree.

JoinVertices:
The joinVertices operator joins the vertices with the input RDD and
returns a new graph with the vertex properties obtained by applying the
user defined map function to the result of the joined vertices. Vertices
without a matching value in the RDD retain their original value.
OuterJoinVertices:
The more general outerJoinVertices behaves similarly to joinVertices
except that the user defined map function is applied to all vertices and
can change the vertex property type. Because not all vertices may have
a matching value in the input RDD the map function takes an Option
type.

GetOrElse() method:
What it basically does is it just evaluates the value of
the variable and return us the alternative value if the
value is empty. This method works on two things
success and fail. It will return us the actual value or
the default value according to the response we get.
This method is used to return an optional value.

for ((id, property) <- userGraph.vertices.collect) {


println(s"User $id is called ${property.name} and is
liked by ${property.inDeg} people.")
}
Oldest Followers: We can also sort the followers by
their characteristics. Let us find the oldest followers of
each user by age.
The property on the edges is the number of years a
person has followed another person.
// Finding the oldest follower for each user
val oldestFollower: VertexRDD[(String, Int)] =
userGraph.mapReduceTriplets[(String, Int)] (

-> The mapReduceTriplets operator takes a user defined map


function which is applied to each triplet and can yield messages
which are aggregated using the user defined reduce
function.(doubt)

// For each edge send a message to the


destination vertex with the attribute of the
source vertex
edge => Iterator ((edge.dstId,
(edge.srcAttr.name, edge.srcAttr.age))),
// To combine messages, take the message for
the older follower
(a, b) => if (a._2 > b._2) a else b
)
The output for the above code is as below:
David is the oldest follower of Alice.

Charlie is the oldest follower of Bob.

Ed is the oldest follower of Charlie.

Bob is the oldest follower of David.

Ed does not have any followers.

Charlie is the oldest follower of Fran.

Example Property Graph


Suppose we want to construct a property graph consisting of the various collaborators on
the GraphX project. The vertex property might contain the username and occupation. We
could annotate edges with a string describing the relationships between collaborators:

The resulting graph would have the type signature:


Step 1: get the type signature from the vertex table and property.

val userGraph: Graph[(String, String), String]

->This means that (rxin, student), Collaborator = goes to vertex 7.

There are numerous ways to construct a property graph from raw files, RDDs, and even
synthetic generators and these are discussed in more detail in the section on graph
builders. Probably the most general method is to use the Graph object. For example the
following code constructs a graph from a collection of RDDs:

// Assume the SparkContext has already been constructed

val sc: SparkContext


// Create an RDD for the vertices

val users: RDD[(VertexId, (String, String))] = sc.parallelize(Seq((3L,


("rxin", "student")), (7L, ("jgonzal", "postdoc")), (5L,
("franklin", "prof")), (2L, ("istoica", "prof"))))

// Create an RDD for edges


val relationships: RDD[Edge[String]] = sc.parallelize(Seq(Edge(3L, 7L, "collab"),
Edge(5L, 3L, "advisor"), Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi")))

// Define a default user in case there are relationship with missing user

val defaultUser = ("John Doe", "Missing")

// Build the initial Graphval graph = Graph(users, relationships, defaultUser)

In the above example we make use of the Edge case class. Edges have a srcId and a dstId
corresponding to the source and destination vertex identifiers. In addition, the Edge class
has an attr member which stores the edge property.
We can deconstruct a graph into the respective vertex and edge views by using the
graph.vertices and graph.edges members respectively.

val graph: Graph[(String, String), String]

// Constructed from above

// Count all users which are postdocs

graph.vertices.filter { case (id, (name, pos)) => pos == "postdoc" }.count

// Count all the edges where src > dst

graph.edges.filter(e => e.srcId > e.dstId).count

Note that graph.vertices returns an VertexRDD[(String, String)] which extends


RDD[(VertexId, (String, String))] and so we use the scala case expression to deconstruct the
tuple.

On the other hand, graph.edges returns an EdgeRDD containing Edge[String] objects. We


could have also used the case class type constructor as in the following:

graph.edges.filter {case Edge(src, dst, prop) => src > dst }.count

In addition to the vertex and edge views of the property graph,

GraphX also exposes a triplet view. The triplet view logically joins the vertex and edge
properties yielding an RDD[EdgeTriplet[VD, ED]] containing instances of the EdgeTriplet
class. This join can be expressed in the following SQL expression:

SELECT src.id, dst.id, src.attr, e.attr, dst.attrFROM edges AS e LEFT JOIN vertices AS
src, vertices AS dstON e.srcId = src.Id AND e.dstId = dst.Id

or graphically as:

Commented [ZS1]: this is a triplet view

The EdgeTriplet class extends the Edge class by adding the srcAttr and dstAttr members
which contain the source and destination properties respectively. We can use the triplet
view of a graph to render a collection of strings describing relationships between users.
val graph: Graph[(String, String), String]

// Constructed from above// Use the triplets view to create an RDD of facts.

val facts: RDD[String] = graph.triplets.map(triplet => triplet.srcAttr._1 + " is the " +


triplet.attr + " of " + triplet.dstAttr._1)facts.collect.foreach(println(_))

We can compute the in-degree of each vertex (defined in GraphOps) by the following:

val graph: Graph [(String, String), String] //defining a graph

// Use the implicit GraphOps.inDegrees operator

val inDegrees: VertexRDD[Int] = graph.inDegrees

The reason for differentiating between core graph operations and GraphOps is to be able to
support different graph representations in the future.

Summary Operators:

/** Summary of the functionality in the property graph */

class Graph[VD, ED] {

// Information about the Graph


==========================================================
=========

val numEdges: Long

val numVertices: Long

val inDegrees: VertexRDD[Int]

val outDegrees: VertexRDD[Int]

val degrees: VertexRDD[Int]


// Views of the graph as collections
==========================================================
===

val vertices: VertexRDD[VD]

val edges: EdgeRDD[ED]

val triplets: RDD[EdgeTriplet[VD, ED]]

// Functions for caching graphs


==========================================================
========

def persist(newLevel: StorageLevel = StorageLevel.MEMORY_ONLY): Graph[VD, ED]

def cache(): Graph[VD, ED]

def unpersistVertices(blocking: Boolean = false): Graph[VD, ED]

// Change the partitioning heuristic


==========================================================
==

def partitionBy(partitionStrategy: PartitionStrategy): Graph[VD, ED]

// Transform vertex and edge attributes


==========================================================

def mapVertices[VD2](map: (VertexId, VD) => VD2): Graph[VD2, ED]

def mapEdges[ED2](map: Edge[ED] => ED2): Graph[VD, ED2]

def mapEdges[ED2](map: (PartitionID, Iterator[Edge[ED]]) => Iterator[ED2]): Graph[VD,


ED2]

def mapTriplets[ED2](map: EdgeTriplet[VD, ED] => ED2): Graph[VD, ED2]


def mapTriplets[ED2](map: (PartitionID, Iterator[EdgeTriplet[VD, ED]]) => Iterator[ED2])

: Graph[VD, ED2]

// Modify the graph structure


==========================================================
==========

def reverse: Graph[VD, ED]

def subgraph(

epred: EdgeTriplet[VD,ED] => Boolean = (x => true),

vpred: (VertexId, VD) => Boolean = ((v, d) => true))

: Graph[VD, ED]

def mask[VD2, ED2](other: Graph[VD2, ED2]): Graph[VD, ED]

def groupEdges(merge: (ED, ED) => ED): Graph[VD, ED]

// Join RDDs with the graph


==========================================================
============

def joinVertices[U](table: RDD[(VertexId, U)])(mapFunc: (VertexId, VD, U) => VD):


Graph[VD, ED]

def outerJoinVertices[U, VD2](other: RDD[(VertexId, U)])

(mapFunc: (VertexId, VD, Option[U]) => VD2)

: Graph[VD2, ED]

// Aggregate information about adjacent triplets


=================================================

def collectNeighborIds(edgeDirection: EdgeDirection): VertexRDD[Array[VertexId]]


def collectNeighbors(edgeDirection: EdgeDirection): VertexRDD[Array[(VertexId, VD)]]

def aggregateMessages[Msg: ClassTag](

sendMsg: EdgeContext[VD, ED, Msg] => Unit,

mergeMsg: (Msg, Msg) => Msg,

tripletFields: TripletFields = TripletFields.All)

: VertexRDD[A]

// Iterative graph-parallel computation


==========================================================

def pregel[A](initialMsg: A, maxIterations: Int, activeDirection: EdgeDirection)(

vprog: (VertexId, VD, A) => VD,

sendMsg: EdgeTriplet[VD, ED] => Iterator[(VertexId, A)],

mergeMsg: (A, A) => A)

: Graph[VD, ED]

// Basic graph algorithms


==========================================================
==============

def pageRank(tol: Double, resetProb: Double = 0.15): Graph[Double, Double]

def connectedComponents(): Graph[VertexId, ED]

def triangleCount(): Graph[Int, ED]

def stronglyConnectedComponents(numIter: Int): Graph[VertexId, ED]

}
Graph Algorithms
GraphX includes a set of graph algorithms to simplify analytics tasks. The algorithms are
contained in the org.apache.spark.graphx.lib package and can be accessed directly as
methods on Graph via GraphOps. This section describes the algorithms and how they are
used.

PageRank
PageRank measures the importance of each vertex in a graph, assuming an edge from u to
v represents an endorsement of v’s importance by u. For example, if a Twitter user is
followed by many others, the user will be ranked highly.

GraphX comes with static and dynamic implementations of PageRank as methods on the
PageRank object. Static PageRank runs for a fixed number of iterations, while dynamic
PageRank runs until the ranks converge (i.e., stop changing by more than a specified
tolerance). GraphOps allows calling these algorithms directly as methods on Graph.

GraphX also includes an example social network dataset that we can run PageRank on. A
set of users is given in data/graphx/users.txt, and a set of relationships between users is
given in data/graphx/followers.txt. We compute the PageRank of each user as follows:

import org.apache.spark.graphx.GraphLoader

// Load the edges as a graph

val graph = GraphLoader.edgeListFile(sc, "data/graphx/followers.txt")

// Run PageRank

val ranks = graph.pageRank(0.0001).vertices

// Join the ranks with the usernames


val users = sc.textFile("data/graphx/users.txt").map { line =>

val fields = line.split(",")

(fields(0).toLong, fields(1))

val ranksByUsername = users.join(ranks).map {

case (id, (username, rank)) => (username, rank)

// Print the result

println(ranksByUsername.collect().mkString("\n"))

Find full example code at


"examples/src/main/scala/org/apache/spark/examples/graphx/PageRankExample.scal
a" in the Spark repo.

Connected Components
The connected components algorithm labels each connected component of the graph with
the ID of its lowest-numbered vertex. For example, in a social network, connected
components can approximate clusters. GraphX contains an implementation of the
algorithm in the ConnectedComponents object, and we compute the connected
components of the example social network dataset from the PageRank section as follows:

import org.apache.spark.graphx.GraphLoader

// Load the graph as in the PageRank example

val graph = GraphLoader.edgeListFile(sc, "data/graphx/followers.txt")

// Find the connected components

val cc = graph.connectedComponents().vertices

// Join the connected components with the usernames


val users = sc.textFile("data/graphx/users.txt").map { line =>

val fields = line.split(",")

(fields(0).toLong, fields(1))

val ccByUsername = users.join(cc).map {

case (id, (username, cc)) => (username, cc)

// Print the result

println(ccByUsername.collect().mkString("\n"))
Neo4j
To learn Neo4j’s basic theory use this link:

https://www.tutorialspoint.com/neo4j/neo4j_cql_introduction.htm

Neo4j theory from sir’s pdf.

CRUD operations on movie database:


Neo4j has this inbuilt movie database.

So, the inbuilt guide will help us to do the crud operations. Once, you have neo4j up and
running, checkout the tutorial.

CREATE
This creates a whole graph database with lots of nodes and relationships.

CREATE (TheMatrix:Movie {title:'The Matrix', released:1999, tagline:'Welcome to the Real


World'})

This is a Movie node with 3 properties attached to it.

CREATE (Keanu:Person {name:'Keanu Reeves', born:1964})

CREATE (Carrie:Person {name:'Carrie-Anne Moss', born:1967})

CREATE (Laurence:Person {name:'Laurence Fishburne', born:1961})

CREATE (Hugo:Person {name:'Hugo Weaving', born:1960})

CREATE (LillyW:Person {name:'Lilly Wachowski', born:1967})


CREATE (LanaW:Person {name:'Lana Wachowski', born:1965})

CREATE (JoelS:Person {name:'Joel Silver', born:1952}

A person node with name and year of birth.

CREATE

(Keanu)-[:ACTED_IN {roles:['Neo']}]->(TheMatrix),

(Carrie)-[:ACTED_IN {roles:['Trinity']}]->(TheMatrix),

(Laurence)-[:ACTED_IN {roles:['Morpheus']}]->(TheMatrix),

(Hugo)-[:ACTED_IN {roles:['Agent Smith']}]->(TheMatrix),

Above nodes are for actors.

Below nodes are for directors.

(LillyW)-[:DIRECTED]->(TheMatrix),

(LanaW)-[:DIRECTED]->(TheMatrix),

(JoelS)-[:PRODUCED]->(TheMatrix)

Now, this code contains how to create a relationship among the nodes.

Similarly, there are nodes of different movies created in the same manner.

The created graph looks like this:


READ OPERATION

MATCH
Describe a data pattern

The MATCH clause describes a pattern of graph data. Neo4j will collect all paths within the graph
which match this pattern. This is often used with WHERE to filter the collection.

The MATCH describes the structure, and WHERE specifies the content of a query.

MATCH (director:Person)-[:DIRECTED]->(movie)WHERE director.name = "Steven


Spielberg"RETURN movie.title

Find all the many fine films directed by Steven Spielberg.

1. Find the actor named “Tom Hanks”

MATCH (tom {name: "Tom Hanks"}) RETURN tom


Output:

MATCH (people:Person) RETURN people.name LIMIT 10

Output in Text form:

════════════════════╕

│"people.name" │

╞════════════════════╡

│"Keanu Reeves" │

├────────────────────┤

│"Carrie-Anne Moss" │

├────────────────────┤

│"Laurence Fishburne"│

├────────────────────┤
│"Hugo Weaving" │

├────────────────────┤

│"Lilly Wachowski" │

├────────────────────┤

│"Lana Wachowski" │

├────────────────────┤

│"Joel Silver" │

├────────────────────┤

│"Emil Eifrem" │

├────────────────────┤

│"Charlize Theron" │

├────────────────────┤

│"Al Pacino" │

└────────────────────┘

To get movies released in the 1990’s

MATCH (nineties:Movie) WHERE nineties.released >= 1990 AND nineties.released < 2000
RETURN nineties.title

Here, ninetees is a variable type of Movie and ninetees.released and ninetees.title is a


property. Because, MATCH returns a graph to us.

2. QUERIES
List all Tom Hanks movies...
MATCH (tom:Person {name: "Tom Hanks"})-[:ACTED_IN]->(tomHanksMovies)
RETURN tom,tomHanksMovies

OUTPUT:

THIS OUTPUT CONTAINS FOR 5 GRAPHS BECAUSE I HAVE RUN THE CREATE
QUERY 5 TIMES AND EVERYTIME IT RUNS, IT APPENDS THE DATA TO THE
GRAPH.

Who directed "Cloud Atlas"?

MATCH (cloudAtlas {title: "Cloud Atlas"})<-[:DIRECTED]-(directors) RETURN


directors.name

OUTPUT:

╒═════════════════╕

│"directors.name" │
╞═════════════════╡

│"Tom Tykwer" │

├─────────────────┤

│"Lana Wachowski" │

├─────────────────┤

│"Lilly Wachowski"│

├─────────────────┤

Tom Hanks' co-actors...

MATCH (tom:Person {name:"Tom Hanks"})-[:ACTED_IN]->(m)<-[:ACTED_IN]-


(coActors) RETURN coActors.name

Tom hanks has an [ACTED_IN] relationship with movies. Similarly, other


actors have the same relation with movies. So, this gives us a list of all
co-actors of Tom Hanks with the movies that he acted in.

OUTPUT:

╒════════════════════════╕

│"coActors.name" │

╞════════════════════════╡

│"Ed Harris" │

├────────────────────────┤

│"Gary Sinise" │

├────────────────────────┤

│"Kevin Bacon" │

├────────────────────────┤
│"Bill Paxton" │

├────────────────────────┤

│"Parker Posey" │

├────────────────────────┤

│"Greg Kinnear" │

├────────────────────────┤

│"Meg Ryan" │

├────────────────────────┤

│"Steve Zahn" │

├────────────────────────┤

│"Dave Chappelle" │

├────────────────────────┤

│"Madonna" │

├────────────────────────┤

│"Rosie O'Donnell" │

├────────────────────────┤

│"Geena Davis" │

├────────────────────────┤

│"Bill Paxton" │

├────────────────────────┤

│"Lori Petty" │

├────────────────────────┤
│"Nathan Lane" │

├────────────────────────┤

│"Meg Ryan" │

├────────────────────────┤

│"Liv Tyler" │

├────────────────────────┤

│"Charlize Theron" │

├────────────────────────┤

│"Ian McKellen" │

├────────────────────────┤

│"Audrey Tautou" │

├────────────────────────┤

│"Paul Bettany" │

├────────────────────────┤

│"Jim Broadbent" │

├────────────────────────┤

│"Halle Berry" │

├────────────────────────┤

│"Hugo Weaving" │

├────────────────────────┤

│"Helen Hunt" │

├────────────────────────┤
│"Sam Rockwell" │

├────────────────────────┤

│"Bonnie Hunt" │

├────────────────────────┤

│"Patricia Clarkson" │

├────────────────────────┤

│"James Cromwell" │

├────────────────────────┤

│"Michael Clarke Duncan" │

├────────────────────────┤

│"David Morse" │

├────────────────────────┤

│"Gary Sinise" │

├────────────────────────┤

│"Meg Ryan" │

├────────────────────────┤

│"Victor Garber" │

├────────────────────────┤

│"Bill Pullman" │

├────────────────────────┤

│"Rita Wilson" │

├────────────────────────┤
│"Rosie O'Donnell" │

├────────────────────────┤

│"Julia Roberts" │

├────────────────────────┤

│"Philip Seymour Hoffman"│

├────────────────────────┤

│"Steve Zahn" │

├────────────────────────┤

│"Meg Ryan" │

├────────────────────────┤

│"Greg Kinnear" │

├────────────────────────┤

│"Dave Chappelle" │

├────────────────────────┤

│"Parker Posey" │

├────────────────────────┤

│"Geena Davis" │

├────────────────────────┤

│"Lori Petty" │

├────────────────────────┤

│"Madonna" │

├────────────────────────┤
│"Bill Paxton" │

├────────────────────────┤

│"Rosie O'Donnell" │

├────────────────────────┤

│"Paul Bettany" │

├────────────────────────┤

│"Ian McKellen" │

├────────────────────────┤

│"Audrey Tautou"

How people are related to "Cloud Atlas"...

MATCH (people:Person)-[relatedTo]-(:Movie {title: "Cloud Atlas"}) RETURN


people.name, Type(relatedTo), relatedTo

This means that all the outgoing and incoming nodes of the movie “Cloud
Atlas”.

OUTPUT:

3. SOLVE

Movies and Actors up to 4 “hops” away from “Kevin Bacon”

MATCH (bacon:Person {name:"Kevin Bacon"})-[*1..4]-(hollywood)


RETURN DISTINCT hollywood

Bacon path, the shortest path of any relationships to Meg Ryan

MATCH p=shortestPath((bacon:Person {name:"Kevin Bacon"})-[*]-(meg:Person


{name:"Meg Ryan"}))RETURN p

You might also like