Spark-GraphX and Neo4j
Spark-GraphX and Neo4j
Spark-GraphX and Neo4j
GraphX is Apache Spark’s API for graphs and graph-parallel computation. GraphX unifies
ETL (Extract, Transform & Load) process, exploratory analysis and iterative graph
computation within a single system. The usage of graphs can be seen in Facebook’s friends,
LinkedIn’s connections, internet’s routers, relationships between galaxies and stars in
astrophysics and Google’s Maps. Even though the concept of graph computation seems to
be very simple, the applications of graphs is literally limitless with use cases in disaster
detection, banking, stock market, banking and geographical systems just to name a few.
In computer science, a graph is an abstract data type that is meant to implement the
undirected graph and directed graph concepts from mathematics, specifically the field of
graph theory. A graph data structure may also associate to each edge some edge value, such
as a symbolic label or a numeric attribute (cost, capacity, length, etc.).
GraphX is the Spark API for graphs and graph-parallel computation. It includes a growing
collection of graph algorithms and builders to simplify graph analytics tasks.
GraphX extends the Spark RDD with a Resilient Distributed Property Graph. The property
graph is a directed multigraph which can have multiple edges in parallel. Every edge and
vertex have user defined properties associated with it. The parallel edges allow multiple
relationships between the same vertices.
Understanding GraphX with Examples
We will now understand the concepts of Spark GraphX using an example. Let us consider a
simple graph as shown in the image below.
Looking at the graph, we can extract information about the people (vertices) and the
relations between them (edges). The graph here represents the Twitter users and whom
they follow on Twitter. For e.g. Bob follows Davide and Alice on Twitter.
The information on edges states that for how many years A has followed B.
Output is:
David is 42
Fran is 50
Ed is 55
Charlie is 65
Here, likes means that this vertex(person) has an outgoing edge towards the other
vertex(person).
Number of followers: Every user in our graph has a different number of followers. Let us
look at all the followers for every user.
JoinVertices:
The joinVertices operator joins the vertices with the input RDD and
returns a new graph with the vertex properties obtained by applying the
user defined map function to the result of the joined vertices. Vertices
without a matching value in the RDD retain their original value.
OuterJoinVertices:
The more general outerJoinVertices behaves similarly to joinVertices
except that the user defined map function is applied to all vertices and
can change the vertex property type. Because not all vertices may have
a matching value in the input RDD the map function takes an Option
type.
GetOrElse() method:
What it basically does is it just evaluates the value of
the variable and return us the alternative value if the
value is empty. This method works on two things
success and fail. It will return us the actual value or
the default value according to the response we get.
This method is used to return an optional value.
There are numerous ways to construct a property graph from raw files, RDDs, and even
synthetic generators and these are discussed in more detail in the section on graph
builders. Probably the most general method is to use the Graph object. For example the
following code constructs a graph from a collection of RDDs:
// Define a default user in case there are relationship with missing user
In the above example we make use of the Edge case class. Edges have a srcId and a dstId
corresponding to the source and destination vertex identifiers. In addition, the Edge class
has an attr member which stores the edge property.
We can deconstruct a graph into the respective vertex and edge views by using the
graph.vertices and graph.edges members respectively.
graph.edges.filter {case Edge(src, dst, prop) => src > dst }.count
GraphX also exposes a triplet view. The triplet view logically joins the vertex and edge
properties yielding an RDD[EdgeTriplet[VD, ED]] containing instances of the EdgeTriplet
class. This join can be expressed in the following SQL expression:
SELECT src.id, dst.id, src.attr, e.attr, dst.attrFROM edges AS e LEFT JOIN vertices AS
src, vertices AS dstON e.srcId = src.Id AND e.dstId = dst.Id
or graphically as:
The EdgeTriplet class extends the Edge class by adding the srcAttr and dstAttr members
which contain the source and destination properties respectively. We can use the triplet
view of a graph to render a collection of strings describing relationships between users.
val graph: Graph[(String, String), String]
// Constructed from above// Use the triplets view to create an RDD of facts.
We can compute the in-degree of each vertex (defined in GraphOps) by the following:
The reason for differentiating between core graph operations and GraphOps is to be able to
support different graph representations in the future.
Summary Operators:
: Graph[VD, ED2]
def subgraph(
: Graph[VD, ED]
: Graph[VD2, ED]
: VertexRDD[A]
: Graph[VD, ED]
}
Graph Algorithms
GraphX includes a set of graph algorithms to simplify analytics tasks. The algorithms are
contained in the org.apache.spark.graphx.lib package and can be accessed directly as
methods on Graph via GraphOps. This section describes the algorithms and how they are
used.
PageRank
PageRank measures the importance of each vertex in a graph, assuming an edge from u to
v represents an endorsement of v’s importance by u. For example, if a Twitter user is
followed by many others, the user will be ranked highly.
GraphX comes with static and dynamic implementations of PageRank as methods on the
PageRank object. Static PageRank runs for a fixed number of iterations, while dynamic
PageRank runs until the ranks converge (i.e., stop changing by more than a specified
tolerance). GraphOps allows calling these algorithms directly as methods on Graph.
GraphX also includes an example social network dataset that we can run PageRank on. A
set of users is given in data/graphx/users.txt, and a set of relationships between users is
given in data/graphx/followers.txt. We compute the PageRank of each user as follows:
import org.apache.spark.graphx.GraphLoader
// Run PageRank
(fields(0).toLong, fields(1))
println(ranksByUsername.collect().mkString("\n"))
Connected Components
The connected components algorithm labels each connected component of the graph with
the ID of its lowest-numbered vertex. For example, in a social network, connected
components can approximate clusters. GraphX contains an implementation of the
algorithm in the ConnectedComponents object, and we compute the connected
components of the example social network dataset from the PageRank section as follows:
import org.apache.spark.graphx.GraphLoader
val cc = graph.connectedComponents().vertices
(fields(0).toLong, fields(1))
println(ccByUsername.collect().mkString("\n"))
Neo4j
To learn Neo4j’s basic theory use this link:
https://www.tutorialspoint.com/neo4j/neo4j_cql_introduction.htm
So, the inbuilt guide will help us to do the crud operations. Once, you have neo4j up and
running, checkout the tutorial.
CREATE
This creates a whole graph database with lots of nodes and relationships.
CREATE
(Keanu)-[:ACTED_IN {roles:['Neo']}]->(TheMatrix),
(Carrie)-[:ACTED_IN {roles:['Trinity']}]->(TheMatrix),
(Laurence)-[:ACTED_IN {roles:['Morpheus']}]->(TheMatrix),
(LillyW)-[:DIRECTED]->(TheMatrix),
(LanaW)-[:DIRECTED]->(TheMatrix),
(JoelS)-[:PRODUCED]->(TheMatrix)
Now, this code contains how to create a relationship among the nodes.
Similarly, there are nodes of different movies created in the same manner.
MATCH
Describe a data pattern
The MATCH clause describes a pattern of graph data. Neo4j will collect all paths within the graph
which match this pattern. This is often used with WHERE to filter the collection.
The MATCH describes the structure, and WHERE specifies the content of a query.
════════════════════╕
│"people.name" │
╞════════════════════╡
│"Keanu Reeves" │
├────────────────────┤
│"Carrie-Anne Moss" │
├────────────────────┤
│"Laurence Fishburne"│
├────────────────────┤
│"Hugo Weaving" │
├────────────────────┤
│"Lilly Wachowski" │
├────────────────────┤
│"Lana Wachowski" │
├────────────────────┤
│"Joel Silver" │
├────────────────────┤
│"Emil Eifrem" │
├────────────────────┤
│"Charlize Theron" │
├────────────────────┤
│"Al Pacino" │
└────────────────────┘
MATCH (nineties:Movie) WHERE nineties.released >= 1990 AND nineties.released < 2000
RETURN nineties.title
2. QUERIES
List all Tom Hanks movies...
MATCH (tom:Person {name: "Tom Hanks"})-[:ACTED_IN]->(tomHanksMovies)
RETURN tom,tomHanksMovies
OUTPUT:
THIS OUTPUT CONTAINS FOR 5 GRAPHS BECAUSE I HAVE RUN THE CREATE
QUERY 5 TIMES AND EVERYTIME IT RUNS, IT APPENDS THE DATA TO THE
GRAPH.
OUTPUT:
╒═════════════════╕
│"directors.name" │
╞═════════════════╡
│"Tom Tykwer" │
├─────────────────┤
│"Lana Wachowski" │
├─────────────────┤
│"Lilly Wachowski"│
├─────────────────┤
OUTPUT:
╒════════════════════════╕
│"coActors.name" │
╞════════════════════════╡
│"Ed Harris" │
├────────────────────────┤
│"Gary Sinise" │
├────────────────────────┤
│"Kevin Bacon" │
├────────────────────────┤
│"Bill Paxton" │
├────────────────────────┤
│"Parker Posey" │
├────────────────────────┤
│"Greg Kinnear" │
├────────────────────────┤
│"Meg Ryan" │
├────────────────────────┤
│"Steve Zahn" │
├────────────────────────┤
│"Dave Chappelle" │
├────────────────────────┤
│"Madonna" │
├────────────────────────┤
│"Rosie O'Donnell" │
├────────────────────────┤
│"Geena Davis" │
├────────────────────────┤
│"Bill Paxton" │
├────────────────────────┤
│"Lori Petty" │
├────────────────────────┤
│"Nathan Lane" │
├────────────────────────┤
│"Meg Ryan" │
├────────────────────────┤
│"Liv Tyler" │
├────────────────────────┤
│"Charlize Theron" │
├────────────────────────┤
│"Ian McKellen" │
├────────────────────────┤
│"Audrey Tautou" │
├────────────────────────┤
│"Paul Bettany" │
├────────────────────────┤
│"Jim Broadbent" │
├────────────────────────┤
│"Halle Berry" │
├────────────────────────┤
│"Hugo Weaving" │
├────────────────────────┤
│"Helen Hunt" │
├────────────────────────┤
│"Sam Rockwell" │
├────────────────────────┤
│"Bonnie Hunt" │
├────────────────────────┤
│"Patricia Clarkson" │
├────────────────────────┤
│"James Cromwell" │
├────────────────────────┤
├────────────────────────┤
│"David Morse" │
├────────────────────────┤
│"Gary Sinise" │
├────────────────────────┤
│"Meg Ryan" │
├────────────────────────┤
│"Victor Garber" │
├────────────────────────┤
│"Bill Pullman" │
├────────────────────────┤
│"Rita Wilson" │
├────────────────────────┤
│"Rosie O'Donnell" │
├────────────────────────┤
│"Julia Roberts" │
├────────────────────────┤
├────────────────────────┤
│"Steve Zahn" │
├────────────────────────┤
│"Meg Ryan" │
├────────────────────────┤
│"Greg Kinnear" │
├────────────────────────┤
│"Dave Chappelle" │
├────────────────────────┤
│"Parker Posey" │
├────────────────────────┤
│"Geena Davis" │
├────────────────────────┤
│"Lori Petty" │
├────────────────────────┤
│"Madonna" │
├────────────────────────┤
│"Bill Paxton" │
├────────────────────────┤
│"Rosie O'Donnell" │
├────────────────────────┤
│"Paul Bettany" │
├────────────────────────┤
│"Ian McKellen" │
├────────────────────────┤
│"Audrey Tautou"
This means that all the outgoing and incoming nodes of the movie “Cloud
Atlas”.
OUTPUT:
3. SOLVE