Lab Distributed Big Data Analytics: Worksheet-3: Spark Graphx and Spark SQL Operations
Lab Distributed Big Data Analytics: Worksheet-3: Spark Graphx and Spark SQL Operations
In this lab we are going to perform basic Spark SQL and Spark GraphX operations
(described on “Spark Fundamentals II”). Spark SQL provides the ability to write sql-like
queries which can run on Spark. Their main abstraction is SchemaRDD which allows
creating an RDD in which you can run SQL, HiveQL, and Scala.
GraphX is the new Spark API for graphs and graph-parallel computation. At a high-level,
GraphX extends the Spark RDD abstraction by introducing the Resilient Distributed Property
Graph: a directed multigraph with properties attached to each vertex and edge.
In this lab, you will use SQL and GraphX to find out the subject distribution over nt file. The
purpose is to demonstrate how to use the Spark SQL and GraphX libraries on Spark.
IN CLASS
1
COMPUTER SCIENCE DEPARTMENT, UNIVERSITY OF BONN
import spark.implicits._
tripleDF.show()
//tripleDF.collect().foreach(println(_))
tripleDF.createOrReplaceTempView("triple")
triplerelatedtoEvents.collect().foreach(println(_))
2
COMPUTER SCIENCE DEPARTMENT, UNIVERSITY OF BONN
}
---------------------------------------------------------------------------------------------------
2. Spark GraphX operations
a. After a file (page_links_simple.nt.bz2) have been downloaded, unzipped, and
uploaded on HDFS under /yourname folder you may need to create an RDD
out of this file.
b. First create a Scala class Triple containing information about a triple read
from a file. Since the data is going to be type of .nt file which inside contains
rows of triples in format <subject> <predicate> <object> we may need to
transform this data into a different format of representation. Hint: Use map
function.
c. Use the filter transformation to return a new RDD with a subset of the triples
on the file by checking if the first row contains “#”, which on .nt file represent a
comment.
d. Perform these operations in order to transform your data into GraphX
i. Generate vertices by combining (Subject, Object) as VertexId and
their value.x
ii. Create Edges by using subject as a key to join within vertices and
generate Edge into format (s_index, obj_index, predicate)
e. Compute connected components for triples containing “author” as a
predicate.
f. Compute triangle count.
g. List top 5 connected component by applying pagerank over them.
------------------------------------------------------Solution----------------------------------------------------------
object graphxlab {
def main(args: Array[String]) = {
val input = "src/main/resources/rdf.nt" // args(0)
val tripleRDD =
spark.sparkContext.textFile(input)
.map(TripleUtils.parsTriples)
3
COMPUTER SCIENCE DEPARTMENT, UNIVERSITY OF BONN
type VertexId =
Long
graph.vertices.collect().foreach(println(_))
println("edges")
graph.edges.collect().foreach(println(_))
val conncompo =
subrealsourse.connectedComponents()
val printoutrankedtriples =
pageranl.vertices.join(graph.vertices)
.map({ case (k, (r, v)) =
> (k, r, v) })
.sortBy(5 - _._2)
println("printoutrankedtriples")
4
COMPUTER SCIENCE DEPARTMENT, UNIVERSITY OF BONN
printoutrankedtriples.take(5).foreach(println(_))
}
}
---------------------------------------------------------------------------------------------------
AT HOME