0% found this document useful (0 votes)

45 views

Lab Distributed Big Data Analytics: Worksheet-3: Spark Graphx and Spark SQL Operations

This document discusses a lab assignment on performing Spark SQL and GraphX operations. It introduces Spark SQL and GraphX and their main abstractions. The lab tasks students to use SQL and GraphX to analyze an RDF dataset, including finding the subject distribution, connected components for triples with a specific predicate, triangle count, and ranking components by PageRank. It provides sample code solutions for these tasks.

Uploaded by

benben08

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

45 views

Lab Distributed Big Data Analytics: Worksheet-3: Spark Graphx and Spark SQL Operations

Uploaded by

benben08

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

COMPUTER SCIENCE DEPARTMENT, UNIVERSITY OF BONN

Lab Distributed Big Data Analytics

Worksheet-3: Spark GraphX and Spark SQL operations

Dr. Hajira Jabeen, Gezim Sejdiu, Prof. Dr. Jens Lehmann

May 9, 2017

In this lab we are going to perform basic Spark SQL and Spark GraphX operations
(described on “Spark Fundamentals II”). Spark SQL provides the ability to write sql-like
queries which can run on Spark. Their main abstraction is SchemaRDD which allows
creating an RDD in which you can run SQL, HiveQL, and Scala.
GraphX is the new Spark API for graphs and graph-parallel computation. At a high-level,
GraphX extends the Spark RDD abstraction by introducing the Resilient Distributed Property
Graph: a directed multigraph with properties attached to each vertex and edge.
In this lab, you will use SQL and GraphX to find out the subject distribution over nt file. The
purpose is to demonstrate how to use the Spark SQL and GraphX libraries on Spark.

IN CLASS

1. Spark SQL operations

a. After a file (page_links_simple.nt.bz2) have been downloaded, unzipped, and
uploaded on HDFS under /yourname folder you may need to create an RDD
out of this file.
b. First create a Scala class Triple containing information about a triple read
from a file, which will be used as schema. Since the data is going to be type
of .nt file which inside contains rows of triples in format <subject>
<predicate> <object> we may need to transform this data into a different
format of representation. Hint: Use map function.
c. Create an RDD of a Triple object
d. Use the filter transformation to return a new RDD with a subset of the triples
on the file by checking if the first row contains “#”, which on .nt file represent a
comment.
e. Run SQL statements using sql method provided by the SQLContext:
i. Taking all triples which are relxated to ‘Category:Events’
ii. Taking all triples for predicate ‘author’.
iii. Taking all triples authored by ‘Andre_Engels’
iv. Count how many time the specific subject have been used on our
dataset.

1
COMPUTER SCIENCE DEPARTMENT, UNIVERSITY OF BONN

f. Since result is considered to be SchemaRDD, every RDD operations work

out-of-the-box. By using map function collect and print out Subjects and their
frequencies.
------------------------------------------------------Solution----------------------------------------------------------
object sqlab {

def main(args: Array[String]) = {

val input = "src/main/resources/rdf.nt" // args(0)

val spark = SparkSession.builder

.master("local[*]")
.config("spark.serializer",
"org.apache.spark.serializer.KryoSerializer")
.appName("SparkSQL example")
.getOrCreate()

import spark.implicits._

val tripleDF = spark.sparkContext.textFile(input)

.map(TripleUtils.parsTriples)
.toDF()

tripleDF.show()

//tripleDF.collect().foreach(println(_))

tripleDF.createOrReplaceTempView("triple")

val sqlText = "SELECT * from triple where subject =

'http://commons.dbpedia.org/resource/Category:Events'"
val triplerelatedtoEvents =
spark.sql(sqlText)

triplerelatedtoEvents.collect().foreach(println(_))

val subjectdistribution = spark.sql("select subject, count(*) from triple

group by subject")
println("subjectdistribution:")
subjectdistribution.collect().foreach(println(_))
}

2
COMPUTER SCIENCE DEPARTMENT, UNIVERSITY OF BONN

}
---------------------------------------------------------------------------------------------------
2. Spark GraphX operations
a. After a file (page_links_simple.nt.bz2) have been downloaded, unzipped, and
uploaded on HDFS under /yourname folder you may need to create an RDD
out of this file.
b. First create a Scala class Triple containing information about a triple read
from a file. Since the data is going to be type of .nt file which inside contains
rows of triples in format <subject> <predicate> <object> we may need to
transform this data into a different format of representation. Hint: Use map
function.
c. Use the filter transformation to return a new RDD with a subset of the triples
on the file by checking if the first row contains “#”, which on .nt file represent a
comment.
d. Perform these operations in order to transform your data into GraphX
i. Generate vertices by combining (Subject, Object) as VertexId and
their value.x
ii. Create Edges by using subject as a key to join within vertices and
generate Edge into format (s_index, obj_index, predicate)
e. Compute connected components for triples containing “author” as a
predicate.
f. Compute triangle count.
g. List top 5 connected component by applying pagerank over them.
------------------------------------------------------Solution----------------------------------------------------------

object graphxlab {
def main(args: Array[String]) = {
val input = "src/main/resources/rdf.nt" // args(0)

val spark = SparkSession.builder

.master("local[*]")
.config("spark.serializer",
"org.apache.spark.serializer.KryoSerializer")
.appName("GraphX example")
.getOrCreate()

val tripleRDD =
spark.sparkContext.textFile(input)
.map(TripleUtils.parsTriples)

val tutleSubjectObject = tripleRDD.map { x => (x.subject,

x.`object`) }

3
COMPUTER SCIENCE DEPARTMENT, UNIVERSITY OF BONN

type VertexId =
Long

val indexVertexID = (tripleRDD.map(_.subject) union

tripleRDD.map(_.`object`)).distinct().zipWithIndex()

val vertices: RDD[(VertexId, String)] = indexVertexID.map(f

=> (f._2, f._1))

val tuples = tripleRDD.keyBy(_.subject).join(indexVertexID).map(

{
case (k, (tech.sda.arcana.spark.worksheet3.Triple(s, p,
o), si)) => (o, (si, p))
})

val edges: RDD[Edge[S

tring]] = tuples.join(indexVertexID).map({
case (k, ((si, p), oi)) => Edge(si, oi, p)
})

val graph = Graph(vertices, edges)

graph.vertices.collect().foreach(println(_))

println("edges")
graph.edges.collect().foreach(println(_))

val subrealsourse = graph.subgraph(t => t.attr ==

"http://commons.dbpedia.org/property/source")
println("subrealsourse")
subrealsourse.vertices.collect().foreach(println(_))

val conncompo =
subrealsourse.connectedComponents()

val pageranl = graph.pageRank(0.0001)

val printoutrankedtriples =
pageranl.vertices.join(graph.vertices)
.map({ case (k, (r, v)) =
> (k, r, v) })
.sortBy(5 - _._2)

println("printoutrankedtriples")

4
COMPUTER SCIENCE DEPARTMENT, UNIVERSITY OF BONN

printoutrankedtriples.take(5).foreach(println(_))
}
}

---------------------------------------------------------------------------------------------------

AT HOME

1. Read and explore

a. Spark SQL, DataFrames and Datasets Guide
b. GraphX Programming Guide
2. RDF Class Distribution - using Spark SQL - count the usage of respective classes
of a RDF dataset.
Hint: Class fulfils the rule(?predicate = rdf:type && ?object.isIRI())).
a. Read the nt file into an RDD of triples.
b. Apply map function for separating triples into (Subject, Predicate, Object)
c. Apply filter transformation for defining the respective classes.
d. Count the frequencies of Object by using sql statement
e. Return the top 100 classes used in the dataset.
3. Using GraphX To Analyze a Real Graph
a. Count the number of vertices and edges in the graph
b. How many resources are on your graph?
c. What is the max in-degree of this graph?
d. Which triple are related to ‘Category:Events’
e. Run Pagerank for 50 iterations.
f. Compute similarity between two nodes - using Spark GraphX
i. Apply different similarity measures
1. Jaccard similarity
2. Edit distance
4. Further readings
a. Spark SQL: Relational Data Processing in Spark
b. Shark: SQL and Rich Analytics at Scale
c. GraphX: Graph Processing in a Distributed Dataflow Framework

Software Engineering
93% (30)
Software Engineering
129 pages
200104092_DA_4
No ratings yet
200104092_DA_4
14 pages
Spark graphX
No ratings yet
Spark graphX
43 pages
Graphx: Graph Analytics in Spark
No ratings yet
Graphx: Graph Analytics in Spark
34 pages
Practical Apache Spark in GraphX
No ratings yet
Practical Apache Spark in GraphX
8 pages
pyspark (1)
No ratings yet
pyspark (1)
44 pages
Spark-GraphX and Neo4j
No ratings yet
Spark-GraphX and Neo4j
32 pages
GraphX - Spark 3.5.0 Documentation
No ratings yet
GraphX - Spark 3.5.0 Documentation
34 pages
Spark - RDD CS DESIGN
No ratings yet
Spark - RDD CS DESIGN
1 page
Devops Slides
No ratings yet
Devops Slides
223 pages
C5-SPARK Technology
No ratings yet
C5-SPARK Technology
39 pages
Spark Summit East 2015 - Adv Dev Ops - Student Slides
No ratings yet
Spark Summit East 2015 - Adv Dev Ops - Student Slides
219 pages
BDA Lect5 Apache Spark 2023
No ratings yet
BDA Lect5 Apache Spark 2023
115 pages
BDA-Unit-III
No ratings yet
BDA-Unit-III
19 pages
MODULE-Analyzing Co-Occurrence-Networks With GraphX
No ratings yet
MODULE-Analyzing Co-Occurrence-Networks With GraphX
43 pages
Week12 Assignment Solution
No ratings yet
Week12 Assignment Solution
10 pages
Transformations and Actions: A Visual Guide of The API
No ratings yet
Transformations and Actions: A Visual Guide of The API
122 pages
SPARK
No ratings yet
SPARK
35 pages
Unit-V Spark
No ratings yet
Unit-V Spark
69 pages
Unit-4
No ratings yet
Unit-4
3 pages
Lecture 4 - Pair RDD and DataFrame
No ratings yet
Lecture 4 - Pair RDD and DataFrame
38 pages
_SA Lab Manual.docx
No ratings yet
_SA Lab Manual.docx
7 pages
BDT Unit 3
No ratings yet
BDT Unit 3
105 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
Bootcamp Keynote
No ratings yet
Bootcamp Keynote
47 pages
Spark PPT
No ratings yet
Spark PPT
55 pages
Slide 8 Spark Shell Tutorial
No ratings yet
Slide 8 Spark Shell Tutorial
61 pages
Unit-5 Spark
No ratings yet
Unit-5 Spark
24 pages
Session 3.8
No ratings yet
Session 3.8
17 pages
Spark
No ratings yet
Spark
11 pages
ScalaJVMBigData SparkLessons PDF
100% (1)
ScalaJVMBigData SparkLessons PDF
100 pages
SA Ex 9,10 - I131
No ratings yet
SA Ex 9,10 - I131
5 pages
BDA Experiment 10
No ratings yet
BDA Experiment 10
9 pages
Lecture 19-RDD in Spark
No ratings yet
Lecture 19-RDD in Spark
12 pages
Apache Spark Python Slides
No ratings yet
Apache Spark Python Slides
186 pages
Sai Chirravuri
No ratings yet
Sai Chirravuri
54 pages
Week 8_Lecture Notes
No ratings yet
Week 8_Lecture Notes
75 pages
What Is Spark?: History of Apache Spark
No ratings yet
What Is Spark?: History of Apache Spark
65 pages
R Taha
No ratings yet
R Taha
24 pages
notes - Copy
No ratings yet
notes - Copy
5 pages
Spark Running Notes
No ratings yet
Spark Running Notes
19 pages
BDA Unit-6
No ratings yet
BDA Unit-6
11 pages
Chapter 3 spark
No ratings yet
Chapter 3 spark
6 pages
Graph Database Query Feature
No ratings yet
Graph Database Query Feature
6 pages
Spark
No ratings yet
Spark
96 pages
Mod4 Bda
No ratings yet
Mod4 Bda
14 pages
Apach Spark With Scala Slides
No ratings yet
Apach Spark With Scala Slides
187 pages
Spark: Fast, Interactive, Language-Integrated Cluster Computing
No ratings yet
Spark: Fast, Interactive, Language-Integrated Cluster Computing
25 pages
Unit 4 Spark Cassendra
No ratings yet
Unit 4 Spark Cassendra
41 pages
Writing Spark Application
No ratings yet
Writing Spark Application
37 pages
SPARK Architecture
No ratings yet
SPARK Architecture
22 pages
Spark RDD Commands - Spark Core
No ratings yet
Spark RDD Commands - Spark Core
7 pages
Pyspark
No ratings yet
Pyspark
31 pages
4.1. Spark Basics
No ratings yet
4.1. Spark Basics
28 pages
Apache Spark - DataFrames and Spark SQL
100% (2)
Apache Spark - DataFrames and Spark SQL
146 pages
5 - Programming With RDDs and Dataframes
No ratings yet
5 - Programming With RDDs and Dataframes
32 pages
Spark RDD
No ratings yet
Spark RDD
4 pages
Graphx
No ratings yet
Graphx
3 pages
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
From Everand
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
Manish Soni
No ratings yet
Computer Engineering Laboratory Solution Primer
From Everand
Computer Engineering Laboratory Solution Primer
Karan Bhandari
No ratings yet
How to a Developers Guide to 4k: Developer edition, #3
From Everand
How to a Developers Guide to 4k: Developer edition, #3
Xinc Cyberwizard
No ratings yet
Configuring and Deploying Mongodb Sharded Cluster in 30 Minutes
No ratings yet
Configuring and Deploying Mongodb Sharded Cluster in 30 Minutes
11 pages
Test Certif
No ratings yet
Test Certif
41 pages
Mie 1628H - Big Data Final Project Report Apple Stock Price Prediction
No ratings yet
Mie 1628H - Big Data Final Project Report Apple Stock Price Prediction
10 pages
Practice Test 3 New
No ratings yet
Practice Test 3 New
22 pages
NoSQL Overview Examples
No ratings yet
NoSQL Overview Examples
15 pages
Which Statistical Test: S1 S2 S3 S4 S5 S6 S7 S8 S9 S10
No ratings yet
Which Statistical Test: S1 S2 S3 S4 S5 S6 S7 S8 S9 S10
5 pages
Hands-On Lab - Hadoop Components
No ratings yet
Hands-On Lab - Hadoop Components
2 pages
Demo MDM
No ratings yet
Demo MDM
1 page
Day 8
No ratings yet
Day 8
8 pages
Cloudlab Exercise 11 Lesson 11
No ratings yet
Cloudlab Exercise 11 Lesson 11
2 pages
HANDS Hadoop Cloud
No ratings yet
HANDS Hadoop Cloud
10 pages
Cognos Query Studio
No ratings yet
Cognos Query Studio
48 pages
Using Bayes Network For Prediction of Type-2 Diabetes: Yan Hu
No ratings yet
Using Bayes Network For Prediction of Type-2 Diabetes: Yan Hu
5 pages
Getting Started With IBM SPSS Statistics For Windows: A Training Manual For Beginners
No ratings yet
Getting Started With IBM SPSS Statistics For Windows: A Training Manual For Beginners
56 pages
Draft Program ICCAD18
No ratings yet
Draft Program ICCAD18
22 pages
1' Ere Feuille D'exercices: Travaux Dirig Es "Entrep Ots de Don Ees Et OLAP " Hiver 2012/13
No ratings yet
1' Ere Feuille D'exercices: Travaux Dirig Es "Entrep Ots de Don Ees Et OLAP " Hiver 2012/13
4 pages
Exercise 2
No ratings yet
Exercise 2
8 pages
Cisco Firewall CaseStudy
No ratings yet
Cisco Firewall CaseStudy
3 pages
STD 5 RS 12
No ratings yet
STD 5 RS 12
16 pages
Share Market Basics in Tamil PDF Download - Google Search
17% (6)
Share Market Basics in Tamil PDF Download - Google Search
2 pages
Manual Testing - Part 5
No ratings yet
Manual Testing - Part 5
14 pages
Solution Brief: Nortel Contact Recording and Quality Monitoring
No ratings yet
Solution Brief: Nortel Contact Recording and Quality Monitoring
6 pages
WEBCON BPS 2019 Sharepoint Installation Guide
No ratings yet
WEBCON BPS 2019 Sharepoint Installation Guide
36 pages
Week6 ch10
No ratings yet
Week6 ch10
77 pages
SBI SO Computer Questions
No ratings yet
SBI SO Computer Questions
2 pages
1.cim Text Book
No ratings yet
1.cim Text Book
18 pages
Practical Work 5
No ratings yet
Practical Work 5
11 pages
6MF28532AA00 Coupling Remote I-O Module
No ratings yet
6MF28532AA00 Coupling Remote I-O Module
2 pages
Discretization - and - Concept - Hierarchy - Generation Word
No ratings yet
Discretization - and - Concept - Hierarchy - Generation Word
4 pages
Data Connect: Using Xlreporter With Opc Da
No ratings yet
Data Connect: Using Xlreporter With Opc Da
6 pages
DDL - DML
No ratings yet
DDL - DML
54 pages
VF - Sem II - Diploma in Electronics Engineering - 2021 22.docx 1 1 13
No ratings yet
VF - Sem II - Diploma in Electronics Engineering - 2021 22.docx 1 1 13
13 pages
05 - Junior Network Adm - Bahan Ajar
No ratings yet
05 - Junior Network Adm - Bahan Ajar
18 pages
Eeprom Microchip
No ratings yet
Eeprom Microchip
27 pages
WIN911
No ratings yet
WIN911
81 pages
Knergy Arquitectura
No ratings yet
Knergy Arquitectura
1 page
Software and Program Design: Part Three
No ratings yet
Software and Program Design: Part Three
17 pages
OC200 Release Note
No ratings yet
OC200 Release Note
6 pages
ASEM2M Iridium SBD Developers Guide
No ratings yet
ASEM2M Iridium SBD Developers Guide
59 pages
Finding Missing Person Using Ai
100% (1)
Finding Missing Person Using Ai
18 pages
Future of Workforce
No ratings yet
Future of Workforce
33 pages
MAWANDIYA - Foreign - Intern - Resume
No ratings yet
MAWANDIYA - Foreign - Intern - Resume
1 page
HP6 Series Brochure en Mobile
No ratings yet
HP6 Series Brochure en Mobile
4 pages
How To Download and Install NX 10.0
No ratings yet
How To Download and Install NX 10.0
25 pages
Poster Template Vertical 1 Purrington v1
No ratings yet
Poster Template Vertical 1 Purrington v1
1 page
26.1.4 Lab - Configure Local and Server-Based AAA Authentication
No ratings yet
26.1.4 Lab - Configure Local and Server-Based AAA Authentication
12 pages

Lab Distributed Big Data Analytics: Worksheet-3: Spark Graphx and Spark SQL Operations

Uploaded by

Lab Distributed Big Data Analytics: Worksheet-3: Spark Graphx and Spark SQL Operations

Uploaded by

COMPUTER SCIENCE DEPARTMENT, UNIVERSITY OF BONN

Lab Distributed Big Data Analytics

Dr. Hajira Jabeen​, ​Gezim Sejdiu​, ​Prof. Dr. Jens Lehmann

1. Spark SQL operations

f. Since result is considered to be SchemaRDD, every RDD operations work

​def​ ​main​(​args​: ​Array​[​String​]) ​=​ {

​val​ ​spark​ ​=​ ​SparkSession​.builder

​val​ ​tripleDF​ ​=​ spark.sparkContext.textFile(input)

​val ​sqlText ​= ​"SELECT * from triple where subject =

​val ​subjectdistribution ​= spark.sql(​"select subject, count(*) from triple

​val​ ​spark​ ​=​ ​SparkSession​.builder

​val ​tutleSubjectObject ​= tripleRDD.map { x ​=> (x.subject,

val ​indexVertexID ​= (tripleRDD.map(_.subject) union

​val ​vertices​: ​RDD​[(​VertexId​, ​String​)] ​= indexVertexID.map(f

​val​ ​tuples​ ​=​ tripleRDD.keyBy(_.subject).join(indexVertexID).map(

​val​ ​edges​:​ ​RDD​[​Edge​[S

​val​ ​graph​ ​=​ ​Graph​(vertices, edges)

​val ​subrealsourse ​= graph.subgraph(t ​=> t.attr ​==

​val​ ​pageranl​ ​=​ graph.pageRank(​0.0001​)

1. Read and explore

You might also like

Dr. Hajira Jabeen, Gezim Sejdiu, Prof. Dr. Jens Lehmann

def main(args: Array[String]) = {

val spark = SparkSession.builder

val tripleDF = spark.sparkContext.textFile(input)

val sqlText = "SELECT * from triple where subject =

val subjectdistribution = spark.sql("select subject, count(*) from triple

val spark = SparkSession.builder

val tutleSubjectObject = tripleRDD.map { x => (x.subject,

val indexVertexID = (tripleRDD.map(_.subject) union

val vertices: RDD[(VertexId, String)] = indexVertexID.map(f

val tuples = tripleRDD.keyBy(_.subject).join(indexVertexID).map(

val edges: RDD[Edge[S

val graph = Graph(vertices, edges)

val subrealsourse = graph.subgraph(t => t.attr ==

val pageranl = graph.pageRank(0.0001)