0% found this document useful (0 votes)

41 views16 pages

HomeWork2 Tutorial

This document contains instructions and questions for homework 2 on graph analysis using GraphFrames in Spark. It introduces GraphFrames and describes how to represent graphs with vertices and edges as DataFrames. It also covers algorithms like connected components and PageRank and provides code skeletons for applying these algorithms to analyze a social network dataset. The homework asks students to provide friend recommendations for users and perform graph analysis on the social network data, including computing connected components and PageRank scores.

Uploaded by

satmania

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

41 views16 pages

HomeWork2 Tutorial

Uploaded by

satmania

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

EECS E6893 Big Data Analytics

HW2: Friend Recommendations, GraphFrames

Hritik Jain, [email protected]

11/13/2020 1
GraphFrames
● DataFrame-based Graph
● GraphX is to RDDs as GraphFrames are to DataFrames
● Represent graphs: vertices (e.g. users) and edges (e.g. relationships between
users)
● GraphFrames package separate from core Apache Spark
Connected components
● A subgraph where any two vertices are connected to each other by edges, but
not connected to other vertices in the graph
● In a social network, connected components can approximate clusters
● In the GraphFrame, the connected components algorithm labels each
connected component of the graph with the ID of its lowest-numbered vertex

Reference: https://en.wikipedia.org/wiki/Component_(graph_theory)
PageRank
● PageRank measures the importance of each vertex in a graph
● An edge from u to v represents an endorsement of v’s importance by u

d: damping factor;

default = 0.85 - 15% chance that a typical users won’t follow any links on the page and instead navigate to a new random URL.

● Convergence occurs when all PageRank values are within the margin of error.

Reference: https://en.wikipedia.org/wiki/PageRank
PageRank (Spark)
pageRank(resetProbability=0.15, sourceId=None, maxIter=None, tol=None)

Parameters:

resetProbability: 1-d, Probability of resetting to a random vertex, default=0.15

maxIter: If set, the algorithm is run for a fixed number of iterations.

tol: If set, the algorithm is run until the given tolerance/margin of error.

NOTE: Exactly one of maxIter or tol must be set.

HW2
● Question 1: Friend Recommendations
● Question 2: Graph Analysis using GraphFrames
Environment Setup
1. Create multiple workers on Dataproc instead of single node, otherwise it will
take long time to run.
2. Install graphframe package in spark when create the cluster.
(You can reference to config Spark properties)

Cloud Shell: This is for Python 3.

You can modify it.
gcloud beta dataproc clusters create <cluster-name>
--optional-components=ANACONDA,JUPYTER --image-version=preview
--enable-component-gateway --bucket <bucket-name> --project <project-id>
1. --num-workers 3 --metadata PIP_PACKAGES=graphframes==0.6
--initialization-actions
gs://dataproc-initialization-actions/python/pip-install.sh
2. --properties
spark:spark.jars.packages=graphframes:graphframes:0.6.0-spark2.3-s_2.11
Q1
● Write a Spark program that implements a simple “People You Might Know”
social network friendship recommendation algorithm. The key idea is that if
two people have a lot of mutual friends, then the system should recommend
that they connect with each other.
● Question: Give recommendation for 10 Users

● Dataset Format

<User> <Tab> <Friends>

<User> is a unique ID ; <Friends> are comma separated list of unique IDs

Q1 - Code Skeleton
Q1 - Function example
Q2
● Use the Q1 dataset again do the graph analysis
● Connected Component
● PageRank
Q2
● Steps 1
○ Format the provided dataset into two Spark DataFrames: edges and vertices
■ Notice: For the vertices, if there is no other properties for vertices (like in our case), then we
should create tuple like this, otherwise a string inside a tuple will not be identified as a tuple
but as a single string. If there are other properties, then no need for that extra comma.
Q2
● Step 2
○ Convert the RDD to DataFrame
■ Directly convert to DataFrame
■ Save RDD to csv, then read csv to DataFrame

● Step 3 If you set the environment correctly following the

○ Create graph instructions above, there should be no problem with
Jupyter.
from graphframes import * If you are using Spark shell and it doesn’t work, you could
try running:
pyspark --packages
g = GraphFrame(v, e) graphframes:graphframes:0.6.0-spark2.3-s_2.11
running Spark
Q2 - Connected components
● Notice

If you are using Connected components, and get the error like

You could reference the following answer on stackoverflow

https://stackoverflow.com/questions/49159896/how-to-set-checkpiont-dir-pyspark-data-science-experience
Q2 - PageRank
● results = g.pageRank(resetProbability=0.15, tol=0.01)
● There are multiple parameters. You can play with them, see whether there are
different result.
References
● https://graphframes.github.io/graphframes/docs/_site/index.html
● https://www.cs.princeton.edu/~chazelle/courses/BIB/pagerank.htm

DSA Company Wise
No ratings yet
DSA Company Wise
35 pages
Hands - On Exercise: Using The Spark Shell..................................
100% (2)
Hands - On Exercise: Using The Spark Shell..................................
13 pages
We Belong Together Matching Lab 4-07
83% (6)
We Belong Together Matching Lab 4-07
8 pages
Assignment 03:: Association Rule Mining
No ratings yet
Assignment 03:: Association Rule Mining
3 pages
Idc Seagate Dataage Whitepaper PDF
No ratings yet
Idc Seagate Dataage Whitepaper PDF
28 pages
Bubble Sort Cocktail Sort: A Group Project On Fundamentals of Computing 1
No ratings yet
Bubble Sort Cocktail Sort: A Group Project On Fundamentals of Computing 1
13 pages
Da 4
No ratings yet
Da 4
14 pages
BigData - W5 - Big Graph Data & PageRank - HoangVu
No ratings yet
BigData - W5 - Big Graph Data & PageRank - HoangVu
58 pages
Assignment4 - Fall 2024 - 553 - Dsci
No ratings yet
Assignment4 - Fall 2024 - 553 - Dsci
8 pages
Social Network Analysis
No ratings yet
Social Network Analysis
20 pages
Lecture 4 - Analyzing Massive Graphs Part I
No ratings yet
Lecture 4 - Analyzing Massive Graphs Part I
27 pages
Unit 5
No ratings yet
Unit 5
22 pages
Devops Slides
No ratings yet
Devops Slides
223 pages
Graphframes: An Integrated Api For Mixing Graph and Relational Queries
No ratings yet
Graphframes: An Integrated Api For Mixing Graph and Relational Queries
8 pages
I210277 I210461 ProjectProposal
No ratings yet
I210277 I210461 ProjectProposal
8 pages
Dsa Assignment 3 FALL2024
No ratings yet
Dsa Assignment 3 FALL2024
3 pages
Unit 4 (Data Frame and Apache Kafka)
No ratings yet
Unit 4 (Data Frame and Apache Kafka)
28 pages
DS 3002 Project Proposal Template
No ratings yet
DS 3002 Project Proposal Template
5 pages
Unit 2 Complete Notes Unit 2 Complete Notes
No ratings yet
Unit 2 Complete Notes Unit 2 Complete Notes
31 pages
All Exp Lab
No ratings yet
All Exp Lab
15 pages
Spark-GraphX and Neo4j
No ratings yet
Spark-GraphX and Neo4j
32 pages
Practical Journal Sna With Writeups
No ratings yet
Practical Journal Sna With Writeups
37 pages
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet
Facebook Friend Recommendation
No ratings yet
Facebook Friend Recommendation
23 pages
Sna Project
No ratings yet
Sna Project
29 pages
ECS765P - W9 - Large-Scale Graph Processing
No ratings yet
ECS765P - W9 - Large-Scale Graph Processing
51 pages
Lecture 1 - Introduction
No ratings yet
Lecture 1 - Introduction
124 pages
7 Apache Spark
No ratings yet
7 Apache Spark
48 pages
BDA Exp E1
No ratings yet
BDA Exp E1
5 pages
Social Network Analysis Metrics
No ratings yet
Social Network Analysis Metrics
5 pages
Unit 6
No ratings yet
Unit 6
34 pages
GraphX in Practice: Definitive Reference for Developers and Engineers
From Everand
GraphX in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Support of Big Data Machine Learning With Apache Spark
No ratings yet
Support of Big Data Machine Learning With Apache Spark
7 pages
Mbds Big Data Hadoop 2019 2020 TP 5 en
No ratings yet
Mbds Big Data Hadoop 2019 2020 TP 5 en
11 pages
Sma Exp 06 Code Print
No ratings yet
Sma Exp 06 Code Print
3 pages
Week 8 - Lecture Notes
No ratings yet
Week 8 - Lecture Notes
75 pages
Spark Summit East 2015 - Adv Dev Ops - Student Slides
No ratings yet
Spark Summit East 2015 - Adv Dev Ops - Student Slides
219 pages
Unit - 4
No ratings yet
Unit - 4
22 pages
Unit I Graph Theory and Concepts
No ratings yet
Unit I Graph Theory and Concepts
35 pages
Python For Data Analysis
No ratings yet
Python For Data Analysis
96 pages
Advanced Data Science On Spark: Reza Zadeh
No ratings yet
Advanced Data Science On Spark: Reza Zadeh
47 pages
Unit 4
No ratings yet
Unit 4
3 pages
Twitter Social Networking Analysis
No ratings yet
Twitter Social Networking Analysis
36 pages
Lab06 Spark Dataframes
No ratings yet
Lab06 Spark Dataframes
12 pages
Lab 2
No ratings yet
Lab 2
14 pages
Big Data HW
No ratings yet
Big Data HW
6 pages
Assignment - 03 - DSA
No ratings yet
Assignment - 03 - DSA
10 pages
Sma Exp 6 - 100
No ratings yet
Sma Exp 6 - 100
14 pages
Bootcamp Keynote
No ratings yet
Bootcamp Keynote
47 pages
GraphX & Graph Analytics
No ratings yet
GraphX & Graph Analytics
61 pages
Neural Graph Reasoning: Complex Logical Query Answering Meets Graph Databases
No ratings yet
Neural Graph Reasoning: Complex Logical Query Answering Meets Graph Databases
65 pages
Unit 2
No ratings yet
Unit 2
28 pages
Big Data Lab File
No ratings yet
Big Data Lab File
49 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
Evaluating Very Large Datalog Queries On Social Networks
No ratings yet
Evaluating Very Large Datalog Queries On Social Networks
11 pages
Social Network Analysis Con Python PDF
No ratings yet
Social Network Analysis Con Python PDF
80 pages
Spark Using Python
No ratings yet
Spark Using Python
28 pages
Py Spark
No ratings yet
Py Spark
9 pages
R For Networks Workshop - Ognyanova - 2018
No ratings yet
R For Networks Workshop - Ognyanova - 2018
51 pages
GML 1
No ratings yet
GML 1
32 pages
Analytics at Large Scale in Spark
No ratings yet
Analytics at Large Scale in Spark
13 pages
CSE545 sp20 (5) 3-3
No ratings yet
CSE545 sp20 (5) 3-3
81 pages
NetSciX 2016 Workshop
No ratings yet
NetSciX 2016 Workshop
64 pages
Igraph Tutorial
No ratings yet
Igraph Tutorial
64 pages
06 Machine Learning
No ratings yet
06 Machine Learning
24 pages
Hadoop Troubleshooting 101 Kate Ting Cloudera
No ratings yet
Hadoop Troubleshooting 101 Kate Ting Cloudera
6 pages
Pages From Unified-Log-Zk-Nov15-4 - Part5
No ratings yet
Pages From Unified-Log-Zk-Nov15-4 - Part5
2 pages
hw09 Monitoring Best Practices
No ratings yet
hw09 Monitoring Best Practices
6 pages
CS246 Proof Probability
No ratings yet
CS246 Proof Probability
13 pages
HW4 Tutorial
No ratings yet
HW4 Tutorial
17 pages
Data Modeling With Graph Databases
100% (2)
Data Modeling With Graph Databases
68 pages
The Forrester Wave™ - Big Data Fabric Q2 2018 PDF
No ratings yet
The Forrester Wave™ - Big Data Fabric Q2 2018 PDF
18 pages
Fillatre Big Data
No ratings yet
Fillatre Big Data
98 pages
Swarm and Evolutionary Computation: ACM Transactions On Knowledge Discovery From Data
No ratings yet
Swarm and Evolutionary Computation: ACM Transactions On Knowledge Discovery From Data
1 page
My PHP Generator
No ratings yet
My PHP Generator
305 pages
Cie1931 RGB
No ratings yet
Cie1931 RGB
70 pages
Idc Digital Universe 2014 PDF
No ratings yet
Idc Digital Universe 2014 PDF
17 pages
LAZER - Editorial-CodeChef
No ratings yet
LAZER - Editorial-CodeChef
2 pages
ML Assignments
No ratings yet
ML Assignments
2 pages
Linear Classifier: by Dr. Sanjeev Kumar Associate Professor Department of Mathematics IIT Roorkee, Roorkee-247 667, India
No ratings yet
Linear Classifier: by Dr. Sanjeev Kumar Associate Professor Department of Mathematics IIT Roorkee, Roorkee-247 667, India
86 pages
MONICA Lectures
No ratings yet
MONICA Lectures
42 pages
Chapter 8 - Arrays
No ratings yet
Chapter 8 - Arrays
18 pages
Unit 1 TOC 1
No ratings yet
Unit 1 TOC 1
34 pages
PIC Tutorial 1
No ratings yet
PIC Tutorial 1
2 pages
Quantum Algorithms For Solving Ordinary Differential Equations Via Classical Integration Methods
No ratings yet
Quantum Algorithms For Solving Ordinary Differential Equations Via Classical Integration Methods
13 pages
Module 2: Matrices and Elementary Row Operations: Letters)
No ratings yet
Module 2: Matrices and Elementary Row Operations: Letters)
14 pages
Introduction To Analysis of Algorithms: COMP171 Fall 2005
No ratings yet
Introduction To Analysis of Algorithms: COMP171 Fall 2005
45 pages
08 r059210502 Mathematical Foundation of Computer Science
No ratings yet
08 r059210502 Mathematical Foundation of Computer Science
12 pages
DSA Formula Sheet
No ratings yet
DSA Formula Sheet
4 pages
Please Use The Following Google Form Link To Answer The Following Questions
No ratings yet
Please Use The Following Google Form Link To Answer The Following Questions
4 pages
Questions Pool For Distributed Systems
No ratings yet
Questions Pool For Distributed Systems
15 pages
Fill in The Blanks
No ratings yet
Fill in The Blanks
3 pages
Python Operators
No ratings yet
Python Operators
5 pages
WIREs Comput Mol Sci - 2024 - Pyrkov - Complexity of Life Sciences in Quantum and AI Era
No ratings yet
WIREs Comput Mol Sci - 2024 - Pyrkov - Complexity of Life Sciences in Quantum and AI Era
21 pages
0-1 Knapsack Problem
No ratings yet
0-1 Knapsack Problem
6 pages
Learning Set of Rules
No ratings yet
Learning Set of Rules
11 pages
Cambridge International AS & A Level: Computer Science 9618/23
No ratings yet
Cambridge International AS & A Level: Computer Science 9618/23
24 pages
DBSCAN Clustering Algorithm: Presented by
No ratings yet
DBSCAN Clustering Algorithm: Presented by
22 pages
Mc5301 Advanced Data Structures and Algorithms
100% (1)
Mc5301 Advanced Data Structures and Algorithms
92 pages
5.doubley Linkedlist and Recursion
No ratings yet
5.doubley Linkedlist and Recursion
35 pages
AI Lecture2
No ratings yet
AI Lecture2
20 pages
CC Assignment2 PDF
No ratings yet
CC Assignment2 PDF
6 pages
CD Question Bank
No ratings yet
CD Question Bank
56 pages
AIML - 21CS54 - IA3 - Preparatory - Question Bank-1
No ratings yet
AIML - 21CS54 - IA3 - Preparatory - Question Bank-1
3 pages

HomeWork2 Tutorial

Uploaded by

HomeWork2 Tutorial

Uploaded by

EECS E6893 Big Data Analytics

HW2: Friend Recommendations, GraphFrames

Hritik Jain, [email protected]

resetProbability: 1-d, Probability of resetting to a random vertex, default=0.15

maxIter: If set, the algorithm is run for a fixed number of iterations.

NOTE: Exactly one of maxIter or tol must be set.

Cloud Shell: This is for Python 3.

<User> <Tab> <Friends>

<User> is a unique ID ; <Friends> are comma separated list of unique IDs

● Step 3 If you set the environment correctly following the

You could reference the following answer on stackoverflow

You might also like