0% found this document useful (0 votes)
33 views

HomeWork2 Tutorial

This document contains instructions and questions for homework 2 on graph analysis using GraphFrames in Spark. It introduces GraphFrames and describes how to represent graphs with vertices and edges as DataFrames. It also covers algorithms like connected components and PageRank and provides code skeletons for applying these algorithms to analyze a social network dataset. The homework asks students to provide friend recommendations for users and perform graph analysis on the social network data, including computing connected components and PageRank scores.

Uploaded by

satmania
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views

HomeWork2 Tutorial

This document contains instructions and questions for homework 2 on graph analysis using GraphFrames in Spark. It introduces GraphFrames and describes how to represent graphs with vertices and edges as DataFrames. It also covers algorithms like connected components and PageRank and provides code skeletons for applying these algorithms to analyze a social network dataset. The homework asks students to provide friend recommendations for users and perform graph analysis on the social network data, including computing connected components and PageRank scores.

Uploaded by

satmania
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

EECS E6893 Big Data Analytics

HW2: Friend Recommendations, GraphFrames

Hritik Jain, [email protected]

11/13/2020 1
GraphFrames
● DataFrame-based Graph
● GraphX is to RDDs as GraphFrames are to DataFrames
● Represent graphs: vertices (e.g. users) and edges (e.g. relationships between
users)
● GraphFrames package separate from core Apache Spark
Connected components
● A subgraph where any two vertices are connected to each other by edges, but
not connected to other vertices in the graph
● In a social network, connected components can approximate clusters
● In the GraphFrame, the connected components algorithm labels each
connected component of the graph with the ID of its lowest-numbered vertex

Reference: https://en.wikipedia.org/wiki/Component_(graph_theory)
PageRank
● PageRank measures the importance of each vertex in a graph
● An edge from u to v represents an endorsement of v’s importance by u

d: damping factor;

default = 0.85 - 15% chance that a typical users won’t follow any links on the page and instead navigate to a new random URL.

● Convergence occurs when all PageRank values are within the margin of error.

Reference: https://en.wikipedia.org/wiki/PageRank
PageRank (Spark)
pageRank(resetProbability=0.15, sourceId=None, maxIter=None, tol=None)

Parameters:

resetProbability: 1-d, Probability of resetting to a random vertex, default=0.15

maxIter: If set, the algorithm is run for a fixed number of iterations.

tol: If set, the algorithm is run until the given tolerance/margin of error.

NOTE: Exactly one of maxIter or tol must be set.


HW2
● Question 1: Friend Recommendations
● Question 2: Graph Analysis using GraphFrames
Environment Setup
1. Create multiple workers on Dataproc instead of single node, otherwise it will
take long time to run.
2. Install graphframe package in spark when create the cluster.
(You can reference to config Spark properties)

Cloud Shell: This is for Python 3.


You can modify it.
gcloud beta dataproc clusters create <cluster-name>
--optional-components=ANACONDA,JUPYTER --image-version=preview
--enable-component-gateway --bucket <bucket-name> --project <project-id>
1. --num-workers 3 --metadata PIP_PACKAGES=graphframes==0.6
--initialization-actions
gs://dataproc-initialization-actions/python/pip-install.sh
2. --properties
spark:spark.jars.packages=graphframes:graphframes:0.6.0-spark2.3-s_2.11
Q1
● Write a Spark program that implements a simple “People You Might Know”
social network friendship recommendation algorithm. The key idea is that if
two people have a lot of mutual friends, then the system should recommend
that they connect with each other.
● Question: Give recommendation for 10 Users

● Dataset Format

<User> <Tab> <Friends>

<User> is a unique ID ; <Friends> are comma separated list of unique IDs


Q1 - Code Skeleton
Q1 - Function example
Q2
● Use the Q1 dataset again do the graph analysis
● Connected Component
● PageRank
Q2
● Steps 1
○ Format the provided dataset into two Spark DataFrames: edges and vertices
■ Notice: For the vertices, if there is no other properties for vertices (like in our case), then we
should create tuple like this, otherwise a string inside a tuple will not be identified as a tuple
but as a single string. If there are other properties, then no need for that extra comma.
Q2
● Step 2
○ Convert the RDD to DataFrame
■ Directly convert to DataFrame
■ Save RDD to csv, then read csv to DataFrame

● Step 3 If you set the environment correctly following the


○ Create graph instructions above, there should be no problem with
Jupyter.
from graphframes import * If you are using Spark shell and it doesn’t work, you could
try running:
pyspark --packages
g = GraphFrame(v, e) graphframes:graphframes:0.6.0-spark2.3-s_2.11
running Spark
Q2 - Connected components
● Notice

If you are using Connected components, and get the error like

You could reference the following answer on stackoverflow


https://stackoverflow.com/questions/49159896/how-to-set-checkpiont-dir-pyspark-data-science-experience
Q2 - PageRank
● results = g.pageRank(resetProbability=0.15, tol=0.01)
● There are multiple parameters. You can play with them, see whether there are
different result.
References
● https://graphframes.github.io/graphframes/docs/_site/index.html
● https://www.cs.princeton.edu/~chazelle/courses/BIB/pagerank.htm

You might also like