HomeWork2 Tutorial
HomeWork2 Tutorial
11/13/2020 1
GraphFrames
● DataFrame-based Graph
● GraphX is to RDDs as GraphFrames are to DataFrames
● Represent graphs: vertices (e.g. users) and edges (e.g. relationships between
users)
● GraphFrames package separate from core Apache Spark
Connected components
● A subgraph where any two vertices are connected to each other by edges, but
not connected to other vertices in the graph
● In a social network, connected components can approximate clusters
● In the GraphFrame, the connected components algorithm labels each
connected component of the graph with the ID of its lowest-numbered vertex
Reference: https://en.wikipedia.org/wiki/Component_(graph_theory)
PageRank
● PageRank measures the importance of each vertex in a graph
● An edge from u to v represents an endorsement of v’s importance by u
d: damping factor;
default = 0.85 - 15% chance that a typical users won’t follow any links on the page and instead navigate to a new random URL.
● Convergence occurs when all PageRank values are within the margin of error.
Reference: https://en.wikipedia.org/wiki/PageRank
PageRank (Spark)
pageRank(resetProbability=0.15, sourceId=None, maxIter=None, tol=None)
Parameters:
tol: If set, the algorithm is run until the given tolerance/margin of error.
● Dataset Format
If you are using Connected components, and get the error like