Large Scale Distributed Graph Processing: Data Mining (CS6720)
Large Scale Distributed Graph Processing: Data Mining (CS6720)
1 2
3 4
03-03-2020
5 6
7 8
03-03-2020
9 10
11 12
03-03-2020
Data Distribution
The 𝑘-machine Model
The Random Vertex Partitioning (RVP)
• Input data size 𝑁 words; each word = 𝑂(log 𝑁) bits. • Typically, data is split into words (often as ⟨𝑘𝑒𝑦, 𝑣𝑎𝑙𝑢𝑒⟩ pairs).
• The number of machines 𝑘. (Machines identified by {1, 2,…, 𝑘}.) • The words could be either randomly distributed or arbitrarily
distributed.
• Each pair of machines connected by a link.
• Typically used in processing large graphs.
• Memory size is unbounded (but usually not abused).
• RVP: Most common approach is to randomly partition vertices into 𝑘
• Synchronous communication rounds parts and place each part into one of the machines. Then, a copy of
• Local computation within each machine each edge is placed in the (≤ 2) machines that contain either of its
• Each machine creates one message of 𝑂(log 𝑛) bits for every other machine. end points.
• Send… Receive. • Other partitioning of graph data is also conceivable (e.g., random
edge partitioning, arbitrary edge partitioning, etc.).
• Goal: Solve problem in as few rounds as possible.
13 14
15 16
03-03-2020
The CC and NCC Models (point to point) Simulating CC/NCC in the 𝑘-machine model
• We have 𝑛 nodes 𝑉 = {1,2, … , 𝑛}. • Assume there is a hash function ℎ: V → {1, 2, … , 𝑘} that is a simple uniform
hash function. (Claims will hold under 𝑂(log 𝑛)-universal families.)
• The input graph 𝐺 = (𝑉, 𝐸) is known locally, i.e., each node 𝑣 ∈ 𝑉 • Assume that each node 𝑣 ∈ 𝑉 is placed in machine ℎ(𝑣).
knows its incident edges. • Each machine 𝑖 now contains (and therefore simulates) nodes 𝑉 =
• The nodes can communicate via synchronous message passing, but {𝑣|ℎ 𝑣 = 𝑖}. We know that 𝑉 ∈ 𝑂 whp.
each message must be at most 𝑂(log 𝑛) bits.
Simulation of one NCC round (at each machine 𝑖):
1. Machine 𝑖 performs local computation for all nodes in 𝑉 as per the CC/NCC
CC: Each node can send 𝑛 − 1 messages (one for every other node) algorithm.
2. The messages to be sent are then individually sent to the machine that holds their
NCC: Each node can send at most 𝑂 log 𝑛 messages to 𝑂 log 𝑛 respective recipient nodes.
carefully nodes. 3. Incoming messages are received and handed over to the recipient nodes.
17 18
19 20
03-03-2020
21 22
Upcasting messages in a tree (under NCC) Claim: Upcasting takes 𝑂 log 𝑛 + 𝑅 rounds.
Input: some 𝑅 nodes in the tree have a message each • A “clump” is a maximal connected collection of vertices such that their 𝑈 ’s are non-
Output: those messages must reach the root. empty.
• There is at most one clump with the root. Call it the head clump.
Strawman Attempt: Those 𝑅 nodes send it to the root. Violates NCC. • Each clump has a node closest to the root. Call it the root of the clump.
• Claim: two messages are “friends” if they are part of the same clump. Once two
messages become friends, they will remain friends.
Correct attempt: • Consequence: clumps can coalesce, but not break apart.
• At the start of a round 𝑟, let each node 𝑣 ∈ 𝑉 has a set of messages 𝑈 . Initially, 𝑈 • Claim: The root of any non-root clump moves closer to the root in each round. This can
contains the message that 𝑣 wishes to broadcast (empty otherwise.) happen in two ways:
• Then in round 𝑟, each 𝑣 picks one message 𝑥 from 𝑈 and sends it to the parent. In turn, 1. The root moves up and does not coalesce with another clump whose root is higher.
it receives up to two messages (say, 𝑦 and 𝑧) from its children. Thus, at the end of the 2. The root moves up and coalesces with another clump whose root is higher.
round, 𝑈 ← 𝑈 ∖ 𝑥 ∪ 𝑦, 𝑧 .
• Consequence: Every message will be part of the root clump in 𝑂 log 𝑛 rounds.
Homework: How can we adapt this algorithm to ensure each node knows when to
terminate the algorithm? • Claim: When a clump becomes a root clump, it will reduce to just the root in at most 𝑅
rounds. Homework: articulate why this is true.
23 24
03-03-2020
25 26
27