Distributed Large-Scale Graph Processing: Data Mining (CS6720)
Distributed Large-Scale Graph Processing: Data Mining (CS6720)
John Augustine
Distributed
Jan 16, 2020 Large-Scale
Data Mining (CS6720) Graph Processing
1 2
3 4
1
26-02-2020
5 6
7 8
2
26-02-2020
9 10
11 12
3
26-02-2020
Data Distribution
The 𝑘-machine Model
The Random Vertex Partitioning (RVP)
• Input data size 𝑁 words; each word = 𝑂(log 𝑁) bits. • Typically, data is split into words (often as ⟨𝑘𝑒𝑦, 𝑣𝑎𝑙𝑢𝑒⟩ pairs).
• The number of machines 𝑘. (Machines identified by {1, 2,…, 𝑘}.) • The words could be either randomly distributed or arbitrarily
distributed.
• Memory size is unbounded (but usually not abused).
• Typically used in processing large graphs.
• Synchronous communication rounds
• RVP: Most common approach is to randomly partition vertices into 𝑘
• Local computation within each machine parts and place each part into one of the machines. Then, a copy of
• Each machine creates one message of 𝑂(log 𝑛) bits for every other machine. each edge is placed in the (≤ 2) machines that contain either of its
• Send… Receive. end points.
• Goal: Solve problem in as few rounds as possible. • Other partitioning of graph data is also conceivable (e.g., random
edge partitioning, arbitrary edge partitioning, etc.).
13 14
15
4