Distributed Large-Scale Graph Processing: Data Mining (CS6720)

1) The document discusses algorithms for maximal matching on graphs in the massively parallel computation (MPC) model. 2) It describes a filtering algorithm that finds a maximal matching in superlinear memory regimes. The algorithm runs in phases, where in each phase edges are randomly sent to a leader machine which computes a maximal matching and broadcasts it back to remove edges. 3) Analysis shows that with high probability, the leader receives at most n/√S edges in each phase, and the number of remaining edges halves each phase.

Uploaded by

Venkata Praneeth

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

45 views4 pages

Distributed Large-Scale Graph Processing: Data Mining (CS6720)

Uploaded by

Venkata Praneeth

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

26-02-2020

John Augustine
Distributed
Jan 16, 2020 Large-Scale
Data Mining (CS6720) Graph Processing

1 2

Shared Memory PRAM Massively Parallel Computation (MPC) Model

• Input data size 𝑁 words; each word = 𝑂(log 𝑁) bits.
MapReduce
• The number of machines 𝑘. (Machines identified by {1, 2,…, 𝑘}.)
Programming
Parallel &
Distributed
Models • Memory size per machine 𝑆 words.
Computing Models Think like a vertex • 𝑆 ≥ 𝑁 is uninteresting. Assume: 𝑆 = 𝑂(𝑁 ) for some 𝜖 ∈ (0,1].
• Also, require 𝑆𝑘 ≥ 𝑁.
• Synchronous communication rounds
Massively Parallel
Computation
• Local computation within each machine
• Create messages for other machines. Sum of message sizes ≤ 𝑆.
Message Passing
• Send… Receive. Ensure no machine requires > 𝑆 memory.
𝑘-machine model • Goal: Solve problem in as few rounds as possible.

3 4
1
26-02-2020

Initial Data Distribution

On Graphs:
𝑁=𝑂 𝑛
• Typically, data is split into words (often as ⟨𝑘𝑒𝑦, 𝑣𝑎𝑙𝑢𝑒⟩ pairs). (Strongly)
Superlinear

• The words could be either randomly distributed or arbitrarily 𝑁 =𝑂 𝑛+𝑚

distributed.
• Load balanced so that no machine has much more than other
machines. = 𝑂(𝑚)
• Output: usually distributed & depends on problem. Memory
• Questions Size 𝑆
• How to achieve random load balanced distribution?
• How to remove duplicates? 𝑁 = 𝑂(𝑛) 𝑁 = 𝑛 for
𝛼 ∈ (0,1).
Near (Strongly)
Linear Sublinear

5 6

Broadcasting Maximal Matching

• Let 𝑆 = 𝑛 for some constant 𝜖 > 0. • A matching in a graph 𝐺 = (𝑉, 𝐸) is a set of edges that don’t share
common vertices.
• One machine src needs to broadcast 𝑛 words.
• Approach 1: the machine sends 𝑘 messages of size 𝑛. If 𝑘 > 𝑛 ???
• A maximum matching is a matching of maximum possible cardinality.
• Approach 2: Build 𝑛 -ary tree with src as root.
• A maximal matching is a matching that ceases to be one when any
• Broadcast takes 𝑂(ℎ𝑒𝑖𝑔ℎ𝑡) rounds edge is added to it.
• ℎ𝑒𝑖𝑔ℎ𝑡 = 𝑂 log 𝑘 =𝑂
• A maximal matching has cardinality at least half of a maximum
since 𝑁 = 𝑝𝑜𝑙𝑦 𝑆 (𝑂(𝑛 ) for graphs) matching. Homework: Prove this.

7 8
2
26-02-2020

Sequential Algorithm for Filtering: Idea to find a maximal matching in

finding a maximal matching. the superlinear memory regime
1. Let 𝑋 = ∅. Preprocessing.
Let ℓ be a designated “leader” machine (say, machine 0). Assume it doesn’t hold any edge at the
2. For each 𝑒 = 𝑢, 𝑣 ∈ 𝐸, beginning. (Why is this OK?) During the course of the algorithm, ℓ maintains a matching (initially
1. If neither 𝑢 nor 𝑣 is an endpoint of any edge in 𝑋, then 𝑋 = 𝑋 ∪ {𝑒}. empty).
Other machines are called regular machines. 𝐺 = 𝑉 , 𝐸 denotes graph during phase 𝑟. We use
3. Output 𝑋. 𝑚 for number of edges in 𝐺 . 𝐺 ← 𝐺.
Steps in each phase 0,1, … (until 𝐺 becomes empty.)
Correctness: 1. Each regular machine marks each local edge independently with probability 𝑝 = and
sends the marked edges to the leader ℓ.
• Invariant: 𝑋 is a matching at all times.
2. The leader ℓ recomputes the maximal matching with edges it received but without losing any
• Suppose 𝑋 is not maximal at the end. Then some edge 𝑒 can be edge from the previous matching. (How?)
added to it and it will remain a matching. But why was 𝑒 rejected? 3. The leader ℓ broadcasts the matching so computed (≤ 𝑛/2 edges) to all machines.
4. Each regular machine removes edges that have at least one common vertex with the received
matching. Isolated vertices are also removed.

9 10

Outline of the Analysis Claim: At most whp at end of round 𝑟

• Correctness is obvious (similar to the sequential algorithm) if • Let 𝐺 = 𝑉 , 𝐸 be the leftover graph at the end of round 𝑟 − 1.
bandwidth limitation is not violated. • For some pair of vertices 𝑢, 𝑣 ∈ 𝑉 , can 𝑒 = 𝑢, 𝑣 have been sent to
the leader? No! (Why? If sent, at least one of 𝑢 or 𝑣 would have been
matched, and therefore discarded.)
• Claims:
• The leader ℓ receives at most 𝑛 edges (whp) in step 1. (Homework)
• Consider any set of vertices 𝐽 with > edges with both end
• If a phase 𝑟 starts with 𝑚 edges, then the number of edges at the end of points in 𝐽.
round 𝑟 is with high probability. • What is the chance that V = 𝐽?
• The total number of rounds is log m∈𝑂 . Why? Pr 𝑎𝑙𝑙 𝑖𝑛𝑑𝑢𝑐𝑒𝑑 𝑒𝑑𝑔𝑒𝑠 𝑛𝑜𝑡 𝑠𝑒𝑛𝑡 ≤ 1 − 𝑝 ≤𝑒 .
There are at most 2 subsets of 𝑉, so by union bound, the result holds.

11 12
3
26-02-2020

Data Distribution
The 𝑘-machine Model
The Random Vertex Partitioning (RVP)
• Input data size 𝑁 words; each word = 𝑂(log 𝑁) bits. • Typically, data is split into words (often as ⟨𝑘𝑒𝑦, 𝑣𝑎𝑙𝑢𝑒⟩ pairs).
• The number of machines 𝑘. (Machines identified by {1, 2,…, 𝑘}.) • The words could be either randomly distributed or arbitrarily
distributed.
• Memory size is unbounded (but usually not abused).
• Typically used in processing large graphs.
• Synchronous communication rounds
• RVP: Most common approach is to randomly partition vertices into 𝑘
• Local computation within each machine parts and place each part into one of the machines. Then, a copy of
• Each machine creates one message of 𝑂(log 𝑛) bits for every other machine. each edge is placed in the (≤ 2) machines that contain either of its
• Send… Receive. end points.
• Goal: Solve problem in as few rounds as possible. • Other partitioning of graph data is also conceivable (e.g., random
edge partitioning, arbitrary edge partitioning, etc.).

13 14

RVP is Load Balanced

Claim: Under RVP of a graph 𝐺 = (𝑉, 𝐸) with 𝑛 vertices and 𝑚 edges,
whp, every machine has
1. at most 𝑂 vertices and
2. at most 𝑂 + Δ edges,
where Δ is the maximum degree in 𝐺.
Proof of part 1 is easy. Just use Chernoff bound.
Proof of part 2 is more complicated and therefore skipped.

15
4

Large Scale Distributed Graph Processing: Data Mining (CS6720)
No ratings yet
Large Scale Distributed Graph Processing: Data Mining (CS6720)
7 pages
Fundamental Problems AND Algorithms Graph Theory and Combinational
No ratings yet
Fundamental Problems AND Algorithms Graph Theory and Combinational
31 pages
Graph Algorithms: Prims & Kruskal
67% (3)
Graph Algorithms: Prims & Kruskal
45 pages
MA252 - Combinatorial Optimisation
No ratings yet
MA252 - Combinatorial Optimisation
9 pages
Algorithms
No ratings yet
Algorithms
8 pages
Review 4: CSCI 2720: Data Structures
No ratings yet
Review 4: CSCI 2720: Data Structures
33 pages
Notebook 231102
No ratings yet
Notebook 231102
10 pages
Computing Functions Over Wireless Networks
No ratings yet
Computing Functions Over Wireless Networks
37 pages
Apznzaac8vgwcs8m7wss9ifm3m39bv2dblkn6hgjnfm6hl8tw6xsqse0zbshp0hc0smk1hvlj2jhy3jl29zxun8chwelu92m9jiwml Botqroep 5xpwlshvrenjn1rq8wgwpyxcfsuyi6k6faid9u2oxfo7 u35n1cm8cvabfgumu0acmli c6iydtlfactuaqgwdpq1loap9q94ry46
No ratings yet
Apznzaac8vgwcs8m7wss9ifm3m39bv2dblkn6hgjnfm6hl8tw6xsqse0zbshp0hc0smk1hvlj2jhy3jl29zxun8chwelu92m9jiwml Botqroep 5xpwlshvrenjn1rq8wgwpyxcfsuyi6k6faid9u2oxfo7 u35n1cm8cvabfgumu0acmli c6iydtlfactuaqgwdpq1loap9q94ry46
14 pages
Graph Theory Applications in Engineering
No ratings yet
Graph Theory Applications in Engineering
18 pages
Graph Algorithms Explained
No ratings yet
Graph Algorithms Explained
10 pages
Chapter - 4 - Graph Theory - Part - 1
No ratings yet
Chapter - 4 - Graph Theory - Part - 1
78 pages
Projects
No ratings yet
Projects
4 pages
Cs - 502 F-T Subjective by Vu - Toper
No ratings yet
Cs - 502 F-T Subjective by Vu - Toper
18 pages
Ds Mod 4
No ratings yet
Ds Mod 4
26 pages
Rozprawa
No ratings yet
Rozprawa
77 pages
Graph Algorithms
No ratings yet
Graph Algorithms
82 pages
Tutorial Problems
No ratings yet
Tutorial Problems
2 pages
Graphs & Algorithms
No ratings yet
Graphs & Algorithms
14 pages
Probabilistic Graphical Models CPSC 532c (Topics in AI) Stat 521a (Topics in Multivariate Analysis)
No ratings yet
Probabilistic Graphical Models CPSC 532c (Topics in AI) Stat 521a (Topics in Multivariate Analysis)
35 pages
26 Apr 24 NP Completeness4
No ratings yet
26 Apr 24 NP Completeness4
27 pages
Network Modularity Assignment
0% (1)
Network Modularity Assignment
28 pages
DS Unit-4
No ratings yet
DS Unit-4
47 pages
Ps 2
No ratings yet
Ps 2
5 pages
Daa Module4 Slides
No ratings yet
Daa Module4 Slides
47 pages
Dna Book
No ratings yet
Dna Book
171 pages
Lecture2 2025 2026 BFS Short
No ratings yet
Lecture2 2025 2026 BFS Short
86 pages
16 - Shortest Path Algorithms
No ratings yet
16 - Shortest Path Algorithms
25 pages
Advanced Graph Theory
No ratings yet
Advanced Graph Theory
11 pages
MST & Knapsack Algorithms Guide
No ratings yet
MST & Knapsack Algorithms Guide
48 pages
Lecture3434 - 16870 - Graphs 1
No ratings yet
Lecture3434 - 16870 - Graphs 1
43 pages
Lecture 1
No ratings yet
Lecture 1
51 pages
Talk Graph Algorithms
No ratings yet
Talk Graph Algorithms
31 pages
Distributed Systems Theory Notes
No ratings yet
Distributed Systems Theory Notes
384 pages
Matching: Algorithms and Networks
No ratings yet
Matching: Algorithms and Networks
52 pages
Greedy Algorithms in Graph Theory
No ratings yet
Greedy Algorithms in Graph Theory
39 pages
Approximation Algorithms for NP-Hard Problems
No ratings yet
Approximation Algorithms for NP-Hard Problems
14 pages
Parallel Random Access Machine (PRAM) : Control
No ratings yet
Parallel Random Access Machine (PRAM) : Control
9 pages
Distributed Systems
67% (3)
Distributed Systems
331 pages
Approximation Algorithms Guide
No ratings yet
Approximation Algorithms Guide
37 pages
Lecture 3
No ratings yet
Lecture 3
57 pages
Graph Algorithms and Complexity Overview
100% (2)
Graph Algorithms and Complexity Overview
102 pages
Graph Theory: Types, Applications, and Algorithms
No ratings yet
Graph Theory: Types, Applications, and Algorithms
5 pages
Lect7 - Graph Algorithm
No ratings yet
Lect7 - Graph Algorithm
45 pages
Practice 2
No ratings yet
Practice 2
8 pages
Recap
No ratings yet
Recap
10 pages
Unit IV - Graph
No ratings yet
Unit IV - Graph
7 pages
Ads Answersheet
No ratings yet
Ads Answersheet
15 pages
Final Paper
No ratings yet
Final Paper
7 pages
Graph Algorithms & Structures PPT
No ratings yet
Graph Algorithms & Structures PPT
2 pages
Algorithm Analysis and Complexity Explanations
No ratings yet
Algorithm Analysis and Complexity Explanations
5 pages
Homework 4: Question 1 - Exercise 17.4-3, 17-3.6
No ratings yet
Homework 4: Question 1 - Exercise 17.4-3, 17-3.6
5 pages
Graph Theory Basics and Algorithms
No ratings yet
Graph Theory Basics and Algorithms
48 pages
11ApproximationAlgorithms 2x2
No ratings yet
11ApproximationAlgorithms 2x2
15 pages
hw3 S
No ratings yet
hw3 S
11 pages
Computational Complexity Theory: P (Polynomial Time)
No ratings yet
Computational Complexity Theory: P (Polynomial Time)
11 pages
AssignmentCO2025 1 1
No ratings yet
AssignmentCO2025 1 1
6 pages
Data Mining: Streaming Algorithms Overview
No ratings yet
Data Mining: Streaming Algorithms Overview
8 pages
Ap Eamcet - 2017 District Wise Toppers in Engineering: Page 1 of 28
No ratings yet
Ap Eamcet - 2017 District Wise Toppers in Engineering: Page 1 of 28
28 pages
Department of Humanities and Social Sciences IIT Madras: S.No. Slot Course No Course Name Instructor Name Room Credit
No ratings yet
Department of Humanities and Social Sciences IIT Madras: S.No. Slot Course No Course Name Instructor Name Room Credit
1 page
Action Recognition with TDDs
No ratings yet
Action Recognition with TDDs
10 pages
3D ConvNets for Video Analysis
No ratings yet
3D ConvNets for Video Analysis
16 pages
Hegel and Marx on Alienation Explained
No ratings yet
Hegel and Marx on Alienation Explained
10 pages
Two-Stream ConvNets for Video Action Recognition
No ratings yet
Two-Stream ConvNets for Video Action Recognition
9 pages
The Drinking Philosophers Problem: K. M. Chandy and J. Misra University of Texas at Austin
No ratings yet
The Drinking Philosophers Problem: K. M. Chandy and J. Misra University of Texas at Austin
15 pages
Differential Equations: Ch.2.3 Linear First Order Equations
No ratings yet
Differential Equations: Ch.2.3 Linear First Order Equations
21 pages
Module in Practical Research 2: Your Lesson For Today!
No ratings yet
Module in Practical Research 2: Your Lesson For Today!
20 pages
Digital Signal Processing : Lecture Notes
No ratings yet
Digital Signal Processing : Lecture Notes
88 pages
B.sc. Electrical Engineering (Electronics) 4th Semester, Section A, Morning, Session Spring 2020
No ratings yet
B.sc. Electrical Engineering (Electronics) 4th Semester, Section A, Morning, Session Spring 2020
4 pages
IGCSE Linear Programming Guide
No ratings yet
IGCSE Linear Programming Guide
4 pages
Candy Color Probability Chart and Analysis
No ratings yet
Candy Color Probability Chart and Analysis
4 pages
Strutted Box Widening for Bridges
100% (2)
Strutted Box Widening for Bridges
18 pages
A Novel Hybrid Moth-Flame Optimization Algorithm For Enhanced Convergence and Search
No ratings yet
A Novel Hybrid Moth-Flame Optimization Algorithm For Enhanced Convergence and Search
7 pages
Multivariable Functions Overview
No ratings yet
Multivariable Functions Overview
20 pages
Guitar String Length & Frequency
No ratings yet
Guitar String Length & Frequency
9 pages
Lecture 37 - Jordans Lemma and Maximum Modulus Principle
No ratings yet
Lecture 37 - Jordans Lemma and Maximum Modulus Principle
4 pages
Compliant Parallel Pan-Tilt Mechanism
No ratings yet
Compliant Parallel Pan-Tilt Mechanism
12 pages
Nordic Electricity Demand Forecast
No ratings yet
Nordic Electricity Demand Forecast
117 pages
ML - Co4 Enotes
No ratings yet
ML - Co4 Enotes
18 pages
Religious Riots Impact on Indian Elections
No ratings yet
Religious Riots Impact on Indian Elections
68 pages
Tugas Mekban Haekal
No ratings yet
Tugas Mekban Haekal
7 pages
Griffiths QMCH 4 P 13
No ratings yet
Griffiths QMCH 4 P 13
4 pages
Dentistry 12 00003
No ratings yet
Dentistry 12 00003
15 pages
9 Types of Regression Analysis
No ratings yet
9 Types of Regression Analysis
16 pages
IB REVIEW - Vectors Review 2012
100% (1)
IB REVIEW - Vectors Review 2012
13 pages
Vibrations Control of Light Rail Transpo
No ratings yet
Vibrations Control of Light Rail Transpo
10 pages
Ceramic Inlays Is The Inlay Thickness An Important Factor Influencing The Fracture Risk
No ratings yet
Ceramic Inlays Is The Inlay Thickness An Important Factor Influencing The Fracture Risk
8 pages
Land Use-Transport Interaction Modeling: A Review of The Literature and Future Research Directions
No ratings yet
Land Use-Transport Interaction Modeling: A Review of The Literature and Future Research Directions
28 pages
Pramita Santra GE3B-04
No ratings yet
Pramita Santra GE3B-04
7 pages
GR 8 June 2019 Paper 2
No ratings yet
GR 8 June 2019 Paper 2
5 pages
M.Sc. Mathematics Syllabus Overview
No ratings yet
M.Sc. Mathematics Syllabus Overview
37 pages
Dom-Gtu Lab Manual
No ratings yet
Dom-Gtu Lab Manual
85 pages
3bowman On Generic Strategies - Must Read
No ratings yet
3bowman On Generic Strategies - Must Read
6 pages
Analysis Professional Success Studenți: Amatiesei Oana Cipcă Cosmin Roca Roxana Sfredel Răzvan Hypothesis 1
No ratings yet
Analysis Professional Success Studenți: Amatiesei Oana Cipcă Cosmin Roca Roxana Sfredel Răzvan Hypothesis 1
5 pages
Number System and Base Conversions
No ratings yet
Number System and Base Conversions
9 pages

Distributed Large-Scale Graph Processing: Data Mining (CS6720)

Uploaded by

Distributed Large-Scale Graph Processing: Data Mining (CS6720)

Uploaded by

26-02-2020

Shared Memory PRAM Massively Parallel Computation (MPC) Model

Initial Data Distribution

• The words could be either randomly distributed or arbitrarily 𝑁 =𝑂 𝑛+𝑚

Broadcasting Maximal Matching

Sequential Algorithm for Filtering: Idea to find a maximal matching in

Outline of the Analysis Claim: At most whp at end of round 𝑟

RVP is Load Balanced

You might also like