0% found this document useful (0 votes)

26 views39 pages

Unit 5

Uploaded by

kjsravani2003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views39 pages

Unit 5

Uploaded by

kjsravani2003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 39

Note to other teachers and users of these slides: We would be delighted if you found this our

material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify
them to fit your own needs. If you make use of a significant portion of these slides in your own
lecture, please include this message, or a link to our web site: http://www.mmds.org

Mining of Massive Datasets

Jure Leskovec, Anand Rajaraman, Jeff Ullman
Stanford University
http://www.mmds.org
 Classic model of algorithms
▪ You get to see the entire input, then compute
some function of it
▪ In this context, “offline algorithm”

 Online Algorithms
▪ You get to see the input one piece at a time, and
need to make irrevocable decisions along the way
▪ Similar to the data stream model

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 2

1 a

2 b

3 c

Boys 4 d Girls

Nodes: Boys and Girls; Edges: Preferences

Goal: Match boys to girls so that maximum
number of preferences is satisfied

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 4

1 a

2 b

3 c

Boys 4 d Girls

M = {(1,a),(2,b),(3,d)} is a matching
Cardinality of matching = |M| = 3

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 5

1 a

2 b

3 c

Boys 4 d Girls

M = {(1,c),(2,b),(3,d),(4,a)} is a
perfect matching
Perfect matching … all vertices of the graph are matched
Maximum matching … a matching that contains the largest possible number of matches
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 6
 Problem: Find a maximum matching for a
given bipartite graph
▪ A perfect one if it exists

 There is a polynomial-time offline algorithm

based on augmenting paths (Hopcroft & Karp 1973,
see http://en.wikipedia.org/wiki/Hopcroft-Karp_algorithm)

 But what if we do not know the entire

graph upfront?

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 7

 Initially, we are given the set boys
 In each round, one girl’s choices are revealed
▪ That is, girl’s edges are revealed
 At that time, we have to decide to either:
▪ Pair the girl with a boy
▪ Do not pair the girl with any boy

 Example of application:
Assigning tasks to servers

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 8

1 a
(1,a)
2 b (2,b)
c
(3,d)
3

4 d

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 9

 Greedy algorithm for the online graph
matching problem:
▪ Pair the new girl with any eligible boy
▪ If there is none, do not pair girl

 How good is the algorithm?

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 10

 For input I, suppose greedy produces
matching Mgreedy while an optimal
matching is Mopt

Competitive ratio =
minall possible inputs I (|Mgreedy|/|Mopt|)
(what is greedy’s worst performance over all possible inputs I)

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 11

Mopt
 Consider a case: Mgreedy≠ Mopt 1 a

 Consider the set G of girls 2 b

matched in Mopt but not in Mgreedy 3 c
 Then every boy B adjacent to girls 4 d
in G is already matched in Mgreedy: B={ } G={ }

▪ If there would exist such non-matched

(by Mgreedy) boy adjacent to a non-matched
girl then greedy would have matched them
 Since boys B are already matched in Mgreedy then
(1) |Mgreedy|≥ |B|
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 12
Mopt
1 a
 Summary so far:
▪ Girls G matched in Mopt but not in Mgreedy2 b
3
▪ (1) |Mgreedy|≥ |B| c

 There are at least |G| such boys 4 d

(|G|  |B|) otherwise the optimal B={ } G={ }

algorithm couldn’t have matched all girls in G

▪ So: |G|  |B|  |Mgreedy|
 By definition of G also: |Mopt|  |Mgreedy| + |G|
▪ Worst case is when |G| = |B| = |Mgreedy|
 |Mopt|  2|Mgreedy| then |Mgreedy|/|Mopt|  1/2
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 13
1 a
(1,a)
2 b (2,b)
3 c

4 d

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 14

 Banner ads (1995-2001)
▪ Initial form of web advertising
▪ Popular websites charged
X$ for every 1,000
“impressions” of the ad
▪ Called “CPM” rate
CPM…cost per mille
(Cost per thousand impressions) Mille…thousand in Latin
▪ Modeled similar to TV, magazine ads
▪ From untargeted to demographically targeted
▪ Low click-through rates
▪ Low ROI for advertisers
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 16
 Introduced by Overture around 2000
▪ Advertisers bid on search keywords
▪ When someone searches for that keyword, the
highest bidder’s ad is shown
▪ Advertiser is charged only if the ad is clicked on

 Similar model adopted by Google with some

changes around 2002
▪ Called Adwords

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 17

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 18
 Performance-based advertising works!
▪ Multi-billion-dollar industry

 Interesting problem:
What ads to show for a given query?
▪ (Today’s lecture)

 If I am an advertiser, which search terms

should I bid on and how much should I bid?
▪ (Not focus of today’s lecture)

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 19

 Given:
▪ 1. A set of bids by advertisers for search queries
▪ 2. A click-through rate for each advertiser-query pair
▪ 3. A budget for each advertiser (say for 1 month)
▪ 4. A limit on the number of ads to be displayed with
each search query
 Respond to each search query with a set of
advertisers such that:
▪ 1. The size of the set is no larger than the limit on the
number of ads per query
▪ 2. Each advertiser has bid on the search query
▪ 3. Each advertiser has enough budget left to pay for
the ad if it is clicked upon
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 20
 A stream of queries arrives at the search
engine: q1, q2, …
 Several advertisers bid on each query
 When query qi arrives, search engine must
pick a subset of advertisers whose ads are
shown
 Goal: Maximize search engine’s revenues
▪ Simple solution: Instead of raw bids, use the
“expected revenue per click” (i.e., Bid*CTR)
 Clearly we need an online algorithm!
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 21
Advertiser Bid CTR Bid * CTR

A $1.00 1% 1 cent

B $0.75 2% 1.5 cents

C $0.50 2.5% 1.125 cents

Click through Expected
rate revenue

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 22

Advertiser Bid CTR Bid * CTR

B $0.75 2% 1.5 cents

C $0.50 2.5% 1.125 cents

A $1.00 1% 1 cent

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 23

 Two complications:
▪ Budget
▪ CTR of an ad is unknown

 Each advertiser has a limited budget

▪ Search engine guarantees that the advertiser
will not be charged more than their daily budget

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 24

 CTR: Each ad has a different likelihood of
being clicked
▪ Advertiser 1 bids $2, click probability = 0.1
▪ Advertiser 2 bids $1, click probability = 0.5
▪ Clickthrough rate (CTR) is measured historically
▪ Very hard problem: Exploration vs. exploitation
Exploit: Should we keep showing an ad for which we have
good estimates of click-through rate
or
Explore: Shall we show a brand new ad to get a better
sense of its click-through rate

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 25

 Our setting: Simplified environment
▪ There is 1 ad shown for each query
▪ All advertisers have the same budget B
▪ All ads are equally likely to be clicked
▪ Value of each ad is the same (=1)

 Simplest algorithm is greedy:

▪ For a query pick any advertiser who has
bid 1 for that query
▪ Competitive ratio of greedy is 1/2

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 26

 Two advertisers A and B
▪ A bids on query x, B bids on x and y
▪ Both have budgets of $4
 Query stream: x x x x y y y y
▪ Worst case greedy choice: B B B B _ _ _ _
▪ Optimal: A A A A B B B B
▪ Competitive ratio = ½
 This is the worst case!
▪ Note: Greedy algorithm is deterministic – it always
resolves draws in the same way

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 27

 BALANCE Algorithm by Mehta, Saberi,
Vazirani, and Vazirani
▪ For each query, pick the advertiser with the
largest unspent budget
▪ Break ties arbitrarily (but in a deterministic way)

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 28

 Two advertisers A and B
▪ A bids on query x, B bids on x and y
▪ Both have budgets of $4

 Query stream: x x x x y y y y

 BALANCE choice: A B A B B B _ _
▪ Optimal: A A A A B B B B

 In general: For BALANCE on 2 advertisers

Competitive ratio = ¾
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 29
 Consider simple case (w.l.o.g.):
▪ 2 advertisers, A1 and A2, each with budget B (1)
▪ Optimal solution exhausts both advertisers’ budgets
 BALANCE must exhaust at least one
advertiser’s budget:
▪ If not, we can allocate more queries
▪ Whenever BALANCE makes a mistake (both advertisers bid
on the query), advertiser’s unspent budget only decreases
▪ Since optimal exhausts both budgets, one will for sure get
exhausted
▪ Assume BALANCE exhausts A2’s budget,
but allocates x queries fewer than the optimal
▪ Revenue: BAL = 2B - x
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 30
Queries allocated to A1 in the optimal solution
B
Queries allocated to A2 in the optimal solution

A1 A2
Optimal revenue = 2B
Assume Balance gives revenue = 2B-x = B+y
x
B
Unassigned queries should be assigned to A2
y x (if we could assign to A1 we would since we still have the budget)
Goal: Show we have y  x
A1 A2 Not Case 1) ≤ ½ of A1’s queries got assigned to A2
used then 𝒚 𝑩/𝟐
Case 2) > ½ of A1’s queries got assigned to A2
x then 𝒙 ≤ 𝑩/𝟐 and 𝒙 + 𝒚 = 𝑩
B Balance revenue is minimum for 𝒙 = 𝒚 = 𝑩/𝟐
y Minimum Balance revenue = 𝟑𝑩/𝟐
x
Competitive Ratio = 3/4
A1 A2 Not BALANCE exhausts A2’s budget
used J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 31
 In the general case, worst competitive ratio
of BALANCE is 1–1/e = approx. 0.63
▪ Interestingly, no online algorithm has a better
competitive ratio!

 Let’s see the worst case example that gives

this ratio

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 32

 N advertisers: A1, A2, … AN
▪ Each with budget B > N
 Queries:
▪ N∙B queries appear in N rounds of B queries each
 Bidding:
▪ Round 1 queries: bidders A1, A2, …, AN
▪ Round 2 queries: bidders A2, A3, …, AN
▪ Round i queries: bidders Ai, …, AN
 Optimum allocation:
Allocate round i queries to Ai
▪ Optimum revenue N∙B
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 33
… B/(N-2)
B/(N-1)
B/N
A1 A2 A3 AN-1 AN

BALANCE assigns each of the queries in round 1 to N advertisers.

After k rounds, sum of allocations to each of advertisers Ak,…,AN is
𝒌−𝟏 𝑩
𝑺𝒌 = 𝑺𝒌+𝟏 = ⋯ = 𝑺𝑵 = 𝒊=𝟏 σ
𝑵−(𝒊−𝟏)

If we find the smallest k such that Sk  B, then after k rounds

we cannot allocate any queries to any advertiser
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 34
B/1 B/2 B/3 … B/(N-(k-1)) … B/(N-1) B/N
S1
S2

Sk = B

1/1 1/2 1/3 … 1/(N-(k-1)) … 1/(N-1) 1/N

S1
S2

Sk = 1

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 35

 Fact: 𝑯𝒏 = σ𝒏𝒊=𝟏 𝟏/𝒊 ≈ 𝐥𝐧 𝒏 for large n
▪ Result due to Euler
1/1 1/2 1/3 … 1/(N-(k-1)) … 1/(N-1) 1/N
ln(N)

ln(N)-1 Sk = 1
𝑵
 𝑺𝒌 = 𝟏 implies: 𝑯𝑵−𝒌 = 𝒍𝒏(𝑵) − 𝟏 = 𝒍𝒏( )
𝒆
 We also know: 𝑯𝑵−𝒌 = 𝒍𝒏(𝑵 − 𝒌)
𝑵
 So: 𝑵 − 𝒌 = N terms sum to ln(N).
𝒆 Last k terms sum to 1.
𝟏
 Then: 𝒌 = 𝑵(𝟏 − ) First N-k terms sum
𝒆 to ln(N-k) but also to ln(N)-1
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 36
 So after the first k=N(1-1/e) rounds, we
cannot allocate a query to any advertiser

 Revenue = B∙N (1-1/e)

 Competitive ratio = 1-1/e

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 37

 Arbitrary bids and arbitrary budgets!
 Consider we have 1 query q, advertiser i
▪ Bid = xi
▪ Budget = bi
 In a general setting BALANCE can be terrible
▪ Consider two advertisers A1 and A2
▪ A1: x1 = 1, b1 = 110
▪ A2: x2 = 10, b2 = 100
▪ Consider we see 10 instances of q
▪ BALANCE always selects A1 and earns 10
▪ Optimal earns 100
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 38
 Arbitrary bids: consider query q, bidder i
▪ Bid = xi
▪ Budget = bi
▪ Amount spent so far = mi
▪ Fraction of budget left over fi = 1-mi/bi
▪ Define i(q) = xi(1-e-fi)

 Allocate query q to bidder i with largest

value of i(q)
 Same competitive ratio (1-1/e)

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 39

Indian Institute of Technology Bombay Department of Electrical Engineering
No ratings yet
Indian Institute of Technology Bombay Department of Electrical Engineering
10 pages
Hands-On Signal Analysis With Python (Thomas Haslwanter)
No ratings yet
Hands-On Signal Analysis With Python (Thomas Haslwanter)
276 pages
A2D and D2A Converter
No ratings yet
A2D and D2A Converter
16 pages
Data Science Model Question Paper 2
No ratings yet
Data Science Model Question Paper 2
2 pages
Discrete Time Fourier Series Final
No ratings yet
Discrete Time Fourier Series Final
28 pages
Data Structure and Algorithm Reviewer
No ratings yet
Data Structure and Algorithm Reviewer
3 pages
Support Machine Learning
No ratings yet
Support Machine Learning
161 pages
Mining of Massive Datasets: Jure Leskovec Anand Rajaraman Jeffrey D. Ullman
0% (1)
Mining of Massive Datasets: Jure Leskovec Anand Rajaraman Jeffrey D. Ullman
17 pages
Association Rules and Frequent Item Sets
No ratings yet
Association Rules and Frequent Item Sets
98 pages
Collaborativefiltering 21
No ratings yet
Collaborativefiltering 21
72 pages
Dynamic Geometry Game for Pods: Gerry Stahl's eLibrary, #21
From Everand
Dynamic Geometry Game for Pods: Gerry Stahl's eLibrary, #21
Gerry Stahl
No ratings yet
4 Frequent Item Set Mining & Association Rules
No ratings yet
4 Frequent Item Set Mining & Association Rules
68 pages
Unit 4
No ratings yet
Unit 4
60 pages
ch06 Assocrules
No ratings yet
ch06 Assocrules
110 pages
16 Streams
No ratings yet
16 Streams
61 pages
ch06 Assocrules
No ratings yet
ch06 Assocrules
59 pages
ch05 Linkanalysis1
No ratings yet
ch05 Linkanalysis1
60 pages
Ch06 Frequent Itemsets
No ratings yet
Ch06 Frequent Itemsets
59 pages
08 Recsys2
No ratings yet
08 Recsys2
60 pages
Week 16 Lecture 01 02 SVD and CUR (Example)
No ratings yet
Week 16 Lecture 01 02 SVD and CUR (Example)
56 pages
ch03 LSH
No ratings yet
ch03 LSH
58 pages
18-Sub-Modular Functions
No ratings yet
18-Sub-Modular Functions
51 pages
07 Recsys1
No ratings yet
07 Recsys1
48 pages
ch03 LSH
No ratings yet
ch03 LSH
58 pages
18 Advertising
No ratings yet
18 Advertising
48 pages
19 Bandits
No ratings yet
19 Bandits
48 pages
07 Recsys1
No ratings yet
07 Recsys1
47 pages
ch09 Recsys1
No ratings yet
ch09 Recsys1
43 pages
Big Data - Week04 - Association Rules
No ratings yet
Big Data - Week04 - Association Rules
46 pages
BigData Clustering
No ratings yet
BigData Clustering
67 pages
BD - Lecture07 - RecSys1
No ratings yet
BD - Lecture07 - RecSys1
45 pages
ch07 Clustering
No ratings yet
ch07 Clustering
56 pages
Big Data - Lecture05 - LSH
No ratings yet
Big Data - Lecture05 - LSH
56 pages
Lecture 1
No ratings yet
Lecture 1
55 pages
Mod2 Data Streams
No ratings yet
Mod2 Data Streams
75 pages
BD - Lecture 3 - Decision Tree
No ratings yet
BD - Lecture 3 - Decision Tree
39 pages
Mining Data Streams 1
No ratings yet
Mining Data Streams 1
46 pages
13 Assoc2
No ratings yet
13 Assoc2
32 pages
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
No ratings yet
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
64 pages
MAGVIT Masked Generative Video Transformer
No ratings yet
MAGVIT Masked Generative Video Transformer
30 pages
ch01 Intro
No ratings yet
ch01 Intro
29 pages
Large-Scale Machine Learning: K-NN, Perceptron: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
No ratings yet
Large-Scale Machine Learning: K-NN, Perceptron: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
33 pages
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
No ratings yet
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
46 pages
ch01 Intro
No ratings yet
ch01 Intro
28 pages
QR Factorization: Triangular Matrices QR Factorization Gram-Schmidt Algorithm Householder Algorithm
No ratings yet
QR Factorization: Triangular Matrices QR Factorization Gram-Schmidt Algorithm Householder Algorithm
42 pages
Mining Data Streams (Part 1) : Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
No ratings yet
Mining Data Streams (Part 1) : Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
46 pages
Big Data Analytics Course Introduction
No ratings yet
Big Data Analytics Course Introduction
28 pages
Community Detection in Social Networks
No ratings yet
Community Detection in Social Networks
64 pages
Mining Data Streams (Part 2) : Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
No ratings yet
Mining Data Streams (Part 2) : Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
46 pages
ch-09 - Part 1
No ratings yet
ch-09 - Part 1
22 pages
rl8.3 - Text - Mining 1
No ratings yet
rl8.3 - Text - Mining 1
28 pages
Integer Programming: The Branch and Bound Method
No ratings yet
Integer Programming: The Branch and Bound Method
14 pages
Ch01 Intro
No ratings yet
Ch01 Intro
19 pages
DSP MCQ Question Bank
No ratings yet
DSP MCQ Question Bank
30 pages
Lecture 27
No ratings yet
Lecture 27
21 pages
CS246: Mining Massive Datasets Jure Leskovec,: Stanford University
No ratings yet
CS246: Mining Massive Datasets Jure Leskovec,: Stanford University
56 pages
7 - The Sampling Theorem
No ratings yet
7 - The Sampling Theorem
16 pages
Price Action Prediciton DL
No ratings yet
Price Action Prediciton DL
26 pages
CS 345 Data Mining: Online Algorithms Search Advertising
No ratings yet
CS 345 Data Mining: Online Algorithms Search Advertising
34 pages
CS246: Mining Massive Datasets Jure Leskovec,: Stanford University
No ratings yet
CS246: Mining Massive Datasets Jure Leskovec,: Stanford University
42 pages
Experiment 2a2q2020
No ratings yet
Experiment 2a2q2020
25 pages
Data Mining Technologies and Implementations
No ratings yet
Data Mining Technologies and Implementations
34 pages
Traditional Conjoint Analysis With Excel - Tables
No ratings yet
Traditional Conjoint Analysis With Excel - Tables
9 pages
Handwritten Bangla Digit Recognition Using Deep Learning: Alomm Udayton EDU
No ratings yet
Handwritten Bangla Digit Recognition Using Deep Learning: Alomm Udayton EDU
12 pages
Algorithm Questionbank
No ratings yet
Algorithm Questionbank
9 pages
NAst
No ratings yet
NAst
8 pages
Web Search Engines: Rooted in Information Retrieval (IR) Systems
No ratings yet
Web Search Engines: Rooted in Information Retrieval (IR) Systems
48 pages
MLDM Lect1 Introduction
No ratings yet
MLDM Lect1 Introduction
40 pages
19 Submodular
No ratings yet
19 Submodular
47 pages
ch02 Mapreduce
No ratings yet
ch02 Mapreduce
7 pages
Bda Unit II Lecture1
No ratings yet
Bda Unit II Lecture1
10 pages
Mining Massive Datasets Preface
No ratings yet
Mining Massive Datasets Preface
17 pages
MTH 501 Assigment 2 (2022) - Vuanswer
No ratings yet
MTH 501 Assigment 2 (2022) - Vuanswer
8 pages
Questions On Digital Signal Proseccesing
No ratings yet
Questions On Digital Signal Proseccesing
16 pages
Assignment 3.1 K Means Clustering in Python PART 1
No ratings yet
Assignment 3.1 K Means Clustering in Python PART 1
7 pages
Design and Implementation of Delta-Sigma Modulator Using Simulink
No ratings yet
Design and Implementation of Delta-Sigma Modulator Using Simulink
5 pages
Cs 201 Ds Nceac SPR 19
No ratings yet
Cs 201 Ds Nceac SPR 19
5 pages
16 Streams
No ratings yet
16 Streams
5 pages
Yann LeCun - What's So Great About - Extreme Learning Machines - MachineLearning
No ratings yet
Yann LeCun - What's So Great About - Extreme Learning Machines - MachineLearning
11 pages
Clock-Driven Scheduling
No ratings yet
Clock-Driven Scheduling
5 pages
ch04 Streams1
No ratings yet
ch04 Streams1
4 pages
ch04 Streams2
No ratings yet
ch04 Streams2
4 pages
1 for - w - ≤ 6 0 for 6 ≤ -: H e e, 0≤w≤ π π w≤π
No ratings yet
1 for - w - ≤ 6 0 for 6 ≤ -: H e e, 0≤w≤ π π w≤π
3 pages
Code Assessment 2
No ratings yet
Code Assessment 2
3 pages
10.3 Power Method For Approximating Eigenvalues: Definition of Dominant Eigenvalue and Dominant Eigenvector
No ratings yet
10.3 Power Method For Approximating Eigenvalues: Definition of Dominant Eigenvalue and Dominant Eigenvector
9 pages
Differential Equation Final Exam
No ratings yet
Differential Equation Final Exam
2 pages
MA19455 Syllabus
No ratings yet
MA19455 Syllabus
1 page
SAT Math Level 2 Subject Test Practice Problems 2013 Edition
From Everand
SAT Math Level 2 Subject Test Practice Problems 2013 Edition
Dr. David Kronmiller
1/5 (1)
Hands On - Session 1
No ratings yet
Hands On - Session 1
4 pages

Unit 5

Uploaded by

Unit 5

Uploaded by

Note to other teachers and users of these slides: We would be delighted if you found this our

Mining of Massive Datasets

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 2

Nodes: Boys and Girls; Edges: Preferences

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 4

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 5

 There is a polynomial-time offline algorithm

 But what if we do not know the entire

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 7

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 8

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 9

 How good is the algorithm?

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 10

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 11

 Consider the set G of girls 2 b

▪ If there would exist such non-matched

 There are at least |G| such boys 4 d

(|G|  |B|) otherwise the optimal B={ } G={ }

algorithm couldn’t have matched all girls in G

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 14

 Similar model adopted by Google with some

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 17

 If I am an advertiser, which search terms

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 19

B $0.75 2% 1.5 cents

C $0.50 2.5% 1.125 cents

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 22

B $0.75 2% 1.5 cents

C $0.50 2.5% 1.125 cents

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 23

 Each advertiser has a limited budget

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 24

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 25

 Simplest algorithm is greedy:

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 26

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 27

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 28

 In general: For BALANCE on 2 advertisers

 Let’s see the worst case example that gives

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 32

BALANCE assigns each of the queries in round 1 to N advertisers.

If we find the smallest k such that Sk  B, then after k rounds

1/1 1/2 1/3 … 1/(N-(k-1)) … 1/(N-1) 1/N

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 35

 Revenue = B∙N (1-1/e)

 Competitive ratio = 1-1/e

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 37

 Allocate query q to bidder i with largest

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 39

You might also like