Bda Unit II Lecture2
Bda Unit II Lecture2
Sampling Streams
Bloom Filters
Counting Distinct Items
Computing Moments
6
Example: Fixed Sample Size
Suppose we start our search-query sample at
10%, but we want to limit the size.
Hash to (say) 100 buckets, 0, 1,…, 99.
Take for the sample those elements hashing to
buckets 0 through 9.
If the sample gets too big, get rid of bucket 9.
Still too big, get rid of 8, and so on.
7
Sampling Key-Value Pairs
This technique generalizes to any form of data
that we can see as tuples (K, V), where K is the
“key” and V is a “value.”
Distinction: We want our sample to be based on
picking some set of keys only, not pairs.
In the search-query example, the data was “all key.”
Hash keys to some number of buckets.
Sample consists of all key-value pairs with a key
that goes into one of the selected buckets.
8
Example: Salary Ranges
Data = tuples of the form (EmpID, Dept, Salary).
Query: What is the average range of salaries
within departments?
Key = Dept.
Value = (EmpID, Salary).
Sample picks some departments, has salaries
for all employees of that department, including
its min and max salaries.
Result will be an unbiased estimate of the
average salary range.
9
References
Jure Leskovec, Anand Rajaraman, Jeff Ullman,
Mining of Massive Datasets, Cambridge
University Press, Second Edition, 2014.
http://mmds.org/
10