0% found this document useful (0 votes)

93 views

Mining Data Streams (Part 1) : Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman

This document discusses summarizing and sampling data streams. It begins by noting that data mining often involves data streams where the entire dataset is not known in advance. It then discusses two problems: (1) sampling a fixed proportion of elements from a stream, and (2) maintaining a random sample of fixed size from a potentially infinite stream. For problem 1, it describes hashing tuple keys to sample proportions. For problem 2, it notes the challenge of maintaining a fixed-size sample as the stream length is unknown, and describes using replacement to preserve sample randomness over time. The document provides examples and applications of data stream mining.

Uploaded by

陳大明

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

93 views

Mining Data Streams (Part 1) : Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman

Uploaded by

陳大明

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 46

Note to other teachers and users of these slides: We would be delighted if you found this our

material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify
them to fit your own needs. If you make use of a significant portion of these slides in your own
lecture, please include this message, or a link to our web site: http://www.mmds.org

Mining Data Streams

(Part 1)
Mining of Massive Datasets
Jure Leskovec, Anand Rajaraman, Jeff Ullman
Stanford University
http://www.mmds.org
New Topic: Infinite Data

High dim. Graph Infinite Machine

Apps
data data data learning

Locality Filtering
PageRank, Recommen
sensitive data SVM
SimRank der systems
hashing streams

Community Queries on Decision Association

Clustering
Detection streams Trees Rules

Dimensiona Duplicate
Spam Web Perceptron,
lity document
Detection advertising kNN
reduction detection

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 2

Data Streams
 In many data mining situations, we do not
know the entire data set in advance

 Stream Management is important when the

input rate is controlled externally:
 Google queries
 Twitter or Facebook status updates
 We can think of the data as infinite and
non-stationary (the distribution changes
over time)
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 3
The Stream Model
 Input elements enter at a rapid rate,
at one or more input ports (i.e., streams)
 We call elements of the stream tuples

 The system cannot store the entire stream

accessibly

 Q: How do you make critical calculations

about the stream using a limited amount of
(secondary) memory?
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 4
Side note: SGD is a Streaming Alg.
 Stochastic Gradient Descent (SGD) is an
example of a stream algorithm
 In Machine Learning we call this: Online Learning
 Allows for modeling problems where we have
a continuous stream of data
 We want an algorithm to learn from it and
slowly adapt to the changes in data
 Idea: Do slow updates to the model
 SGD (SVM, Perceptron) makes small updates
 So: First train the classifier on training data.
 Then: For every example from the stream, we slightly
update the model (using small learning rate)
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 5
General Stream Processing Model
Ad-Hoc
Queries

. . . 1, 5, 2, 7, 0, 9, 3 Standing
Queries
. . . a, r, v, t, y, h, b Output
Processor
. . . 0, 0, 1, 0, 1, 1, 0
time

Streams Entering.
Each is stream is
composed of
elements/tuples
Limited
Working
Storage Archival
Storage

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 6

Problems on Data Streams
 Types of queries one wants on answer on
a data stream: (we’ll do these today)
 Sampling data from a stream
 Construct a random sample
 Queries over sliding windows
 Number of items of type x in the last k elements
of the stream

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 7

Problems on Data Streams
 Types of queries one wants on answer on
a data stream: (we’ll do these next time)
 Filtering a data stream
 Select elements with property x from the stream
 Counting distinct elements
 Number of distinct elements in the last k elements
of the stream
 Estimating moments
 Estimate avg./std. dev. of last k elements
 Finding frequent elements

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 8

Applications (1)
 Mining query streams
 Google wants to know what queries are
more frequent today than yesterday

 Mining click streams

 Yahoo wants to know which of its pages are getting
an unusual number of hits in the past hour

 Mining social network news feeds

 E.g., look for trending topics on Twitter, Facebook

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 9

Applications (2)
 Sensor Networks
 Many sensors feeding into a central controller
 Telephone call records
 Data feeds into customer bills as well as
settlements between telephone companies
 IP packets monitored at a switch
 Gather information for optimal routing
 Detect denial-of-service attacks

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 10

Sampling from a Data Stream:
Sampling a fixed proportion
As the stream grows the sample
also gets bigger
Sampling from a Data Stream
 Since we can not store the entire stream,
one obvious approach is to store a sample
 Two different problems:
 (1) Sample a fixed proportion of elements
in the stream (say 1 in 10)
 (2) Maintain a random sample of fixed size
over a potentially infinite stream
 At any “time” k we would like a random sample
of s elements
 What is the property of the sample we want to maintain?
For all time steps k, each of k elements seen so far has
equal prob. of being sampled
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 12
Sampling a Fixed Proportion
 Problem 1: Sampling fixed proportion
 Scenario: Search engine query stream
 Stream of tuples: (user, query, time)
 Answer questions such as: How often did a user
run the same query in a single days
 Have space to store 1/10th of query stream
 Naïve solution:
 Generate a random integer in [0..9] for each query
 Store the query if the integer is 0, otherwise
discard

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 13

Problem with Naïve Approach
 Simple
question: What fraction of queries by an
average search engine user are duplicates?
 Suppose each user issues x queries once and d queries
twice (total of x+2d queries)
 Correct answer: d/(x+d)
 Proposed solution: We keep 10% of the queries
 Sample will contain x/10 of the singleton queries and
2d/10 of the duplicate queries at least once
 But only d/100 pairs of duplicates
 d/100 = 1/10 ∙ 1/10 ∙ d
 Of d “duplicates” 18d/100 appear exactly once
 18d/100 = ((1/10 ∙ 9/10)+(9/10 ∙ 1/10)) ∙ d
 So the sample-based answer is
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 14
Solution: Sample Users
Solution:
 Pick 1/10th of users and take all their
searches in the sample

 Use a hash function that hashes the

user name or user id uniformly into 10
buckets

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 15

Generalized Solution
 Stream of tuples with keys:
 Key is some subset of each tuple’s components
 e.g., tuple is (user, search, time); key is user
 Choice of key depends on application

 To get a sample of a/b fraction of the stream:

 Hash each tuple’s key uniformly into b buckets
 Pick the tuple if its hash value is at most a

Hash table with b buckets, pick the tuple if its hash value is at most a.
How to generate a 30% sample?
Hash into b=10 buckets, take the tuple if it hashes to one of the first 3 buckets
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 16
Sampling from a Data Stream:
Sampling a fixed-size sample
As the stream grows, the sample is of
fixed size
Maintaining a fixed-size sample
 Problem 2: Fixed-size sample
 Suppose we need to maintain a random
sample S of size exactly s tuples
 E.g., main memory size constraint
 Why? Don’t know length of stream in advance
 Suppose at time n we have seen n items
 Each item is in the sample S with equal prob. s/n
How to think about the problem: say s = 2
Stream: a x c y z k c d e g…
At n= 5, each of the first 5 tuples is included in the sample S with equal prob.
At n= 7, each of the first 7 tuples is included in the sample S with equal prob.
Impractical solution would be to store all the n tuples seen
so far and out of them pick s at random
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 18
Solution: Fixed Size Sample
 Algorithm (a.k.a. Reservoir Sampling)
 Store all the first s elements of the stream to S
 Suppose we have seen n-1 elements, and now
the nth element arrives (n > s)
 With probability s/n, keep the nth element, else discard it
 If we picked the nth element, then it replaces one of the
s elements in the sample S, picked uniformly at random

 Claim: This algorithm maintains a sample S

with the desired property:
 After n elements, the sample contains each
element seen so far with probability s/n
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 19
Proof: By Induction
 We prove this by induction:
 Assume that after n elements, the sample contains
each element seen so far with probability s/n
 We need to show that after seeing element n+1 the
sample maintains the property
 Sample contains each element seen so far with probability
s/(n+1)
 Base case:
 After we see n=s elements the sample S has the
desired property
 Each out of n=s elements is in the sample with probability
s/s = 1
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 20
Proof: By Induction
 Inductive
hypothesis: After n elements, the sample
S contains each element seen so far with prob. s/n
 Now element n+1 arrives
 Inductive step: For elements already in S,
probability that the algorithm keeps it in S is:
 s   s  s  1  n
1    
 n  1  Element
 n n+11  Element
s  n 1
in the
Element n+1 discarded
not discarded sample not picked

 So, at time n, tuples in S were there with prob. s/n

 Time nn+1, tuple stayed in S with prob. n/(n+1)
 So prob. tuple is in S at time n+1 =
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 21
Queries over a
(long) Sliding Window
Sliding Windows
 A useful model of stream processing is that queries
are about a window of length N –
the N most recent elements received

 Interesting case: N is so large that the data cannot be

stored in memory, or even on disk
 Or, there are so many streams that windows
for all cannot be stored
 Amazon example:
 For every product X we keep 0/1 stream of whether that
product was sold in the n-th transaction
 We want answer queries, how many times have we sold X
in the last k sales
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 23
Sliding Window: 1 Stream
 Sliding window on a single stream: N=6

qwertyuiopasdfghjklzxcvbnm

Past Future

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 24

Counting Bits (1)
 Problem:
 Given a stream of 0s and 1s
 Be prepared to answer queries of the form
How many 1s are in the last k bits? where k ≤ N

 Obvious solution:
Store the most recent N bits
 When new bit comes in, discard the N+1st bit
010011011101010110110110 Suppose N=6
Past Future

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 25

Counting Bits (2)
 You can not get an exact answer without
storing the entire window

 Real Problem:
What if we cannot afford to store N bits?
 E.g., we’re processing 1 billion streams and
N = 1 billion 010011011101010110110110
Past Future

 But we are happy with an approximate

answer
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 26
An attempt: Simple solution
 Q:
How many 1s are in the last N bits?
 A simple solution that does not really solve
our problem: Uniformity assumption
N
010011100010100100010110110111001010110011010
Past Future

 Maintain 2 counters:
 S: number of 1s from the beginning of the stream
 Z: number of 0s from the beginning of the stream
 How many 1s are in the last N bits?
 But, what if stream is non-uniform?
 What if distribution changes over time?
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 27
[Datar, Gionis, Indyk, Motwani]

DGIM Method
 DGIM solution that does not assume
uniformity

 We store bits per stream

 Solution gives approximate answer,

never off by more than 50%
 Error factor can be reduced to any fraction > 0,
with more complicated algorithm and
proportionally more stored bits

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 28

Idea: Exponential Windows
 Solution that doesn’t (quite) work:
 Summarize exponentially increasing regions
of the stream, looking backward
 Drop small regions if they begin at the same point
Window of as a larger region
width 16
has 6 1s 6 10
4
?
3 2
2 1
1 0
010011100010100100010110110111001010110011010
N
We can reconstruct the count of the last N bits, except we
are not sure how many of the last 6 1s are included in the N
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 29
What’s Good?
 Stores only O(log2N ) bits
 counts of bits each

 Easy update as more bits enter

 Error in count no greater than the number

of 1s in the “unknown” area

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 30

What’s Not So Good?
 As long as the 1s are fairly evenly distributed,
the error due to the unknown region is small
– no more than 50%
 But it could be that all the 1s are in the
unknown area at the end
 In that case, the error is unbounded!

6 10
4
?
3 2
2 1
1 0
010011100010100100010110110111001010110011010
N
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 31
[Datar, Gionis, Indyk, Motwani]

Fixup: DGIM method

 Idea: Instead of summarizing fixed-length
blocks, summarize blocks with specific
number of 1s:
 Let the block sizes (number of 1s) increase
exponentially

 When there are few 1s in the window, block

sizes stay small, so errors are small
1001010110001011010101010101011010101010101110101010111010100010110010
N

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 32

DGIM: Timestamps
 Each bit in the stream has a timestamp,
starting 1, 2, …

 Record timestamps modulo N (the window

size), so we can represent any relevant
timestamp in bits

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 33

DGIM: Buckets
 A bucket in the DGIM method is a record
consisting of:
 (A) The timestamp of its end [O(log N) bits]
 (B) The number of 1s between its beginning and
end [O(log log N) bits]

 Constraint on buckets:
Number of 1s must be a power of 2
 That explains the O(log log N) in (B) above
1001010110001011010101010101011010101010101110101010111010100010110010
N
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 34
Representing a Stream by Buckets
 Either one or two buckets with the same
power-of-2 number of 1s

 Buckets do not overlap in timestamps

 Buckets are sorted by size

 Earlier buckets are not smaller than later buckets

 Buckets disappear when their

end-time is > N time units in the past
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 35
Example: Bucketized Stream

At least 1 of 2 of 2 of 1 of 2 of
size 16. Partially size 8 size 4 size 2 size 1
beyond window.

1001010110001011010101010101011010101010101110101010111010100010110010

Three properties of buckets that are maintained:

- Either one or two buckets with the same power-of-2 number of 1s
- Buckets do not overlap in timestamps
- Buckets are sorted by size
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 36
Updating Buckets (1)
 When a new bit comes in, drop the last
(oldest) bucket if its end-time is prior to N
time units before the current time

 2 cases: Current bit is 0 or 1

 If the current bit is 0:

no other changes are needed

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 37

Updating Buckets (2)
 If the current bit is 1:
 (1) Create a new bucket of size 1, for just this bit
 End timestamp = current time
 (2) If there are now three buckets of size 1,
combine the oldest two into a bucket of size 2
 (3) If there are now three buckets of size 2,
combine the oldest two into a bucket of size 4
 (4) And so on …

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 38

Example: Updating Buckets
Current state of the stream:
1001010110001011010101010101011010101010101110101010111010100010110010

Bit of value 1 arrives

0010101100010110101010101010110101010101011101010101110101000101100101
Two orange buckets get merged into a yellow bucket
0010101100010110101010101010110101010101011101010101110101000101100101

Next bit 1 arrives, new orange bucket is created, then 0 comes, then 1:
0101100010110101010101010110101010101011101010101110101000101100101101

Buckets get merged…

0101100010110101010101010110101010101011101010101110101000101100101101

State of the buckets after merging

0101100010110101010101010110101010101011101010101110101000101100101101

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 39

How to Query?
 To estimate the number of 1s in the most
recent N bits:
1. Sum the sizes of all buckets but the last
(note “size” means the number of 1s in the bucket)
2. Add half the size of the last bucket

 Remember: We do not know how many 1s

of the last bucket are still within the wanted
window

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 40

Example: Bucketized Stream

At least 1 of 2 of 2 of 1 of 2 of
size 16. Partially size 8 size 4 size 2 size 1
beyond window.

1001010110001011010101010101011010101010101110101010111010100010110010

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 41

Error Bound: Proof
 Why is error 50%? Let’s prove it!
 Suppose the last bucket has size 2r
 Then by assuming 2r-1 (i.e., half) of its 1s are
still within the window, we make an error of
at most 2r-1
 Since there is at least one bucket of each of
the sizes less than 2r, the true sum is at least
1 + 2 + 4 + .. + 2r-1 = 2r -1
 Thus, error at most 50% At least 16 1s

111111110000000011101010101011010101010101110101010111010100010110010
N
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 42
Further Reducing the Error
 Instead of maintaining 1 or 2 of each size
bucket, we allow either r-1 or r buckets (r > 2)
 Except for the largest size buckets; we can have
any number between 1 and r of those
 Error is at most O(1/r)
 By picking r appropriately, we can tradeoff
between number of bits we store and the
error

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 43

Extensions
 Can we use the same trick to answer queries
How many 1’s in the last k? where k < N?
 A: Find earliest bucket B that at overlaps with k.
Number of 1s is the sum of sizes of more recent
buckets + ½ size of B
1001010110001011010101010101011010101010101110101010111010100010110010
k

 Can we handle the case where the stream is

not bits, but integers, and we want the sum
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 44
Extensions

Stream
of positive integers
 We want the sum of the last k elements
 Amazon: Avg. price of last k sales
 Solution:
 (1) If you know all have at most m bits
 Treat m bits of each integer as a separate stream
ci …estimated count for i-th bit
 Use DGIM to count 1s in each integer
 The sum is
 (2) Use buckets to keep partial sums
 Sum of elements in size b bucket is at most 2b Idea: Sum in each
2 5 7 1 3 8 4 6 7 9 1 3 7 6 5 3 5 7 1 3 3 1 2 2 6
bucket is at most
2 5 7 1 3 8 4 6 7 9 1 3 7 6 5 3 5 7 1 3 3 1 2 2 6 3 2b (unless bucket
has only 1 integer)
2 5 7 1 3 8 4 6 7 9 1 3 7 6 5 3 5 7 1 3 3 1 2 2 6 3 2 Bucket sizes:
2 5 7 1 3 8 4 6 7 9 1 3 7 6 5 3 5 7 1 3 3 1 2 2 6 3 2 5 16 8 4 2 1
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 45
Summary
 Sampling a fixed proportion of a stream
 Sample size grows as the stream grows
 Sampling a fixed-size sample
 Reservoir sampling
 Counting the number of 1s in the last N
elements
 Exponentially increasing windows
 Extensions:
 Number of 1s in any last k (k < N) elements
 Sums of integers in the last N elements

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 46

Unit 4 Notes PDF
100% (2)
Unit 4 Notes PDF
27 pages
Mining Data Streams 1
No ratings yet
Mining Data Streams 1
46 pages
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
No ratings yet
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
46 pages
Mod2_Data_Streams
No ratings yet
Mod2_Data_Streams
75 pages
Mining Data Streams (Part 1)
No ratings yet
Mining Data Streams (Part 1)
46 pages
ch09 Recsys1
No ratings yet
ch09 Recsys1
43 pages
Unit-II BDA
No ratings yet
Unit-II BDA
19 pages
Unit-II (Big Data)
No ratings yet
Unit-II (Big Data)
20 pages
Data Stream Management: Patrick Martin Calisto Zuzarte
No ratings yet
Data Stream Management: Patrick Martin Calisto Zuzarte
22 pages
Bigdata Unit II
No ratings yet
Bigdata Unit II
19 pages
BDA
No ratings yet
BDA
6 pages
Mmd04A Streams
No ratings yet
Mmd04A Streams
78 pages
ch03 LSH
No ratings yet
ch03 LSH
58 pages
MMD 03
No ratings yet
MMD 03
53 pages
BDA Unit-2
No ratings yet
BDA Unit-2
12 pages
Data Stream Processing - An Overview: Sangeetha Seshadri Sangeeta@cc - Gatech.edu
No ratings yet
Data Stream Processing - An Overview: Sangeetha Seshadri Sangeeta@cc - Gatech.edu
68 pages
Big Data Analytics_Unit 3
No ratings yet
Big Data Analytics_Unit 3
64 pages
ch-09 - Part 1
No ratings yet
ch-09 - Part 1
22 pages
Unit III - MMD - Lecture Notes
No ratings yet
Unit III - MMD - Lecture Notes
8 pages
ch05 Linkanalysis1
No ratings yet
ch05 Linkanalysis1
60 pages
Intro Big Data
No ratings yet
Intro Big Data
36 pages
9 RA MIRI SamplingDS
No ratings yet
9 RA MIRI SamplingDS
66 pages
Big Data Processing: Jiaul Paik
No ratings yet
Big Data Processing: Jiaul Paik
47 pages
16 Streams
No ratings yet
16 Streams
61 pages
BDA Module-4
No ratings yet
BDA Module-4
8 pages
3. Unit 3 - BD - Streaming
No ratings yet
3. Unit 3 - BD - Streaming
42 pages
Clustering Data Streams Theory Practice
No ratings yet
Clustering Data Streams Theory Practice
33 pages
Stanford - Slides Mapreduce
No ratings yet
Stanford - Slides Mapreduce
76 pages
Bda Unit II Lecture1
No ratings yet
Bda Unit II Lecture1
10 pages
Big Data IV Nit
No ratings yet
Big Data IV Nit
15 pages
Bda M4
No ratings yet
Bda M4
57 pages
Unit2 Bda
No ratings yet
Unit2 Bda
293 pages
Spam Mail Detection5x9,x8,w6
No ratings yet
Spam Mail Detection5x9,x8,w6
11 pages
Data Analytics Unit 3
No ratings yet
Data Analytics Unit 3
14 pages
BDA Mod 3
No ratings yet
BDA Mod 3
57 pages
ch01 Intro
No ratings yet
ch01 Intro
29 pages
DWDM - Unit - VII
No ratings yet
DWDM - Unit - VII
42 pages
DM Unit V
No ratings yet
DM Unit V
20 pages
Unit-5 Data Mining AIML
No ratings yet
Unit-5 Data Mining AIML
31 pages
ch01 Intro
No ratings yet
ch01 Intro
28 pages
Unit Ii BD
No ratings yet
Unit Ii BD
74 pages
Class 7
No ratings yet
Class 7
29 pages
Swe2011 Bda - III
No ratings yet
Swe2011 Bda - III
50 pages
Unit 2
No ratings yet
Unit 2
23 pages
Big Data Analytics Course Introduction
No ratings yet
Big Data Analytics Course Introduction
28 pages
BDA_Viva_Questions
No ratings yet
BDA_Viva_Questions
3 pages
BIG_DATA_UNIT_II_NOTES
No ratings yet
BIG_DATA_UNIT_II_NOTES
19 pages
MapReduce-Final
No ratings yet
MapReduce-Final
92 pages
Sampling For Big Data: Graham Cormode, University of Warwick Nick Duffield, Texas A&M University
No ratings yet
Sampling For Big Data: Graham Cormode, University of Warwick Nick Duffield, Texas A&M University
50 pages
Introduction PDF
No ratings yet
Introduction PDF
69 pages
INT 354 CA1 Mokshagna
No ratings yet
INT 354 CA1 Mokshagna
8 pages
RIoTBench Summary
No ratings yet
RIoTBench Summary
26 pages
Big Data 3rd Unit
No ratings yet
Big Data 3rd Unit
16 pages
Ieee 2010 Titles: Data Alcott Systems (0) 9600095047
No ratings yet
Ieee 2010 Titles: Data Alcott Systems (0) 9600095047
6 pages
Minor Research Project Report
No ratings yet
Minor Research Project Report
23 pages
XSEDE15 Part1 Intro
No ratings yet
XSEDE15 Part1 Intro
101 pages
Stream Mining
No ratings yet
Stream Mining
65 pages
Big Data Analytics Unit 2 MINING DATA STREAMS
100% (2)
Big Data Analytics Unit 2 MINING DATA STREAMS
22 pages
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
Practical Packet Analysis, 3rd Edition: Using Wireshark to Solve Real-World Network Problems
From Everand
Practical Packet Analysis, 3rd Edition: Using Wireshark to Solve Real-World Network Problems
Chris Sanders
3.5/5 (6)
1.5. Reclamation and Restoration of Derelict Lands
100% (1)
1.5. Reclamation and Restoration of Derelict Lands
9 pages
Nailable Insulation Guide: Acfoam
No ratings yet
Nailable Insulation Guide: Acfoam
32 pages
Drawback
No ratings yet
Drawback
1 page
Adisu Ppt. End
100% (1)
Adisu Ppt. End
41 pages
320 & 365 Kva PDF
No ratings yet
320 & 365 Kva PDF
156 pages
Differentiating Between Gestational and Chronic Hyper-Tension An Explorative Study
No ratings yet
Differentiating Between Gestational and Chronic Hyper-Tension An Explorative Study
6 pages
4R44E, 4R55E, 5R44E, 5R55E: RWD 4/5 Speed
No ratings yet
4R44E, 4R55E, 5R44E, 5R55E: RWD 4/5 Speed
10 pages
For CIDAM01
No ratings yet
For CIDAM01
32 pages
Turning Plots COATED DNMA Cs-135 & 0.08 at 0.5 MM Doc FFT
No ratings yet
Turning Plots COATED DNMA Cs-135 & 0.08 at 0.5 MM Doc FFT
13 pages
CC Critical Incidents 131127
No ratings yet
CC Critical Incidents 131127
21 pages
2015-5 Dusaran (Production Practices of The Native Chicken Growers in Western Visayas)
No ratings yet
2015-5 Dusaran (Production Practices of The Native Chicken Growers in Western Visayas)
9 pages
11 Part 2 Att 2-2 HSE Pre-Qualification Questionnaire
No ratings yet
11 Part 2 Att 2-2 HSE Pre-Qualification Questionnaire
5 pages
R-Stahl Ammeter Model
No ratings yet
R-Stahl Ammeter Model
4 pages
Conformal Coating UV Curing Technical Bulletin June 09
No ratings yet
Conformal Coating UV Curing Technical Bulletin June 09
2 pages
Uantitative: Data Analysis
No ratings yet
Uantitative: Data Analysis
15 pages
Environmental Psychology Theories
No ratings yet
Environmental Psychology Theories
24 pages
Worksheet 10 Trigonometry
No ratings yet
Worksheet 10 Trigonometry
3 pages
Case Study - Quality Management System at Coca Cola Company - Docx - 1538569969006 PDF
No ratings yet
Case Study - Quality Management System at Coca Cola Company - Docx - 1538569969006 PDF
7 pages
Lung - Pathophysiology
No ratings yet
Lung - Pathophysiology
66 pages
دكتور علي المتناني ورقت بحث
No ratings yet
دكتور علي المتناني ورقت بحث
7 pages
Ionic and Covalent Bonds
No ratings yet
Ionic and Covalent Bonds
5 pages
Grade 11 Mid-Term Test 2018
No ratings yet
Grade 11 Mid-Term Test 2018
4 pages
P1C8 Integration (Exercises)
No ratings yet
P1C8 Integration (Exercises)
24 pages
Account Executive Sales B2B in Indianapolis IN Resume Sheryl Karnes
No ratings yet
Account Executive Sales B2B in Indianapolis IN Resume Sheryl Karnes
2 pages
All About Threads PDF
No ratings yet
All About Threads PDF
85 pages
Unit 1: Hello and Goodbye
No ratings yet
Unit 1: Hello and Goodbye
2 pages
Shareholders' Equity
No ratings yet
Shareholders' Equity
16 pages
Cisco Script PT 3.6.1 Packetracer Skills Challenge
No ratings yet
Cisco Script PT 3.6.1 Packetracer Skills Challenge
5 pages
Nitrogen Clean-Outs: Total Well Hole Volume Total CT Annular
No ratings yet
Nitrogen Clean-Outs: Total Well Hole Volume Total CT Annular
10 pages
China Study Summary
100% (7)
China Study Summary
24 pages

Unit 4 Notes PDF
Unit 4 Notes PDF
Mining Data Streams 1
Mining Data Streams 1
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
Mod2_Data_Streams
Mod2_Data_Streams
Mining Data Streams (Part 1)
Mining Data Streams (Part 1)
ch09 Recsys1
ch09 Recsys1
Unit-II BDA
Unit-II BDA
Unit-II (Big Data)
Unit-II (Big Data)
Data Stream Management: Patrick Martin Calisto Zuzarte
Data Stream Management: Patrick Martin Calisto Zuzarte
Bigdata Unit II
Bigdata Unit II
BDA
BDA
Mmd04A Streams
Mmd04A Streams
ch03 LSH
ch03 LSH
MMD 03
MMD 03
BDA Unit-2
BDA Unit-2
Data Stream Processing - An Overview: Sangeetha Seshadri Sangeeta@cc - Gatech.edu
Data Stream Processing - An Overview: Sangeetha Seshadri Sangeeta@cc - Gatech.edu
Big Data Analytics_Unit 3
Big Data Analytics_Unit 3
ch-09 - Part 1
ch-09 - Part 1
Unit III - MMD - Lecture Notes
Unit III - MMD - Lecture Notes
ch05 Linkanalysis1
ch05 Linkanalysis1
Intro Big Data
Intro Big Data
9 RA MIRI SamplingDS
9 RA MIRI SamplingDS
Big Data Processing: Jiaul Paik
Big Data Processing: Jiaul Paik
16 Streams
16 Streams
BDA Module-4
BDA Module-4
3. Unit 3 - BD - Streaming
3. Unit 3 - BD - Streaming
Clustering Data Streams Theory Practice
Clustering Data Streams Theory Practice
Stanford - Slides Mapreduce
Stanford - Slides Mapreduce
Bda Unit II Lecture1
Bda Unit II Lecture1
Big Data IV Nit
Big Data IV Nit
Bda M4
Bda M4
Unit2 Bda
Unit2 Bda
Spam Mail Detection5x9,x8,w6
Spam Mail Detection5x9,x8,w6
Data Analytics Unit 3
Data Analytics Unit 3
BDA Mod 3
BDA Mod 3
ch01 Intro
ch01 Intro
DWDM - Unit - VII
DWDM - Unit - VII
DM Unit V
DM Unit V
Unit-5 Data Mining AIML
Unit-5 Data Mining AIML
ch01 Intro
ch01 Intro
Unit Ii BD
Unit Ii BD
Class 7
Class 7
Swe2011 Bda - III
Swe2011 Bda - III
Unit 2
Unit 2
Big Data Analytics Course Introduction
Big Data Analytics Course Introduction
BDA_Viva_Questions
BDA_Viva_Questions
BIG_DATA_UNIT_II_NOTES
BIG_DATA_UNIT_II_NOTES
MapReduce-Final
MapReduce-Final
Sampling For Big Data: Graham Cormode, University of Warwick Nick Duffield, Texas A&M University
Sampling For Big Data: Graham Cormode, University of Warwick Nick Duffield, Texas A&M University
Introduction PDF
Introduction PDF
INT 354 CA1 Mokshagna
INT 354 CA1 Mokshagna
RIoTBench Summary
RIoTBench Summary
Big Data 3rd Unit
Big Data 3rd Unit
Ieee 2010 Titles: Data Alcott Systems (0) 9600095047
Ieee 2010 Titles: Data Alcott Systems (0) 9600095047
Minor Research Project Report
Minor Research Project Report
XSEDE15 Part1 Intro
XSEDE15 Part1 Intro
Stream Mining
Stream Mining
Big Data Analytics Unit 2 MINING DATA STREAMS
Big Data Analytics Unit 2 MINING DATA STREAMS
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
Practical Packet Analysis, 3rd Edition: Using Wireshark to Solve Real-World Network Problems
From Everand
Practical Packet Analysis, 3rd Edition: Using Wireshark to Solve Real-World Network Problems
1.5. Reclamation and Restoration of Derelict Lands
1.5. Reclamation and Restoration of Derelict Lands
Nailable Insulation Guide: Acfoam
Nailable Insulation Guide: Acfoam
Drawback
Drawback
Adisu Ppt. End
Adisu Ppt. End
320 & 365 Kva PDF
320 & 365 Kva PDF
Differentiating Between Gestational and Chronic Hyper-Tension An Explorative Study
Differentiating Between Gestational and Chronic Hyper-Tension An Explorative Study
4R44E, 4R55E, 5R44E, 5R55E: RWD 4/5 Speed
4R44E, 4R55E, 5R44E, 5R55E: RWD 4/5 Speed
For CIDAM01
For CIDAM01
Turning Plots COATED DNMA Cs-135 & 0.08 at 0.5 MM Doc FFT
Turning Plots COATED DNMA Cs-135 & 0.08 at 0.5 MM Doc FFT
CC Critical Incidents 131127
CC Critical Incidents 131127
2015-5 Dusaran (Production Practices of The Native Chicken Growers in Western Visayas)
2015-5 Dusaran (Production Practices of The Native Chicken Growers in Western Visayas)
11 Part 2 Att 2-2 HSE Pre-Qualification Questionnaire
11 Part 2 Att 2-2 HSE Pre-Qualification Questionnaire
R-Stahl Ammeter Model
R-Stahl Ammeter Model
Conformal Coating UV Curing Technical Bulletin June 09
Conformal Coating UV Curing Technical Bulletin June 09
Uantitative: Data Analysis
Uantitative: Data Analysis
Environmental Psychology Theories
Environmental Psychology Theories
Worksheet 10 Trigonometry
Worksheet 10 Trigonometry
Case Study - Quality Management System at Coca Cola Company - Docx - 1538569969006 PDF
Case Study - Quality Management System at Coca Cola Company - Docx - 1538569969006 PDF
Lung - Pathophysiology
Lung - Pathophysiology
دكتور علي المتناني ورقت بحث
دكتور علي المتناني ورقت بحث
Ionic and Covalent Bonds
Ionic and Covalent Bonds
Grade 11 Mid-Term Test 2018
Grade 11 Mid-Term Test 2018
P1C8 Integration (Exercises)
P1C8 Integration (Exercises)
Account Executive Sales B2B in Indianapolis IN Resume Sheryl Karnes
Account Executive Sales B2B in Indianapolis IN Resume Sheryl Karnes
All About Threads PDF
All About Threads PDF
Unit 1: Hello and Goodbye
Unit 1: Hello and Goodbye
Shareholders' Equity
Shareholders' Equity
Cisco Script PT 3.6.1 Packetracer Skills Challenge
Cisco Script PT 3.6.1 Packetracer Skills Challenge
Nitrogen Clean-Outs: Total Well Hole Volume Total CT Annular
Nitrogen Clean-Outs: Total Well Hole Volume Total CT Annular
China Study Summary
China Study Summary

Mining Data Streams (Part 1) : Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman

Uploaded by

Mining Data Streams (Part 1) : Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman

Uploaded by

Note to other teachers and users of these slides: We would be delighted if you found this our

Mining Data Streams

High dim. Graph Infinite Machine

Community Queries on Decision Association

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 2

 Stream Management is important when the

 The system cannot store the entire stream

 Q: How do you make critical calculations

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 6

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 7

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 8

 Mining click streams

 Mining social network news feeds

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 9

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 10

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 13

 Use a hash function that hashes the

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 15

 To get a sample of a/b fraction of the stream:

 Claim: This algorithm maintains a sample S

 So, at time n, tuples in S were there with prob. s/n

 Interesting case: N is so large that the data cannot be

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 24

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 25

 But we are happy with an approximate

 We store bits per stream

 Solution gives approximate answer,

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 28

 Easy update as more bits enter

 Error in count no greater than the number

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 30

Fixup: DGIM method

 When there are few 1s in the window, block

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 32

 Record timestamps modulo N (the window

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 33

 Buckets do not overlap in timestamps

 Buckets are sorted by size

 Buckets disappear when their

Three properties of buckets that are maintained:

 2 cases: Current bit is 0 or 1

 If the current bit is 0:

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 37

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 38

Bit of value 1 arrives

Buckets get merged…

State of the buckets after merging

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 39

 Remember: We do not know how many 1s

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 40

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 41

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 43

 Can we handle the case where the stream is

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 46

You might also like