0% found this document useful (0 votes)

49 views75 pages

Mod2_Data_Streams

Uploaded by

rajputakashchand4

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

49 views75 pages

Mod2_Data_Streams

Uploaded by

rajputakashchand4

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 75

Mining Data Streams

(Part 1)
Mining of Massive Datasets
Jure Leskovec, Anand Rajaraman, Jeff
Ullman Stanford University
http://www.mmds.org
Contents
◼Introduction to Data Streams
◼ Examples of Streams
◼ The Stream Model
◼ Filtering Streams: The Bloom Filter
◼ The Count-Distinct Problem,
- The Flajolet-Martin Algorithm;
◼ Estimating Moments: AMS Algorithm
◼ Higher-Order Moments
◼ Quering on Windows
▪ Counting Ones
▪ DGIM algorithm

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 2

New Topic: Infinite Data
High Infinit
Graph Machine
dim.
data e learning
Apps
data data
Locality Filtering
PageRank, Recommen
sensitive data SVM
SimRank der systems
hashing streams

Community Queries on Decision Association

Clustering
Detection streams Trees Rules

Dimensional Duplicate
Spam Web Perceptron,
ity document
Detection advertising kNN
reduction detection

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 3

Data Management v/s Stream
Management

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 4

Example of Stream sources

◼Sensor data
◼Image Data
◼Internet and Web Traffic
◼Sensor Data

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 5

Data Streams
◼In many data mining situations, we do not
know the entire data set in advance

◼Stream Management is important when the

input rate is controlled externally:
▪ Google queries
▪ Twitter or Facebook status updates
◼We can think of the data as infinite and
non-stationary (the distribution changes
over time)
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 6
General Stream Processing Model
Ad-Hoc
Queries

Standin
. . . 1, 5, 2, 7, 0, 9, 3 g
Queries
. . . a, r, v, t, y, h, b Output
Processor
. . . 0, 0, 1, 0, 1, 1, 0
time

Streams Entering.
Each is stream is
composed of
elements/tuples
Limited
Working
Storage Archival
Storage

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 7

The Stream Model
◼Input elements enter at a rapid rate,
at one or more input ports (i.e., streams)
▪ We call elements of the stream tuples

◼The system cannot store the entire stream

accessibly
◼Q: How do you make critical calculations
about the stream using a limited amount of
(secondary) memory?

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 8

Types of queries
1) Ad-hoc queries
A question asked once about the current state of a
stream or stream.
Example: What is the max value seen so far?

2) Standing queries
Example: The stream produced by the ocean-surface-
temperature sensor might have a standing query to output an
alert whenever the temperature exceeds 25 degrees centigrade.
This query is easily answered, since it depends only on the most
recent stream element.

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 9

Problems on Data Streams

◼Types of queries one wants on answer on

a data stream:
▪ Sampling data from a stream
▪ Construct a random sample
▪ Queries over sliding windows
▪ Number of items of type x in the last k elements
of the stream

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 10

Problems on Data Streams

◼Types of queries one wants on answer on

a data stream:
◼Filtering a data stream
▪ Select elements with property x from the stream
▪ Counting distinct elements
▪ Number of distinct elements in the last k elements
of the stream
▪ Estimating moments
▪ Estimate avg./std. dev. of last k elements
▪ Finding frequent elements

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 11

Applications (1)
◼Mining query streams
▪ Google wants to know what queries are
more frequent today than yesterday

◼Mining click streams

▪ Yahoo wants to know which of its pages are
getting an unusual number of hits in the past hour

◼Mining social network news feeds

▪ E.g., look for trending topics on Twitter, Facebook

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 12

Applications (2)
◼Sensor Networks
▪ Many sensors feeding into a central controller
◼Telephone call records
▪ Data feeds into customer bills as well as
settlements between telephone companies
◼IP packets monitored at a switch
▪ Gather information for optimal routing
▪ Detect denial-of-service attacks

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 13

Algorithms for streams:
(1) Filtering a data stream: Bloom filters
Select elements with property x from stream
(2) Counting distinct elements: Flajolet-Martin
Number of distinct elements in the last k elements
of the stream
(3) Estimating moments: AMS method
Estimate std. dev. of last k elements
(4) Counting frequent items

14
(1) Filtering Data
Streams
Filtering Data Streams
Each element of data stream is a tuple
Given a list of keys S
Determine which tuples of stream are in S

¡Obvious solution: Hash table

But suppose we do not have enough memory to store
all of S in a hash table
E.g., we might be processing millions of filters on the same
stream

16
Bloom Filters – Introduction
Example: Creating gmail account.

◼
A space-efficient probabilistic data structure that is
used to test whether an element is a member of a
set.

◼
The price we pay for efficiency is that it is
probabilistic in nature that means, there might be
some False Positive results.

◼
False positive means, it might tell that given
username is already taken but actually it’s not.
17
The Bloom Filter
A Bloom filter consists of:

1. An array of n bits, initially all 0’s.

2. A collection of hash functions h1, h2, . . . , hk.
Each hash function maps “key” values to n
buckets, corresponding to the n bits of the bit-
array.
3. A set S of m key values.

18
Examples
◼Refer the numerical examples taken in class.

◼Bloom Filters - Introduction and Implementati

on - GeeksforGeeks
(2) Counting Distinct
Elements
Counting Distinct Elements

21
Applications

22
Formal Definition

23
Using Small Storage

24
Flajolet Martin Algorithm
◼ Pseudo Code-Stepwise Solution:

1. Selecting a hash function h so each element in the set is mapped to

a string to at least log2n bits.

2. Convert this h(x) output to binary_value.

3. For each binary_value x, r(b_v)= length of trailing zeroes in

binary_value

4. R=max(r(x))

=> Distinct elements= 2^R

25
Example
Consider stream, x=[ 1,5,10,5,15,1 ]
h(x) = x mod 11

26
Example (2)
Example:
S=1,3,2,1,2,3,4,3,1,2,3,1
h(x) = 6x + 1 mod 5

Numerical solved in lecture.

27
(3) Computing
Moments
Generalization: Moments
◼Suppose a stream has elements chosen
from a set A of N values

◼Let mi be the number of times value i occurs

in the stream
◼The kth moment is

 iA
( mi ) k

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://w 29

Special Cases

 iA
( mi ) k

◼0thmoment = number of distinct elements

▪ The problem just considered
◼1st moment = count of the numbers of
elements = length of the stream
▪ Easy to compute
◼2nd moment = surprise number S =
a measure of how uneven the distribution is

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://w 30

Example: Surprise
Number
◼Stream of length 100
◼11 distinct values
◼Item counts: 10, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9
Surprise S = 910
◼Item counts: 90, 1, 1, 1, 1, 1, 1, 1 ,1, 1, 1
Surprise S = 8,110

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://w 31

[Alon, Matias, and Szegedy]

AMS Method
◼AMS method works for all moments
◼Gives an unbiased estimate
◼We will just concentrate on the 2nd moment S
◼We pick and keep track of many variables X:
▪ For each variable X we store X.el and X.val
▪ X.el corresponds to the item i
▪ X.val corresponds to the count of item i
▪ Note this requires a count in main memory,
so number of Xs is limited
◼Our goal is to compute

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://w 32

One Random Variable
(X)
◼How to set X.val and X.el?
▪ Assume stream has length n (we relax this later)
▪ Pick some random time t (t<n) to start,
so that any time is equally likely
▪ Let at time t the stream have item i. We set X.el = i
▪ Then we maintain count c (X.val = c) of the number
of is in the stream starting from the chosen time t
 Then the estimate of the 2nd moment () is:

▪ Note, we will keep track of multiple Xs, (X1, X2,… Xk)

and our final estimate will be
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://w 33
Example
◼ The stream is a, b, c, b, d, a, c, d, a, b, d, c, a, a, b.
◼ n= 15

◼ AMS:
◼ Assume that at “random” we pick the 3rd, 8th, and
13th positions
◼ Calculate X1, X2, X3
Expectation Analysis
Count: 1 2 3 ma

Stream: a a b b b a b a

◼2nd moment is
◼ct … number of times item at time t appears
from time t onwards (c1=ma , c2=ma-1, c3=mb)
mi … total count of
item i in the stream
(we are assuming
stream has length n)

Time t when Time t when

Time t when the penultimate the first i is
Group times the last i is i is seen (ct=2) seen (ct=mi)
by the value seen (ct=1)
seen

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 35

Expectation Analysis
Count: 1 2 3 ma

Stream: a a b b b a b a

▪ Little side calculation:

◼Then
◼So,
◼We have the second moment (in expectation)!

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 36

Higher-Order Moments
◼For estimating kth moment we essentially use
the same algorithm but change the estimate:
▪ For k=2 we used n (2·c – 1)
▪ For k=3 we use: n (3·c2 – 3c + 1) (where c=X.val)
◼Why?
▪ For k=2: Remember we had and we showed terms
2c-1 (for c=1,…,m) sum to m2

▪ So:
▪ For k=3: c3 - (c-1)3 = 3c2 - 3c + 1
◼Generally: Estimate
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://w 37
Sampling from a Data
Stream:
Techniques of Sampling:
1) Sampling a fixed proportion
2) Fixed Size Sampling
3) Biased Reservoir Sampling

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 39

Sampling from a Data Stream

◼Since we can not store the entire stream,

one obvious approach is to store a sample
◼Two different problems:
▪ (1) Sample a fixed proportion of elements
in the stream (say 1 in 10)
▪ (2) Maintain a random sample of fixed size
over a potentially infinite stream
▪ At any “time” k we would like a random sample
of s elements
▪ What is the property of the sample we want to maintain?
For all time steps k, each of k elements seen so far has
equal prob. of being sampled
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 40
Sampling a Fixed Proportion

◼Problem 1: Sampling fixed proportion

◼Scenario: Search engine query stream
▪ Stream of tuples: (user, query, time)
▪ Answer questions such as: How often did a user
run the same query in a single days
▪ Have space to store 1/10th of query stream
◼Naïve solution:
▪ Generate a random integer in [0..9] for each query
▪ Store the query if the integer is 0, otherwise
discard

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 41

Problem with Naïve Approach

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 42

Solution: Sample Users
Solution:
◼Pick 1/10th of users and take all their
searches in the sample
◼Use a hash function that hashes the
user name or user id uniformly into 10
buckets

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 43

Generalized Solution
◼Stream of tuples with keys:
▪ Key is some subset of each tuple’s components
▪ e.g., tuple is (user, search, time); key is user
▪ Choice of key depends on application

◼To get a sample of a/b fraction of the stream:

▪ Hash each tuple’s key uniformly into b buckets
▪ Pick the tuple if its hash value is at most a

Hash table with b buckets, pick the tuple if its hash value is at most a.
How to generate a 30% sample?
Hash into b=10 buckets, take the tuple if it hashes to one of the first 3 buckets
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 44
Queries over a
(long) Sliding Window
Sliding Windows
◼ A useful model of stream processing is that
queries are about a window of length N –
the N most recent elements received
◼ Interesting case: N is so large that the data cannot
be stored in memory, or even on disk
▪ Or, there are so many streams that windows
for all cannot be stored
◼ Amazon example:
▪ For every product X we keep 0/1 stream of whether
that product was sold in the n-th transaction
▪ We want answer queries, how many times have we sold
X in the last k sales
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 46
Sliding Window: 1 Stream

◼Sliding window on a single stream: N=6

qwertyuiopasdfghjklzxcvbnm

Past Future

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 47

Counting Bits (1)
◼Problem:
▪ Given a stream of 0s and 1s
▪ Be prepared to answer queries of the form
How many 1s are in the last k bits? where k ≤ N

◼Obvious solution:
Store the most recent N bits
▪ When new bit comes in, discard the N+1st bit
010011011101010110110 Suppose N=6
110
Past Future

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 48

Counting Bits (2)
◼You can not get an exact answer without
storing the entire window
◼Real Problem:
What if we cannot afford to store N bits?
▪ E.g., we’re processing 1 billion streams and
N = 1 billion 0 1 0 0 1 1 0 1 1 1 0 1 0 1 0 1 1 0 1 1 0
110 Past Future

◼But we are happy with an approximate

answer
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 49
An attempt: Simple solution

N
0100111000101001000101101101110010101100
11010 Past Future

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 50

[Datar, Gionis, Indyk,

DGIM Method Motwani]

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 51

DGIM Elements
◼Timestamp – modulo N
◼Bucket size
◼Rules:
1. The right end of a bucket is always a position with a 1.
2. Every position with a 1 is in some bucket.
3. Length of bucket= number of 1’s
4. There are one or two buckets of any given size, up to
some maximum size.
5. All sizes must be a power of 2.
6. Buckets cannot decrease in size as we move to the left
(back in time)
DGIM Example
◼. . 1 0 1 1 0 1 1 0 0 0 1 0 1 1 1 0 1 1 0 0 1 0 1 1 0
This picture shows how we can form the buckets based on the number
of ones by following the rules.

56
In the given data stream let us assume the new bit arrives from the right. When the ne w bit = 0

57
After the new bit ( 0 ) arrives with a time stamp 101, there is no change in the bucke
But what if the new bit that arrives is 1, then we need to make changes..

58
Storage Requirements for the DGIM
Algorithm

◼Each bucket can be represented by O(log N)

bits. If the window has length N, then there
are no more than N 1’s, surely.

◼Suppose the largest bucket is of size 2j . Then j

cannot exceed log2 N, or else there are more
1’s in this bucket than there are 1’s in the
entire window.
Idea: Exponential Windows

◼Solution that doesn’t (quite) work:

▪ Summarize exponentially increasing regions
of the stream, looking backward
▪ Drop small regions if they begin at the same point
Window of as a larger region
width 16
has 6 1s 6 10
4
?
3 2
2 1
1 0
0100111000101001000101101101110010101100
11010 N
We can reconstruct the count of the last N bits, except we
are not sure how many of the last 6 1s are included in the N
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 60
What’s Good?
◼

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 61

What’s Not So Good?
◼As long as the 1s are fairly evenly distributed,
the error due to the unknown region is small
– no more than 50%
◼But it could be that all the 1s are in the
unknown area at the end
◼In that case, the error is unbounded!
6 10
4
?
3 2
2 1
1 0
0100111000101001000101101101110010101100
11010 N
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 62
[Datar, Gionis, Indyk,

Fixup: DGIM method Motwani]

◼Idea: Instead of summarizing fixed-length

blocks, summarize blocks with specific
number of 1s:
▪ Let the block sizes (number of 1s) increase
exponentially

◼When there are few 1s in the window, block

sizes stay small, so errors are small
1001010110001011010101010101011010101010101110101010111010100
010110010 N

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 63

DGIM: Timestamps
◼

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 64

DGIM: Buckets
◼ A bucket in the DGIM method is a record
consisting of:
▪ (A) The timestamp of its end [O(log N) bits]
▪ (B) The number of 1s between its beginning and
end [O(log log N) bits]

◼ Constraint on buckets:
Number of 1s must be a power of 2
▪ That explains the O(log log N) in (B) above
1001010110001011010101010101011010101010101110101010111010100
010110010 N
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 65
Representing a Stream by Buckets

◼Either one or two buckets with the same

power-of-2 number of 1s
◼Buckets do not overlap in timestamps
◼Buckets are sorted by size
▪ Earlier buckets are not smaller than later buckets

◼Buckets disappear when their

end-time is > N time units in the past

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 66

Example: Bucketized Stream

At least 1 of 2 of 2 of 1 of 2 of
size 16. Partially size 8 size 4 size 2 size 1
beyond window.

1001010110001011010101010101011010101010101110101010111010100
010110010
N

Three properties of buckets that are maintained:

- Either one or two buckets with the same power-of-2 number of 1s
- Buckets do not overlap in timestamps
- Buckets are sorted by size
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 67
Updating Buckets (1)
◼When a new bit comes in, drop the last
(oldest) bucket if its end-time is prior to N
time units before the current time
◼2 cases: Current bit is 0 or 1
◼If the current bit is 0:
no other changes are needed

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 68

Updating Buckets (2)
◼ If the current bit is 1:
▪ (1) Create a new bucket of size 1, for just this bit
▪ End timestamp = current time
▪ (2) If there are now three buckets of size 1,
combine the oldest two into a bucket of size 2
▪ (3) If there are now three buckets of size 2,
combine the oldest two into a bucket of size 4
▪ (4) And so on …

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 69

Example: Updating Buckets

Current state of the stream:

1001010110001011010101010101011010101010101110101010111010100
010110010
Bit of value 1 arrives
0010101100010110101010101010110101010101011101010101110101000
101100101
Two orange buckets get merged into a yellow bucket
001010110001011010101010101011010101010101110101010111010100
0101100101
Next bit 1 arrives, new orange bucket is created, then 0 comes, then 1:
0101100010110101010101010110101010101011101010101110101000101
100101101
Buckets get merged…
0101100010110101010101010110101010101011101010101110101000101
100101101
State of the buckets after merging
010110001011010101010101011010101010101110101010111010100010
1100101101
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 70
How to Query?
◼ To estimate the number of 1s in the most
recent N bits:
1. Sum the sizes of all buckets but the last
(note “size” means the number of 1s in the
bucket)
2. Add half the size of the last bucket

◼ Remember: We do not know how many 1s

of the last bucket are still within the wanted
window

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 71

Example: Bucketized Stream

At least 1 of 2 of 2 of 1 of 2 of
size 16. Partially size 8 size 4 size 2 size 1
beyond window.

1001010110001011010101010101011010101010101110101010111010100
010110010
N

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 72

Extensions
◼Can we use the same trick to answer queries
How many 1’s in the last k? where k < N?
▪ A: Find earliest bucket B that at overlaps with k.
Number of 1s is the sum of sizes of more recent
buckets + ½ size of B
100101011000101101010101010101101010101010111010101011101010
0010110010 k

◼Can we handle the case where the stream is

not bits, but integers, and we want the sum
of the last k elements?
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 73
Extensions
◼

ci …estimated count for i-th bit

2 5 7 1 3 8 4 6 7 9 1 3 7 6 5 3 5 7 1 3 3 1 2 2 6 Idea: Sum in each

bucket is at most
2 5 7 1 3 8 4 6 7 9 1 3 7 6 5 3 5 7 1 3 3 1 2 2 6 3 2b (unless bucket
has only 1 integer)
2 5 7 1 3 8 4 6 7 9 1 3 7 6 5 3 5 7 1 3 3 1 2 2 6 3 2 Bucket sizes:
2 5 7 1 3 8 4 6 7 9 1 3 7 6 5 3 5 7 1 3 3 1 2 2 6 3 2 16 8 4 2 1
5 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 74
Summary
◼Sampling a fixed proportion of a stream
▪ Sample size grows as the stream grows
◼Sampling a fixed-size sample
▪ Reservoir sampling
◼Counting the number of 1s in the last N
elements
▪ Exponentially increasing windows
▪ Extensions:
▪ Number of 1s in any last k (k < N) elements
▪ Sums of integers in the last N elements

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 75

python notebook
No ratings yet
python notebook
48 pages
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
No ratings yet
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
46 pages
Bda Unit - 2
No ratings yet
Bda Unit - 2
12 pages
DataStreamsCRC Anjaly
No ratings yet
DataStreamsCRC Anjaly
258 pages
Mining Data Streams
No ratings yet
Mining Data Streams
34 pages
Stream Computing Methods
No ratings yet
Stream Computing Methods
35 pages
Stream Processing
No ratings yet
Stream Processing
70 pages
Mining Data Streams 1
No ratings yet
Mining Data Streams 1
46 pages
unit-3.pptx
No ratings yet
unit-3.pptx
49 pages
16 Streams
No ratings yet
16 Streams
61 pages
Lecture 27
No ratings yet
Lecture 27
21 pages
Module 3 Mining Data Streams
No ratings yet
Module 3 Mining Data Streams
96 pages
Unit III - MMD - Lecture Notes
No ratings yet
Unit III - MMD - Lecture Notes
8 pages
Mmd04A Streams
No ratings yet
Mmd04A Streams
78 pages
02 StreamsAlgorithms
No ratings yet
02 StreamsAlgorithms
93 pages
ch04 Streams1
No ratings yet
ch04 Streams1
4 pages
14 Streams
No ratings yet
14 Streams
6 pages
Algorithms for Massive Data Problems
No ratings yet
Algorithms for Massive Data Problems
28 pages
Streams 1
No ratings yet
Streams 1
33 pages
Ch05a Streams1
No ratings yet
Ch05a Streams1
48 pages
Mining Data Streams (Part 1) : Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
No ratings yet
Mining Data Streams (Part 1) : Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
46 pages
ch04 Streams2
No ratings yet
ch04 Streams2
4 pages
Methodologies for Stream Data Processing and Stream Data Systems
No ratings yet
Methodologies for Stream Data Processing and Stream Data Systems
20 pages
ch06 Assocrules
No ratings yet
ch06 Assocrules
110 pages
Unit2 Bda
No ratings yet
Unit2 Bda
293 pages
PDM
No ratings yet
PDM
181 pages
Bda Unit II Lecture1
No ratings yet
Bda Unit II Lecture1
10 pages
Unit Ii BD
No ratings yet
Unit Ii BD
74 pages
Common Criteria
100% (1)
Common Criteria
306 pages
Bda Unit II Lecture2
No ratings yet
Bda Unit II Lecture2
10 pages
Bishop 1-5
No ratings yet
Bishop 1-5
54 pages
BDA-UNIT3
No ratings yet
BDA-UNIT3
22 pages
MMD 03
No ratings yet
MMD 03
53 pages
Vmware Learning Course Catalog
No ratings yet
Vmware Learning Course Catalog
38 pages
16 Streams
No ratings yet
16 Streams
5 pages
Big Data Unit III
No ratings yet
Big Data Unit III
20 pages
ch03 LSH
No ratings yet
ch03 LSH
58 pages
An Efficient Closed Frequent Itemset Miner For The MOA Stream Mining System
No ratings yet
An Efficient Closed Frequent Itemset Miner For The MOA Stream Mining System
10 pages
Mining Data Streams
No ratings yet
Mining Data Streams
37 pages
DWDM - Unit - VII
No ratings yet
DWDM - Unit - VII
42 pages
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
No ratings yet
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
64 pages
Unit 3
No ratings yet
Unit 3
30 pages
BDA Unit-2
No ratings yet
BDA Unit-2
12 pages
mining data stream
No ratings yet
mining data stream
31 pages
Streaming Algorithms: CS6234 Advanced Algorithms February 10 2015
No ratings yet
Streaming Algorithms: CS6234 Advanced Algorithms February 10 2015
90 pages
Samsung ML 1520
No ratings yet
Samsung ML 1520
138 pages
Mining Data Streams
No ratings yet
Mining Data Streams
33 pages
BEX Variable Customer Exits
No ratings yet
BEX Variable Customer Exits
40 pages
Data Science 5
No ratings yet
Data Science 5
82 pages
Mining Data Streams (Part 2) : Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
No ratings yet
Mining Data Streams (Part 2) : Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
46 pages
Estimating Moments
No ratings yet
Estimating Moments
17 pages
BDA
No ratings yet
BDA
6 pages
Big Data Analytics_Unit 3
No ratings yet
Big Data Analytics_Unit 3
64 pages
Online Learning With Streamdrill
No ratings yet
Online Learning With Streamdrill
25 pages
Pulse Connect Secure (PCS) 9.0Rx - 9.1R4: New Feature Introduction
No ratings yet
Pulse Connect Secure (PCS) 9.0Rx - 9.1R4: New Feature Introduction
37 pages
Google Search Console
100% (1)
Google Search Console
27 pages
Streaming Algorithms: Ajinkya Potdar Hemanga Krishna Borah
No ratings yet
Streaming Algorithms: Ajinkya Potdar Hemanga Krishna Borah
47 pages
ch03 ITSS 459
No ratings yet
ch03 ITSS 459
38 pages
CS NEP Syllabus 5 and 6
No ratings yet
CS NEP Syllabus 5 and 6
34 pages
DA Unit 3
No ratings yet
DA Unit 3
12 pages
2025 GRANDIOSE MOCK - Computing 2
No ratings yet
2025 GRANDIOSE MOCK - Computing 2
7 pages
Data Streams: Models and Algorithms
No ratings yet
Data Streams: Models and Algorithms
372 pages
Community Detection in Social Networks
No ratings yet
Community Detection in Social Networks
64 pages
Upgrading Dell EMC Networking N4000 Series Switches From Version 6.x.x.x To 6.5.4.18
No ratings yet
Upgrading Dell EMC Networking N4000 Series Switches From Version 6.x.x.x To 6.5.4.18
16 pages
Number: 300-420 Passing Score: 825 Time Limit: 140 Min File Version: 1.0
100% (1)
Number: 300-420 Passing Score: 825 Time Limit: 140 Min File Version: 1.0
19 pages
Online Learning With Stream Mining
No ratings yet
Online Learning With Stream Mining
36 pages
Java Study Guide
No ratings yet
Java Study Guide
24 pages
Mining Data Streams (Part 1)
No ratings yet
Mining Data Streams (Part 1)
46 pages
Array Programs
No ratings yet
Array Programs
7 pages
Prosedur Service Timbangan
No ratings yet
Prosedur Service Timbangan
7 pages
Lolok Apk
No ratings yet
Lolok Apk
3 pages
Crash Course On Data Stream Algorithms: Part I: Basic Definitions and Numerical Streams
No ratings yet
Crash Course On Data Stream Algorithms: Part I: Basic Definitions and Numerical Streams
76 pages
Cisco Script PT 3.6.1 Packetracer Skills Challenge
No ratings yet
Cisco Script PT 3.6.1 Packetracer Skills Challenge
5 pages
Wincc V7.3 - Simple Sample: Calling A Script by A Tag Change
No ratings yet
Wincc V7.3 - Simple Sample: Calling A Script by A Tag Change
2 pages
Dop Open Standard: 1. Motivation
No ratings yet
Dop Open Standard: 1. Motivation
4 pages
Mining Data Streams (Part 2)
No ratings yet
Mining Data Streams (Part 2)
56 pages
CS 312 Schedule Fall 2008
No ratings yet
CS 312 Schedule Fall 2008
2 pages
QUESTION BANK FOR DCA I SEM Fundamentals of Computer (101) Hi-Tech Institute of Computers
100% (2)
QUESTION BANK FOR DCA I SEM Fundamentals of Computer (101) Hi-Tech Institute of Computers
2 pages
Mining Data Streams
No ratings yet
Mining Data Streams
67 pages
DCW
No ratings yet
DCW
2 pages
How To Fix Dell System Restore (DSR)
No ratings yet
How To Fix Dell System Restore (DSR)
6 pages
Dimension Drawing CD+ 20-335 Metric Antwerp 9829529800 Ed05
No ratings yet
Dimension Drawing CD+ 20-335 Metric Antwerp 9829529800 Ed05
1 page
Lab 7 v1
No ratings yet
Lab 7 v1
2 pages
PROGRAMS
No ratings yet
PROGRAMS
4 pages
4.3 - Content-Based and Geographic Addressing
No ratings yet
4.3 - Content-Based and Geographic Addressing
1 page
Estimating Moments
No ratings yet
Estimating Moments
22 pages
LJ Create: Analog and Digital Motor Control
No ratings yet
LJ Create: Analog and Digital Motor Control
7 pages
Crushing The Technical Interview: Data Structures And Algorithms (Python Edition)
From Everand
Crushing The Technical Interview: Data Structures And Algorithms (Python Edition)
Keith Henning
No ratings yet
Técnicas Estadísticas para la Ciencia de Datos a través de R. Aprendizaje Supervisado: Análisis Discriminante, Árboles de Decisión, Redes Neuronales y Modelos Lineales Generalizados
From Everand
Técnicas Estadísticas para la Ciencia de Datos a través de R. Aprendizaje Supervisado: Análisis Discriminante, Árboles de Decisión, Redes Neuronales y Modelos Lineales Generalizados
César Pérez López
No ratings yet
Mastering Data Structures and Algorithms in C and C++
From Everand
Mastering Data Structures and Algorithms in C and C++
Sachin Naha
No ratings yet