0% found this document useful (0 votes)
22 views

Chapter 2

The document discusses streaming algorithms and data. It describes streaming as processing data records that come and go rapidly without being stored. Some key points are: - Streaming algorithms are useful when data is too large to store, cannot be accessed repeatedly, or is arriving rapidly and needs updated results. Examples include search queries, satellite imagery, and click streams. - The streaming model involves records entering a processor, being processed, and then leaving. It allows for both standing queries that are always running and ad-hoc one-time queries. Memory is limited. - Sampling techniques in streaming include simple random sampling of individual records and hierarchical sampling of record attributes to create statistical samples for analysis.

Uploaded by

SANG VÕ NGỌC
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

Chapter 2

The document discusses streaming algorithms and data. It describes streaming as processing data records that come and go rapidly without being stored. Some key points are: - Streaming algorithms are useful when data is too large to store, cannot be accessed repeatedly, or is arriving rapidly and needs updated results. Examples include search queries, satellite imagery, and click streams. - The streaming model involves records entering a processor, being processed, and then leaving. It allows for both standing queries that are always running and ad-hoc one-time queries. Memory is limited. - Sampling techniques in streaming include simple random sampling of individual records and hierarchical sampling of record attributes to create statistical samples for analysis.

Uploaded by

SANG VÕ NGỌC
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Streaming Algorithms:

Data without a disk

Big Data Analytics, The Class


Goal: Generalizations
A model or summarization of the data.

Data Frameworks Algorithms and Analyses

Similarity Search
Hadoop File System Spark Hypothesis Testing
Streaming Graph Analysis
MapReduce Recommendation Systems
Tensorflow
Deep Learning

1
What is Streaming?
Broadly:

Process
RECORD IN RECORD GONE

Why Streaming?

(1) Direct: Often, data …


● … cannot be stored (too big, privacy concerns)
● … are not practical to access repeatedly
(reading is too long)
● … are rapidly arriving (need rapidly updated
"results")

2
Why Streaming?
(1) Direct: Often, data …
● … cannot be stored (too big, privacy concerns)
● … are not practical to access repeatedly (reading
is too long)
● … are rapidly arriving (need rapidly updated
"results")

Examples: Google search queries


Satellite imagery
data Text Messages, Status updates
Click Streams

Why Streaming?
(1) Direct: Often, data …
● … cannot be stored (too big, privacy concerns)
● … are not practical to access repeatedly (reading is
too long)
● … are rapidly arriving (need rapidly updated
"results")

(2) Indirect: The constraints for streaming data force one to


solutions that are often efficient even when storing data.
Streaming Approx Random Sample
6

3
Why Streaming?
Often translates into O(N) or strictly N algorithms.
(1) Direct: Often


RECORD IN PROCESS RECORD GONE

● … are rapidly arriving (need rapidly updated

(2) Indirect:The constraints for streaming data force one


to solutions that are often efficient even when storing
data. Streaming Approx Random Sample

Distributed IO (MapReduce, Spark)

Streaming Topics

● General Stream Processing Model


● Sampling
● Counting Distinct Elements
● Filtering data according to a criteria

4
Process
RECORD IN for RECORD GONE
stream queries

Ad-Hoc:
Standing Queries: One-time questions
Stored and permanently executing. -- must store expected parts /
summaries of streams

Process
RECORD IN for RECORD GONE
stream queries

Ad-Hoc:
Standing Queries: One-time questions
Stored and permanently executing. -- must store expected parts /
summaries of streams

E.g. How would you handle:


What is the mean of values seen so far?

10

5
Process
RECORD IN for RECORD GONE
stream queries

• Ad-Hoc:
• Standing
Important difference
Queries:from typical database management:
One-time questions
• Stored and permanently executing. -- must store expected parts /
● Input is not controlled by system• staff.
summaries of streams

● Input timing/rate is often unknown, controlled by users.

E.g. How would you handle:


What is the mean of values seen so far?

11

General Stream Processing Model

ad-hoc queries -- one-time questions

Processor
…, 4, 3, 11, 2, 0, 5, 8, 1, 4 Output
(Generalization,
Input stream Summarization)

12

6
General Stream Processing Model

ad-hoc queries

Processor
…, 4, 3, 11, 2, 0, 5, 8, 1, 4
standing
Output
queries (Generalization,
Input stream Summarization)

-- asked at all times.

13

General Stream Processing Model

ad-hoc queries

Processor
…, 4, 3, 11, 2, 0, 5, 8, 1, 4
standing
Output
queries (Generalization,
Input stream Summarization)

limited
memory

14

7
General Stream Processing Model

ad-hoc queries

Processor
…, 4, 3, 11, 2, 0, 5, 8, 1, 4
standing
Output
queries (Generalization,
Input stream Summarization)

limited
memory
archival storage -- not suitable for
fast queries.

15

Sampling
Create a random sample for statistical analysis.

RECORD IN Process RECORD GONE

16

8
Sampling
Create a random sample for statistical analysis.

RECORD IN Process RECORD GONE

Keep?

limited
memory

17

Sampling
Create a random sample for statistical analysis.

RECORD IN Process RECORD GONE

Keep?

sometime in
limited future run statistical
memory analysis

18

9
Sampling: 2 Versions
Create a random sample for statistical analysis.
1. Simple Sampling: Individual records are what you wish to sample.

19

Sampling: 2 Versions
Create a random sample for statistical analysis.
1. Simple Sampling: Individual records are what you wish to sample.

2. Hierarchical Sampling: Sample an attribute of a record. (e.g.


records are tweets, but with to sample users)

tweet! tweet!
tweet! tweet! tweet! tweet! tweet!
tweet! tweet!
tweet! tweet! tweet!

tweet! tweet! tweet! tweet!


tweet! tweet! tweet! tweet!
tweet! tweet!
tweet! tweet! tweet! tweet! tweet! tweet!

20

10
Sampling: 2 Versions
Create a random sample for statistical analysis.
1. Simple Sampling: Individual records are what you wish to sample.

2. Hierarchical Sampling: Sample an attribute of a record. (e.g.


records are tweets, but with to sample users)

tweet! tweet!
tweet! tweet! tweet! tweet!
tweet! tweet!
tweet! tweet!
tweet! tweet!

tweet! tweet! tweet! tweet! tweet! tweet!


tweet! tweet! tweet! tweet!
tweet! tweet! tweet! tweet!
tweet! tweet!

21

Sampling
Create a random sample for statistical analysis.
1. Simple Sampling: Individual records are what you wish to sample.

22

11
Sampling
Create a random sample for statistical analysis.
1. Simple Sampling: Individual records are what you wish to sample.
record = stream.next()
if ?: #keep: e.g., true 5% of the time
memory.write(record)

RECORD IN RECORD GONE

limited
memory

23

Sampling
Create a random sample for statistical analysis.
1. Simple Sampling: Individual records are what you wish to sample.
record = stream.next()
if random() <= .05: #keep: true 5% of the time
memory.write(record)

RECORD IN random() < .05? RECORD GONE

limited
memory

24

12
Sampling
Create a random sample for statistical analysis.
1. Simple Sampling: Individual records are what you wish to sample.
record = stream.next()
if random() <= .05: #keep: true 5% of the time
memory.write(record)

Problem: records/rows often are not units-of-analysis for statistical analyses


E.g. user_ids for searches, tweets; location_ids for satellite images

sometime in
limited future run statistical
memory analysis

25

Sampling
2. Hierarchical Sampling: Sample an attribute of a record.
(e.g. records are tweets, but with to sample users)
record = stream.next()
if random() <= .05: #keep: true 5% of the time
memory.write(record)

Solution: ?

26

13
Sampling
2. Hierarchical Sampling: Sample an attribute of a record.
(e.g. records are tweets, but with to sample users)
record = stream.next()
if ??: #keep
memory.write(record)

Solution: ?
tweet! tweet!
tweet! tweet! tweet! tweet!
tweet! tweet!
tweet! tweet!
tweet! tweet!

tweet! tweet! tweet! tweet! tweet! tweet!


tweet! tweet! tweet! tweet!
tweet! tweet! tweet! tweet!
tweet! tweet!

27

Sampling
2. Hierarchical Sampling: Sample an attribute of a record.
(e.g. records are tweets, but with to sample users)
record = stream.next()
if ??: #keep:
memory.write(record)

Solution: instead of checking random digit; hash the attribute being sampled.
– streaming: only need to store hash functions; may be part of standing query

28

14
Sampling
2. Hierarchical Sampling: Sample an attribute of a record.
(e.g. records are tweets, but with to sample users)
record = stream.next()
if hash(record[‘user_id’]) == 1: #keep
memory.write(record)

Solution: instead of checking random digit; hash the attribute being sampled.
– streaming: only need to store hash functions; may be part of standing query

Howmany buckets to hash into?

29

Streaming Topics

● General Stream Processing Model


● Sampling
● Counting Distinct Elements
● Filtering data according to a criteria

30

15
Counting Moments
Moments:

● Suppose mi is the count of distinct element i in the data


● The kth moment of the stream is

31

Counting Moments
Moments:

● Suppose mi is the count of distinct element i in the data


● The kth moment of the stream is

● 0th moment: count of distinct elements


● 1st moment: length of stream
● 2nd moment: sum of squares
(measures uneveness; related to variance)

32

16

You might also like