Chapter 2
Chapter 2
Similarity Search
Hadoop File System Spark Hypothesis Testing
Streaming Graph Analysis
MapReduce Recommendation Systems
Tensorflow
Deep Learning
1
What is Streaming?
Broadly:
Process
RECORD IN RECORD GONE
Why Streaming?
2
Why Streaming?
(1) Direct: Often, data …
● … cannot be stored (too big, privacy concerns)
● … are not practical to access repeatedly (reading
is too long)
● … are rapidly arriving (need rapidly updated
"results")
Why Streaming?
(1) Direct: Often, data …
● … cannot be stored (too big, privacy concerns)
● … are not practical to access repeatedly (reading is
too long)
● … are rapidly arriving (need rapidly updated
"results")
3
Why Streaming?
Often translates into O(N) or strictly N algorithms.
(1) Direct: Often
●
RECORD IN PROCESS RECORD GONE
Streaming Topics
4
Process
RECORD IN for RECORD GONE
stream queries
Ad-Hoc:
Standing Queries: One-time questions
Stored and permanently executing. -- must store expected parts /
summaries of streams
Process
RECORD IN for RECORD GONE
stream queries
Ad-Hoc:
Standing Queries: One-time questions
Stored and permanently executing. -- must store expected parts /
summaries of streams
10
5
Process
RECORD IN for RECORD GONE
stream queries
• Ad-Hoc:
• Standing
Important difference
Queries:from typical database management:
One-time questions
• Stored and permanently executing. -- must store expected parts /
● Input is not controlled by system• staff.
summaries of streams
11
Processor
…, 4, 3, 11, 2, 0, 5, 8, 1, 4 Output
(Generalization,
Input stream Summarization)
12
6
General Stream Processing Model
ad-hoc queries
Processor
…, 4, 3, 11, 2, 0, 5, 8, 1, 4
standing
Output
queries (Generalization,
Input stream Summarization)
13
ad-hoc queries
Processor
…, 4, 3, 11, 2, 0, 5, 8, 1, 4
standing
Output
queries (Generalization,
Input stream Summarization)
limited
memory
14
7
General Stream Processing Model
ad-hoc queries
Processor
…, 4, 3, 11, 2, 0, 5, 8, 1, 4
standing
Output
queries (Generalization,
Input stream Summarization)
limited
memory
archival storage -- not suitable for
fast queries.
15
Sampling
Create a random sample for statistical analysis.
16
8
Sampling
Create a random sample for statistical analysis.
Keep?
limited
memory
17
Sampling
Create a random sample for statistical analysis.
Keep?
sometime in
limited future run statistical
memory analysis
18
9
Sampling: 2 Versions
Create a random sample for statistical analysis.
1. Simple Sampling: Individual records are what you wish to sample.
19
Sampling: 2 Versions
Create a random sample for statistical analysis.
1. Simple Sampling: Individual records are what you wish to sample.
tweet! tweet!
tweet! tweet! tweet! tweet! tweet!
tweet! tweet!
tweet! tweet! tweet!
20
10
Sampling: 2 Versions
Create a random sample for statistical analysis.
1. Simple Sampling: Individual records are what you wish to sample.
tweet! tweet!
tweet! tweet! tweet! tweet!
tweet! tweet!
tweet! tweet!
tweet! tweet!
21
Sampling
Create a random sample for statistical analysis.
1. Simple Sampling: Individual records are what you wish to sample.
22
11
Sampling
Create a random sample for statistical analysis.
1. Simple Sampling: Individual records are what you wish to sample.
record = stream.next()
if ?: #keep: e.g., true 5% of the time
memory.write(record)
limited
memory
23
Sampling
Create a random sample for statistical analysis.
1. Simple Sampling: Individual records are what you wish to sample.
record = stream.next()
if random() <= .05: #keep: true 5% of the time
memory.write(record)
limited
memory
24
12
Sampling
Create a random sample for statistical analysis.
1. Simple Sampling: Individual records are what you wish to sample.
record = stream.next()
if random() <= .05: #keep: true 5% of the time
memory.write(record)
sometime in
limited future run statistical
memory analysis
25
Sampling
2. Hierarchical Sampling: Sample an attribute of a record.
(e.g. records are tweets, but with to sample users)
record = stream.next()
if random() <= .05: #keep: true 5% of the time
memory.write(record)
Solution: ?
26
13
Sampling
2. Hierarchical Sampling: Sample an attribute of a record.
(e.g. records are tweets, but with to sample users)
record = stream.next()
if ??: #keep
memory.write(record)
Solution: ?
tweet! tweet!
tweet! tweet! tweet! tweet!
tweet! tweet!
tweet! tweet!
tweet! tweet!
27
Sampling
2. Hierarchical Sampling: Sample an attribute of a record.
(e.g. records are tweets, but with to sample users)
record = stream.next()
if ??: #keep:
memory.write(record)
Solution: instead of checking random digit; hash the attribute being sampled.
– streaming: only need to store hash functions; may be part of standing query
28
14
Sampling
2. Hierarchical Sampling: Sample an attribute of a record.
(e.g. records are tweets, but with to sample users)
record = stream.next()
if hash(record[‘user_id’]) == 1: #keep
memory.write(record)
Solution: instead of checking random digit; hash the attribute being sampled.
– streaming: only need to store hash functions; may be part of standing query
29
Streaming Topics
30
15
Counting Moments
Moments:
31
Counting Moments
Moments:
32
16