0% found this document useful (0 votes)

22 views

Chapter 2

The document discusses streaming algorithms and data. It describes streaming as processing data records that come and go rapidly without being stored. Some key points are: - Streaming algorithms are useful when data is too large to store, cannot be accessed repeatedly, or is arriving rapidly and needs updated results. Examples include search queries, satellite imagery, and click streams. - The streaming model involves records entering a processor, being processed, and then leaving. It allows for both standing queries that are always running and ad-hoc one-time queries. Memory is limited. - Sampling techniques in streaming include simple random sampling of individual records and hierarchical sampling of record attributes to create statistical samples for analysis.

Uploaded by

SANG VÕ NGỌC

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views

Chapter 2

Uploaded by

SANG VÕ NGỌC

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Streaming Algorithms:

Data without a disk

Big Data Analytics, The Class

Goal: Generalizations
A model or summarization of the data.

Data Frameworks Algorithms and Analyses

Similarity Search
Hadoop File System Spark Hypothesis Testing
Streaming Graph Analysis
MapReduce Recommendation Systems
Tensorﬂow
Deep Learning

1
What is Streaming?
Broadly:

Process
RECORD IN RECORD GONE

Why Streaming?

(1) Direct: Often, data …

● … cannot be stored (too big, privacy concerns)
● … are not practical to access repeatedly
(reading is too long)
● … are rapidly arriving (need rapidly updated
"results")

2
Why Streaming?
(1) Direct: Often, data …
● … cannot be stored (too big, privacy concerns)
● … are not practical to access repeatedly (reading
is too long)
● … are rapidly arriving (need rapidly updated
"results")

Examples: Google search queries

Satellite imagery
data Text Messages, Status updates
Click Streams

Why Streaming?
(1) Direct: Often, data …
● … cannot be stored (too big, privacy concerns)
● … are not practical to access repeatedly (reading is
too long)
● … are rapidly arriving (need rapidly updated
"results")

(2) Indirect: The constraints for streaming data force one to

solutions that are often efficient even when storing data.
Streaming Approx Random Sample
6

3
Why Streaming?
Often translates into O(N) or strictly N algorithms.
(1) Direct: Often

●
RECORD IN PROCESS RECORD GONE

● … are rapidly arriving (need rapidly updated

(2) Indirect:The constraints for streaming data force one

to solutions that are often efficient even when storing
data. Streaming Approx Random Sample

Distributed IO (MapReduce, Spark)

Streaming Topics

● General Stream Processing Model

● Sampling
● Counting Distinct Elements
● Filtering data according to a criteria

4
Process
RECORD IN for RECORD GONE
stream queries

Ad-Hoc:
Standing Queries: One-time questions
Stored and permanently executing. -- must store expected parts /
summaries of streams

Process
RECORD IN for RECORD GONE
stream queries

Ad-Hoc:
Standing Queries: One-time questions
Stored and permanently executing. -- must store expected parts /
summaries of streams

E.g. How would you handle:

What is the mean of values seen so far?

5
Process
RECORD IN for RECORD GONE
stream queries

• Ad-Hoc:
• Standing
Important difference
Queries:from typical database management:
One-time questions
• Stored and permanently executing. -- must store expected parts /
● Input is not controlled by system• staff.
summaries of streams

● Input timing/rate is often unknown, controlled by users.

E.g. How would you handle:

What is the mean of values seen so far?

General Stream Processing Model

ad-hoc queries -- one-time questions

Processor
…, 4, 3, 11, 2, 0, 5, 8, 1, 4 Output
(Generalization,
Input stream Summarization)

6
General Stream Processing Model

ad-hoc queries

Processor
…, 4, 3, 11, 2, 0, 5, 8, 1, 4
standing
Output
queries (Generalization,
Input stream Summarization)

-- asked at all times.

General Stream Processing Model

ad-hoc queries

Processor
…, 4, 3, 11, 2, 0, 5, 8, 1, 4
standing
Output
queries (Generalization,
Input stream Summarization)

limited
memory

7
General Stream Processing Model

ad-hoc queries

Processor
…, 4, 3, 11, 2, 0, 5, 8, 1, 4
standing
Output
queries (Generalization,
Input stream Summarization)

limited
memory
archival storage -- not suitable for
fast queries.

Sampling
Create a random sample for statistical analysis.

RECORD IN Process RECORD GONE

8
Sampling
Create a random sample for statistical analysis.

RECORD IN Process RECORD GONE

Keep?

limited
memory

Sampling
Create a random sample for statistical analysis.

RECORD IN Process RECORD GONE

Keep?

sometime in
limited future run statistical
memory analysis

9
Sampling: 2 Versions
Create a random sample for statistical analysis.
1. Simple Sampling: Individual records are what you wish to sample.

Sampling: 2 Versions
Create a random sample for statistical analysis.
1. Simple Sampling: Individual records are what you wish to sample.

2. Hierarchical Sampling: Sample an attribute of a record. (e.g.

records are tweets, but with to sample users)

tweet! tweet!
tweet! tweet! tweet! tweet! tweet!
tweet! tweet!
tweet! tweet! tweet!

tweet! tweet! tweet! tweet!

tweet! tweet! tweet! tweet!
tweet! tweet!
tweet! tweet! tweet! tweet! tweet! tweet!

10
Sampling: 2 Versions
Create a random sample for statistical analysis.
1. Simple Sampling: Individual records are what you wish to sample.

2. Hierarchical Sampling: Sample an attribute of a record. (e.g.

records are tweets, but with to sample users)

tweet! tweet!
tweet! tweet! tweet! tweet!
tweet! tweet!
tweet! tweet!
tweet! tweet!

tweet! tweet! tweet! tweet! tweet! tweet!

tweet! tweet! tweet! tweet!
tweet! tweet! tweet! tweet!
tweet! tweet!

Sampling
Create a random sample for statistical analysis.
1. Simple Sampling: Individual records are what you wish to sample.

11
Sampling
Create a random sample for statistical analysis.
1. Simple Sampling: Individual records are what you wish to sample.
record = stream.next()
if ?: #keep: e.g., true 5% of the time
memory.write(record)

RECORD IN RECORD GONE

limited
memory

Sampling
Create a random sample for statistical analysis.
1. Simple Sampling: Individual records are what you wish to sample.
record = stream.next()
if random() <= .05: #keep: true 5% of the time
memory.write(record)

RECORD IN random() < .05? RECORD GONE

limited
memory

12
Sampling
Create a random sample for statistical analysis.
1. Simple Sampling: Individual records are what you wish to sample.
record = stream.next()
if random() <= .05: #keep: true 5% of the time
memory.write(record)

Problem: records/rows often are not units-of-analysis for statistical analyses

E.g. user_ids for searches, tweets; location_ids for satellite images

sometime in
limited future run statistical
memory analysis

Sampling
2. Hierarchical Sampling: Sample an attribute of a record.
(e.g. records are tweets, but with to sample users)
record = stream.next()
if random() <= .05: #keep: true 5% of the time
memory.write(record)

Solution: ?

13
Sampling
2. Hierarchical Sampling: Sample an attribute of a record.
(e.g. records are tweets, but with to sample users)
record = stream.next()
if ??: #keep
memory.write(record)

Solution: ?
tweet! tweet!
tweet! tweet! tweet! tweet!
tweet! tweet!
tweet! tweet!
tweet! tweet!

tweet! tweet! tweet! tweet! tweet! tweet!

tweet! tweet! tweet! tweet!
tweet! tweet! tweet! tweet!
tweet! tweet!

Sampling
2. Hierarchical Sampling: Sample an attribute of a record.
(e.g. records are tweets, but with to sample users)
record = stream.next()
if ??: #keep:
memory.write(record)

Solution: instead of checking random digit; hash the attribute being sampled.
– streaming: only need to store hash functions; may be part of standing query

14
Sampling
2. Hierarchical Sampling: Sample an attribute of a record.
(e.g. records are tweets, but with to sample users)
record = stream.next()
if hash(record[‘user_id’]) == 1: #keep
memory.write(record)

Solution: instead of checking random digit; hash the attribute being sampled.
– streaming: only need to store hash functions; may be part of standing query

Howmany buckets to hash into?

Streaming Topics

● General Stream Processing Model

● Sampling
● Counting Distinct Elements
● Filtering data according to a criteria

15
Counting Moments
Moments:

● Suppose mi is the count of distinct element i in the data

● The kth moment of the stream is

Counting Moments
Moments:

● Suppose mi is the count of distinct element i in the data

● The kth moment of the stream is

● 0th moment: count of distinct elements

● 1st moment: length of stream
● 2nd moment: sum of squares
(measures uneveness; related to variance)

Iseries Memory Tuning
100% (1)
Iseries Memory Tuning
81 pages
SAP Performance Analysis - AC
50% (2)
SAP Performance Analysis - AC
99 pages
CSE545 Sp23 (2) Streaming Algorithms 2-4
No ratings yet
CSE545 Sp23 (2) Streaming Algorithms 2-4
60 pages
Lecture4 IntroMapReduce PDF
No ratings yet
Lecture4 IntroMapReduce PDF
75 pages
CA-chap6-IO System
No ratings yet
CA-chap6-IO System
59 pages
12051500 ITHome - Deep Dive into Apache Flink_Gordon
No ratings yet
12051500 ITHome - Deep Dive into Apache Flink_Gordon
44 pages
Scale Perf Best Practices
No ratings yet
Scale Perf Best Practices
39 pages
Slot24 25 CH14 ProcessorStructureAndFunction 42 Slots
No ratings yet
Slot24 25 CH14 ProcessorStructureAndFunction 42 Slots
42 pages
Recovery and Back Up
No ratings yet
Recovery and Back Up
30 pages
Processes: Bilkent University Department of Computer Engineering CS342 Operating Systems
No ratings yet
Processes: Bilkent University Department of Computer Engineering CS342 Operating Systems
79 pages
CA_lecture_7
No ratings yet
CA_lecture_7
42 pages
Memory Cache
No ratings yet
Memory Cache
18 pages
Session Title: Bob Johnston / IPS Grid & HA Solutions
No ratings yet
Session Title: Bob Johnston / IPS Grid & HA Solutions
31 pages
Chap 01
No ratings yet
Chap 01
11 pages
CS-chap6-Storage and Other IO Topics
No ratings yet
CS-chap6-Storage and Other IO Topics
60 pages
Introduction To DSP Processors: K. Vijaya Kumar Asst. Prof. Usharama College of Engineering & Technology
No ratings yet
Introduction To DSP Processors: K. Vijaya Kumar Asst. Prof. Usharama College of Engineering & Technology
45 pages
Operating System: Scope of Lecture
No ratings yet
Operating System: Scope of Lecture
3 pages
Lecture 07 - Memory Management
No ratings yet
Lecture 07 - Memory Management
87 pages
2 Hadoop (Uploaded)
No ratings yet
2 Hadoop (Uploaded)
82 pages
Oracle Architecture PDF
No ratings yet
Oracle Architecture PDF
68 pages
C42-Batch Stream Micro Batch Realtime Processing
No ratings yet
C42-Batch Stream Micro Batch Realtime Processing
33 pages
4 Zos PDF
No ratings yet
4 Zos PDF
29 pages
Webcast 117445
No ratings yet
Webcast 117445
35 pages
Chap 05
No ratings yet
Chap 05
13 pages
6.830/6.814 - Notes For Lecture 4: Database Internals Overview
No ratings yet
6.830/6.814 - Notes For Lecture 4: Database Internals Overview
7 pages
2.2 DD2356 Threads
No ratings yet
2.2 DD2356 Threads
22 pages
GoldenGate - Training PDF
100% (1)
GoldenGate - Training PDF
33 pages
Postgresql'S Io Subsystem: Problems, Workarounds, Solutions: Andres Freund Postgresql Developer & Committer
No ratings yet
Postgresql'S Io Subsystem: Problems, Workarounds, Solutions: Andres Freund Postgresql Developer & Committer
23 pages
Chap05
No ratings yet
Chap05
13 pages
DB2 Advance Performance Monitoring
100% (1)
DB2 Advance Performance Monitoring
47 pages
MicrocontrollersII (1)
No ratings yet
MicrocontrollersII (1)
9 pages
II DDCA CO-4 Terminal Questions PDF
No ratings yet
II DDCA CO-4 Terminal Questions PDF
10 pages
Webcast 118215
No ratings yet
Webcast 118215
26 pages
Whack Am Ole Short
No ratings yet
Whack Am Ole Short
60 pages
Introduction To Parallel Programming
No ratings yet
Introduction To Parallel Programming
129 pages
Scalable-ML-3 4 1
No ratings yet
Scalable-ML-3 4 1
147 pages
Computer Organization and Architecture Chapter 7 Large and Fast Exploiting
No ratings yet
Computer Organization and Architecture Chapter 7 Large and Fast Exploiting
32 pages
Dbms
No ratings yet
Dbms
14 pages
D78846GC20_12
No ratings yet
D78846GC20_12
23 pages
Operating Systems Fundamentals: Roadmap
No ratings yet
Operating Systems Fundamentals: Roadmap
32 pages
Agile Development: Chapter 10 - Part 2
No ratings yet
Agile Development: Chapter 10 - Part 2
13 pages
Apache Zookeeper
No ratings yet
Apache Zookeeper
31 pages
03 Processes
No ratings yet
03 Processes
72 pages
22 Module 5 Introduction 13-06-2023
No ratings yet
22 Module 5 Introduction 13-06-2023
149 pages
Operating Systems: Lesson 3: Introduction To Process Management
No ratings yet
Operating Systems: Lesson 3: Introduction To Process Management
52 pages
Radical Speed For SQL Queries On Databricks Photon Under The Hood
No ratings yet
Radical Speed For SQL Queries On Databricks Photon Under The Hood
48 pages
Lecture 7 Memory 2021
No ratings yet
Lecture 7 Memory 2021
64 pages
itec2210
No ratings yet
itec2210
2 pages
Data Stream Management
No ratings yet
Data Stream Management
46 pages
CS330 Operating Systems Lec03
No ratings yet
CS330 Operating Systems Lec03
9 pages
EE457Unit9c_CMT
No ratings yet
EE457Unit9c_CMT
60 pages
AD Issues & Solutions
No ratings yet
AD Issues & Solutions
90 pages
SoftMC Hpca17 Talk
No ratings yet
SoftMC Hpca17 Talk
35 pages
Resource MGR On Exadata 201104
No ratings yet
Resource MGR On Exadata 201104
53 pages
NoMemoryAbstraction
No ratings yet
NoMemoryAbstraction
15 pages
Ruben Getting Started
No ratings yet
Ruben Getting Started
273 pages
11 - Lecture - Class
No ratings yet
11 - Lecture - Class
44 pages
Identifying Performance Issues Beyond Oracle Wait
No ratings yet
Identifying Performance Issues Beyond Oracle Wait
19 pages
Cap 5
No ratings yet
Cap 5
50 pages
Oracle GoldenGate 11g Implementer's guide
From Everand
Oracle GoldenGate 11g Implementer's guide
John P Jeffries
5/5 (1)
IM1011 Topic 01 Introduction To CBIS
No ratings yet
IM1011 Topic 01 Introduction To CBIS
20 pages
Chapter 1
No ratings yet
Chapter 1
40 pages
Chapter 3
No ratings yet
Chapter 3
47 pages
Reading Passage 1: IELTS Mock Test 2021 November
No ratings yet
Reading Passage 1: IELTS Mock Test 2021 November
16 pages
Reading Passage 1: IELTS Mock Test 2022 January
No ratings yet
Reading Passage 1: IELTS Mock Test 2022 January
18 pages
Improving The Ability of Student in Budgeting The Daily Allowance Through Seminar
No ratings yet
Improving The Ability of Student in Budgeting The Daily Allowance Through Seminar
12 pages
Midlands 2022 ZimVAC Rural Livelihoods Assessment Report
No ratings yet
Midlands 2022 ZimVAC Rural Livelihoods Assessment Report
219 pages
Research Paper A Comparative Study of Co PDF
No ratings yet
Research Paper A Comparative Study of Co PDF
12 pages
Inventory Methods
No ratings yet
Inventory Methods
12 pages
Ugc Net First Paper1
No ratings yet
Ugc Net First Paper1
309 pages
qoad013 (2)
No ratings yet
qoad013 (2)
44 pages
Financial Performance of Nepal Multipurpose Co-Operative Limited
No ratings yet
Financial Performance of Nepal Multipurpose Co-Operative Limited
10 pages
NLP Project - QA Final Presentation
No ratings yet
NLP Project - QA Final Presentation
21 pages
Prajwal Internship Report full final (2)
No ratings yet
Prajwal Internship Report full final (2)
53 pages
PROPOSALRESEARCHGROUP5
No ratings yet
PROPOSALRESEARCHGROUP5
5 pages
Bs Info - Tech.marynel
100% (1)
Bs Info - Tech.marynel
46 pages
201851061757691
No ratings yet
201851061757691
7 pages
Unit h230 01 Pure Mathematics and Statistics Sample Assessment Material
No ratings yet
Unit h230 01 Pure Mathematics and Statistics Sample Assessment Material
28 pages
Online Airline Ticket Purchasing - Influence of Online Sales Promotion Type and Internet Experience
No ratings yet
Online Airline Ticket Purchasing - Influence of Online Sales Promotion Type and Internet Experience
12 pages
Bharath 6th Sem Project (Acharya College)
No ratings yet
Bharath 6th Sem Project (Acharya College)
109 pages
2018 12 14 Fsca Financial Literacy Report 2018 Final Version
No ratings yet
2018 12 14 Fsca Financial Literacy Report 2018 Final Version
148 pages
Term Paper Word
100% (1)
Term Paper Word
5 pages
Let Questionaire
No ratings yet
Let Questionaire
20 pages
Group No.1 For Grammarian
No ratings yet
Group No.1 For Grammarian
167 pages
社会学案例研究
100% (2)
社会学案例研究
6 pages
Peter Marshall - Research Methods_ How to Design and Conduct a Successful Project (Student Handbooks) -How To Books, Ltd. (1997)
No ratings yet
Peter Marshall - Research Methods_ How to Design and Conduct a Successful Project (Student Handbooks) -How To Books, Ltd. (1997)
122 pages
Impact Assessment of The Police Visibility of Quirino Province Amidst Pandemic
No ratings yet
Impact Assessment of The Police Visibility of Quirino Province Amidst Pandemic
6 pages
Academic stress & Coping mechanisms
No ratings yet
Academic stress & Coping mechanisms
18 pages
A Project Report On: "Role of Bajaj Finserv in Consumer Durable Lending "
No ratings yet
A Project Report On: "Role of Bajaj Finserv in Consumer Durable Lending "
62 pages
National Islands Plan Survey Final Report
No ratings yet
National Islands Plan Survey Final Report
64 pages
BPMN3143 Silibus STUDENTS A222
No ratings yet
BPMN3143 Silibus STUDENTS A222
5 pages
shakthi_yojana_ppt[1][2]
No ratings yet
shakthi_yojana_ppt[1][2]
28 pages
Apc Recruitment Fatema
No ratings yet
Apc Recruitment Fatema
44 pages
Mste 001
No ratings yet
Mste 001
5 pages
G4 P2 BRM Sample
No ratings yet
G4 P2 BRM Sample
8 pages

Chapter 2

Uploaded by

Chapter 2

Uploaded by

Streaming Algorithms:

Data without a disk

Big Data Analytics, The Class

Data Frameworks Algorithms and Analyses

(1) Direct: Often, data …

Examples: Google search queries

(2) Indirect: The constraints for streaming data force one to

● … are rapidly arriving (need rapidly updated

(2) Indirect:The constraints for streaming data force one

Distributed IO (MapReduce, Spark)

● General Stream Processing Model

E.g. How would you handle:

● Input timing/rate is often unknown, controlled by users.

E.g. How would you handle:

General Stream Processing Model

ad-hoc queries -- one-time questions

-- asked at all times.

General Stream Processing Model

RECORD IN Process RECORD GONE

RECORD IN Process RECORD GONE

RECORD IN Process RECORD GONE

2. Hierarchical Sampling: Sample an attribute of a record. (e.g.

tweet! tweet! tweet! tweet!

2. Hierarchical Sampling: Sample an attribute of a record. (e.g.

tweet! tweet! tweet! tweet! tweet! tweet!

RECORD IN RECORD GONE

RECORD IN random() < .05? RECORD GONE

Problem: records/rows often are not units-of-analysis for statistical analyses

tweet! tweet! tweet! tweet! tweet! tweet!

Howmany buckets to hash into?

● General Stream Processing Model

● Suppose mi is the count of distinct element i in the data

● Suppose mi is the count of distinct element i in the data

● 0th moment: count of distinct elements

You might also like