0% found this document useful (0 votes)

82 views15 pages

Reflections FINAL PDF

Distributed computing systems allow for the processing of massive amounts of data across computer clusters in a parallel and distributed manner. There are different types of distributed systems based on how data is processed, including batch processing systems like MapReduce which process all data at once and streaming systems which process data continuously as it arrives. Apache Beam provides a unified programming model for both batch and streaming data processing across different distributed computing frameworks like Spark, Flink, and Google Dataflow. It separates the processing logic from the specific runtime system to allow applications to work across different distributed environments.

Uploaded by

doob10163

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

82 views15 pages

Reflections FINAL PDF

Uploaded by

doob10163

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Distributed Computing Systems

Shen Li @ IBM Research

Agenda
• Overview

• Stream Computing Systems

• Apache Beam (VLDB’15 Dataflow): A Unified Model for Batch and

Stream Processing

• MapReduce (OSDI’04) presented by Huh & Cline

• Spark (NSDI’12) presented by Lin & Chang

• Spark Streaming (SOSP’13) presented by Murali & Zhang

Overview
• Motivation?

Handle Massive Data Lower Cost Reduce Complexity

Overview
• Applications?
Overview
• History

MillWheel

System S Dryad

Pig
MapReduce Hive S4 Stream

2004 2006 2008 2010 2012 2014 2016 2018

Trend? Why? See the problem?

Overview
• Categorization: based on granularity

Dryad

MapReduce

Batch Micro-batch Streaming

Overview

100 events per bundle/micro-batch

Stream Computing Systems

Fusion into PEs Execution

Communications
in PE becomes
Function calls
Master Slave

Parallelism in a PE
Apache Beam
• A unified programming model for both batch and stream computing
applications.

User App Written in Beam

Beam SDK

Dataflow Runner Flink Runner Streams Runner Spark Runner Apex Runner Gearpump

(DataArtisans) (DataBricks) (DataTorrent) (Intel)

Google Dataflow IBM Streams
Flink Spark Apex Gearpump

• Why adopting Beam

Beam may become a language standard for streaming applications
Applications no longer need to make commitment to specific engines
Beam API
• Separate data processing logic from runtime requirements.To write
an app, users need to answer four questions:

• What is being computed?

Window Window
Discard/Accumulate
input
• Where in event time (event
occurs) to create windows?
Trigger
• When in processing time
output
(tuple been processed) to
carry out computation? Computation

• How do refinements relate?

Primitive Transforms
source Creates a stream

Window Defines windowing, triggering, and retracting schemes

Merges multiple input streams with the same tuple

Flatten type into a single output stream

input side input Converts tuples/windows of the input stream into

View user-defined data structures, which can be
consumed by ParDo as side inputs
s

Applies a user-defined DoFn to each tuple in the main

sid

ut
tp
e

input stream and emits one main out stream. Besides,

ou
inp

ParDo
e
ut

it may take multiple side input streams, and generates

sid
s

multiple side output streams

main input main output

(k, v1) (k, v2) (k, [v1, v2]) Groups input values with the same key in the same
GroupByKey window (pane) into the same output tuple
Window Model Comparison
Discrete Windows Continuous Windows
• Each window belongs to a fixed interval • Maintains a single window that
in event time moves along time axis
• After creation, a window never moves • Mobility is achieved by receiving and
• The mobility is achieved by creating, evicting tuples
discarding, and merging windows

1 2 3

event time processing time

Pro? Con?
Lateness
• Time concepts:
1. Event Time: the time when event occurs, recorded by the timestamp in the
tuple

2. Processing Time: the time when the tuple gets processed at the operator in
the pipeline

3. Low Watermark: A local estimation on progress of an operator in event

time.

• It is up to the app source operator and the runner to design the

watermark algorithm. Usually, watermark at an operator is the min
watermark of all upstream operators.
Lateness
• What is late arrival?

- Tuples that arrive with timestamps older than the watermark is considered a
late arrival 7
6
1 2 3 4 5
Processing Time

Watermark

0 1 2 3 4 5 6 7
Event Time
Join Example
Goal: jointly process these data

WindowFn needs to identify tuples in

the target window

1 2 3 4 5 6
1 2 3 4 5 6 Window View
1
2
3
Sid 4
eI 5
np 6
ut

Main Input
a b c d e f Window GBK ParDo

a b c e d f a b c e d f

Window1 Window2 Window3

Hands On Guide To Apache Spark 3 Build Scalable Computing Engines For Batch and Stream Data Processing 1nbsped 1484293797 9781484293799
No ratings yet
Hands On Guide To Apache Spark 3 Build Scalable Computing Engines For Batch and Stream Data Processing 1nbsped 1484293797 9781484293799
407 pages
Bigdata-Mining Data Streams
No ratings yet
Bigdata-Mining Data Streams
19 pages
10-Big Data Nhom7
No ratings yet
10-Big Data Nhom7
81 pages
Azure Data Bricks
No ratings yet
Azure Data Bricks
8 pages
Bài Giảng Spark Streaming
No ratings yet
Bài Giảng Spark Streaming
75 pages
DAV Chapter3
No ratings yet
DAV Chapter3
44 pages
Unit - 5 FBDA
No ratings yet
Unit - 5 FBDA
7 pages
8 - Streaming 3 - Spark Flink
No ratings yet
8 - Streaming 3 - Spark Flink
52 pages
Unit 1 Windowing
No ratings yet
Unit 1 Windowing
23 pages
09 - Apache Spark Streaming
No ratings yet
09 - Apache Spark Streaming
31 pages
Lecture #9.1 - Apache Spark - Streaming API II
No ratings yet
Lecture #9.1 - Apache Spark - Streaming API II
31 pages
Big Data Engines: Binary Batch Processing
No ratings yet
Big Data Engines: Binary Batch Processing
12 pages
Stream Processing Chapter 5
No ratings yet
Stream Processing Chapter 5
23 pages
b0m33bdt 7p Spark Databricks Streaming - 2023 - en
No ratings yet
b0m33bdt 7p Spark Databricks Streaming - 2023 - en
50 pages
Lec 05
No ratings yet
Lec 05
10 pages
Bda Unit-Iii-1
No ratings yet
Bda Unit-Iii-1
29 pages
Lec 20
No ratings yet
Lec 20
25 pages
7 - Streaming 2 - Calcite
No ratings yet
7 - Streaming 2 - Calcite
45 pages
Lecture 7 - 1-Spark - Streaming
No ratings yet
Lecture 7 - 1-Spark - Streaming
25 pages
Lecture 11
No ratings yet
Lecture 11
31 pages
BDA Lec10
No ratings yet
BDA Lec10
33 pages
BDA UNIT-2 (Final)
No ratings yet
BDA UNIT-2 (Final)
27 pages
BDA Unit 3
No ratings yet
BDA Unit 3
42 pages
Big Data Analytics - Unit 2 Notes
No ratings yet
Big Data Analytics - Unit 2 Notes
44 pages
Lecture #7.2 - Apache Spark - Streaming API
No ratings yet
Lecture #7.2 - Apache Spark - Streaming API
37 pages
Eugene Kirpichov - STREAM 2016 Dataflow and Apache Beam
No ratings yet
Eugene Kirpichov - STREAM 2016 Dataflow and Apache Beam
37 pages
Lec 19
No ratings yet
Lec 19
24 pages
Assignment No. 3 For Business Data Analytics
No ratings yet
Assignment No. 3 For Business Data Analytics
16 pages
SA Unit 1 PPT 2
No ratings yet
SA Unit 1 PPT 2
27 pages
Unit 5 (Big Data Analytics)
No ratings yet
Unit 5 (Big Data Analytics)
11 pages
Lec 19
No ratings yet
Lec 19
23 pages
Spark Streaming
No ratings yet
Spark Streaming
14 pages
TRabl StreamProcessing
No ratings yet
TRabl StreamProcessing
79 pages
Mining Data Streams
No ratings yet
Mining Data Streams
37 pages
Dataflow
No ratings yet
Dataflow
3 pages
Bigdata Unit II
No ratings yet
Bigdata Unit II
57 pages
Real-Time Data Pipelines Made Easy With Structured Streaming in Apache Spark
No ratings yet
Real-Time Data Pipelines Made Easy With Structured Streaming in Apache Spark
51 pages
Spark Streaming: Tathagata "TD" Das
No ratings yet
Spark Streaming: Tathagata "TD" Das
28 pages
Unit Iii
No ratings yet
Unit Iii
19 pages
Ade Mod 1 Incremental Processing With Spark Structured Streaming
No ratings yet
Ade Mod 1 Incremental Processing With Spark Structured Streaming
73 pages
Berkeley Data Analytics Stack: Prof. Harold Liu 15 December 2014
No ratings yet
Berkeley Data Analytics Stack: Prof. Harold Liu 15 December 2014
48 pages
Streaming Graph Processing Unit5
No ratings yet
Streaming Graph Processing Unit5
7 pages
Py Spark 3 Quick Reference Guide
No ratings yet
Py Spark 3 Quick Reference Guide
2 pages
Unit 3-6
No ratings yet
Unit 3-6
14 pages
Siemens Hmi
No ratings yet
Siemens Hmi
34 pages
Notes
No ratings yet
Notes
4 pages
Bda Unit-4 PDF
No ratings yet
Bda Unit-4 PDF
63 pages
Data Lake 1
No ratings yet
Data Lake 1
19 pages
ECS765P - W11 - Stream Processing II
No ratings yet
ECS765P - W11 - Stream Processing II
47 pages
Big Data Notes (All Lectures)
No ratings yet
Big Data Notes (All Lectures)
44 pages
Notes
No ratings yet
Notes
4 pages
Bigdata Unit-Ii
No ratings yet
Bigdata Unit-Ii
33 pages
Bigdata Unit II
No ratings yet
Bigdata Unit II
19 pages
Collaboration Diagram of Student Registration System
100% (2)
Collaboration Diagram of Student Registration System
22 pages
UNIT V Streaming
No ratings yet
UNIT V Streaming
22 pages
Big Data 3rd Assignment Answers
No ratings yet
Big Data 3rd Assignment Answers
8 pages
Structured Streaming Programming Guide - Spark 3.4.0 Documentation
No ratings yet
Structured Streaming Programming Guide - Spark 3.4.0 Documentation
1 page
JEDI Course Notes-Intro1
100% (2)
JEDI Course Notes-Intro1
243 pages
Spark Streaming Through Dynamic Batch Sizing
No ratings yet
Spark Streaming Through Dynamic Batch Sizing
4 pages
Apache Spark Streaming Presentation
100% (1)
Apache Spark Streaming Presentation
28 pages
Srs For Online Movie Ticket Booking
100% (4)
Srs For Online Movie Ticket Booking
9 pages
ARM7
No ratings yet
ARM7
76 pages
Hardware - Migration - Replication+Method+Downtime+Activities
No ratings yet
Hardware - Migration - Replication+Method+Downtime+Activities
7 pages
Normalization Examples
No ratings yet
Normalization Examples
8 pages
w12 Celestarium
No ratings yet
w12 Celestarium
19 pages
Asigra Dsuser Guide
No ratings yet
Asigra Dsuser Guide
895 pages
How To Build Your Own Mikrotik Wifi
No ratings yet
How To Build Your Own Mikrotik Wifi
14 pages
"LastPass Manual" This Will Help You Manage Your Computer Passwords.
0% (1)
"LastPass Manual" This Will Help You Manage Your Computer Passwords.
127 pages
Babok (Business Analysis Body of Knowledge)
No ratings yet
Babok (Business Analysis Body of Knowledge)
10 pages
EHP7 For SAP ERP 6.0
No ratings yet
EHP7 For SAP ERP 6.0
2 pages
Online Quiz Examination System: Bachelor of Technology
No ratings yet
Online Quiz Examination System: Bachelor of Technology
21 pages
Chapter One Kenneth and Louden Slide
100% (1)
Chapter One Kenneth and Louden Slide
69 pages
Computer Basics: Introduction To Computers Lesson 2: Common Computer Terminology
No ratings yet
Computer Basics: Introduction To Computers Lesson 2: Common Computer Terminology
29 pages
VCS-278.examcollection - Premium.exam.159q p2vOwLd PDF
No ratings yet
VCS-278.examcollection - Premium.exam.159q p2vOwLd PDF
51 pages
CPP Questions
No ratings yet
CPP Questions
8 pages
2
No ratings yet
2
4 pages
Numerical Computing
No ratings yet
Numerical Computing
2 pages
PKC Algorithm
No ratings yet
PKC Algorithm
20 pages
Prj1 Data Requirements ERD Table DesignISMG6080
No ratings yet
Prj1 Data Requirements ERD Table DesignISMG6080
4 pages
Dumont Discount Schedule Final
No ratings yet
Dumont Discount Schedule Final
11 pages
Windows 7 - Win7 Enabling HPET, Bcdedit Set Useplatformclock True (Command)
No ratings yet
Windows 7 - Win7 Enabling HPET, Bcdedit Set Useplatformclock True (Command)
5 pages
WMS Gaming Article - v1 PDF
No ratings yet
WMS Gaming Article - v1 PDF
13 pages
Functional Specification Document Material License Limit Report
No ratings yet
Functional Specification Document Material License Limit Report
6 pages
Train Schedule
No ratings yet
Train Schedule
7 pages
Basic Update Statment
No ratings yet
Basic Update Statment
8 pages
Lenovo Y520 Checking Media
No ratings yet
Lenovo Y520 Checking Media
5 pages
Thug Life
No ratings yet
Thug Life
4 pages
ARC1882 Specification 201306
No ratings yet
ARC1882 Specification 201306
2 pages
33 R 4 W 1 Dfa 5 SDF 1 Ac 1 V 57751
No ratings yet
33 R 4 W 1 Dfa 5 SDF 1 Ac 1 V 57751
2 pages
National Interest Waiver Questionnaire:: Paper #1: Title: Summary: Its Significance To US Interest
No ratings yet
National Interest Waiver Questionnaire:: Paper #1: Title: Summary: Its Significance To US Interest
2 pages
Abstract Submission Gita Marina
No ratings yet
Abstract Submission Gita Marina
1 page
Eoc 1
No ratings yet
Eoc 1
2 pages
Mastering IPython 4.0
From Everand
Mastering IPython 4.0
Thomas Bitterman
No ratings yet
Dataflow and Reactive Programming Systems
From Everand
Dataflow and Reactive Programming Systems
Matt Carkci
No ratings yet
The Beginner’s Guide to Node.js
From Everand
The Beginner’s Guide to Node.js
Steven Mcananey
No ratings yet

Reflections FINAL PDF

Uploaded by

Reflections FINAL PDF

Uploaded by

Distributed Computing Systems

Shen Li @ IBM Research

• Stream Computing Systems

• Apache Beam (VLDB’15 Dataflow): A Unified Model for Batch and

• MapReduce (OSDI’04) presented by Huh & Cline

• Spark (NSDI’12) presented by Lin & Chang

• Spark Streaming (SOSP’13) presented by Murali & Zhang

Handle Massive Data Lower Cost Reduce Complexity

2004 2006 2008 2010 2012 2014 2016 2018

Trend? Why? See the problem?

Batch Micro-batch Streaming

100 events per bundle/micro-batch

Fusion into PEs Execution

User App Written in Beam

(DataArtisans) (DataBricks) (DataTorrent) (Intel)

• Why adopting Beam

• What is being computed?

• How do refinements relate?

Window Defines windowing, triggering, and retracting schemes

Merges multiple input streams with the same tuple

input side input Converts tuples/windows of the input stream into

Applies a user-defined DoFn to each tuple in the main

input stream and emits one main out stream. Besides,

it may take multiple side input streams, and generates

multiple side output streams

event time processing time

3. Low Watermark: A local estimation on progress of an operator in event

• It is up to the app source operator and the runner to design the

WindowFn needs to identify tuples in

Window1 Window2 Window3

You might also like