0% found this document useful (0 votes)
82 views15 pages

Reflections FINAL PDF

Distributed computing systems allow for the processing of massive amounts of data across computer clusters in a parallel and distributed manner. There are different types of distributed systems based on how data is processed, including batch processing systems like MapReduce which process all data at once and streaming systems which process data continuously as it arrives. Apache Beam provides a unified programming model for both batch and streaming data processing across different distributed computing frameworks like Spark, Flink, and Google Dataflow. It separates the processing logic from the specific runtime system to allow applications to work across different distributed environments.

Uploaded by

doob10163
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
82 views15 pages

Reflections FINAL PDF

Distributed computing systems allow for the processing of massive amounts of data across computer clusters in a parallel and distributed manner. There are different types of distributed systems based on how data is processed, including batch processing systems like MapReduce which process all data at once and streaming systems which process data continuously as it arrives. Apache Beam provides a unified programming model for both batch and streaming data processing across different distributed computing frameworks like Spark, Flink, and Google Dataflow. It separates the processing logic from the specific runtime system to allow applications to work across different distributed environments.

Uploaded by

doob10163
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Distributed Computing Systems

Shen Li @ IBM Research


Agenda
• Overview

• Stream Computing Systems

• Apache Beam (VLDB’15 Dataflow): A Unified Model for Batch and


Stream Processing

• MapReduce (OSDI’04) presented by Huh & Cline

• Spark (NSDI’12) presented by Lin & Chang

• Spark Streaming (SOSP’13) presented by Murali & Zhang


Overview
• Motivation?

Handle Massive Data Lower Cost Reduce Complexity


Overview
• Applications?
Overview
• History

MillWheel

System S Dryad

Pig
MapReduce Hive S4 Stream

2004 2006 2008 2010 2012 2014 2016 2018

Trend? Why? See the problem?


Overview
• Categorization: based on granularity

Dryad

MapReduce

Batch Micro-batch Streaming


Overview

100 events per bundle/micro-batch


Stream Computing Systems

Fusion into PEs Execution

Communications
in PE becomes
Function calls
Master Slave

Parallelism in a PE
Apache Beam
• A unified programming model for both batch and stream computing
applications.

User App Written in Beam

Beam SDK

Dataflow Runner Flink Runner Streams Runner Spark Runner Apex Runner Gearpump

(DataArtisans) (DataBricks) (DataTorrent) (Intel)


Google Dataflow IBM Streams
Flink Spark Apex Gearpump

• Why adopting Beam


Beam may become a language standard for streaming applications
Applications no longer need to make commitment to specific engines
Beam API
• Separate data processing logic from runtime requirements.To write
an app, users need to answer four questions:

• What is being computed?


Window Window
Discard/Accumulate
input
• Where in event time (event
occurs) to create windows?
Trigger
• When in processing time
output
(tuple been processed) to
carry out computation? Computation

• How do refinements relate?


Primitive Transforms
source Creates a stream

Window Defines windowing, triggering, and retracting schemes

Merges multiple input streams with the same tuple


Flatten type into a single output stream

input side input Converts tuples/windows of the input stream into


View user-defined data structures, which can be
consumed by ParDo as side inputs
s

Applies a user-defined DoFn to each tuple in the main


sid

ut
tp
e

input stream and emits one main out stream. Besides,


ou
inp

ParDo
e
ut

it may take multiple side input streams, and generates


sid
s

multiple side output streams


main input main output

(k, v1) (k, v2) (k, [v1, v2]) Groups input values with the same key in the same
GroupByKey window (pane) into the same output tuple
Window Model Comparison
Discrete Windows Continuous Windows
• Each window belongs to a fixed interval • Maintains a single window that
in event time moves along time axis
• After creation, a window never moves • Mobility is achieved by receiving and
• The mobility is achieved by creating, evicting tuples
discarding, and merging windows

1 2 3

event time processing time

Pro? Con?
Lateness
• Time concepts:
1. Event Time: the time when event occurs, recorded by the timestamp in the
tuple

2. Processing Time: the time when the tuple gets processed at the operator in
the pipeline

3. Low Watermark: A local estimation on progress of an operator in event


time.

• It is up to the app source operator and the runner to design the


watermark algorithm. Usually, watermark at an operator is the min
watermark of all upstream operators.
Lateness
• What is late arrival?

- Tuples that arrive with timestamps older than the watermark is considered a
late arrival 7
6
1 2 3 4 5
Processing Time

Watermark

0 1 2 3 4 5 6 7
Event Time
Join Example
Goal: jointly process these data

WindowFn needs to identify tuples in


the target window

1 2 3 4 5 6
1 2 3 4 5 6 Window View
1
2
3
Sid 4
eI 5
np 6
ut

Main Input
a b c d e f Window GBK ParDo

a b c e d f a b c e d f

Window1 Window2 Window3

You might also like