0% found this document useful (0 votes)
6 views

Module 07 Streaming - Distributed Stream Computing Engine

The document provides a comprehensive overview of Streaming, a distributed real-time computing framework, detailing its system architecture, key features, and application scenarios such as real-time analysis and recommendation. It also introduces StreamCQL, a query language designed for stream processing, and compares Streaming with Spark Streaming in terms of performance and processing modes. Additionally, it covers basic concepts, message delivery policies, reliability mechanisms, and integration with other components like HDFS and Kafka.

Uploaded by

Lucas Oliveira
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Module 07 Streaming - Distributed Stream Computing Engine

The document provides a comprehensive overview of Streaming, a distributed real-time computing framework, detailing its system architecture, key features, and application scenarios such as real-time analysis and recommendation. It also introduces StreamCQL, a query language designed for stream processing, and compares Streaming with Spark Streaming in terms of performance and processing modes. Additionally, it covers basic concepts, message delivery policies, reliability mechanisms, and integration with other components like HDFS and Kafka.

Uploaded by

Lucas Oliveira
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Technical Principles of

Streaming

www.huawei.com

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved.


Objectives
 Upon completion of this course, you will be able to know:
 Real-time stream processing
 System architecture of Streaming
 Key features of Streaming
 Basic CQL concepts

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 2
Contents
1. Introduction to Streaming

2. System Architecture

3. Key Features

4. Introduction to StreamCQL

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 3
Streaming Overview
 Streaming is a distributed real-time computing framework based on
the open source Storm.with the following features:
 Real-time response with low delay

 Data computing before storing

 Continuous query
No waiting; Results delivered in-flight
 Event-driven Event Alerts
Data Actions

Queries

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 4
Application Scenarios of Streaming
 Streaming is applicable to the following scenarios:
 Real-time analysis: real-time log processing and vehicle traffic analysis

 Real-time statistics: real-time website access statistics and sorting

 Real-time recommendation: real-time advertisement positioning and event


marketing

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 5
Position of Streaming in FusionInsight
Application service layer
OpenAPI/SDK REST/SNMP/Syslog

Data Information Knowledge Wisdom


DataFarm Porter Miner Farmer Manager
System
management
Hadoop API Plugin API
Service
governance
HIVE M/R Spark Streaming Flink
Hadoop LibrA
YARN/ ZooKeeper Security
management
HDFS/HBase

Streaming is a distributed real-time computing framework, widely used in real-


time business.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 6
Comparison with Spark Streaming

Micro-batch processing by Spark Streaming Stream processing by Streaming

Spark Streaming Streaming


Task execution Instant execution logic startup and Execution logic startup before execution, and
mode reclamation upon completion logic persistence
Event processing Processing started upon accumulation of a Real-time event processing
mode certain number of event batches
Delay Second-level Millisecond-level
Throughput High (2 to 5 times that of Streaming) Average

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 7
Comparison of Application Scenario
Real-time Performance

Streaming

Spark Streaming
Time
milliseconds seconds

 Streaming applies to delay-sensitive services.


 Spark Streaming applies to delay-insensitive services.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 8
Contents
1. Introduction to Streaming

2. System Architecture

3. Key Features

4. Introduction to StreamCQL

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 9
Basic Concepts (1)
 Topology: a real-time application in Streaming.
 Nimbus: assigns resources and schedules tasks.
 Supervisor: receives tasks assigned by Nimbus, and starts/stops
Worker processes.
 Worker: runs component logic processes.
 Spout: generates source data flows in a topology.
 Bolt: receives and processes data in a topology.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 10
Basic Concepts (2)
 Task: a Spout or Bolt thread of Worker.
 Tuple: core data structure of Streaming. It is basic message delivery
unit in key-value pairs, which can be created and processed in a
distributed way.
 Stream: an infinite continuous Tuple sequence.
 Zookeeper: provides distributed collaboration services for processes.
Active/Standby Nimbus, Supervisor, and Worker register their
information in ZooKeeper. This enables Nimbus to detect the health
status of all roles.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 11
System Architecture
Submits a Monitors the heartbeat
topology. and assigns tasks.
Client Nimbus
ZooKeeper

Downloads a JAR package.

ZooKeeper

Obtains tasks.
Supervisor Supervisor
ZooKeeper
Starts Worker.
Worker
Executor
Worker
Executor Reports the heartbeat.
Worker

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 12
Topology
 A topology is a directed acyclic graph (DAG) consisting of Spout (data source) and Bolt
(for logical processing). Spout and Bolt are connected through Stream Groupings.
 Service processing logic is encapsulated in topologies in Streaming.

Filters data.

Obtains stream data


from external data
sources Bolt A Bolt B

Spout
Triggers external
messages.
Bolt C

Persistent archiving

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 13
Worker
 Worker: A Worker is a JVM process and Worker Process
a topology runs in one or more Workers. Executor
Executor
A started Worker runs all the way Task

unless manually stopped. The number of Task

Worker processes depends on the Executor


Task Task
topology setting, and has no upper

limit. The number of Worker processes

that can be scheduled and started

depends on the number of slots configured in Supervisor.

 Executor: In a Worker process runs one or more Executor threads. Each Executor can run one
or more task instances of either Spout or Bolt.

 Task: a unit that processes data

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 14
Task
Both Spout and Bolt in a topology support concurrent running. In the topology, you can
specify the number of concurrently running tasks on each node. Streaming assigns tasks
in the cluster to enable simultaneous calculation and enhance processing capability of
the system.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 15
Message Delivery Policies
Grouping Mode Description
Delivers messages in groups to tasks of the target
fieldsGrouping (field grouping)
Bolt according to message hash values.
Delivers all messages to a fixed task of the target
globalGrouping (global grouping)
Bolt.
Delivers messages to a random task of the target
shuffleGrouping (shuffle grouping)
Bolt.
Delivers messages randomly to tasks if one or more
localOrShuffleGrouping (local or shuffle grouping) tasks exist in the target Bolt process, or delivers
messages in shuffle grouping mode.
allGrouping (broadcast grouping) Delivers messages to all tasks of the target Bolt.
Delivers messages to the task of the target Bolt
specified by the data producer. The task ID needs
directGrouping (direct grouping)
to be specified by using the emitDirect (taskID,
tuple) interface.
partialKeyGrouping (partial field grouping) Balanced field grouping.
noneGrouping (no grouping) Same as shuffle grouping.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 16
Contents
1. Introduction to Streaming

2. System Architecture

3. Key Features

4. Introduction to StreamCQL

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 17
Nimbus HA
ZooKeeper cluster

Streaming cluster
Active Standby
Nimbus Nimbus

Supervisor Supervisor Supervisor



worker worker worker worker worker worker

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 18
Disaster Recovery
 Services are automatically migrated from faulty nodes to normal ones, preventing
service interruptions.

Node1 Node2 Node3


Topo1 Topo1 Topo1

Topo2 Topo3 Topo4

Zero
manual
operation
Node1 Node2 Node3
Topo1 Topo1 Topo1

Topo2 Topo3 Topo4

Topo1 Topo3

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 19
Message Reliability
Reliability Processing
Description
Level Mechanism
This mode involves the highest throughput and applies to
At Most Once None
messages with low reliability requirements.
This mode involves low throughput and applies to messages
At Least Once Ack with high reliability problems. All data must be completely
processed.
Trident is a special transactional API provided by Storm and
Exactly Once Trident
involves the lowest throughput.

When a tuple is completely processed in Streaming, the tuple and all its derived tuples are successfully
processed. A tuple fails to be processed if the processing is not complete within the timeout period.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 20
Ack Mechanism
 When Spout sends a tuple, it notifies Acker
that a new root message is generated. Acker Spout Ack6

will create a tuple tree and initializes the Ac


k1
checksum to 0.
Bolt1 Bolt2 Ack
 When Bolt sends a message, it sends an 2 Acker
anchor tuple to Acker to refresh the tuple Ack3
tree, and reports the result to Acker after k4
Bolt3 Bolt4 Ac
the message is sent successfully. If the
message is sent successfully, Acker refreshes
the checksum. If the message fails to be sent, Ack5

Acker immediately notifies Spout of the failure.


 When a tuple tree is completely processed (checksum = 0), Acker notifies Spout of the
result.
 Spout provides ack() and fail() functions to process Acker results. The fail() function
implements message resending logic.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 21
Reliability Level Setting
 If not every message is required to be processed (allowing some
message loss), the reliability mechanism can be disabled to ensure
better performance.

 The reliability mechanism can be disabled in the following ways:


 Setting Config.TOPOLOGY_ACKERS to 0.

 Using Spout to send messages through interfaces that do not restrict


message IDs.

 Using Bolt to send messages in Unanchor mode.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 23
Streaming and Other Components
 HDFS, HBase, Kafka…
Streaming
HDFS
Kafka
Topology1
Topic1 Redis
Topic2 Topology2
HBase
Topic N
Topology N Kafka

……

 External components such as HDFS and HBase are integrated to


facilitate real-time offline analysis.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 24
Contents
1. Introduction to Streaming

2. System Architecture

3. Key Features

4. Introduction to StreamCQL

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 25
StreamCQL Overview
 StreamCQL(Stream Continuous Query Language) is a query language based on the
distributed stream processing platform based on and can be built on various stream
processing engines (mainly Apache Storm).

 Currently, most stream processing platforms provide only distributed processing


capabilities but involve complex service logic development and poor stream computing
capabilities. The development efficiency is low due to low reuse and repeated
development. StreamCQL provides various distributed stream computing functions,
including traditional SQL functions such as filtering and conversion, and new functions
such as stream-based time window computing, window data statistics, and stream
data splitting and merging.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 26
StreamCQL Easy to Develop
//Def Input:
public void open(Map conf,
TopologyContext context,
SpoutOutputCollector collector) {…} --Def Input:
public void nextTuple() {…} CREATE INPUT STREAM S1 …
public void ack(Object id) { …}
public void
declareOutputFields(OutputFieldsDeclar
er declarer) {…} --Def logic:
//Def logic: INSERT INTO STREAM filterstr SELECT *
public void execute(Tuple tuple, FROM S1 WHERE name="HUAWEI";
BasicOutputCollector collector) {…}
public void
declareOutputFields(OutputFieldsDeclar --Def Output:
er ofd) {…} CREATE OUTPUT STREAM S2…
//Def Output:
public void execute(Tuple tuple,
BasicOutputCollector collector) {…} --Def Topology:
public void SUBMIT APPLICATION test;
declareOutputFields(OutputFieldsDeclar
er ofd) {…} StreamCQL
//Def Topology:
public static void main(String[] args)
throws Exception {…}
Native Storm API

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 28
StreamCQL and Stream Processing
Platform

Service interface CQL IDE

Function
Join Aggregate Split Merge Pattern Matching

Stream Window

Engine
Other stream
Storm processing engines

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 29
Summary
 This module describes the following information about
Streaming:
 Definition
 Application Scenarios
 Position of Streaming in FusionInsight
 System architecture of Streaming
 Key features of Streaming
 Introduction to StreamCQL

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 30
Quiz
1. How is message reliability guaranteed in Streaming?

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 31
Quiz
1. Which of the following statements about Supervisor is CORRECT? ( )

A. Nimbus HA supports hot failover and eliminates single points of


failure.

B. Supervisor faults can be automatically recovered without affecting


running services.

C. Worker faults can be automatically recovered.

D. Tasks on a faulty node of the cluster will be re-assigned to other


normal nodes.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 32
Quiz
2. Which of the following statements about Supervisor is

CORRECT? ( )

A. Supervisor assigns resources and schedules tasks.

B. Supervisor receives tasks assigned by Nimbus, and starts/stops


Worker processes.

C. Supervisor runs component logic processes.

D. Supervisor receives and processes data in a topology.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 33
More Information
 Training materials:
 http://support.huawei.com/learning/Certificate!showCertificate?lang=en&pbiPath=term100002
5450&id=Node1000011796
 Exam outline:
 http://support.huawei.com/learning/Certificate!toExamOutlineDetail?lang=en&nodeId=Node10
00011797
 Mock exam:
 http://support.huawei.com/learning/Certificate!toSimExamDetail?lang=en&nodeId=Node10000
11798
 Authentication process:
 http://support.huawei.com/learning/NavigationAction!createNavi#navi[id]=_40

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 34
Thank You
www.huawei.com

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 35

You might also like