Module 04 Spark2x - In-memory Distributed Computing Engine
Module 04 Spark2x - In-memory Distributed Computing Engine
FusionInsight Spark2x
www.huawei.com
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 2
Contents
1. Spark Overview
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 3
Spark Introduction
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 4
Application Scenarios
Batch processing can be used for extracting, transforming, and
loading (ETL).
Machine learning can be used to automatically determine
whether comments of Taobao customers are positive or
negative.
Interactive analysis can be used to query the Hive data
warehouse.
Stream processing can be used for real-time businesses such as
page-click stream analysis, recommendation systems, and
public opinion analysis.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 5
Spark Highlights
Light Fast
- Spark core code has - Delay for small
30,000 lines. datasets reaches the
sub-second level.
Spark
Flexible Smart
- Spark offers different - Spark smartly uses
levels of flexibility. existing big data
components.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 6
Spark Ecosystem
Applications
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 8
Spark vs MapReduce (1)
Input Input
One-time
HDFS Query 1 processing Query 1 Result 1
Result 1
Read
Query 2 Result 2 Query 2 Result 2
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 9
Spark vs MapReduce (2)
Hadoop Spark Spark
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 10
Contents
1. Spark Overview
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 11
Spark System Architecture
Spark Core
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 12
Core Concepts of Spark - RRD
Resilient Distributed Datasets (RDDs) are elastic, read-only, and partitioned distributed
datasets.
RDDs are stored in memory by default and are written to disks when the memory is
insufficient.
RDD data is stored in the cluster as partitions.
RDD has a lineage mechanism (Spark Lineage), which allows for rapid data recovery
when data loss occurs.
External
HDFS Spark cluster storage
RDD1 RDD2
Hello Spark
Hello Hadoop "Hello Spark" "Hello, Spark"
China Mobile "Hello Hadoop" "Hello, Hadoop"
"China Mobile" "China, Mobile"
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 13
RDD Dependencies
map,filter groupByKey
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 14
Stage Division of RDD
A: B:
G:
Stage 1 groupby
C: D: F:
map
E: join
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 15
RDD Operators
Transformation
Transformation operators are invoked to generate a new RDD
from one or more existing RDDs. Such an operator initiates a job
only when an Action operator is invoked.
Typical operators: map, flatMap, filter, and reduceByKey
Action
A job is immediately started when action operators are invoked.
Typical operators: take, count, and saveAsTextFile
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 17
Major Roles of Spark (1)
Driver
Responsible for the application business logic and operation
planning (DAG).
ApplicationMaster
Manages application resources and applies for resources from
ResourceManager as needed.
Client
Demand side that submits requirements (applications).
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 20
Major Roles of Spark (2)
ResourceManager
Resource management department that centrally schedules and
distributes resources in the entire cluster
NodeManager
Resource management of the current node
Executor
Actual executor of a task. An application is split for multiple
executors to compute.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 21
Spark on Yarn - Client Operation Process
Driver ResourceManager
1: Submit an application.
Spark on Yarn-client
YarnClientScheduler ApplicationMaster
Backend
2. Submit
3. Apply for
Application
a container.
5. Schedule tasks. Master.
NodeManager
NodeManager
Container
4. Start the container.
Executor
ExecutorLauncher
Cache Task
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 22
Spark on Yarn - Cluster Operation Process
NodeManager
Spark on Yarn-client Container
4. Driver assigns tasks.
1. Submit an Executor
application.
5. Executor Cache Task Task
ResourceManager
reports task
statuses.
resources for the
application.
2. Allocate
NodeManager
Container
Container
Executor
ApplicationMaster Cache
(including Driver)
DAGScheduler Task
3. Apply for Executor
from Task
ResourceManager. YarnClusterScheduler
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 23
Differences Between Yarn-Client and Yarn -
Cluster
Differences
Differences between Yarn-client and Yarn-cluster lie in
ApplicationMaster.
Yarn-client is suitable for testing, whereas Yarn-cluster is suitable
for production.
If the task submission node in Yarn-client mode is down and the
entire task fails. Such a situation in Yarn-cluster mode will not
affect the entire task.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 24
Typical Case - WordCount
An (An, 1)
(An, 1)
apple (apple, 1)
(A,1)
A (A, 1) (An, 1)
An apple An apple (apple, 2)
pair (pair, 1) (A,1)
HDFS
A pair of shoes A pair of shoes (pair, 1) HDFS
of (of, 1) (apple, 2)
Orange apple Orange apple (of, 1)
shoes (shoes, 1) (pair, 1)
(shoes, 1)
Orange (Orange, 1) (of, 1)
(Orange, 1)
apple (apple, 1) (shoes, 1)
(Orange, 1)
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 25
Contents
1. Spark Overview
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 26
Spark SQL Overview
Spark SQL is the module used in Spark for structured data
processing. In Spark applications, you can seamlessly use SQL
statements or DataFrame APIs to query structured data.
Cost Model
DataFrame Unresolved Logical Plan Optimized Physical Selected RDDs
Logical Plan Logical Plan Plans Physical Plan
Dataset Catalog
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 27
Introduction to Dataset
A dataset is a strongly typed collection of objects in a particular
domain that can be converted in parallel by a function or relationship
operation.
A dataset is represented by a Catalyst logical execution plan, and the
data is stored in encoded binary form, and the sort, filter, and shuffle
operations can be performed without deserialization.
A dataset is lazy and triggers computing only when an action
operation is performed. When an action operation is performed,
Spark uses the query optimizer to optimize the logical plan and
generate an efficient parallel distributed physical plan.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 28
Introduction to DataFrame
DataFrame is a dataset with specified column names.
DataFrame is a special case of Dataset[Row].
RDD[Person] DataFrame
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 29
RDD, DataFrame, and Datasets (1)
RDD:
Advantages: safe type and object oriented
Disadvantages: high performance overhead for serialization and
deserialization; high GC overhead due to frequent object creation
and deletion
DataFrame:
Advantages: schema information to reduce serialization and
deserialization overhead
Disadvantages: not object-oriented; insecure during compiling
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 30
RDD, DataFrame, and Datasets (2)
Characteristics of Dataset:
Fast: In most scenarios, performance is superior to RDD. Encoders
are better than Kryo or Java serialization, avoiding unnecessary
format conversion.
Secure type: Similar to RDD, functions are as secure as possible
during compiling.
Dataset, DataFrame, and RDD can be converted to each other.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 31
Spark SQL and Hive
Differences
The execution engine of Spark SQL is Spark Core, and the default execution
engine of Hive is MapReduce.
The execution speed of Spark SQL is 10 to 100 times faster than Hive.
Spark SQL does not support buckets, but Hive does.
Dependencies
Spark SQL depends on the metadata of Hive.
Spark SQL is compatible with most syntax and functions of Hive.
Spark SQL can use the custom functions of Hive.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 32
Contents
1. Spark Overview
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 33
Structured Streaming Overview (1)
Structured Streaming is a streaming data–processing engine
built on the Spark SQL engine. You can compile a streaming
computing process like using static RDD data. When
streaming data is continuously generated, Spark SQL will
process the data incrementally and continuously, and
update the results to the result set.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 34
Structured Streaming Overview (2)
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 35
Programming Model for Structured
Streaming
Trigger.every 1 sec
1 2 3
Time
Query
Output
complete mode
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 36
Example Programming Model of
Structured Streaming
Cat dog Owl cat Dog
Dog dog owl
1 2 3
Time
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 37
Contents
1. Spark Overview
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 38
Overview of Spark Streaming
HDFS
Kafka
Kafka
HDFS
Database
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 39
Mini Batch Processing of Spark
Streaming
Spark Streaming programming is based on DStream, which
decomposes streaming programming into a series of short batch jobs.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 40
Fault Tolerance Mechanism of Spark
Streaming
Spark Streaming performs computing based RDDs. If some partitions
of an RDD are lost, the lost partitions can be recovered using the RDD
lineage mechanism.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 41
Contents
1. Spark Overview
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 44
Spark WebUI
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 45
Permanent Processes of Spark
JDBCServer
JDBCServer is a permanent Spark application that provides the Java
database connectivity (JDBC) service.
Users can connect to JDBCServer to execute SQL statements by running
Beeline or JDBC scripts.
JDBCServer is deployed in active/standby mode, and no single point of
failure occurs.
JobHistory
JobHistory is used to provide the HistorySever page that displays historical
execution information of Spark applications.
JobHistory is deployed in two-node load sharing mode, and no single
point of failure occurs.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 46
Spark and Other Components
In the FusionInsight cluster, Spark interacts with the following
components:
HDFS: Spark reads or writes data in the HDFS (mandatory).
Yarn: Yarn schedules and manages resources to support the running of
Spark tasks (mandatory).
Hive: Spark SQL shares metadata and data files with Hive (mandatory).
ZooKeeper: HA of JDBCServer depends on the coordination of ZooKeeper
(mandatory).
Kafka: Spark can receive data streams sent by Kafka (optional).
HBase: Spark can perform operations on HBase tables (optional).
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 47
Summary
The background, application scenarios, and characteristics of
Spark are briefly introduced.
Basic concepts, technical architecture, task running processes,
Spark on Yarn, and application scheduling of Spark are
introduced.
Integration of Spark in FusionInsight HD is introduced.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 48
Quiz
1. What are the characteristics of Spark?
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 49
Quiz
1. RDD operators are classified into: ________ and _________.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 50
More Information
Download training materials:
http://support.huawei.com/learning/trainFaceDetailAction?lang=en&pbiPath=term1000121094
&courseId=Node1000011807
eLearning course:
http://support.huawei.com/learning/Certificate!showCertificate?lang=en&pbiPath=term100002
5450&id=Node1000011796
Outline:
http://support.huawei.com/learning/Certificate!toExamOutlineDetail?lang=en&nodeId=Node100
0011797
Authentication process:
http://support.huawei.com/learning/NavigationAction!createNavi?navId=_40&lang=en
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 51
Thank You
www.huawei.com
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 52