0% found this document useful (0 votes)
3 views

Module 04 Spark2x - In-memory Distributed Computing Engine

The document outlines the technical principles of FusionInsight Spark2x, focusing on its architecture, integration, and various application scenarios. It highlights Spark's capabilities in big data processing, including batch processing, machine learning, and real-time stream processing. Additionally, it compares Spark with MapReduce, emphasizing its efficiency and flexibility in handling large datasets.

Uploaded by

Lucas Oliveira
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Module 04 Spark2x - In-memory Distributed Computing Engine

The document outlines the technical principles of FusionInsight Spark2x, focusing on its architecture, integration, and various application scenarios. It highlights Spark's capabilities in big data processing, including batch processing, machine learning, and real-time stream processing. Additionally, it compares Spark with MapReduce, emphasizing its efficiency and flexibility in handling large datasets.

Uploaded by

Lucas Oliveira
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Technical Principles of

FusionInsight Spark2x

www.huawei.com

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved.


Objectives
 Upon completion of this course, you will be able to:
 Understand application scenarios and master highlights of Spark.
 Master the computing capability and technical framework of
Spark.
 Master the integration of Spark components in FusionInsight HD.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 2
Contents
1. Spark Overview

2. Spark Principles and Architecture

3. Spark Integration in FusionInsight HD

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 3
Spark Introduction

 Apache Spark was developed in the UC Berkeley AMP lab in


2009.
 Apache Spark is a fast, versatile, and scalable in-memory big
data computing engine.
 Apache Spark is a one-stop solution that integrates batch
processing, real-time stream processing, interactive query,
graph computing, and machine learning.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 4
Application Scenarios
 Batch processing can be used for extracting, transforming, and
loading (ETL).
 Machine learning can be used to automatically determine
whether comments of Taobao customers are positive or
negative.
 Interactive analysis can be used to query the Hive data
warehouse.
 Stream processing can be used for real-time businesses such as
page-click stream analysis, recommendation systems, and
public opinion analysis.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 5
Spark Highlights

Light Fast
- Spark core code has - Delay for small
30,000 lines. datasets reaches the
sub-second level.

Spark

Flexible Smart
- Spark offers different - Spark smartly uses
levels of flexibility. existing big data
components.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 6
Spark Ecosystem

Applications

Environments Data sources

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 8
Spark vs MapReduce (1)

HDFS HDFS HDFS HDFS


Read Write Read Write

Iter.1 Iter.2 Iter.1 Iter.2

Input Input

One-time
HDFS Query 1 processing Query 1 Result 1
Result 1
Read
Query 2 Result 2 Query 2 Result 2

Query 3 Result 3 Query 3 Result 3


Input Input Distributed
memory

Data sharing in MapReduce Data sharing in Spark

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 9
Spark vs MapReduce (2)
Hadoop Spark Spark

Data volume 102.5 TB 102 TB 1000 TB

Time required (min) 72 23 234

Number of nodes 2100 206 206

Number of cores 50,400 6592 6592

Rate 1.4 TB/min 4.27 TB/min 4.27 TB/min

Rate/node 0.67 GB/min 20.7 GB/min 22.5 GB/min

Daytona Gray Sort Yes Yes Yes

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 10
Contents
1. Spark Overview

2. Spark Principles and Architecture


 Spark Core
 Spark SQL and Dataset
 Spark Structured Streaming
 Spark Streaming

3. Spark Integration in FusionInsight HD

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 11
Spark System Architecture

Spark Structured Spark


MLlib GraphX SparkR
SQL Streaming Streaming

Spark Core

Standalone Yarn Mesos

Existing functions of Spark 1.0

New functions of Spark 2.0

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 12
Core Concepts of Spark - RRD
 Resilient Distributed Datasets (RDDs) are elastic, read-only, and partitioned distributed
datasets.
 RDDs are stored in memory by default and are written to disks when the memory is
insufficient.
 RDD data is stored in the cluster as partitions.
 RDD has a lineage mechanism (Spark Lineage), which allows for rapid data recovery
when data loss occurs.
External
HDFS Spark cluster storage

RDD1 RDD2
Hello Spark
Hello Hadoop "Hello Spark" "Hello, Spark"
China Mobile "Hello Hadoop" "Hello, Hadoop"
"China Mobile" "China, Mobile"

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 13
RDD Dependencies

Narrow Dependencies: Wide Dependencies:

map,filter groupByKey

join with inputs


co-partitioned
join with inputs not
union
co-partitioned

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 14
Stage Division of RDD

A: B:

G:
Stage 1 groupby
C: D: F:

map
E: join

Stage 2 union Stage 3

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 15
RDD Operators
 Transformation
 Transformation operators are invoked to generate a new RDD
from one or more existing RDDs. Such an operator initiates a job
only when an Action operator is invoked.
 Typical operators: map, flatMap, filter, and reduceByKey

 Action
 A job is immediately started when action operators are invoked.
 Typical operators: take, count, and saveAsTextFile

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 17
Major Roles of Spark (1)
 Driver
 Responsible for the application business logic and operation
planning (DAG).

 ApplicationMaster
 Manages application resources and applies for resources from
ResourceManager as needed.

 Client
 Demand side that submits requirements (applications).

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 20
Major Roles of Spark (2)
 ResourceManager
 Resource management department that centrally schedules and
distributes resources in the entire cluster

 NodeManager
 Resource management of the current node

 Executor
 Actual executor of a task. An application is split for multiple
executors to compute.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 21
Spark on Yarn - Client Operation Process
Driver ResourceManager
1: Submit an application.
Spark on Yarn-client

YarnClientScheduler ApplicationMaster
Backend
2. Submit
3. Apply for
Application
a container.
5. Schedule tasks. Master.
NodeManager
NodeManager
Container
4. Start the container.
Executor
ExecutorLauncher
Cache Task

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 22
Spark on Yarn - Cluster Operation Process
NodeManager
Spark on Yarn-client Container
4. Driver assigns tasks.
1. Submit an Executor
application.
5. Executor Cache Task Task
ResourceManager
reports task
statuses.
resources for the
application.
2. Allocate

NodeManager
Container
Container
Executor
ApplicationMaster Cache
(including Driver)
DAGScheduler Task
3. Apply for Executor
from Task
ResourceManager. YarnClusterScheduler

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 23
Differences Between Yarn-Client and Yarn -
Cluster
 Differences
 Differences between Yarn-client and Yarn-cluster lie in
ApplicationMaster.
 Yarn-client is suitable for testing, whereas Yarn-cluster is suitable
for production.
 If the task submission node in Yarn-client mode is down and the
entire task fails. Such a situation in Yarn-cluster mode will not
affect the entire task.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 24
Typical Case - WordCount

textFile flatMap map reduceByKey saveAsTextFile


HDFS RDD RDD RDD RDD HDFS

An (An, 1)
(An, 1)
apple (apple, 1)
(A,1)
A (A, 1) (An, 1)
An apple An apple (apple, 2)
pair (pair, 1) (A,1)
HDFS
A pair of shoes A pair of shoes (pair, 1) HDFS
of (of, 1) (apple, 2)
Orange apple Orange apple (of, 1)
shoes (shoes, 1) (pair, 1)
(shoes, 1)
Orange (Orange, 1) (of, 1)
(Orange, 1)
apple (apple, 1) (shoes, 1)
(Orange, 1)

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 25
Contents
1. Spark Overview

2. Spark Principles and Architecture


 Spark Core
 Spark SQL and Dataset
 Spark Structured Streaming
 Spark Streaming

3. Spark Integration in FusionInsight HD

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 26
Spark SQL Overview
 Spark SQL is the module used in Spark for structured data
processing. In Spark applications, you can seamlessly use SQL
statements or DataFrame APIs to query structured data.

SQL AST Logical Code


Analysis
Optimization Generation

Cost Model
DataFrame Unresolved Logical Plan Optimized Physical Selected RDDs
Logical Plan Logical Plan Plans Physical Plan

Dataset Catalog

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 27
Introduction to Dataset
 A dataset is a strongly typed collection of objects in a particular
domain that can be converted in parallel by a function or relationship
operation.
 A dataset is represented by a Catalyst logical execution plan, and the
data is stored in encoded binary form, and the sort, filter, and shuffle
operations can be performed without deserialization.
 A dataset is lazy and triggers computing only when an action
operation is performed. When an action operation is performed,
Spark uses the query optimizer to optimize the logical plan and
generate an efficient parallel distributed physical plan.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 28
Introduction to DataFrame
 DataFrame is a dataset with specified column names.
DataFrame is a special case of Dataset[Row].

RDD[Person] DataFrame
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 29
RDD, DataFrame, and Datasets (1)
 RDD:
 Advantages: safe type and object oriented
 Disadvantages: high performance overhead for serialization and
deserialization; high GC overhead due to frequent object creation
and deletion

 DataFrame:
 Advantages: schema information to reduce serialization and
deserialization overhead
 Disadvantages: not object-oriented; insecure during compiling

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 30
RDD, DataFrame, and Datasets (2)
 Characteristics of Dataset:
 Fast: In most scenarios, performance is superior to RDD. Encoders
are better than Kryo or Java serialization, avoiding unnecessary
format conversion.
 Secure type: Similar to RDD, functions are as secure as possible
during compiling.
 Dataset, DataFrame, and RDD can be converted to each other.

 Dataset has the advantages of RDD and DataFrame, and avoids


their disadvantages.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 31
Spark SQL and Hive
 Differences
 The execution engine of Spark SQL is Spark Core, and the default execution
engine of Hive is MapReduce.
 The execution speed of Spark SQL is 10 to 100 times faster than Hive.
 Spark SQL does not support buckets, but Hive does.

 Dependencies
 Spark SQL depends on the metadata of Hive.
 Spark SQL is compatible with most syntax and functions of Hive.
 Spark SQL can use the custom functions of Hive.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 32
Contents
1. Spark Overview

2. Spark Principles and Architecture


 Spark Core
 Spark SQL and Dataset
 Spark Structured Streaming
 Spark Streaming

3. Spark Integration in FusionInsight HD

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 33
Structured Streaming Overview (1)
 Structured Streaming is a streaming data–processing engine
built on the Spark SQL engine. You can compile a streaming
computing process like using static RDD data. When
streaming data is continuously generated, Spark SQL will
process the data incrementally and continuously, and
update the results to the result set.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 34
Structured Streaming Overview (2)

Data Stream Unbounded Table

New data in the


data stream
=
New rows appended to
an unbounded table

Data stream as an unbounded table

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 35
Programming Model for Structured
Streaming
Trigger.every 1 sec

1 2 3
Time

Data up Data up Data up


Input to t=1 to t=2 to t=3

Query

Result Result up Result up Result up


to t=1 to t=2 to t=3

Output
complete mode

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 36
Example Programming Model of
Structured Streaming
Cat dog Owl cat Dog
Dog dog owl

1 2 3
Time

Enter data in the Cat dog Cat dog Cat dog


unbounded table.
Dog dog Dog dog Dog dog
Owl cat Owl cat
Dog
Owl
T=1 computing T=2 computing result T=3 computing result
result
Cat 2 Cat 2
Cat 1
Computing Dog 3 Dog 4
results Dog 3
Owl 1 Owl 2

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 37
Contents
1. Spark Overview

2. Spark Principles and Architecture


 Spark Core
 Spark SQL and Dataset
 Spark Structured Streaming
 Spark Streaming

3. Spark Integration in FusionInsight HD

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 38
Overview of Spark Streaming

 Spark Streaming is an extension of the Spark core API, which is


a real-time computing framework featured with scalability, high
throughput, and fault tolerance.

HDFS
Kafka
Kafka
HDFS
Database

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 39
Mini Batch Processing of Spark
Streaming
 Spark Streaming programming is based on DStream, which
decomposes streaming programming into a series of short batch jobs.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 40
Fault Tolerance Mechanism of Spark
Streaming
 Spark Streaming performs computing based RDDs. If some partitions
of an RDD are lost, the lost partitions can be recovered using the RDD
lineage mechanism.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 41
Contents
1. Spark Overview

2. Spark Principles and Architecture

3. Spark Integration in FusionInsight HD

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 44
Spark WebUI

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 45
Permanent Processes of Spark
 JDBCServer
 JDBCServer is a permanent Spark application that provides the Java
database connectivity (JDBC) service.
 Users can connect to JDBCServer to execute SQL statements by running
Beeline or JDBC scripts.
 JDBCServer is deployed in active/standby mode, and no single point of
failure occurs.
 JobHistory
 JobHistory is used to provide the HistorySever page that displays historical
execution information of Spark applications.
 JobHistory is deployed in two-node load sharing mode, and no single
point of failure occurs.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 46
Spark and Other Components
 In the FusionInsight cluster, Spark interacts with the following
components:
 HDFS: Spark reads or writes data in the HDFS (mandatory).
 Yarn: Yarn schedules and manages resources to support the running of
Spark tasks (mandatory).
 Hive: Spark SQL shares metadata and data files with Hive (mandatory).
 ZooKeeper: HA of JDBCServer depends on the coordination of ZooKeeper
(mandatory).
 Kafka: Spark can receive data streams sent by Kafka (optional).
 HBase: Spark can perform operations on HBase tables (optional).

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 47
Summary
 The background, application scenarios, and characteristics of
Spark are briefly introduced.
 Basic concepts, technical architecture, task running processes,
Spark on Yarn, and application scheduling of Spark are
introduced.
 Integration of Spark in FusionInsight HD is introduced.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 48
Quiz
1. What are the characteristics of Spark?

2. What are the advantages of Spark in comparison with MapReduce?

3. What are the differences between wide dependencies and narrow


dependencies of Spark?

4. What are the application scenarios of Spark?

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 49
Quiz
1. RDD operators are classified into: ________ and _________.

2. The ___________ module is the core module of Spark.

3. RDD dependency types include ___________ and ___________.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 50
More Information
 Download training materials:
 http://support.huawei.com/learning/trainFaceDetailAction?lang=en&pbiPath=term1000121094
&courseId=Node1000011807
 eLearning course:
 http://support.huawei.com/learning/Certificate!showCertificate?lang=en&pbiPath=term100002
5450&id=Node1000011796
 Outline:
 http://support.huawei.com/learning/Certificate!toExamOutlineDetail?lang=en&nodeId=Node100
0011797
 Authentication process:
 http://support.huawei.com/learning/NavigationAction!createNavi?navId=_40&lang=en

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 51
Thank You
www.huawei.com

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 52

You might also like