0% found this document useful (0 votes)

9 views

Module 04 Spark2x - In-memory Distributed Computing Engine

The document outlines the technical principles of FusionInsight Spark2x, focusing on its architecture, integration, and various application scenarios. It highlights Spark's capabilities in big data processing, including batch processing, machine learning, and real-time stream processing. Additionally, it compares Spark with MapReduce, emphasizing its efficiency and flexibility in handling large datasets.

Uploaded by

Lucas Oliveira

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views

Module 04 Spark2x - In-memory Distributed Computing Engine

Uploaded by

Lucas Oliveira

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 46

Technical Principles of

FusionInsight Spark2x

www.huawei.com

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved.

Objectives
 Upon completion of this course, you will be able to:
 Understand application scenarios and master highlights of Spark.
 Master the computing capability and technical framework of
Spark.
 Master the integration of Spark components in FusionInsight HD.

2. Spark Principles and Architecture

3. Spark Integration in FusionInsight HD

 Apache Spark was developed in the UC Berkeley AMP lab in

2009.
 Apache Spark is a fast, versatile, and scalable in-memory big
data computing engine.
 Apache Spark is a one-stop solution that integrates batch
processing, real-time stream processing, interactive query,
graph computing, and machine learning.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 4
Application Scenarios
 Batch processing can be used for extracting, transforming, and
loading (ETL).
 Machine learning can be used to automatically determine
whether comments of Taobao customers are positive or
negative.
 Interactive analysis can be used to query the Hive data
warehouse.
 Stream processing can be used for real-time businesses such as
page-click stream analysis, recommendation systems, and
public opinion analysis.

Light Fast
- Spark core code has - Delay for small
30,000 lines. datasets reaches the
sub-second level.

Spark

Flexible Smart
- Spark offers different - Spark smartly uses
levels of flexibility. existing big data
components.

Applications

Environments Data sources

HDFS HDFS HDFS HDFS

Read Write Read Write

Iter.1 Iter.2 Iter.1 Iter.2

Input Input

One-time
HDFS Query 1 processing Query 1 Result 1
Result 1
Read
Query 2 Result 2 Query 2 Result 2

Query 3 Result 3 Query 3 Result 3

Input Input Distributed
memory

Data sharing in MapReduce Data sharing in Spark

Data volume 102.5 TB 102 TB 1000 TB

Time required (min) 72 23 234

Number of nodes 2100 206 206

Number of cores 50,400 6592 6592

Rate 1.4 TB/min 4.27 TB/min 4.27 TB/min

Rate/node 0.67 GB/min 20.7 GB/min 22.5 GB/min

Daytona Gray Sort Yes Yes Yes

2. Spark Principles and Architecture

 Spark Core
 Spark SQL and Dataset
 Spark Structured Streaming
 Spark Streaming

3. Spark Integration in FusionInsight HD

Spark Structured Spark

MLlib GraphX SparkR
SQL Streaming Streaming

Spark Core

Standalone Yarn Mesos

Existing functions of Spark 1.0

New functions of Spark 2.0

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 12
Core Concepts of Spark - RRD
 Resilient Distributed Datasets (RDDs) are elastic, read-only, and partitioned distributed
datasets.
 RDDs are stored in memory by default and are written to disks when the memory is
insufficient.
 RDD data is stored in the cluster as partitions.
 RDD has a lineage mechanism (Spark Lineage), which allows for rapid data recovery
when data loss occurs.
External
HDFS Spark cluster storage

RDD1 RDD2
Hello Spark
Hello Hadoop "Hello Spark" "Hello, Spark"
China Mobile "Hello Hadoop" "Hello, Hadoop"
"China Mobile" "China, Mobile"

Narrow Dependencies: Wide Dependencies:

map,filter groupByKey

join with inputs

co-partitioned
join with inputs not
union
co-partitioned

A: B:

G:
Stage 1 groupby
C: D: F:

map
E: join

Stage 2 union Stage 3

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 15
RDD Operators
 Transformation
 Transformation operators are invoked to generate a new RDD
from one or more existing RDDs. Such an operator initiates a job
only when an Action operator is invoked.
 Typical operators: map, flatMap, filter, and reduceByKey

 Action
 A job is immediately started when action operators are invoked.
 Typical operators: take, count, and saveAsTextFile

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 17
Major Roles of Spark (1)
 Driver
 Responsible for the application business logic and operation
planning (DAG).

 ApplicationMaster
 Manages application resources and applies for resources from
ResourceManager as needed.

 Client
 Demand side that submits requirements (applications).

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 20
Major Roles of Spark (2)
 ResourceManager
 Resource management department that centrally schedules and
distributes resources in the entire cluster

 NodeManager
 Resource management of the current node

 Executor
 Actual executor of a task. An application is split for multiple
executors to compute.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 21
Spark on Yarn - Client Operation Process
Driver ResourceManager
1: Submit an application.
Spark on Yarn-client

YarnClientScheduler ApplicationMaster
Backend
2. Submit
3. Apply for
Application
a container.
5. Schedule tasks. Master.
NodeManager
NodeManager
Container
4. Start the container.
Executor
ExecutorLauncher
Cache Task

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 22
Spark on Yarn - Cluster Operation Process
NodeManager
Spark on Yarn-client Container
4. Driver assigns tasks.
1. Submit an Executor
application.
5. Executor Cache Task Task
ResourceManager
reports task
statuses.
resources for the
application.
2. Allocate

NodeManager
Container
Container
Executor
ApplicationMaster Cache
(including Driver)
DAGScheduler Task
3. Apply for Executor
from Task
ResourceManager. YarnClusterScheduler

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 23
Differences Between Yarn-Client and Yarn -
Cluster
 Differences
 Differences between Yarn-client and Yarn-cluster lie in
ApplicationMaster.
 Yarn-client is suitable for testing, whereas Yarn-cluster is suitable
for production.
 If the task submission node in Yarn-client mode is down and the
entire task fails. Such a situation in Yarn-cluster mode will not
affect the entire task.

textFile flatMap map reduceByKey saveAsTextFile

HDFS RDD RDD RDD RDD HDFS

An (An, 1)
(An, 1)
apple (apple, 1)
(A,1)
A (A, 1) (An, 1)
An apple An apple (apple, 2)
pair (pair, 1) (A,1)
HDFS
A pair of shoes A pair of shoes (pair, 1) HDFS
of (of, 1) (apple, 2)
Orange apple Orange apple (of, 1)
shoes (shoes, 1) (pair, 1)
(shoes, 1)
Orange (Orange, 1) (of, 1)
(Orange, 1)
apple (apple, 1) (shoes, 1)
(Orange, 1)

2. Spark Principles and Architecture

 Spark Core
 Spark SQL and Dataset
 Spark Structured Streaming
 Spark Streaming

3. Spark Integration in FusionInsight HD

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 26
Spark SQL Overview
 Spark SQL is the module used in Spark for structured data
processing. In Spark applications, you can seamlessly use SQL
statements or DataFrame APIs to query structured data.

SQL AST Logical Code

Analysis
Optimization Generation

Cost Model
DataFrame Unresolved Logical Plan Optimized Physical Selected RDDs
Logical Plan Logical Plan Plans Physical Plan

Dataset Catalog

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 27
Introduction to Dataset
 A dataset is a strongly typed collection of objects in a particular
domain that can be converted in parallel by a function or relationship
operation.
 A dataset is represented by a Catalyst logical execution plan, and the
data is stored in encoded binary form, and the sort, filter, and shuffle
operations can be performed without deserialization.
 A dataset is lazy and triggers computing only when an action
operation is performed. When an action operation is performed,
Spark uses the query optimizer to optimize the logical plan and
generate an efficient parallel distributed physical plan.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 28
Introduction to DataFrame
 DataFrame is a dataset with specified column names.
DataFrame is a special case of Dataset[Row].

RDD[Person] DataFrame
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 29
RDD, DataFrame, and Datasets (1)
 RDD:
 Advantages: safe type and object oriented
 Disadvantages: high performance overhead for serialization and
deserialization; high GC overhead due to frequent object creation
and deletion

 DataFrame:
 Advantages: schema information to reduce serialization and
deserialization overhead
 Disadvantages: not object-oriented; insecure during compiling

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 30
RDD, DataFrame, and Datasets (2)
 Characteristics of Dataset:
 Fast: In most scenarios, performance is superior to RDD. Encoders
are better than Kryo or Java serialization, avoiding unnecessary
format conversion.
 Secure type: Similar to RDD, functions are as secure as possible
during compiling.
 Dataset, DataFrame, and RDD can be converted to each other.

 Dataset has the advantages of RDD and DataFrame, and avoids

their disadvantages.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 31
Spark SQL and Hive
 Differences
 The execution engine of Spark SQL is Spark Core, and the default execution
engine of Hive is MapReduce.
 The execution speed of Spark SQL is 10 to 100 times faster than Hive.
 Spark SQL does not support buckets, but Hive does.

 Dependencies
 Spark SQL depends on the metadata of Hive.
 Spark SQL is compatible with most syntax and functions of Hive.
 Spark SQL can use the custom functions of Hive.

2. Spark Principles and Architecture

 Spark Core
 Spark SQL and Dataset
 Spark Structured Streaming
 Spark Streaming

3. Spark Integration in FusionInsight HD

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 33
Structured Streaming Overview (1)
 Structured Streaming is a streaming data–processing engine
built on the Spark SQL engine. You can compile a streaming
computing process like using static RDD data. When
streaming data is continuously generated, Spark SQL will
process the data incrementally and continuously, and
update the results to the result set.

Data Stream Unbounded Table

New data in the

data stream
=
New rows appended to
an unbounded table

Data stream as an unbounded table

1 2 3
Time

Data up Data up Data up

Input to t=1 to t=2 to t=3

Query

Result Result up Result up Result up

to t=1 to t=2 to t=3

Output
complete mode

1 2 3
Time

Enter data in the Cat dog Cat dog Cat dog

unbounded table.
Dog dog Dog dog Dog dog
Owl cat Owl cat
Dog
Owl
T=1 computing T=2 computing result T=3 computing result
result
Cat 2 Cat 2
Cat 1
Computing Dog 3 Dog 4
results Dog 3
Owl 1 Owl 2

2. Spark Principles and Architecture

 Spark Core
 Spark SQL and Dataset
 Spark Structured Streaming
 Spark Streaming

3. Spark Integration in FusionInsight HD

 Spark Streaming is an extension of the Spark core API, which is

a real-time computing framework featured with scalability, high
throughput, and fault tolerance.

HDFS
Kafka
Kafka
HDFS
Database

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 39
Mini Batch Processing of Spark
Streaming
 Spark Streaming programming is based on DStream, which
decomposes streaming programming into a series of short batch jobs.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 40
Fault Tolerance Mechanism of Spark
Streaming
 Spark Streaming performs computing based RDDs. If some partitions
of an RDD are lost, the lost partitions can be recovered using the RDD
lineage mechanism.

2. Spark Principles and Architecture

3. Spark Integration in FusionInsight HD

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 45
Permanent Processes of Spark
 JDBCServer
 JDBCServer is a permanent Spark application that provides the Java
database connectivity (JDBC) service.
 Users can connect to JDBCServer to execute SQL statements by running
Beeline or JDBC scripts.
 JDBCServer is deployed in active/standby mode, and no single point of
failure occurs.
 JobHistory
 JobHistory is used to provide the HistorySever page that displays historical
execution information of Spark applications.
 JobHistory is deployed in two-node load sharing mode, and no single
point of failure occurs.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 46
Spark and Other Components
 In the FusionInsight cluster, Spark interacts with the following
components:
 HDFS: Spark reads or writes data in the HDFS (mandatory).
 Yarn: Yarn schedules and manages resources to support the running of
Spark tasks (mandatory).
 Hive: Spark SQL shares metadata and data files with Hive (mandatory).
 ZooKeeper: HA of JDBCServer depends on the coordination of ZooKeeper
(mandatory).
 Kafka: Spark can receive data streams sent by Kafka (optional).
 HBase: Spark can perform operations on HBase tables (optional).

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 47
Summary
 The background, application scenarios, and characteristics of
Spark are briefly introduced.
 Basic concepts, technical architecture, task running processes,
Spark on Yarn, and application scheduling of Spark are
introduced.
 Integration of Spark in FusionInsight HD is introduced.

2. What are the advantages of Spark in comparison with MapReduce?

3. What are the differences between wide dependencies and narrow

dependencies of Spark?

4. What are the application scenarios of Spark?

2. The ___________ module is the core module of Spark.

3. RDD dependency types include _ and _.

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 50
More Information
 Download training materials:
 http://support.huawei.com/learning/trainFaceDetailAction?lang=en&pbiPath=term1000121094
&courseId=Node1000011807
 eLearning course:
 http://support.huawei.com/learning/Certificate!showCertificate?lang=en&pbiPath=term100002
5450&id=Node1000011796
 Outline:
 http://support.huawei.com/learning/Certificate!toExamOutlineDetail?lang=en&nodeId=Node100
0011797
 Authentication process:
 http://support.huawei.com/learning/NavigationAction!createNavi?navId=_40&lang=en

Solved Java
No ratings yet
Solved Java
28 pages
Big Data Engineering - PySpark
100% (2)
Big Data Engineering - PySpark
120 pages
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Error ME41 Texto Explicativo
No ratings yet
Error ME41 Texto Explicativo
12 pages
_Chapter06Spark—In-memoryDistributedCompu
No ratings yet
_Chapter06Spark—In-memoryDistributedCompu
102 pages
Chapter 6 Spark - An in-Memory Distributed Computing Engine
No ratings yet
Chapter 6 Spark - An in-Memory Distributed Computing Engine
43 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
Introduction to Spark
No ratings yet
Introduction to Spark
30 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
Module 3
No ratings yet
Module 3
51 pages
What Is Spark?: History of Apache Spark
No ratings yet
What Is Spark?: History of Apache Spark
65 pages
Features of Apache Spark
No ratings yet
Features of Apache Spark
7 pages
20J41A0514-Big Data Spark
No ratings yet
20J41A0514-Big Data Spark
12 pages
Spark
No ratings yet
Spark
49 pages
Hadoopvsspark 180108070838
No ratings yet
Hadoopvsspark 180108070838
17 pages
BDA-Lec8
No ratings yet
BDA-Lec8
39 pages
SPARK
No ratings yet
SPARK
66 pages
BigData Spark Sparklyr
No ratings yet
BigData Spark Sparklyr
80 pages
Spark Programming Basics
No ratings yet
Spark Programming Basics
54 pages
Spark2x: Big Data Huawei Course
No ratings yet
Spark2x: Big Data Huawei Course
25 pages
SPARK
No ratings yet
SPARK
125 pages
Enterprise Data Storage and Analysis On Spark
No ratings yet
Enterprise Data Storage and Analysis On Spark
34 pages
Intro To Spark Development
No ratings yet
Intro To Spark Development
172 pages
Cse3002 Big Data m3 Detailed
No ratings yet
Cse3002 Big Data m3 Detailed
39 pages
Tech Seminar Report
No ratings yet
Tech Seminar Report
5 pages
Intro To Apache Spark
No ratings yet
Intro To Apache Spark
66 pages
ECS765P_W5_Spark Programming
No ratings yet
ECS765P_W5_Spark Programming
43 pages
Spark Notes
No ratings yet
Spark Notes
37 pages
Introduction To Spark
No ratings yet
Introduction To Spark
54 pages
UNIT V
No ratings yet
UNIT V
35 pages
Parallel Processing
No ratings yet
Parallel Processing
38 pages
Spark-Rdd
No ratings yet
Spark-Rdd
15 pages
Apache Spark Architecture
No ratings yet
Apache Spark Architecture
7 pages
Real Time Analytics With Spark and Kafka
No ratings yet
Real Time Analytics With Spark and Kafka
53 pages
Spark: Prepared by Dulari Bhatt
No ratings yet
Spark: Prepared by Dulari Bhatt
19 pages
4a.introduction to Apache Spark
No ratings yet
4a.introduction to Apache Spark
28 pages
7_SPARK
No ratings yet
7_SPARK
9 pages
Spark 101
No ratings yet
Spark 101
25 pages
BDA U4 copy
No ratings yet
BDA U4 copy
49 pages
L03-Spark Framework
No ratings yet
L03-Spark Framework
58 pages
Data Platform and Analytics Foundational Training: (Speaker Name)
No ratings yet
Data Platform and Analytics Foundational Training: (Speaker Name)
14 pages
07_Apache Spark - An Introduction
No ratings yet
07_Apache Spark - An Introduction
36 pages
Lecture 25
No ratings yet
Lecture 25
59 pages
PySpark Notes
No ratings yet
PySpark Notes
31 pages
Spark Interview Questions
100% (1)
Spark Interview Questions
7 pages
BDA Unit-6
No ratings yet
BDA Unit-6
11 pages
Lecture 3 PPT 22
No ratings yet
Lecture 3 PPT 22
25 pages
7
No ratings yet
7
39 pages
Bda Unit-4 PDF
No ratings yet
Bda Unit-4 PDF
63 pages
Spark Architecture
No ratings yet
Spark Architecture
7 pages
Msbte Super 25 Unit 5 Notes
No ratings yet
Msbte Super 25 Unit 5 Notes
17 pages
BDA Unit 2
No ratings yet
BDA Unit 2
52 pages
Unit 4 Spark Cassendra
No ratings yet
Unit 4 Spark Cassendra
41 pages
BDA NOTES
No ratings yet
BDA NOTES
241 pages
Unit 5
100% (1)
Unit 5
109 pages
Lecturer 5
No ratings yet
Lecturer 5
21 pages
Spark SQL
100% (1)
Spark SQL
25 pages
Unit-V Spark
No ratings yet
Unit-V Spark
69 pages
Real-Time Big Data Analytics
From Everand
Real-Time Big Data Analytics
Shilpi
5/5 (1)
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet
OpenStack Sahara Essentials
From Everand
OpenStack Sahara Essentials
Omar Khedher
No ratings yet
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Huawei Academy Instructor Assessment Guide(Only for HCNA-cloud)-20180416
No ratings yet
Huawei Academy Instructor Assessment Guide(Only for HCNA-cloud)-20180416
8 pages
Appendix 4 Apply for HCAI certificate guide
No ratings yet
Appendix 4 Apply for HCAI certificate guide
8 pages
Module 13 FusionInsight HD Solution Overview
No ratings yet
Module 13 FusionInsight HD Solution Overview
57 pages
Module 11 Kafka - Distributed Message Subscription System
No ratings yet
Module 11 Kafka - Distributed Message Subscription System
34 pages
Module 12 Zookeeper - Cluster Distributed Coordination Service
No ratings yet
Module 12 Zookeeper - Cluster Distributed Coordination Service
26 pages
Module 01 Big Data Industry and Technological Trends
No ratings yet
Module 01 Big Data Industry and Technological Trends
50 pages
Module 10 Flume - Massive Logs Aggregation
No ratings yet
Module 10 Flume - Massive Logs Aggregation
42 pages
Module 07 Streaming - Distributed Stream Computing Engine
No ratings yet
Module 07 Streaming - Distributed Stream Computing Engine
33 pages
Module 08 Flink – Stream Processing and Batch Processing Platform
No ratings yet
Module 08 Flink – Stream Processing and Batch Processing Platform
40 pages
Gurobi Optimization
No ratings yet
Gurobi Optimization
26 pages
Python For Programmers - A Project-Based Tutorial
No ratings yet
Python For Programmers - A Project-Based Tutorial
131 pages
Design and Fabrication of A Programmable 5-DOF Autonomous Robotic Arm Journal
No ratings yet
Design and Fabrication of A Programmable 5-DOF Autonomous Robotic Arm Journal
6 pages
Game Development With Unreal Engine
No ratings yet
Game Development With Unreal Engine
263 pages
Assignment 05
No ratings yet
Assignment 05
6 pages
Quezon City Polytechnic University: Course Syllabus QCPU Vision
No ratings yet
Quezon City Polytechnic University: Course Syllabus QCPU Vision
6 pages
Pnfs
No ratings yet
Pnfs
28 pages
Alex C#
No ratings yet
Alex C#
4 pages
Destination Technologies - TCS Ninja Training (Coding Round)
No ratings yet
Destination Technologies - TCS Ninja Training (Coding Round)
11 pages
Business Analytics 20-21
No ratings yet
Business Analytics 20-21
72 pages
JavaScript Tutorial
100% (1)
JavaScript Tutorial
10 pages
Simplify
No ratings yet
Simplify
5 pages
Numerical Methods
No ratings yet
Numerical Methods
2 pages
Spin Locks and Contention
No ratings yet
Spin Locks and Contention
53 pages
History of Java
No ratings yet
History of Java
32 pages
StarPay Settlement CPM POS - v2.0.23 - API Specifications
No ratings yet
StarPay Settlement CPM POS - v2.0.23 - API Specifications
36 pages
Vijeo Citect Cicode Reference
No ratings yet
Vijeo Citect Cicode Reference
1,284 pages
2023 FALL ELEC334 L03 STD
No ratings yet
2023 FALL ELEC334 L03 STD
21 pages
GDB Book
No ratings yet
GDB Book
746 pages
Format Specifiers in C
No ratings yet
Format Specifiers in C
3 pages
Hci Lab Mid Exam
No ratings yet
Hci Lab Mid Exam
16 pages
Python Operator Class 11
No ratings yet
Python Operator Class 11
18 pages
Presentation by Rajashekar G.S
100% (1)
Presentation by Rajashekar G.S
79 pages
R Day - 2 - DSIB - Ipynb - Colab
No ratings yet
R Day - 2 - DSIB - Ipynb - Colab
6 pages
TS Format
No ratings yet
TS Format
11 pages
Assignment 6
No ratings yet
Assignment 6
14 pages
Building With Oracle XML Database
No ratings yet
Building With Oracle XML Database
5 pages
Pascal - Exercise
No ratings yet
Pascal - Exercise
18 pages

Module 04 Spark2x - In-memory Distributed Computing Engine

Uploaded by

Module 04 Spark2x - In-memory Distributed Computing Engine

Uploaded by

Technical Principles of

Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved.

2. Spark Principles and Architecture

3. Spark Integration in FusionInsight HD

 Apache Spark was developed in the UC Berkeley AMP lab in

Environments Data sources

HDFS HDFS HDFS HDFS

Iter.1 Iter.2 Iter.1 Iter.2

Query 3 Result 3 Query 3 Result 3

Data sharing in MapReduce Data sharing in Spark

Data volume 102.5 TB 102 TB 1000 TB

Time required (min) 72 23 234

Number of nodes 2100 206 206

Number of cores 50,400 6592 6592

Rate 1.4 TB/min 4.27 TB/min 4.27 TB/min

Rate/node 0.67 GB/min 20.7 GB/min 22.5 GB/min

Daytona Gray Sort Yes Yes Yes

2. Spark Principles and Architecture

3. Spark Integration in FusionInsight HD

Spark Structured Spark

Standalone Yarn Mesos

Existing functions of Spark 1.0

New functions of Spark 2.0

Narrow Dependencies: Wide Dependencies:

join with inputs

Stage 2 union Stage 3

textFile flatMap map reduceByKey saveAsTextFile

2. Spark Principles and Architecture

3. Spark Integration in FusionInsight HD

SQL AST Logical Code

 Dataset has the advantages of RDD and DataFrame, and avoids

2. Spark Principles and Architecture

3. Spark Integration in FusionInsight HD

Data Stream Unbounded Table

New data in the

Data stream as an unbounded table

Data up Data up Data up

Result Result up Result up Result up

Enter data in the Cat dog Cat dog Cat dog

2. Spark Principles and Architecture

3. Spark Integration in FusionInsight HD

 Spark Streaming is an extension of the Spark core API, which is

2. Spark Principles and Architecture

3. Spark Integration in FusionInsight HD

2. What are the advantages of Spark in comparison with MapReduce?

3. What are the differences between wide dependencies and narrow

4. What are the application scenarios of Spark?

2. The ___________ module is the core module of Spark.

3. RDD dependency types include ___________ and ___________.

You might also like

3. RDD dependency types include _ and _.