0% found this document useful (0 votes)
13 views14 pages

Optimization of The Join Between Large Tables in T

Uploaded by

ke gavin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views14 pages

Optimization of The Join Between Large Tables in T

Uploaded by

ke gavin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

applied

sciences
Article
Optimization of the Join between Large Tables in the Spark
Distributed Framework
Xiang Wu and Yueshun He *

School of Information Engineering, East China University of Technology, Nanchang 330013, China;
[email protected]
* Correspondence: [email protected]

Abstract: The Join task between Spark large tables takes a long time to run and produces a lot of disk
I/O, network I/O and disk occupation in the Shuffle process. This paper proposes a lightweight
distributed data filtering model that combines broadcast variables and accumulators using Roar-
ingBitmap. When the data in the two tables are not exactly matched, the dimension table Key is
collected through the accumulator, compressed by RoaringBitmap and distributed to each node
using broadcast variables. The distributed fact table data can be pre-filtered on the local server,
which effectively reduces the data transmission and disk reading and writing in the Shuffle phase.
Experimental results show that this optimization method can reduce disk usage, shorten the running
time and reduce network I/O and disk I/O for Spark Join tasks in the case of massive data, and the
effect is more obvious when the two tables have a higher incomplete matching degree or a fixed
matching degree but a larger amount of data. This optimization scheme has the advantages of being
easy to use, being easy to maintain and having an obvious effect, and it can be applied to many
development scenarios.

Keywords: Join; Spark; Shuffle; optimization method; RoaringBitmap

1. Introduction
With the rapid development of the Internet in recent years, the era of big data has
Citation: Wu, X.; He, Y. Optimization
arrived. After years of development, a large number of new high-performance technologies
of the Join between Large Tables in
have emerged in the field of big data, such as Apache Spark [1–3] and Apache Flink [4],
the Spark Distributed Framework.
which are stronger than MapReduce [5,6] in terms of query and computation performance
Appl. Sci. 2023, 13, 6257. https://
and which have become powerful tools for big data acquisition, storage, analysis and
doi.org/10.3390/app13106257
presentation. Big data analysis technology plays a key role in various industries. Asad
Academic Editor: Antonio Fernandez et al. [7,8] studied the importance of big data analysis technology in enterprises.
Caballero Spark is a fast, universal, scalable and highly available big data analysis search engine
Received: 24 March 2023
developed based on Scala. It has upgraded its performance based on the MapReduce model.
Revised: 6 May 2023
Developers can deploy Spark on a large number of servers to form clusters that efficiently
Accepted: 13 May 2023 process data. The core technology of Spark is the use of resilient distributed datasets
Published: 19 May 2023 (RDD) [9]. The data are distributed in the form of RDD on each server for management,
to achieve data parallelization and distributed processing. During the data repartitioning
process of the Spark task, if data are moved across nodes, Shuffle is generated, as shown
in Figure 1. Shuffle is a bridge between Map and Reduce. It corresponds the Map output
Copyright: © 2023 by the authors. to the Reduce input and involves serialization and deserialization, cross-node network
Licensee MDPI, Basel, Switzerland. I/O and disk read/write I/O. If a complex service logic has Shuffle, the next stage can
This article is an open access article be executed only after the previous stage produces a result. In the mass data Join task
distributed under the terms and of distributed architecture, the data interaction between servers will inevitably generate
conditions of the Creative Commons
Shuffle, which means a large number of serialization–deserializations, cross-node network
Attribution (CC BY) license (https://
I/Os and disk read and write I/Os.
creativecommons.org/licenses/by/
4.0/).

Appl. Sci. 2023, 13, 6257. https://doi.org/10.3390/app13106257 https://www.mdpi.com/journal/applsci


Appl. Sci. 2023, 13, x FOR PEER REVIEW 2 of 14
Appl.
Appl.Sci.
Sci.2023,
2023,13,
13,x6257
FOR PEER REVIEW 2 2ofof14
14

Figure 1. The Spark RDD dependency displayed in a Shuffle structure diagram.


TheSpark
Figure1.1.The
Figure SparkRDD
RDDdependency
dependencydisplayed
displayedin
inaaShuffle
Shufflestructure
structure diagram.
diagram.
Shuffle consists of Shuffle write and Shuffle read phases, as shown in Figure 2. Dur-
Shuffle aconsists
ing Shuffle,
Shuffle consists of Shuffle
Shuffle
large amount
of write
write and
and Shuffle
of intermediate dataread
Shuffle phases,
phases,as
is migrated
read toasshown
disks
shown inin
for aFigure
long
Figure 2. 2.During
time, and
Dur-
aShuffle,
ing large a large
amount
Shuffle, amount of intermediate
of network
a large amount I/O data data
is generated,
of intermediate is migrated
affecting thetooverall
is migrated disks for a long
for atime,
performance
to disks longofand
theaSpark
time, large
and
amount
ajob. of network
large amount I/O is I/O
of network generated, affecting
is generated, the overall
affecting performance
the overall of theofSpark
performance job.
the Spark
job.

Interaction between
Figure 2. Interaction between Shuffle
Shuffle write
write and Shuffle read.
Figure 2. Interaction between Shuffle write and Shuffle read.
When optimizing
When optimizing the the Spark
Spark performance,
performance, we we should
should not not only
only paypay attention
attention to to the
the
execution
execution
Whentime time of tasks
of tasks the
optimizing but should
but Spark also
shouldperformance, pay attention
also pay attention we shouldto the network
to the network
not only IO IO
pay and
and disk read
disk read
attention and
to and
the
write. The
write.
execution The proper
proper
time optimization
optimization
of tasks but should isisalso
notpay
not onlyattention
only to reduce
to reduce the
to the uptime but
the uptime
network but alsodisk
also
IO and to reduce
to reduce
read and the
the
network
network I/O
I/O load,
load, diskdisk usage,
usage, and and so on. In actual development, the number of Spark tasks
write. The proper optimization is so noton.only
In actual development,
to reduce the uptime thebutnumber
also to of Spark
reducetasksthe
ranges from
ranges from as as few
few as as 100
100 to to as
as many
many as as thousands.
thousands. In Inthis
thiscase,
case,performance
performance optimization
optimization
network I/O load, disk usage, and so on. In actual development, the number of Spark tasks
of key
of key tasks
tasks is extremely
extremely important. Proper Proper performance
performance optimization
optimization can can ensure
ensure running
running
ranges from is as few as 100important.
to as many as thousands. In this case, performance optimization
efficiency
efficiency and
and save resources,
save resources, and
and helps helps to avoid
to performance
avoid negativenegative effects
effects caused caused by excessive
of key tasks is extremely important. Proper optimization canby excessive
ensure runningop-
operation
eration data.data.
efficiency and save resources, and helps to avoid negative effects caused by excessive op-
In order
In order to solve the problems of a long running time, excessive network IO load
eration data. to solve the problems of a long running time, excessive network IO load and
and
highInhigh
disk disk occupancy
occupancy of Spark Join tasks between large tables, this paper proposes a
order to solve of theSpark Join tasks
problems of a longbetweenrunning large tables,
time, this paper
excessive proposes
network IO load a light-
and
lightweight distributed data filtering model using RoaringBitmap [10] to combine broadcast
weight
high diskdistributed
occupancydata filtering
of Spark Joinmodel using RoaringBitmap
tasks between large tables, this [10]paper
to combine
proposes broadcast
a light-
variables and accumulators when the data in two tables are not completely matched. This
variables
weight and accumulators
distributed data filteringwhenmodel the data usingin two tables are not[10]
RoaringBitmap completely
to combine matched. This
broadcast
optimization is theoretically analyzed and experimentally verified. The implementation
optimization
variables is theoreticallywhen
and accumulators analyzedthe data andinexperimentally
two tables are not verified. The implementation
completely matched. This
results show that this method effectively reduces the running time of Spark Join and
results show is
optimization that this method
theoretically effectively andreduces the running time of Spark Join and ef-
effectively reduces the data analyzed
transfer and experimentally
disk read and write verified.
in theThe implementation
Shuffle phase. The
fectively
results reduces the data transfer and disk read and write in the Shuffle phase. Theandoverall
overallshow that thisofmethod
performance Spark Joineffectively
tasks isreduces
improved. the running time of Spark Join ef-
performance
fectively reducesof Spark
the data Jointransfer
tasks isand improved.
disk read and write in the Shuffle phase. The overall
performance
2. Related Work of Spark Join tasks is improved.
2. Related Work
The performance of most Spark jobs is mainly consumed by the Shuffle process. This
2. Related
The Work a largeofnumber
performance
process involves most Sparkof disk jobs
I/O is operations,
mainly consumed by theand
serialization Shuffle process. This
deserialization op-
process
The involves
performance a large
of number
most Spark of disk
jobs I/O
is operations,
mainly consumed
erations and network data transmission operations. Therefore, to improve the performance serialization
by the and
Shuffle deserialization
process. This
operations
process andit network
of Sparkinvolves
jobs, isa necessary datatotransmission
large number of disk the
optimize I/Ooperations.
operations,
Shuffle Therefore,
serialization
process. to improve the perfor-
and deserialization
mance
operationsof Spark
In terms jobs,
andofnetwork it is necessary
data transmission
load balancing, to optimize
Ren et al. [11] the
operations. Shuffle
studied the process.
Therefore, to improve
cross-network reading theofperfor-
Shuffle
mance Inof
and the terms
Spark of jobs,
aggregationload balancing,
it
ofispartition
necessary Ren
data et optimize
to al. [11] studied
among the data
the Shuffle
tasks with cross-network
process.
dependence. reading
Theyofadopted
Shuffle
and theterms
In
heuristic aggregation
prescheduling of partition
of load balancing,through Rendata
SCache, among
et al. [11] tasks with
studied
combined data dependence.
the cross-network
with Shuffle Theyofadopted
reading
size prediction, Shuffle
and bal-
and the aggregation of partition data among tasks with data dependence. They adopted
Appl. Sci. 2023, 13, 6257 3 of 14

anced the load of each node through load balancing to achieve Shuffle optimization. Li
et al. [12] studied the data skew in the Shuffle stage and proposed a Shuffle phase dynamic
balance partitioning method based on reservoir sampling to sample and preprocess the
intermediate data, predict the overall data skew and provide the overall partitioning strat-
egy for application implementation, thus reducing the impact of data skew on the Spark
performance. Kumar et al. [13] studied the search space partitioning strategy of data paral-
lelism. Based on the communication cost-effectiveness pattern mining algorithm, tasks can
be allocated fairly and effectively among cluster nodes to reduce the communication cost
generated during Shuffle. Choi et al. [14] used SSD to make up for the lack of main memory
bandwidth and applied RDD cache strategies with different proportions of Shuffle and
storage space to improve the overall performance of the system. Tang et al. [15] proposed
an initial adaptive task concurrency estimation algorithm combined with known task input
information and actuator memory, realized dynamic memory-aware task scheduling and
used two typical benchmarks, light Shuffle-light and heavy Shuffle-heavy, to evaluate
the performance, which significantly improved the resource utilization. Zeidan et al. [16]
proposed a new spatial divider for the spatial query of large spatial data sets. KNN spatial
join query, based on Spark, is used to reduce the spatial query skew and task running time.
Zhao et al. [17] studied the cache management strategy in the Dag-aware task scheduling
algorithm, and proposed a new cache management strategy called long-run phase set
priority to make full use of task dependency to optimize cache management performance
in the Dag-aware scheduling algorithm. Tang et al. [18] studied partitioning methods in
the Spark framework, considering the partition balance of the intermediate data and the
partition balance after the Shuffle operator. The range-based key segmentation algorithm
realized slant mitigation in Shuffle and effectively reduced task execution time. Based on
the new operators and some new logical and physical rules, they extended the Spark query
to achieve task optimization.
In the rational use of resources, Jiang et al. [19] proposed a data management algorithm
based on the data mixing stage to effectively reduce the resource occupation and computing
response delay based on Spark, which is prone to problems such as insufficient utilization
of Spark cluster resources, high computation delay and high task processing delay in the
Shuffle stage. The partition-weighted adaptive cache replacement algorithm based on
RDD can make full use of memory resources and reduce resource waste effectively. Bazai
et al. [20] proposed a data processing method based on distributed data set RDD-based
data anonymization technology, based on subtree, which provides effective partition RDD-
based method management, improves memory usage, uses cache to frequently reference
intermediate values and enhances iteration support. Modi et al. [21] studied the execution of
big data queries to realize the sorting and hash aggregation of intermediate data in memory,
the exchange of intermediate data to disks and the network transmission of data. Chen
et al. [22] proposed a new method of temporal data processing for large events, based on
the problem that the computing capacity of distributed systems is limited when processing
large-time data and cannot meet the requirements of low delay and high throughput, which
effectively realizes large-time data management, operation and real-time response. Shen
et al. [23] studied the scalability of Shuffle and designed a new Shuffle mechanism through
Magnet, which effectively reduced the data local Shuffle operation and further improved
the efficiency and reliability of Shuffle in Spark.
Nowadays, many optimization schemes lack out-of-the-box methods; that is, when
the performance of a big data cluster reaches a bottleneck, it needs to be simple, convenient,
practical and convenient for later maintenance to break through the performance bottleneck.
Although many optimization methods increase some of the performance, they add a lot of
unstable factors to the big data cluster. They may not be able to achieve a stable equilibrium
state in the actual development process, which may require additional maintenance of
the algorithm model and complicate the development. Many practical problems can be
solved by using appropriate algorithm models. Qalati et al. [24] used a partial least squares
structural equation model to analyze data and obtained the influencing factors of energy
partial least squares structural equation model to analyze data and obtained the in
ing factors of energy saving intention and actual behavior. The optimization schem
in this paper can achieve the effect out of the box in Spark Join tasks between large
Appl. Sci. 2023, 13, 6257
and the effect is obvious, the stability is strong and the maintenance is 4easy,
of 14
and it
applied to many development scenarios.
saving intention and actual behavior. The optimization scheme used in this paper can
3.achieve
Related Technologies
the effect out of the box in Spark Join tasks between large tables, and the effect is
obvious,
3.1. the stability Algorithm
RoaringBitmap is strong and the maintenance is easy, and it can be applied to many
development scenarios.
RoaringBitmap is composed of binary data structure, using bit as the unit t
data, so the
3. Related data compression rate is very high. To store 4 billion data of type int, th
Technologies
3.1. RoaringBitmap
size is 14.9 GB for Algorithm
normal storage and 512 Mb for RoaringBitmap storage. RoaringB
RoaringBitmap
storage is about 30 times is composed of binary
smaller thandata structure,
normal using bit as the unit to store data,
storage.
so the data compression rate is very high. To store 4 billion data of type int, the data size is
RoaringBitmap uses a bucket mechanism to save space. The int data are divid
14.9 GB for normal storage and 512 MB for RoaringBitmap storage. RoaringBitmap storage
2is16 about
buckets. Thesmaller
30 times first 16 bits
than of binary
normal data are used as bucket numbers. Each bucke
storage.
Container for storing
RoaringBitmap uses athe lastmechanism
bucket 16 bits oftobinary data.
save space. TheAintRoaringBitmap
data are divided into is a collec
16
2 buckets. The first 16 bits of binary data first
are used as bucket numbers. Each bucketto hasfind the
Containers. When storing data, the 16 bits of data are numbered
a Container for storing the last 16 bits of binary data. A RoaringBitmap
sponding Container. If the corresponding Container is not found, the correspondin is a collection
of Containers. When storing data, the first 16 bits of data are numbered to find the
tainer is created and the last 16 bits of data are put into the Container. As shown in
corresponding Container. If the corresponding Container is not found, the corresponding
3,Container
the value of 20 isand
is created saved into16RoaringBitmap,
the last bits of data are putand
intothe
thevalue of the
Container. Asfirst 16 bits
shown in is 0 th
calculation. Therefore, the corresponding Container number is 0. After obtaining
Figure 3, the value of 20 is saved into RoaringBitmap, and the value of the first 16 bits is 0 t
through calculation. Therefore, the corresponding Container number
responding Container, the calculated value of the last 16 bits of 20 is set into the is 0. After obtaining
the corresponding Container, the calculated value of the last 16 bits of 20 is set into the
sponding Container.
corresponding Container.

3.RoaringBitmap
Figure 3.
Figure RoaringBitmapstorage mode.mode.
storage
3.2. Spark Accumulator
3.2. Spark Accumulator
The Spark accumulator summarizes data about variables on the Executor side of a
cluster
Theto the Driver
Spark side. As shown
accumulator in Figure 4,data
summarizes the accumulator of the Driver
about variables on theside is
Executor si
first serialized and sent to the Executor. Then, the accumulator is used in the Executor to
cluster to the Driver side. As shown in Figure 4, the accumulator of the Driver side
collect data. Finally, the accumulator of each Executor is obtained at the Driver end and the
serialized
accumulator and sent tobythe
is merged theExecutor. Then,
Merge function to the accumulator
obtain is used in the Executor to
the final result.
data. Finally, the accumulator of each Executor is obtained at the Driver end and
cumulator is merged by the Merge function to obtain the final result.
Appl. Sci. 2023, 13, x FOR PEER REVIEW 5 of 14

Appl. Sci.Sci.
Appl. 2023, 13,13,
2023, x FOR
6257 PEER REVIEW 5 of 14 5 of 14

Figure 4. Spark components related to the accumulator.

Figure 4. Spark components related to the accumulator.


Figure 4. Spark components
3.3. Spark Broadcast Variablerelated to the accumulator.
3.3. Spark Broadcast Variable
The broadcast variable means that one variable is sent to memory on each Executor
3.3. Spark Broadcastvariable
Variablemeans that one variable is sent to memory on each Executor
nodeThe broadcast
associated with the cluster task, as shown in Figure 5. Data information is broadcast
to each Executor node, serialized asasthat
node The broadcast
associated with variable
the means
cluster task, oneinvariable
itshown Figure
is fetched, and 5.isData
sentinformation
to memory
deserialized on each
as it isisused.
broadcast Executor
Spark tasks
to each
node Executor node,
associated with serialized
the clusteras task,
it is fetched,
as andin
shown deserialized
Figure 5. as it isinformation
Data used. Spark tasksis broadcast
can directly read data information from the Executor memory of the local node to prevent
caneach
to directly read data
Executor node,information
serializedfromitthe Executor and
memory of the local asnode to prevent
data interaction between differentastasks is fetched,
from generatingdeserialized it is used.
a large cross-node Spark tasks
network I/O.
data interaction between different tasks from generating a large cross-node
can directly read data information from the Executor memory of the local node network I/O.to prevent
data interaction between different tasks from generating a large cross-node network I/O.

5. Spark
Figure 5.
Figure Sparkdata
dataexchange
exchangeof broadcast variables.
of broadcast variables.
4. Optimization of Project Analysis
Figure
4. 5. Spark data
Optimization ofexchange
Project of broadcast variables.
Analysis
4.1. Cost Optimization Estimation
4.1. Cost
The Optimization Estimation
executionofefficiency of Spark Join tasks is affected by the CPU, memory, disk,
4. Optimization Project Analysis
runtime configuration and execution
The execution efficiency of Spark code. InJoin
the process
tasks isofaffected
evaluatingbythethecost of executing
CPU, memory, disk,
4.1. Cost
tasks, it isOptimization
difficult for usEstimation
to calculate the exact cost of tasks. In the case of a fixed configura-
runtime configuration and execution code. In the process of evaluating the cost of execut-
tion, The
we execution
only need to efficiency
estimate theof Spark
cost of Join
Spark tasks
Join is
tasks affected
before andby
ing tasks, it is difficult for us to calculate the exact cost of tasks. In the case of the
after CPU, memory,
optimization disk,
to a fixed
obtain a comparative
runtime configuration result,
and so as to reflect
execution the rationality
code. In the of theof
process optimization
evaluating scheme.
the cost of execut-
configuration, we only need to estimate the cost of Spark Join tasks before and after opti-
The physical
ing tasks, itobtainexecution
is difficult forplan
us of
to the Spark Join
calculate task
thetoexactbased
costonofcost-based
tasks. In optimization
thethe
case of a fixed
mization
(CBO) is ato tree a comparative
structure, result,
the cost of which issoequal
as toreflect theof
the sum rationality
costs of eachof optimization
execution
configuration,
scheme. we only need to estimate the cost of Spark Join tasks before and after opti-
node, as shown in Figure 6:
mization to obtainexecution
The physical a comparativeplan ofresult, so as to
the Spark reflect
Join task the rationality
based of the optimization
on cost-based optimization
scheme.
(CBO) is a tree structure, the cost of which is equal to the sum of costs of each execution
node,The physical
as shown in execution
Figure 6: plan of the Spark Join task based on cost-based optimization
(CBO) is a tree structure, the cost of which is equal to the sum of costs of each execution
node, as shown in Figure 6:
Appl. Sci. 2023, 13, x FOR PEER REVIEW 6 of 14
Appl.
Appl.Sci.
Sci.2023,
2023,13,
13,x6257
FOR PEER REVIEW 6 6ofof14
14

Figure 6.
Figure6. Spark execution
Sparkexecution
6.Spark node
executionnode cost.
nodecost.
cost.
Figure

The
Thecostcost is equalequal tothe the sumofofthe the costs of each execution node, where the highest
The is equal to to thesum sum of thecosts costsofof each
eachexecution
execution node,
node,where
where the the
highest cost
highest
cost
is theis Join
the Join procedure.
procedure. In the In the
costcost estimation
estimation formula
formula of CBO,
of CBO, as shown
as shown in Formula
in Formula (1):(1):
cost is the Join procedure. In the cost estimation formula of CBO, as shown in Formula (1):
𝐶𝑜𝑠𝑡 = Rows 𝑅𝑜𝑤𝑠 ×Weight
𝑊𝑒𝑖𝑔ℎ𝑡 + 𝑆𝑖𝑧𝑒××(1(1−−Weight 𝑊𝑒𝑖𝑔ℎ𝑡) (1)
𝐶𝑜𝑠𝑡 =
Cost = 𝑅𝑜𝑤𝑠 × × 𝑊𝑒𝑖𝑔ℎ𝑡++Size 𝑆𝑖𝑧𝑒 × (1 − 𝑊𝑒𝑖𝑔ℎ𝑡) ) (1)
(1)
𝑅𝑜𝑤𝑠 is the number of rows, 𝑆𝑖𝑧𝑒 is the size of the data, and 𝑊𝑒𝑖𝑔ℎ𝑡 is the weight,
𝑅𝑜𝑤𝑠 is is the
thenumber rows,𝑆𝑖𝑧𝑒
of rows, Size is
is the
thesize
sizeofofthe
thedata,
data,and 𝑊𝑒𝑖𝑔ℎ𝑡
Weight isisthe weight,
whichRows is determined number
by theofspark.sql.cbo.joinReorder.card.weight andconfiguration. theIn weight,
Spark
which
which isisdetermined
determined by
by the
the spark.sql.cbo.joinReorder.card.weight
spark.sql.cbo.joinReorder.card.weight configuration.
configuration. In
InSpark
Spark
Join, when the data in the two tables are not exactly matched, the weight is fixed, and the
Join,
Join, when
when the
the data
data in
in the
the two
two tables
tables are
are not
not exactly
exactly matched,
matched, the
the weight
weight isisfixed,
fixed, and
and the
the
fact table is pre-filtered using the optimization scheme before joining, then the rows and
fact
fact table
table isispre-filtered
pre-filtered using
using the
the optimization
optimization scheme
scheme before
before joining,
joining, then
then the
the rows
rows and
and
size are reduced. According to the CBO estimation formula, the cost before optimization
size
size are
arereduced.
reduced. According totothe theCBO
CBOestimation formula, the
thecost before optimization
is greater than the According
cost after optimization. estimation
The more data formula,
you filter, cost
thebefore
loweroptimization
the cost will
isisgreater
greaterthan thanthe thecost
cost after
afteroptimization.
optimization. TheThe more datadata
more youyoufilter, the lower
filter, the lower the cost
thewill
cost
be. Lim et al. [25] studied all possible query execution paths in grouping subquery com-
be.
willLimbe.etLim al. [25]
et al.studied
[25] studiedall possible query query
all possible execution paths in
execution grouping
paths in groupingsubquery com-
subquery
putation overhead and selected effective query execution paths through efficient query
putation
computation overhead overheadand and selected
selectedeffective query
effective query execution
execution paths
paths through
throughefficient
efficientquery
query
algorithms to reduce the cost. Path analysis technology is also applied in all walks of life.
algorithms
algorithmsto toreduce
reducethe thecost.cost.Path
Pathanalysis
analysistechnology
technologyisisalso alsoapplied
appliedininall allwalks
walksofoflife.
life.
Hammami et al. [26] used path analysis technology to test the hypothesis of the dimension
Hammami
Hammamiet etal.al.[26]
[26]used
usedpath pathanalysis
analysistechnology
technologyto totest
testthethehypothesis
hypothesisof ofthethedimension
dimension
of organizational knowledge ability, reveal the various knowledge abilities of the enter-
ofoforganizational
organizational knowledge knowledge ability, ability,reveal
revealthe thevarious
various knowledge
knowledge abilities
abilities of theof the enter-
enterprise,
prise, and establish the relationship between them. The experimental optimization pur-
and establish
prise, and establish the relationship
the relationship between them.them.
between The experimental
The experimental optimization
optimization purpose pur- of
pose of this paper is to reduce the Rows and Size before the Shuffle of Join so as to reduce
this of
pose paper
this is to reduce
paper the Rows
is to reduce and Size
the Rows and before the Shuffle
Size before the Shuffleof Join so as
of Join soto asreduce
to reducethe
the cost in the maximum cost link. Since the filtering model used in our experiment is very
cost
the in in
cost thethe maximum
maximum costcostlink.
link.Since
Sincethe
thefiltering
filteringmodel
model used used in our experiment
experimentisisvery very
lightweight, it has little impact on the overall cost. After a large amount of irrelevant data
lightweight, it has little impact on the overall cost. After a large
lightweight, it has little impact on the overall cost. After a large amount of irrelevant data amount of irrelevant data
have
havebeenbeen filtered
beenfiltered
filteredout out
outin in the
inthe pre-filtering
thepre-filtering
pre-filteringphasephase
phaseof of the
ofthe experiment,
theexperiment,
experiment,Spark Spark
SparkJoin Join tasks
Jointasks may
tasksmay
may
have
degrade
degradefrom from complex
fromcomplex
complextypes types
typesto to simple
tosimple ones,
simpleones,
ones,as as shown
asshown
shownin in Figure
inFigure
Figure7, 7, greatly
7,greatly reducing
greatlyreducing
reducingthe the
the
degrade
overall
overallcost.cost.
cost.
overall

Figure 7.
Figure SparkJoin
7. Spark Joindegradation.
degradation.
Figure 7. Spark Join degradation.
Appl. Sci. 2023, 13, 6257 7 of 14

This paper studied the cost of Shuffle write and Shuffle read in Shuffle of Spark Join.
The cost estimate of Shuffle write workflow is shown in Formula (2):
 
Costshu f f le write = Costcache + Costsort + ∑ Costbu f f er + Costspill +Costmerge (2)

Costcache represents the cost of reading data into the cache, Costsort represents the
cost of sorting according to the marked partition, Costbu f f er + Costspill represents the cost
of each save to the cache and spill to disk and Costmerge represents the cost of small file
consolidation on disk.
It should be noted that the operation processes of Shuffle write and Shuffle read are
similar, but Shuffle read needs to establish a network connection and data transfer. When
the running memory is sufficient, there will be no spill operation, so no disk file will be
generated. When the memory is insufficient, it will also generate sort and spill operations to
generate disk files, so the cost calculation of Shuffle read is different in the case of sufficient
memory and insufficient memory.
Shuffle read workflow estimates the cost of sufficient memory, as shown in the follow-
ing equation:
Costshu f f le read = Costnet + Costcache (3)
Shuffle read Workflow memory shortage cost estimation is shown as follows:
 
Costshu f f le read = Costnet + Costcache + ∑ Costbu f f er + Costsort + Costspill +Costmerge (4)

Costnet represents the cost of obtaining data transmitted over the network; Costcache
represents the cost of reading data into the cache; Costbu f f er + Costsort + Costspill represents
the cost of obtaining cache data each time for sorting and then spilling to disk; Costmerge
represents the cost of small file consolidation on disk.
The Shuffle write workflow first fetches the data and caches it in memory, then sorts
the data, and finally writes the data to disk to generate small files and merges the small files.
The size of the data acquired by Shuffle write affects the final size of the data written to disk.
The larger the amount of data acquired by Shuffle write, the more data will be written to
the disk. Cost estimation involves each step of Shuffle write, but the data cache in the first
stage is the key to the cost size. If only a small amount of data are cached, the subsequent
cost consumption will be small; if the amount of cached data is large, the subsequent cost
will also be large.
Shuffle read mainly involves data network transmission and data caching. The Shuffle
read cost is also strongly determined by the size of the data read, but the data read is derived
from the data written to disk by the Shuffle write. When the memory is not sufficient, it is
also necessary to write data to the disk for temporary storage, increasing the cost.
In this optimization scheme, the amount of data read by Shuffle write is reduced by
pre-filtering, so that the overall cost of Shuffle write is reduced, and the amount of data
written to disk by Shuffle write is also reduced. When Shuffle write writes fewer data to
the disk, Shuffle read needs to read fewer data for network transfer and data caching. At
the same time, it also reduces the cache of Shuffle write and Shuffle read data in memory,
reduces the utilization of memory and avoids the shortage of memory in the Shuffle read
workflow to a greater extent, resulting in a high cost.

4.2. Optimization of Work Content


In the Shuffle process of Spark Join, each node of the cluster writes data to the local disk
file through Shuffle write, and Shuffle read obtains the disk file of each node through the
network transmission. There are a lot of data interactions, network transfers and file read
and write operations, which is why the Shuffle phase is very time and resource consuming.
The optimization scheme in this paper is to preprocess the two tables of Join based on
the fact table and dimension table data not completely matching, clean the fact table data
Appl. Sci. 2023, 13, x FOR PEER REVIEW 8 of 14

Appl. Sci. 2023, 13, 6257 The optimization scheme in this paper is to preprocess the two tables of Join based 8 of 14
on the fact table and dimension table data not completely matching, clean the fact table
data before Shuffle, deal with unnecessary data and only let the data that need to be joined
enter
beforetheShuffle,
Shuffledealphase,
withwhich saves resources
unnecessary data and and onlyreduces the that
let the data running
needtimeto betojoined
a greater
enter
extent.
the Shuffle phase, which saves resources and reduces the running time to a greater extent.
How
Howtotoclean
cleanthethedata
dataofofeach
eachnodenodehas hasbecome
becomethe thekey
keytotothe
theexperiment.
experiment.Firstly,Firstly,itit
should
shouldcomplete
completethe thedata
datacleaning
cleaningtasktaskunder
underthe thecondition
conditionofof limited
limited resources.
resources. Secondly,
Secondly,
ititshould
shouldhave havea agood
goodcleaning
cleaningeffect
effectininmany
manydata datasets
setsand
andshould
shouldkeep keepthe thetask
taskstable
stable
during
duringoperation
operationand andeasy
easytotomaintain.
maintain.According
Accordingtotothe therequirements
requirementsofofthe theoptimization
optimization
scheme,
scheme,the thelightweight
lightweightand andhighly
highlycompressible
compressiblestorage storagecomponent
componentRoaringBitmap
RoaringBitmapwas was
selected.
selected. The accumulator and broadcast variable are used to ensure thatRoaringBitmap
The accumulator and broadcast variable are used to ensure that RoaringBitmap
has
hashigh
highstability,
stability,maintainability
maintainabilityand andefficiency
efficiencyininthe theprocess
processofofdatadataloading
loadingand anddata
data
transmission.
transmission.Therefore,
Therefore, thethe
accumulator,
accumulator, broadcast
broadcast variable andand
variable RoaringBitmap
RoaringBitmap werewere
se-
lected for for
selected thethe
Spark
Sparkpre-filtering
pre-filteringtasktask
in the experiment.
in the experiment.
The
Theexecution
executionflow flow of of
Join forfor
Join thetheoptimization
optimization scheme
schemein this
in paper is shown
this paper in Fig-in
is shown
ure 8. We8.first
Figure Wecreate an accumulator
first create an accumulator and load andthe RoaringBitmap
load the RoaringBitmapinto it, and
intothen collect
it, and then
collect
the the dimension
dimension table data table data
keys keys
into theinto the accumulator
accumulator of the of theRoaringBitmap.
type type RoaringBitmap. The
The RoaringBitmap
RoaringBitmap is broadcast
is broadcast to eachto each
node node
as aasSpark
a Spark broadcast
broadcast variable
variable andandstored
storedinin
memory.In
memory. Inthe
thefiltering
filtering phase, each cluster
clusternodenodereads
readsthe theKey
Keystored
stored inin
thetheRoaringBitmap
RoaringBit-
and and
map matches the Key
matches the of theof
Key current fact table.
the current fact Iftable.
the Key of the
If the Keyfact tablefact
of the does not does
table matchnot the
value,the
match thevalue,
data will
the be
datadeleted.
will beThrough
deleted. the abovethe
Through method,
aboveamethod,
fact tablea without
fact tableredundant
without
data is obtained,
redundant data is and then we
obtained, and Join
thenthewe data.
JoinThere are no
the data. redundant
There data in the data
are no redundant fact table
in theto
enter the Shuffle phase, so as to avoid unnecessary data interaction,
fact table to enter the Shuffle phase, so as to avoid unnecessary data interaction, network network transmission
and disk reading
transmission and reading
and disk writing generated
and writing bygenerated
Shuffle write and Shuffle
by Shuffle writeread.
and Thus,
Shuffleefficient
read.
and energy-efficient
Thus, Spark Join tasks
efficient and energy-efficient Sparkcan Join
be achieved.
tasks can be achieved.

Experimenttotooptimize
Figure8.8.Experiment
Figure optimizespecific
specificexecution
executionsteps.
steps.

The storage engine in the experiment running task infrastructure in this paper is based
The storage engine in the experiment running task infrastructure in this paper is
on Spark on hive, and the query engine is based on Spark on yarn. We first store the
based on Spark on hive, and the query engine is based on Spark on yarn. We first store the
dimension table and fact table data required for the experiment in the Hive [27] database,
and then submit the Spark Join job request from the Spark client, which will submit the job
Appl. Sci. 2023, 13, x FOR PEER REVIEW 9 of 14

dimension table and fact table data required for the experiment in the Hive [27] database,
Appl. Sci. 2023, 13, 6257
and then submit the Spark Join job request from the Spark client, which will9 of 14
submit the
job to Yarn [28]. Finally, Yarn reads the dimension and fact tables to be joined from Hive
and performs a distributed Spark Join. Figure 9 shows the Spark task execution architec-
to Yarn [28]. Finally, Yarn reads the dimension and fact tables to be joined from Hive and
ture.
performs a distributed Spark Join. Figure 9 shows the Spark task execution architecture.

Figure 9. Spark task execution architecture.


Figure 9. Spark task execution architecture.
In this paper, the comparison is mainly based on three aspects: the task running time
Inand
before thisafter
paper, the comparison
optimization, is size
the data mainly based
written on three
to disk aspects:
by Shuffle theand
write task
therunning
data time
before
size readand
byafter optimization,
Shuffle thethe
read. The larger data size written
Shuffle write andtoShuffle
disk by Shuffle
read write
phases, and the data
the higher
size read by Shuffle read. The larger the Shuffle write and Shuffle read phases,
the disk footprint, the higher the network IO and the higher the disk IO. If the task running the higher
time
the is shortened
disk footprint,after
theoptimization, and the IO
higher the network data sizethe
and of higher
Shuffle the
write to disk
disk and
IO. If theShuffle
task running
read to disk is reduced, the optimization scheme is very feasible in Spark Join tasks.
time is shortened after optimization, and the data size of Shuffle write to disk and Shuffle
read to disk is reduced, the optimization scheme is very feasible in Spark Join tasks.
5. Experiment
5.1. System Configuration
5. Experiment
This optimization experiment is based on Cloudera’s Distribution Including Apache
5.1. System
Hadoop Configuration
(CDH) big data platform. The Spark, Hive, Hadoop [29], Zookeeper [30], and
Hue This
components were installed
optimization on theisCDH
experiment basedbigon
data platform. Spark
Cloudera’s was usedIncluding
Distribution to executeApache
parallel Join tasks, Hive was used to build a data warehouse
Hadoop (CDH) big data platform. The Spark, Hive, Hadoop [29], Zookeeperon Hadoop’s HDFS [31]
[30], and
storage engine, Hadoop’s Yarn was used to manage resources and schedule tasks on Spark,
Hue components were installed on the CDH big data platform. Spark was used to execute
Zookeeper was used to coordinate components and manage metadata, and Hue was used
parallel Join tasks, Hive was used to build a data warehouse on Hadoop’s HDFS [31] stor-
to build visual queries on Hive to check whether Spark Join data were lost or incorrect. In
age engine,
order Hadoop’s
to achieve Yarn
the effect was used computing,
of distributed to managethis resources andinvolved
experiment schedulethetasks on Spark,
setting
Zookeeper was used to coordinate components
up of a big data cluster on three Linux servers. and manage metadata, and Hue was used
to build
Thevisual
clusterqueries on Hive
configuration to check
is shown whether
in Table Spark Jointhere
1. Altogether, datawas
were lost or CPU,
a 24-core incorrect. In
order
192 GBtoofachieve
memorythe andeffect
600 GBof of
distributed
hard disk. computing, this experiment involved the setting
up of a big data cluster on three Linux servers.
1. Cluster
TableThe configuration.
cluster configuration is shown in Table 1. Altogether, there was a 24-core CPU,
192 GB of memory
Server Name and 600 GB CPU of hard disk. Memory Hard Disk
Hadoop201 8-core 64 GB 200 GB
Table 1. Cluster configuration. 8-core
Hadoop202 64 GB 200 GB
Hadoop203 8-core 64 GB 200 GB
Server Name CPU Memory Hard Disk
Hadoop201 8-core 64 GB 200 GB
The versions of development tools used by the cluster are shown in Table 2.
Hadoop202 8-core 64 GB 200 GB
Hadoop203 8-core 64 GB 200 GB

The versions of development tools used by the cluster are shown in Table 2.
Appl. Sci. 2023, 13, 6257 10 of 14

Table 2. Development tool versions.

Tool Versions
Operating System Centos7.5
CDH 6.3.2
JDK 1.8.0_181
Hadoop 3.0.0 + cdh6.3.2
Hive 2.1.1 + cdh6.3.2
Zookeeper 3.4.5 + cdh6.3.2
Spark 2.4.0 + cdh6.3.2
Hue 4.2.0 + cdh6.3.2

5.2. Testing Dataset


In this paper, we used the TPC-H [32] data set, which is a test set of the TPC-H business
intelligence computing test used to simulate decision support applications. At present, this
data set is widely used in academia and industry to evaluate performance related to the
application of decision support technology.
The first round of experimental data we used was the official data set of TPC-H.
The number of data used in the fact table lineitem was 120 million, and the numbers of
data used in the dimension table orders were 100,000, 1 million, 5 million, 10 million and
30 million, respectively. The numbers of data after Join were 400,000, 4 million, 20 million,
40 million and 120 million. The orders table has a one-to-many data association with the
lineitem table.
In the second round of experimental data, we also used the official data set of TPC-H.
In order to realize the complex Join scenario of the many-to-many data association mode,
we tested the optimization scheme through different amounts of data when the matching
degree was determined. We processed the data of the TPC-H dataset, obtained the orders
table with a 1-million-data volume, and copied the data in the table five times to become the
orders table with a 5-million-data volume. The lineitem table with a 10-million-data volume
was obtained, and the lineitem table with a 10-million-data volume was replicated 5, 10,
50, 100 and 150 times, respectively, to obtain lineitem tables with 50 million, 100 million,
500 million, 1 billion and 1.5 billion-data volumes. The numbers of data in the orders
table Join lineitem table are 140 million, 280 million, 1.4 billion, 2.8 billion and 4.2 billion,
respectively. The orders table is used as the dimension table and the lineitem table is used
as the fact table in the experiment. The orders table is many-to-many with the lineitem
table. The orders table matches 14.28% of the data in each lineitem table.

5.3. Experimental Results and Analysis


In the experiment, the configuration resources applied for when submitting tasks to
Spark were executor-cores 2, num-executors 3, and executor-memory 1 g.

5.3.1. First Round of Experiments


The data volume of the fact table lineitem used in the experiment was 120 million, and
the data volume of the dimension table orders was 100,000, 1 million, 5 million, 10 million
and 30 million; the data volume of the Join result was 400,000, 4 million, 20 million,
40 million and 120 million, respectively. The orders table has a one-to-many data association
with the lineitem table. The Spark distributed computing query framework is used to read
Hive data and run it on Yarn for Join operation. The data of each group were tested five
times and the average value was obtained. In the data sets of 100,000, 1 million, 5 million
and 10 million, the pre-filtering was carried out under the condition of incomplete matching,
and the Join time was shortened correspondingly, with the proportion of shortening time
being 30.0%, 29.6%, 23.7% and 19.7%, respectively, and the average shortening time was
68.75 s. The average shortening time ratio was 25.75%. The experimental results are shown
in Figure 10.
Appl. Sci. 2023, 13, x FOR PEER REVIEW 11 of 1

time was 68.75 s. The average shortening time ratio was 25.75%. The experimental results
are shown in Figure 10.
Appl. Sci. 2023, 13, 6257 time was 68.75 s. The average shortening time ratio was 25.75%. The experimental
11 of 14 result
are shown in Figure 10.

Figure 10. Task run time for one-to-many data.

Figure 10.Task
10.
FigureSecond
5.3.2. Taskrun time
run
Round forfor
time one-to-many data.data.
one-to-many
of Experiments
5.3.2.The
SeconddataRound
volume of Experiments
of the orders table used in the experiment was 5 million, and the
5.3.2. Second Round of Experiments
data The
volumedata of the lineitem
volume table was
of the orders table50 million,
used in the100 million,was
experiment 500 5million,
million,1andbillion
the and 1.5
The
data volume
billion, data volume
of the
respectively. Theoforders
lineitem the
tableorders
table50table
was aused
million,
has inmillion,
100 the experiment
many-to-many 500
data million,was 5 million,
1 billion
association and
andthe
with the
line-
data
1.5
item volume
billion,
table. of Spark
the lineitem
respectively.
The tabletable
The orders
distributed was 50
hasmillion,
computing 100framework
a many-to-many
query million,
data500 ismillion,
association
still used 1 billion
with readand
tothe 1.5
Hive
billion,
lineitem respectively.
table. The Spark The orders
distributed table has
computing a many-to-many
query framework data
is stillassociation
data and run it on Yarn for the Join operation. Five experiments were conducted to obtain used to readwithHive the line
data
item and runThe
table. it onSpark
Yarn for the Join operation.
distributed computingFive query
experiments were conducted
framework to obtain
is still used
the average value of each group’s data. After pre-filtering, the execution timetoofread
the Hive
Join
the
dataaverage
and value
run it of Yarn
on each group’s
for the data.operation.
Join After pre-filtering,
Five the execution
experiments weretime of the Jointo obtain
conducted
task decreasedmore
task decreased morewithwith
anan increasing
increasing amount
amount of data.
of data. Theofrate
The rate timeofreduction
time reduction
was was
the average
15.1%, 17.0%, value
19.7%, of 22.0%
each group’s
and data.respectively.
25.2%, After pre-filtering,
The the execution
experimental timeare
results of the Join
shown
15.1%, 17.0%, 19.7%, 22.0% and 25.2%, respectively. The experimental results are shown in
task
in decreased
Figure
Figure 11. 11.
more with an increasing amount of data. The rate of time reduction was
15.1%, 17.0%, 19.7%, 22.0% and 25.2%, respectively. The experimental results are shown
in Figure 11.

Figure 11.Task
Figure 11. run time
Amount for many
of Shuffle -to-many
write data.
data in a many-to-many data Join.

In addition
Figure to the task running speed,
in awe also recordeddata
the Shuffle write data in the
In 11. Amount
addition toofthe
Shuffle
task write data
running speed,many-to-many
we also recordedJoin. the Shuffle write data in the
Join process. The data volume before optimization was 133 MB, 232 MB, 1024 MB, 2013 MB
Join process.
and 3096 Thethe
MB, and datadatavolume
volume before optimization
after optimization was was 133114
74 MB, MB,MB,232
434MB, 1024
MB, 834 MB,MB, 2013
MB In addition toand
the task running speed, weoptimization
also recorded the Shuffle write data in the
and 1234 MB, respectively. In the Join process, as the data volume of the optimized Shuffle434 MB,
and 3096 MB, the data volume after was 74 MB, 114 MB,
Join process.
834 The data volume before Inoptimization was 133 MB, 232volume
MB, 1024of MB, 2013
writeMB,
taskand 1234
increased, MB, respectively.
the data volume written the Joindisk
to the process, as the
decreased data
by 44.3%, the opti-
50.8%, 57.6%,
MB
mized and
58.5% and 3096
Shuffle MB,
60.1%,write and the data volume
task increased,
respectively. after optimization
the dataresults
The experimental volume was
written
are shown 74 MB,disk decreasedMB
to the 12.
in Figure 114 MB, 434 by
834 MB,
44.3%, and 1234
50.8%, MB,
57.6%, respectively.
58.5% and 60.1%,Inrespectively.
the Join process, as the data volume
The experimental results of
arethe opti
shown
mized Shuffle
in Figure 12. write task increased, the data volume written to the disk decreased by
44.3%, 50.8%, 57.6%, 58.5% and 60.1%, respectively. The experimental results are shown
in Figure 12.
Appl. Sci.Sci.
Appl. 2023, 13,13,
2023, x FOR
x FORPEER
PEERREVIEW
REVIEW 12 of 14

Appl. Sci. 2023, 13, 6257 12 of 14

Figure 12. Amount of Shuffle write data in a many-to-many data Join.

Figure 12.Amount
Figure 12. Amount of Shuffle
of Shuffle writewrite
data indata in a many-to-many
a many-to-many data Join. data Join.
During Shuffle write, there is also a corresponding Shuffle read. Shuffle read data
wereDuring
recorded in the
Shuffle experiment.
write, there is alsoBefore optimization,
a corresponding Shuffle
Shuffle read 133
read. Shuffle readMB,
data232
wereMB, 1024
MB, During
2013
recorded MB
in Shuffle
theand 3096write,
experiment.MBBefore
fromthere
the is also
disk. a Shuffle
After
optimization, corresponding
optimization, Shuffle
Shuffle
read 133 MB, 232 read
MB, read. Shuffle
74 MB,
1024 MB, rea
114 MB,
were
434 recorded
2013MB,
MB andMB
834 3096 in the1234
MB
and experiment.
from the disk.
MB, Before
After
respectively. optimization,
optimization,
In the Join ShuffleShuffle
process, read read
74
as the MB,
data133
114 MB,
MB, 232
volume M
of the
MB,
434 2013 MB
MB,
optimized834 taskand
MB and 3096 MB
1234
increased, MB,thefrom
data the
respectively. disk.read
volume In theAfter
Joinbyoptimization,
process, as the Shuffle read
data
Shuffle read decreased volume of 74 MB,
the
by more. The1
optimized
434 MB, taskMB
834 increased,
and the data
1234 MB, volume read by Shuffle
respectively. In the read
Join decreased
process, by
as more.
the The volum
data
data read from the disk decreased by 44.3%, 50.8%, 57.6%, 58.5% and 60.1%, respectively.
data read from the disk decreased by 44.3%, 50.8%, 57.6%, 58.5% and 60.1%, respectively.
optimized
The experimentaltask increased,
results are theshowndata
The experimental results are shown in Figure 13.
involume
Figure 13. read by Shuffle read decreased by mo
data read from the disk decreased by 44.3%, 50.8%, 57.6%, 58.5% and 60.1%, respe
The experimental results are shown in Figure 13.

13. Amount
Figure 13.
Figure AmountofofShuffle
Shuffle read data
read in ain
data many-to-many data Join.
a many-to-many data Join.
5.3.3. Summary and Analysis of Experiments
5.3.3. Summary and Analysis of Experiments
An experimental comparison between the Spark Join task before optimization and
An experimental
the optimized Spark Joincomparison between
task was carried out. the
In theSpark
first Join
roundtask before optimization
of experiments, it can and
Figure 13. Amount of Shuffle read data in a many-to-many data Join.
the optimized
be seen from the Spark Join task results
experimental was carried out. the
that when In the
data first round of
matching experiments,
degree of the twoit can be
tablesfrom
seen continues to decrease, the
the experimental running
results that time
whenofthe moredata tasks can be degree
matching shortened by our
of the two tables
5.3.3. Summary
optimized scheme, and
and Analysis
the of
proportion Experiments
of shortened time increases. In
continues to decrease, the running time of more tasks can be shortened by our optimized the second round of
experiments,
An and when
experimentalthe matching degree
comparison of the two
between tables is fixed,
the SparkInJoin when the amount
task before of data
scheme, the proportion of shortened time increases. the second roundoptimizati
of experi-
in the lineitem table increases and the amount of data after Join increases, the optimized
ments,
the when the matching
optimized degree of the two tables
out. Inis fixed, when theof
amount of data in
tasks can shortenSpark
running Join
timetask was
more. carried
After optimization, asthe
thefirst
amount round
of data experiments,
increases, it
the
seen lineitem
from of
the amount table
the
data increases
experimental and the
written to diskresults amount
in Shufflethat when
write of data
phasethe after Join
data matching
is reduced increases,
more and thedegreethe
reductionoptimized
of the two
tasks can
continues shorten running
to decrease,
proportion increases. time more.
the running
The amount After
of data time optimization,
of more
read from disk in tasks as the
canread
Shuffle amount
be phase of
shortened data by
is reduced increases,
our opt
the
moreamount
and the of data
reductionwritten to disk
proportion in Shuffle
increases. write phase
scheme, and the proportion of shortened time increases. In the second round is reduced more and the reduc-
of
tion proportion increases. The amount of data read from disk in Shuffle read phase is re-
ments, when the matching degree of the two tables is fixed, when the amount of
duced more and the reduction proportion increases.
the lineitem table increases and the amount of data after Join increases, the opt
Experiments show that the optimization scheme reduces the running time and re-
tasks
duces can shorten consumption
the resource running timeofmore.
SparkAfter optimization,
tasks when asthe
the data of thetwo
amount of data
big tables inc
are not
Appl. Sci. 2023, 13, 6257 13 of 14

Experiments show that the optimization scheme reduces the running time and reduces
the resource consumption of Spark tasks when the data of the two big tables are not exactly
matched; it also reduces the amount of operation data in Shuffle write and Shuffle Read
phases to reduce network I/O, disk I/O and disk consumption.

6. Conclusions
The Join process between Spark large tables consumes a lot of resources. This paper
proposes a data filtering model using RoaringBitmap as the main Spark accumulator and
broadcast variables as the auxiliary. Using this filtering model eliminates the irrelevant data
in the process of distributed interaction with a very small storage cost, avoiding unnecessary
data processing in the Shuffle phase leading to resource consumption. Compared with
other optimization schemes, this optimization scheme pays more attention to the simplicity,
maintainability and versatility of the optimization method, considers the running time, disk
I/O, disk occupation and network I/O, and pays more attention to the overall performance
of the Join task. Therefore, a lightweight, maintainable, and extensible combination of
RoaringBitmap, accumulator and broadcast variable is adopted. In the experiments on this
optimization scheme, the Spark Join task completes in less time, with less disk consumption,
lower disk I/O and lower network I/O. The optimization scheme can be applied in many
development scenarios; when the two tables have a higher degree of incomplete matching
or a fixed degree of matching but a larger amount of data, the effect is more obvious.

Author Contributions: Conceptualization, X.W.; methodology, X.W.; validation, X.W.; investigation,


X.W.; data curation, X.W.; writing, X.W.; supervision, Y.H.; project administration, Y.H. All authors
have read and agreed to the published version of the manuscript.
Funding: This research was funded by the Key Research Projects of Jiangxi Province, grant No.
20224BBC41001 and Jiangxi Key Laboratory of Cybersecurity Intelligent Perception, grant No. JKL-
CIP202203, JKLCIP202204.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: The original data used in this study can be downloaded from the
Transactionprocessing Performance Council website (http://www.tpc.org), accessed on 5 March 2023.
Conflicts of Interest: The authors declare no conflict of interest.

References
1. Salloum, S.; Dautov, R.; Chen, X.; Peng, P.X.; Huang, J.Z. Big data analytics on Apache Spark. Int. J. Data Sci. Anal. 2016, 1, 145–164.
[CrossRef]
2. Zaharia, M.; Chowdhury, M.; Franklin, M.J.; Shenker, S.; Stoica, I. Spark: Cluster computing with working sets. In Proceedings of
the 2nd USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 10), Boston, MA, USA, 22–25 June 2010; HotCloud: Berkeley,
CA, USA, 2010.
3. Zaharia, M.; Xin, R.S.; Wendell, P.; Das, T.; Armbrust, M.; Dave, A.; Meng, X.; Rosen, J.; Venkataraman, S.; Franklin, M.J.; et al.
Apache spark: A unified engine for big data processing. Commun. ACM 2016, 59, 56–65. [CrossRef]
4. Carbone, P.; Katsifodimos, A.; Ewen, S.; Markl, V.; Haridi, S.; Tzoumas, K. Apache flink: Stream and batch processing in a single
engine. Bull. Tech. Comm. Data Eng. 2015, 38, 28–38.
5. Dean, J.; Ghemawat, S. MapReduce: Simplified data processing on large clusters. Commun. ACM 2008, 51, 107–113. [CrossRef]
6. Dean, J.; Ghemawat, S. MapReduce: A flexible data processing tool. Commun. ACM 2010, 53, 72–77. [CrossRef]
7. Asad, M.; Asif, M.U.; Khan, A.A.; Allam, Z.; Satar, M.S. Synergetic effect of entrepreneurial orientation and big data analytics for
competitive advantage and SMEs performance. In Proceedings of the 2022 International Conference on Decision Aid Sciences
and Applications (DASA), Chiangrai, Thailand, 23–25 March 2022.
8. Asad, M.; Asif, M.U.; Bakar, L.J.; Altaf, N. Entrepreneurial orientation, big data analytics, and SMEs performance under the
effects of environmental turbulence. In Proceedings of the 2021 International Conference on Data Analytics for Business and
Industry (ICDABI). Sakheer, Bahrain, 25–26 October 2021.
9. Zaharia, M.; Chowdhury, M.; Das, T.; Dave, A.; Ma, J.; McCauly, M.; Franklin, M.J.; Shenker, S.; Stoica, I. Resilient distributed
datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX Symposium on
Networked Systems Design and Implementation (NSDI), San Jose, CA, USA, 25–27 April 2012; pp. 15–28.
Appl. Sci. 2023, 13, 6257 14 of 14

10. Chambi, S.; Lemire, D.; Kaser, O.; Godin, R. Better bitmap performance with roaring bitmaps. Softw. Pract. Exp. 2016, 46, 709–719.
[CrossRef]
11. Ren, R.; Wu, C.; Fu, Z.; Song, T.; Liu, Y.; Qi, Z.; Guan, H. Efficient shuffle management for DAG computing frameworks based on
the FRQ model. J. Parallel Distrib. Comput. 2021, 149, 163–173. [CrossRef]
12. Li, C.; Cai, Q.; Luo, Y. Data balancing-based intermediate data partitioning and check point-based cache recovery in Spark
environment. J. Supercomput. 2022, 78, 3561–3604. [CrossRef]
13. Kumar, S.; Mohbey, K.K. A Utility-Based Distributed Pattern Mining Algorithm with Reduced Shuffle Overhead. IEEE Trans.
Parallel Distrib. Syst. 2022, 34, 416–428. [CrossRef]
14. Choi, J.; Lee, J.; Kim, J.S.; Lee, J. Optimization Techniques for a Distributed In-Memory Computing Platform by Leveraging SSD.
Appl. Sci. 2021, 11, 8476. [CrossRef]
15. Tang, Z.; Zeng, A.; Zhang, X.; Yang, L.; Li, K. Dynamic memory-aware scheduling in spark computing environment. J. Parallel
Distrib. Comput. 2020, 141, 10–22. [CrossRef]
16. Zeidan, A.; Vo, H.T. Efficient spatial data partitioning for distributed kNN joins. J. Big Data 2022, 9, 77. [CrossRef]
17. Zhao, Y.; Dong, J.; Liu, H.; Wu, J.; Liu, Y. Performance improvement of dag-aware task scheduling algorithms with efficient cache
management in spark. Electronics 2021, 10, 1874. [CrossRef]
18. Tang, Z.; Lv, W.; Li, K.; Li, K. An intermediate data partition algorithm for skew mitigation in spark computing environment.
IEEE Trans. Cloud Comput. 2018, 9, 461–474. [CrossRef]
19. Jiang, K.; Du, S.; Zhao, F.; Huang, Y.; Li, C.; Luo, Y. Effective data management strategy and RDD weight cache replacement
strategy in Spark. Comput. Commun. 2022, 194, 66–85. [CrossRef]
20. Bazai, S.U.; Jang-Jaccard, J.; Alavizadeh, H. Scalable, high-performance, and generalized subtree data anonymization approach
for Apache Spark. Electronics 2021, 10, 589. [CrossRef]
21. Modi, A.; Rajan, K.; Thimmaiah, S.; Jain, P.; Mann, S.; Agarwal, A.; Shetty, A.; Gosalia, A.; Partho, P. New query optimization
techniques in the Spark engine of Azure synapse. Proc. VLDB Endow. 2021, 15, 936–948. [CrossRef]
22. Chen, Z.; Yao, B.; Wang, Z.J.; Zhang, W.; Zheng, K.; Kalnis, P.; Tang, F. ITISS: An efficient framework for querying big temporal
data. GeoInformatica 2020, 24, 27–59. [CrossRef]
23. Shen, M.; Zhou, Y.; Singh, C. Magnet: Push-based shuffle service for large-scale data processing. Proc. VLDB Endow. 2020,
13, 3382–3395. [CrossRef]
24. Qalati, S.A.; Qureshi, N.A.; Ostic, D.; Sulaiman, M.A. An extension of the theory of planned behavior to understand factors
influencing Pakistani households’ energy-saving intentions and behavior: A mediated–moderated model. Energy Effic. 2022,
15, 40. [CrossRef]
25. Lim, J.; Kim, B.; Lee, H.; Choi, D.; Bok, K.; Yoo, J. An Efficient Distributed SPARQL Query Processing Scheme Considering
Communication Costs in Spark Environments. Appl. Sci. 2021, 12, 122. [CrossRef]
26. Hammami, S.M.; Ahmed, F.; Johny, J.; Sulaiman, M.A. Impact of knowledge capabilities on organizational performance in the
private sector in Oman: An SEM approach using path analysis. Int. J. Knowl. Manag. (IJKM) 2021, 17, 15–18. [CrossRef]
27. Thusoo, A.; Sarma, J.S.; Jain, N.; Shao, Z.; Chakka, P.; Anthony, S.; Murthy, R. Hive: A Warehousing Solution over A Map-Reduce
Framework. Proc. VLDB Endow 2009, 2, 1626–1629. [CrossRef]
28. Vavilapalli, V.K.; Murthy, A.C.; Douglas, C.; Agarwal, S.; Konar, M.; Evans, R.; Graves, T.; Lowe, J.; Shah, H.; Seth, S.; et al. Apache
Hadoop YARN: Yet another resource negotiator. In Proceedings of the 4th Annual Symposium on Cloud Computing, Santa Clara,
CA, USA, 1–3 October 2013.
29. Shvachko, K.; Kuang, H.; Radia, S.; Chansler, R. The hadoop distributed file system. In Proceedings of the 2010 IEEE 26th
Symposium on Mass Storage Systems and Technologies (MSST), Incline Village, NV, USA, 3–7 May 2010.
30. Hunt, P.; Konar, M.; Junqueira, F.P.; Reed, B. ZooKeeper: Wait-free coordination for internet-scale systems. In Proceedings of the
USENIX Annual Technical Conference (USENIX ATC’10), Boston, MA, USA, 23–25 June 2010.
31. Borthakur, D. HDFS architecture guide. Hadoop Apache Proj. 2008, 53, 2.
32. Ivanov, T.; Rabl, T.; Poess, M.; Queralt, A.; Poelman, J.; Poggi, N.; Buell, J. Big data benchmark compendium. In Proceedings of
the 7th TPC Technology Conference, Kohala Coast, HI, USA, 31 August–4 September 2015.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.

You might also like