0% found this document useful (0 votes)
58 views10 pages

Predicting SQL Query Execution Time With A Cost Model For Spark Platform

Predicting SQL Query Execution Time with a Cost Model for Spark Platform

Uploaded by

jq4q7z6nny
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views10 pages

Predicting SQL Query Execution Time With A Cost Model For Spark Platform

Predicting SQL Query Execution Time with a Cost Model for Spark Platform

Uploaded by

jq4q7z6nny
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/341279987

Predicting SQL Query Execution Time with a Cost Model for Spark Platform

Conference Paper · May 2020


DOI: 10.5220/0009396202790287

CITATIONS READS

2 481

5 authors, including:

Aleksey Burdakov
Bauman Moscow State Technical University
59 PUBLICATIONS 67 CITATIONS

SEE PROFILE

All content following this page was uploaded by Aleksey Burdakov on 10 May 2020.

The user has requested enhancement of the downloaded file.


Predicting SQL Query Execution Time
with a Cost Model for Spark Platform

Aleksey Burdakov1, Viktoria Proletarskaya1, Andrey Ploutenko2, Oleg Ermakov1 and Uriy Grigorev1
1Informatics
and Control Systems, Bauman Moscow State Technical University, Moscow, Russia
2Mathematics and Informatics, Amur State University, Blagoveschensk, Russia
[email protected], [email protected], [email protected], [email protected], [email protected]

Keywords: SQL, Apache Spark, Bloom Filter, TPC-H Test, Big Data, Cost Model.

Abstract: The paper proposes a cost model for predicting query execution time in a distributed parallel system requiring
time estimation. The estimation is paramount for running a DaaS environment or building an optimal query
execution plan. It represents a SQL query with nested stars. Each star includes dimension tables, a fact table,
and a Bloom filter. Bloom filters can substantially reduce network traffic for the Shuffle phase and cut join
time for the Reduce stage of query execution in Spark. We propose an algorithm for generating a query
implementation program. The developed model was calibrated and its adequacy evaluated (50 points). The
obtained coefficient of determination R2=0.966 demonstrates a good model accuracy even with non-precise
intermediate table cardinalities. 77% of points for the modelling time over 10 seconds have modelling error
<30%. Theoretical model evaluation supports the modelling and experimental results for large databases.

1 INTRODUCTION prognostic model (Tozer et al., 2010; Xiong et al.,


2011; Akdere et al., 2012; Ganapathi et al., 2009),
Database query execution forecasting has always 2) Cost Models (Wu, 2013; Leis et al., 2015).
been an important task. This task has become even ML methods give a significant error as shown in
more valuable in the Database as a Service (DaaS) (Wu, 2013). This is potentially caused by assumed
(Wu, 2013) context. A DaaS provider has to manage test and model training queries similarity. This
infrastructure costs, and Service Level Agreements assumption is not correct for real dynamic database
(SLA). Query execution estimates can help system loads. In this case, the query execution plans can
management (Wu, 2013) in: differ dramatically and the time changes radically.
1. Access Control: by evaluating whether a query Using exact table row counts in cost models
can be executed (Tozer et al., 2010; Xiong et al., allows building a precise linear correlation between
2011). query execution time and query cost for real
2. Query Planning: by planning for delays and databases (Leis et al., 2015). Model parameters
query execution time limits (Chi et al., 2011; Guirguis calibration and utilization of exact row counts give
et al., 2009). the lowest query execution time error for the cost
3. Progress Monitoring: by eliminating model (Wu, 2013).
abandoned large queries that overload the system Sources (Wu, 2013; Leis et al., 2015) consider the
(Mishra et al., 2009). predictive cost model only for relational databases. At
4. System Calibration: by designing and tuning the same time MapReduce (Dean et al., 2004) is
the system based on query execution time widely used to implement big database queries. It
dependency on the hardware resources (Wasserman assumes a parallel execution of the queries to data
et al., 2004). fragments distributed over many nodes (workers).
There are two major approaches for database Several data access platforms use this technology
query execution time forecasting: (Mistrík et al., 2017; Armbrust et al., 2015). The
1) Machine Learning (ML) methods that look at source (Armbrust et al., 2015) shows that Apache
the DBMS as a black box and attempt to build a Spark SQL has advantages. The original query is split
into tasks and tasks into stages. Each stage usually
includes Map and Reduce execution.
The paper discusses a new cost model for SQL is generated at the creation of a dimension table (see
query execution time prediction for the Spark Fig. 2). During the fact table creation (usually large)
platform. This model accounts for Bloom filter and its records are additionally filtered with that Bloom
small tables duplication over the nodes. These aspects filter (see squares in Fig. 2). This significantly
significantly reduce the original query execution time reduces the volume of data transmitted over the
(Burdakov et al., 2019). The developed model also network at the shuffle phase and cuts the table join
helps in making an optimal SQL query execution plan time at the Reduce phase (Burdakov et al., 2019).
in a distributed environment.
In Paragraph 2, we illustrate how the source
queries can be represented as subqueries and where
you can connect and use Bloom filters. Then we
extend this approach to the general case (Table 1).
Details of the developed method for SQL query
implementation and its comparison with traditional
tools are given in (Burdakov et al., 2019). Paragraph
3 develops a cost model of query execution processes.
It can be represented in the form of nested structures
with a “star” scheme (Fig. 6). Paragraph 4 shows the
results of model calibration and its adequacy
assessment with the Q3, Q17 queries and their stages.

Figure 2: Q3 query execution schema.


2 REPRESENTATION OF AN
ORIGINAL QUERY WITH Example 2: Fig. 3 shows Q17 query with a correlated
sub-query from TPC-H test.
SUBSEQUENT SUB-QUERIES Please, note that Spark SQL cannot execute this
query in its original form. It has to be decomposed
Let us start with examples for an original query into sub-queries. Fig. 4 presents the Q17 query
transformation into a sequence S of sub-queries {Zi} execution schema. The following identifiers denote
and their execution results join {Ji}. the source tables from the TPC-H database schema:
D1 – part, F1 – lineitem.
Example 1: Fig. 1 shows the Q3 query from the TPC-
H test (TPC, 2019).
select sum(l_extendedprice)/7.0 as avg_yearly from lineitem, part
The Q3 query execution schema is shown in Fig. where p_partkey = l_partkey and p_brand = '[BRAND]'
2. Each box Zi provides a source table identifier along and p_container = '[CONTAINER]' and l_quantity < (
with a filter condition shown in round brackets. select 0.2 * avg(l_quantity) from lineitem where
l_partkey = p_partkey );

select l_orderkey, sum(l_extendedprice*(1-l_discount)) as revenue,


o_orderdate, o_shippriority
Figure 3: Q17 query from TPC-H test.
from customer, orders, lineitem
where c_mktsegment = '[SEGMENT]' and c_custkey = o_custkey
and l_orderkey = o_orderkey and o_orderdate < date '[DATE]' We can identify here the following two stars: Z1,
and l_shipdate > date '[DATE]' Z2-J1, and J1, Z3 - J2.
group by l_orderkey, o_orderdate, o_shippriority
order by revenue desc, o_orderdate;
Each star has an enabled Bloom filter. Fig. 4
shows that for the first star a broadcast distribution is
Figure 1: Q3 query from TPC-H test. executed for a small dimension table Z1 (see
diamond) over the nodes that store fact table Z2(BF1)
The following TPC-H source table identifiers are fragments. There Z1 and Z2(BF1) tables join is
provided: D1 – customer, F1 – orders, F2 – lineitem. performed in RAM at the Map stage (w/o shuffle and
Fig. 2 has two join stars: Z1, Z2 - J1, and J1, Z3 - J2. w/o Reduce task execution).
Each star has one dimension and one fact table Let us call the structure depicted in Fig. 2 and Fig.
(separated with a comma). The join result in the first 4 as a query structure. Representation of the source
star (J1) becomes a dimension in the second star. queries in the form of stars allows describing the
Fig. 2 shows that each star can have a Bloom filter source query as a Zi sequence of sub-queries and Jj
applied (Bloom, 1970; Tarkoma, 2012). Bloom filter joins, and connecting Bloom filter or executing a
D1 Z1
4). The following language operator generation
(p_brand = '[BRAND]' and algorithm shall be applied in the next step (Fig. 5).
p_container = '[CONTAINER]')
SELECT p_partkey
Fig. 5 has the following elements:
BF1
Z1 J1 𝐽 - j-th join (as a dimension) in the r-th three of
Z1.p_partkey = SELECT l_partkey as pr1, the query schema, 𝐽 ∈ 𝐽 , … , 𝐽 ,
Z2 Z2.l_partkey l_extendedprice as e1,
l_quantity as q1 BF2 𝑍 - j-th sub-query (as a dimension) in the r-th star
F1
BF1 of the query schema, 𝑍 ∈ 𝑍 , … , 𝑍 .
SELECT l_partkey,
l_extendedprice,
l_quantity
J1. q1<Z3.a1 and main:
J1.pr1=Z3.pr1
Z3 J2 star( );
J1 BF2
SELECT sum(J1.e1)/7.0 as
delete join and sub-query duplicates (if any);
GROUP BY pr1 avg_yearly duplicates will exist if the same joins or sub-queries are
SELECT pr1, 0.2*avg(q1) as a1 used as sub-queries in a few stars;
end main;
Figure 4: Q17 query execution schema. star ( ):
r:
broadcast distribution of small dimension tables. This CYCLE on j
can be done for almost any SQL query. To do this, all star( )
“select” queries have to be represented in END OF CYCLE
intermediate tables and include them into the “from” CYCLE on j
clause of the original query. Table 1 provides the Select statement for sub-query
intermediate tables composition schemas for various [Bloom filter create or apply operators]
SQL “select” sub-queries (Date et al., 1993). The END OF CYCLE
corresponding TPC-H test query names are shown in Select statement for join
round brackets. DataFrame/DataSet can implement [Bloom filter creation operators]
intermediate tables in Spark. end star;

Figure 5: Program generation algorithm for source code


Table 1: Intermediate table composition schemas.
execution in Spark.
SQL “select” Intermediate table
sub-queries composition schema Each star has one dimension table and one fact
Sub-query is in table in the provided examples 1 and 2. One can
the “from” clause Represent the sub-query in the form derive from the algorithm shown in Fig. 5 that each Jr
of the original of a new table after the “from”
join corresponds to a star with a join of a few
query (Q7, Q8, clause of the original query.
Q9, Q13).
dimension tables {Di} and fact table F (cycle on j).
Represent the sub-query in a form This is true for the “snowflake” query schema.
Non-correlated of a scalar (aggregate, EXISTS, Fig. 6 shows dfDi sub-queries implementation
sub-query (Q11, NOT EXISTS) or table with one schema of the original query star and their join with
Q15, Q16, Q18, column; use table with IN, NOT IN, the F fact table.
Q20, Q22). and use scalar in comparison The transformation and action sequence (see Fig.
operations of the original query. 6) forms DAG (Directed Acyclic Graph) for the star
Add required attributes from implementation. It works like a conveyor processing
the original query into sub-query, df fragments in parallel through the graph nodes (the
perform group by; represent the
Correlated sub- fragments stored on the cluster nodes). This is a plan
sub-query in a table form; add table
query (Q2, Q4, for the “star” schema (D1,..., Dn – F), which can be
name into “from” of the original
Q17, Q20, Q21, used as a dimension in another star. The partition
query; replace condition with a sub-
Q22). processing track is provided below (stages 1-4):
query after “where” of the original
query with a condition with 1. Read Di dimension tables, filter records with
required comparison operations. the Pi condition, obtain the projection (ki, wi).
2. Build Bloom filters for dfDi table partitions in
The steps described in Table 1 are recursive. A RAM (by ki key for each dimension), assemble each
“select” sub-query can be treated as an original query. Bloom filter in a Driver (logical OR) followed by
To generate an original query execution program, broadcast distribution of the Bloom filter to all
a query schema shall be built (please, see Fig. 2 and Executors which perform fact table filtration (F).
Dn, D2, D1 F
P1D: read table split on node 1
1
2
P1P: filter records in the processor core of node 1
dfD1=(Select k1,w1 BF1=broadcast(
From D1Where P1) bloomFilter(k1))
t1D t1P t1B
P1
dfD2=(Select k2,w2 BF2=broadcast( bloo
From D2 Where P2) mFilter(k2)) P1B: build Bloom filter for split records key
dfDn=(Select kn,wn BFn=broadcast( bloo
From Dn Where Pn) mFilter(kn)) 3 P2D: read table split on node 2 (or another split
on a different processor core of node 1)
df=(Select kF,{si},sF
From {dfF Join dfDi dfF=(Select fk1,...,fkn,kF,wF
On ki=fki} From F Where PF). P2P: filter split records in the processor core of node 2
[Group By Order By]) filter(BF1,fk1).... filter(BFn,fkn) t2D t2P t2B
P2
4
Dimension in P2B: build Bloom filter for split records key
another star

Figure 6: Original query star implementation general Figure 7: Spark-created Processes Example.
schema.
Let us denote the PSP group with a line
3. Read fact table, filter records with PF condition corresponding to the group interval definition (see
and with Bloom filters, obtain projection ({fki}, kF, Fig. 8). The parent PSP creates descendant PSPs, e.g.
wF). descendant PSP PP is created by parent PSP PD and so
4. Join the filtered fact table with the filtered on. Let us call a PSP groups set as connected parallel
dimension tables (dfF Join dfDi). Group and sort if processes (CPP). For example, PD, PP, PB PSPs form
applicable to the particular “star” schema. Jump to a CPP with P identifier (see Fig. 8).
new star in the query schema.

3 COST MODEL DEVELOPMENT


A cost model has the following features (Leis et al.,
2015):
1. Uniform distribution: it is assumed that all values
of an attribute are uniformly distributed in a given Figure 8: PSP Group (PD, PP, PB) and CPP (P) Notation.
interval.
2. Independence: attribute values are considered Let us represent CPP for simplicity with one line
independent (whether in the same table or different that corresponds to a CPP interval. The interval
tables). duration is equal to the duration between the
3. Inclusion principle: join key domains overlap in a beginning and the end of all activities of the PSP
way that the smaller domain’s keys are present in the included in the set. Let us call it the duration for the
larger domain. execution of connected parallel processes. Each CPP
A dataset that adheres to features 1-3 is called interval will be provided with an identifier (e.g. P in
synthetic, e.g. TPC-H database is synthetic. Fig. 8). CPP instance frequently corresponds to a task
Spark creates one or a few parallel processes at that is executed in the Executor slot.
each stage. Fig. 7 demonstrates an example. Based on the Spark processes analysis (see Fig. 6)
The lines in Fig. 7 denote process intervals (the we developed a mathematical model. Fig. 9 provides
beginning and the end). The duration of the intervals a process execution description for sub-queries and
(tiх), i.e. resource consumption time, is shown above joins related to one star of the original query. The
the lines. We will consider the average values. A lines in Fig. 9 correspond to a CPP:
process can create another process (at the end of tiх). 1. Ri CPP. Reading and processing of dimension
For example, the chain of created processes for node tables, creation of Bloom filters for a key
1 looks as follows: P1D→P1P→P1B. The Fig. 7 shows (BFi=bloomfilter(ki)).
that two processes with the same name executed on 2. А CPP. Assembly of the Bloom filters in the Driver
different nodes (or on different processor cores of the program, OR join, broadcast distribution throughout
same node) form a group of parallel similar processes the nodes.
(PSP). The following three groups can be identified: 3. RF CPP. Reading and processing of fact tables,
(P1D, P2D), (P1P, P2P), (P1B, P2B). record filtration with the Bloom filter (fki is the
foreign key of the fact table).
A “group by” operation for a fact table sometimes 𝑉 , 𝑖𝑓 𝑁 2,
𝑉 (3)
precedes Bloom filter application. 𝑉 , 𝑖𝑓 𝑁 1
The items 4 and 5 are then executed if L>0 (L is
the number of small tables). Let us calculate one task execution time connected to
4. B CPP. Broadcast distribution of filtered dimension record processing in the i-th dimension table:
tables which size does not exceed the VM threshold. 𝑟 𝜏 𝑉 /𝜇 𝜏 𝑄 𝜏 𝑃𝑄 . (4)
5. C CPP. Hash Join of fact table with 1…L
dimension tables in RAM. The Executor slot may be consequentially processing
Items 6 and 7 are executed further if L<n. a few tasks. Let a task related to the m-th dimension
6. Xi (i=1…n-L+1) CPP. Sorting on the Map side of table be planned for a slot:
each dimension table partition (dfDL+1...dfDn) or fact
𝑚: 𝑟 𝑚𝑎𝑥 𝑟 . (5)
table (dfFH) by the join key, storage of the sorted
partitions in the local file system (Shuffle Write). The split blocks of the table are distributed over the
7. Yi (i=1…n-L) CPP. Pairwise join of the fact and cluster nodes uniformly. Hence the probability that
dimension tables. the same slot would get the rest of the dimension table
If the original query has a “group by” (Z1) or tasks planned where each task processes one table
“order by” (Z2) parts then items 8 and 9 are also split equals to:
executed.
8. Z1 CPP. Grouping at the end of the query execution. ⋅ . (6)
/
If there is order by part then item 9 is also
executed. The dimension tables processing time will be equal
9. Z2 CPP. Sorting at the end of the query execution. to:
The table obtained as the result of the sub-queries
execution and star joins of the original query can 𝑟 𝑟 𝑟 ∑ 𝑟 1 𝑟
serve as a dimension in the next star.
∑ 𝑁 𝑟 . (7)
The original query execution time is estimated as
a sum of CPP intervals 1-9 of all stars.
The limited volume of the paper does not allow
providing all formulas for calculation of CPP 1-9 4 MODEL CALIBRATION AND
intervals. Let us provide formulas for Ri CPP interval ADEQUACY EVALUATION
calculation. The formulas (1)-(7) use the following
elements:
The mathematical model of the processes shown in
N – cluster nodes (workers) number; NC – total
Fig. 9 was implemented in Python. A test stand that
CPU quantity in a cluster (the number of Executor
included a virtual cluster was used to calibrate the
slots, quantity of physical cores); VS – split block size
model. The cluster had 8 nodes with HDFS, Hive,
(bytes); VDi, QDi – compressed volume and the
Spark, Yarn (Vavilapalli et al., 2013). Each node had
number of records in the i-th dimension table
a double core processor, 200 GB SSD disk, Ubuntu
(i=1…n); Pi - probability that a record satisfies the
14.16 OS. The results of the experiments were split
search condition for the i-th dimension table; RH -
into two subsets.
data read intensity from HDFS file system (byte/sec);
The first subset had execution of ten queries: five
τd – processor time for task deserialization in
Q3 queries with SF=500 (NBF=40 million), SF=250
Executor slot; τf – processor time for filtration per
(NBF=50 million), SF=100, SF=50, SF=10 (NBF=15
record; τb – processor time for record read/write from
million) database population parameters, where SF is
Bloom filter.
the scale factor for database size of TPC-H (TPC,
The number of slots (tasks) required to process the
2019), NBF – anticipated number of element in BF1
i-th dimension table is equal to:
and BF2 Bloom filters; and five Q17 queries with
𝑁 ⌈𝑉 /𝑉 ⌉ . (1) SF=500, SF=250, SF=100, SF=50, SF=10 (NBF=15
million).
The following formula gives the records number per The corresponding query execution time (10
slot: points) was used to calibrate model parameters with
𝑄 𝑄 /𝑁 . (2) the Least Squares Method (LSM) with gradient
descent. Table 2 provides the calibrated parameters of
Split volume for the i-th dimension table equals to: the model, their variation ranges and optimal values.
r1
1. R1
read dimension table D1(k1,w1) split, filter (P1), BF1=bloomfilter(k1)
dfD1

rn
Rn
read dimension table Dn(kn,wn) split, filter (Pn), BFn=bloomfilter(kn)
dfDn
a
2. A broadcast(BF1),…, broadcast(BFn)

rF
3. RF read fact table F({fki},kF,wF) split, filter (PF), filter(BF1,fk1),
dfF …,filter(BFn,fkn)

b
4. B DR1=broadcast(dfD1),…,DRL=broadcast(dfDL) and hashing

c
5. C dfF.filter(DR1)....filter(DRL) - HashJoin
dfFH
x1
6. X1 dfDL+1: sort, shuffle write

xn-L
Xn-L dfDn: sort, shuffle write

xn-L+1
Xn-L+1 dfFH: sort, shuffle write

y1
7. Y1 shuffle read (X1,Xn-L+1), sort (and
join), sort, shuffle write
y2
Y2 shuffle read (X2,Y1), sort (and join),
sort, shuffle write

yn-L
Yn-L shuffle read (Xn-L,Yn-L-1), sort
(and join)
z1
8. Z1(if group by) agg or (shuffle write (Yn-L),
shuffle read, agg)
z2
9. Z2(if order by) sort, shuffle write (Z1),
shuffle read, sort

Figure 9: Implementation processes description for sub-queries and joins correspondent to the original query star.

The optimal values for the calibrated model The sum of squared deviations of modelled time
parameters were found in the following way: the from the experiment measurements equalled to 27079
outer cycle randomly selected a point inside 9- for 10 queries.
dimensional parallelepiped (9 is the number of We developed a universal interface allowing
calibrated parameters), while the inner cycle setting up and calibrating an arbitrary model.
performed error minimization by numerical methods The second subset of experimental results had
with gradient descent. The error function equals the query execution time for 40 query stages: stages
sum of squared differences between the execution 0,1,2,3,6,7,8 of Q3 query with SF=500 (NBF=40
times of ten experimental and modelled queries. million), SF=250 (NBF=50 million), SF=100, SF=50,
Since the error function may have multiple minimums SF=10 (NBF=15 million) database population
and there is a possibility of going beyond the ranges, parameters; and stages 0, 2, (3+5), 4, (6+7) of Q17
the outer cycle was repeated 100 times. query with SF=500 (NBF=15 million) database
population parameters.
Stages execution time (40 measurements) were (R2=0.966), which shows a very high modelling
used for model adequacy evaluation. accuracy for large modelling time values (y=x in this
case). Fig. 10 demonstrates that for the values over 10
Table 2: Model Calibration Parameters and their Optimal seconds the modelling accuracy is good (the dots are
Values. close to the y=x line). The relative modelling error
Lower Upper Optimal (=100ꞏ|TExperiment-TModeling|÷TExperiment) for points to
Calibrated Model Parameter
Limit Limit Value right of x=10 (31 point) has the following
τf – processor time of filtration
1.0E-06 1.0E-05 1.14E-06 distribution: 35% points have error 10% , 19%
per record, s
τb – processor time for record points have 10%<20%, 23% points - 20%
read/write from Bloom filter, 1.0E-08 1.0E-07 2.07E-08 <30%, 13% points - 30%<40% -, and 10%
s points - >40 %.
τs – record sorting time per
1.0E-08 1.0E-07 2.11E-08 Fig. 10 shows that model parameters calibration
records, s
τd – deserialization processor allows building a good prognostic cost model for
time for a task in the Executor 1.0E-06 1.0E-04 7.59E-05 query execution time estimation for large databases
slot, s even with non-precise cardinality values of
τh – hashing time per record intermediate tables (cardinality values are estimated
(for further comparison and 1.0E-08 1.0E-06 4.27E-07
aggregation), s on probability, Pi in formula (4)). Further, we provide
KS – coefficient a theoretical justification for this finding.
correspondent to serialization
effect on the transmitted data 0.5 1.5 0.81
volume during shuffle
execution 5 MODEL ADEQUACY
RH - HDFS file system data
read intensity (MBps)
20.0 50.0 44.1 THEORETICAL EVALUATION
WL - LFS local file system
50.0 200 61.4
FOR LARGE DATABASES
data write intensity (MBps)
N1 – network switch data
transmission intensity (MBps)
100 500 186 Let us represent the random time of the i-th query
execution:
A model vs. experiment scatter plot in Fig. 10 (50
points) shows all modeling and experimental query 𝑡 ∑ ∑ 𝜉ijk , (8)
and their stages execution time from the two subsets.
here Ji is a number of tables taking part in the i-th
Modeling vs. Experiment query execution, |Rj| is j-th table number of records,
ijk ≥0 is random time of k-th record processing from
y ‐ experiment, seconds

1000,0 y = 0,990x + 4,05


j-th table during execution of the i-th query.
R² = 0,966
100,0 Let us further for simplicity assume that the
database is synthetic. Then we can derive from the
10,0 synthetic dataset’s characteristics 1-3 (see Paragraph
III) that the probability distribution function (PDF) of
1,0 a random variable ijk does not depend on к, and the
0,01 1,00 100,00 number of records in tables is proportional to SF
0,1 factor (even for the intermediate tables produced by
x ‐ modeling, seconds joins). The number of records resulting from some m
and n table joins for the j-th table will be equal to:
Figure 10: Query and stage modeling execution time (x) vs.
SF| |⋅ | |
experimental measurements (y). 𝑆𝐹 ⋅ 𝑅 , (9)
⋅ , , ⋅ ,

The logarithmic scale is used for both axes: x1=lg here |Rz1| is the records number in table z given SF=1,
x, y1=lg y. The regression dependency between y and z∈(m,n,j), I(Rz1,a) – join attribute a cardinality
x is expressed as y=0.99x+4.0x+4. For the (unique values count) Rz1 table, z∈(m,n).
logarithmic scale, it will be y1=lg(10x1+4). For a large Formula (8) can be expressed in the following
enough x1, we get y1=x1 and hence y=x. For x1- way:
y1 lg 4 (horizontal asymptote in Fig. 10). The
coefficient of determination for the experimental data 𝑡 ∑ ∑ 𝜉ij , (10)
approximation of the regression is close to 1
Based on characteristic 2 of the synthetic databases data and use SF 𝑚𝑖𝑛 ∑ |𝑅 | , which is
let us consider independence of the random variables determined by the data stored in a database.
ij. These variables are limited on both sides so that The overall point distribution in Fig. 10
the conditions of the Lyapunov theorem are satisfied corresponds to the results described in (Leis et al.,
(Zukerman, 2019). Given numerous additives in (10), 2015) for query execution in the real database (please,
the ti PDF will be close to the normal distribution. The see, the left column in Fig. 8 in (Leis et al., 2015)).
mathematical expectation and variance of query Please note that these diagrams were plotted for non-
execution time can be derived from (10) in the calibrated cost models.
following form:
𝐸 𝑡 𝑆𝐹 ∑ |𝑅 |𝐸 𝜉ij 𝑆𝐹𝐸 𝑡 , (11)
𝑉𝑎𝑟 𝑡 𝑆𝐹 ∑ |𝑅 |𝑉𝑎𝑟 𝜉ij 𝐹𝑉𝑎𝑟 𝑡 , (12) 6 CONCLUSION
here E1(ti) and Var1(ti) are mathematical expectation We developed a mathematical model for Spark
and variance of query execution time for SF=1. processes based on the sub-models of connected
The confidence interval for an arbitrary query parallel processes (Fig. 9). The model can help to
execution time t can be calculated with the following predict SQL query execution time based on its
formula: schema. Fig. 2 and Fig. 4 provide schema
construction examples, and Table 1 shows how to do
|𝑡 𝐸 𝑡 | 𝑘 𝑉𝑎𝑟 𝑡 , (13) it for other queries.
Based on the experimental results (overall 50
here  is the confidence level (13), k -  quantile:
points) the model parameters were calibrated and its
0.95 quantile = 1.645, 0.99 quantile = 2.326, 0.999
adequacy evaluated. The coefficient of determination
quantile = 3.090.
for linear regression approximation is R2=0.966,
From (11), (12), (13) we derive:
which shows good model accuracy for high
  modelling time values. It was shown that for
𝐸 𝑡 1 𝑡 𝐸 𝑡 1 ,
√ √ modelling time over 10 seconds the points are
(14) concentrated close to the y=x line (Fig. 10). 77% of
An arbitrary query set is used for model these points have relative modelling error <30%.
calibration so that the regression formula obtained This is satisfactory for predicting query execution
with the Least Squares Method (LSM) is: time in a distributed parallel system which requires
E(t)=y=x+c1, here х is modelling value, c1 is some time estimation, e.g. for a DaaS environment, or
constant. If time t has Normal Distribution then LSM performs comparison and selection of query
and MLE (maximum likelihood estimation) give the implementation option, i.e. for query optimization.
same result (Seber el al., 2012). The model gives an acceptable accuracy even with
From (14) we derive: non-precise intermediate table cardinalities. This is
important since unlike with relational databases
𝑥 1 1 𝑡 𝑥 1 1 , (15) calculation of the precise cardinality in a distributed
√ √
environment requires complete table analysis.
here 𝑐 𝑚𝑎𝑥 𝑘 𝑉𝑎𝑟 𝑡 𝐸 𝑡 .
Provided SF and x are large we derive from (15)
that query execution time t corresponds well with the REFERENCES
modelling value x. This confirms the distribution of
the “experiment vs. model” points in Fig. 10. Akdere, M. et al. (2012) Learning-based query performance
Real datasets have many correlations and uneven modeling and prediction //Data Engineering (ICDE),
2012 IEEE 28th International Conference on. – IEEE,
data distribution. The developed model though
2012. – pp. 390-401.
should not lose its adequacy with the real data. Query Armbrust M. et al. (2015) Spark SQL: Relational data
execution time (8) has Normal Distribution even if ijk processing in spark //Proceedings of the 2015 ACM
random variables correlate in case the maximum SIGMOD international conference on management of
correlation coefficient tends to 0 as the distance data. – ACM, 2015. – pp. 1383-1394.
between elements increases (Seber et al., 2012). We Bloom, B. H. (1970) Space/time trade-offs in hash coding
can relax the uniform distribution requirement for with allowable errors // Communications of the ACM.
– 1970. – Vol. 13. – №. 7. – Pages 422-426.
Burdakov, A., Ermakov, E., Panichkina, A., Ploutenko, A., Wu, W. et al. (2013) Predicting query execution time: Are
Grigorev, U., Ermakov, O., & Proletarskaya, V. (2019). optimizer cost models really unusable? //Data
Bloom Filter Cascade Application to SQL Query Engineering (ICDE), 2013 IEEE 29th International
Implementation on Spark. In 2019 27th Euromicro Conference on. – IEEE, 2013. – pp. 1081-1092.
International Conference on Parallel, Distributed and Xiong, P. et al. (2011) ActiveSLA: a profit-oriented
Network-Based Processing (PDP) (pp. 187-192). IEEE admission control framework for database-as-a-service
Chi, Y., Moon, H. J. and Hacigümüş, H. (2011) iCBS: providers //Proceedings of the 2nd ACM Symposium
incremental cost-based scheduling under piecewise on Cloud Computing. – ACM, 2011. – P. 15.
linear SLAs //Proceedings of the VLDB Endowment. – Zukerman, M. (2019) Introduction to Queueing Theory and
2011. – Т. 4. – №. 9. – pp. 563-574. Stochastic Teletrac Models. [Online]. Available:
Date, C. J., and Darwen, H. (1993). A Guide to the SQL http://www.ee.cityu.edu.hk/~zukerman/classnotes.pdf.
Standard (Vol. 3). Reading: Addison-wesley. [Accessed: Sept. 22, 2019].
Dean, J. and Ghemawat, S. (2004) MapReduce: Simplified
data processing on large clusters. In Proceedings of the
Sixth Conference on Operating System Design and
Implementation (Berkeley, CA, 2004).
Ganapathi, A. et al. (2009) Predicting multiple metrics for
queries: Better decisions enabled by machine learning
//Data Engineering, 2009. ICDE'09. IEEE 25th
International Conference on. – IEEE, 2009. – pp. 592-
603.
Guirguis, S. et al. (2009) Adaptive scheduling of web
transactions //Data Engineering, 2009. ICDE'09. IEEE
25th International Conference on. – IEEE, 2009. – pp.
357-368.
Mishra, C. and Koudas, N. (2009) The design of a query
monitoring system //ACM Transactions on Database
Systems (TODS). – 2009. – Т. 34. – №. 1.
Leis, V. et al. (2015) How good are query optimizers,
really? //Proceedings of the VLDB Endowment. –
2015. – Т. 9. – №. 3. – pp. 204-215.
Mistrík, I., Bahsoon, R., Ali, N., Heisel, M., & Maxim, B.
(Eds.). (2017). Software Architecture for Big Data and
the Cloud. Morgan Kaufmann.
Odersky, M., Spoon, L., & Venners, B. (2008).
Programming in scala. Artima Inc.
Seber, G. A., and Lee, A. J. (2012). Linear regression
analysis (Vol. 329). John Wiley & Sons.
Tarkoma, S., Rothenberg, C. and Lagerspetz, E. (2012)
“Theory and practice of bloom filters for distributed
systems” IEEE Comms. Surveys and Tutorials, vol. 14,
no. 1, pp. 131–155, 2012.
Tozer, S., Brecht, T. and Aboulnaga, A. (2010) Q-Cop:
Avoiding bad query mixes to minimize client timeouts
under heavy loads //Data Engineering (ICDE), 2010
IEEE 26th International Conference on. – IEEE, 2010.
– pp. 397-408.
TPC org. (2019) “Documentation on TPC-H performance
tests”, tpc.org. [Online]. Available:
http://www.tpc.org/tpc_documents_current_versions/p
df/tpc-h_v2.17.2.pdf. [Accessed: Sept. 22, 2019]
Vavilapalli, V.K., et al. (2013) "Apache hadoop yarn: Yet
another resource negotiator." Proceedings of the 4th
annual Symposium on Cloud Computing. ACM, 2013,
p. 5
Wasserman, T. J. et al. (2004) Developing a
characterization of business intelligence workloads for
sizing new database systems //Proceedings of the 7th
ACM International Workshop on Data Warehousing
and OLAP. – ACM, 2004. – pp. 7-13.

View publication stats

You might also like