Predicting SQL Query Execution Time With A Cost Model For Spark Platform
Predicting SQL Query Execution Time With A Cost Model For Spark Platform
net/publication/341279987
Predicting SQL Query Execution Time with a Cost Model for Spark Platform
CITATIONS READS
2 481
5 authors, including:
Aleksey Burdakov
Bauman Moscow State Technical University
59 PUBLICATIONS 67 CITATIONS
SEE PROFILE
All content following this page was uploaded by Aleksey Burdakov on 10 May 2020.
Aleksey Burdakov1, Viktoria Proletarskaya1, Andrey Ploutenko2, Oleg Ermakov1 and Uriy Grigorev1
1Informatics
and Control Systems, Bauman Moscow State Technical University, Moscow, Russia
2Mathematics and Informatics, Amur State University, Blagoveschensk, Russia
[email protected], [email protected], [email protected], [email protected], [email protected]
Keywords: SQL, Apache Spark, Bloom Filter, TPC-H Test, Big Data, Cost Model.
Abstract: The paper proposes a cost model for predicting query execution time in a distributed parallel system requiring
time estimation. The estimation is paramount for running a DaaS environment or building an optimal query
execution plan. It represents a SQL query with nested stars. Each star includes dimension tables, a fact table,
and a Bloom filter. Bloom filters can substantially reduce network traffic for the Shuffle phase and cut join
time for the Reduce stage of query execution in Spark. We propose an algorithm for generating a query
implementation program. The developed model was calibrated and its adequacy evaluated (50 points). The
obtained coefficient of determination R2=0.966 demonstrates a good model accuracy even with non-precise
intermediate table cardinalities. 77% of points for the modelling time over 10 seconds have modelling error
<30%. Theoretical model evaluation supports the modelling and experimental results for large databases.
Figure 6: Original query star implementation general Figure 7: Spark-created Processes Example.
schema.
Let us denote the PSP group with a line
3. Read fact table, filter records with PF condition corresponding to the group interval definition (see
and with Bloom filters, obtain projection ({fki}, kF, Fig. 8). The parent PSP creates descendant PSPs, e.g.
wF). descendant PSP PP is created by parent PSP PD and so
4. Join the filtered fact table with the filtered on. Let us call a PSP groups set as connected parallel
dimension tables (dfF Join dfDi). Group and sort if processes (CPP). For example, PD, PP, PB PSPs form
applicable to the particular “star” schema. Jump to a CPP with P identifier (see Fig. 8).
new star in the query schema.
rF
3. RF read fact table F({fki},kF,wF) split, filter (PF), filter(BF1,fk1),
dfF …,filter(BFn,fkn)
b
4. B DR1=broadcast(dfD1),…,DRL=broadcast(dfDL) and hashing
c
5. C dfF.filter(DR1)....filter(DRL) - HashJoin
dfFH
x1
6. X1 dfDL+1: sort, shuffle write
…
xn-L
Xn-L dfDn: sort, shuffle write
xn-L+1
Xn-L+1 dfFH: sort, shuffle write
y1
7. Y1 shuffle read (X1,Xn-L+1), sort (and
join), sort, shuffle write
y2
Y2 shuffle read (X2,Y1), sort (and join),
sort, shuffle write
…
yn-L
Yn-L shuffle read (Xn-L,Yn-L-1), sort
(and join)
z1
8. Z1(if group by) agg or (shuffle write (Yn-L),
shuffle read, agg)
z2
9. Z2(if order by) sort, shuffle write (Z1),
shuffle read, sort
Figure 9: Implementation processes description for sub-queries and joins correspondent to the original query star.
The optimal values for the calibrated model The sum of squared deviations of modelled time
parameters were found in the following way: the from the experiment measurements equalled to 27079
outer cycle randomly selected a point inside 9- for 10 queries.
dimensional parallelepiped (9 is the number of We developed a universal interface allowing
calibrated parameters), while the inner cycle setting up and calibrating an arbitrary model.
performed error minimization by numerical methods The second subset of experimental results had
with gradient descent. The error function equals the query execution time for 40 query stages: stages
sum of squared differences between the execution 0,1,2,3,6,7,8 of Q3 query with SF=500 (NBF=40
times of ten experimental and modelled queries. million), SF=250 (NBF=50 million), SF=100, SF=50,
Since the error function may have multiple minimums SF=10 (NBF=15 million) database population
and there is a possibility of going beyond the ranges, parameters; and stages 0, 2, (3+5), 4, (6+7) of Q17
the outer cycle was repeated 100 times. query with SF=500 (NBF=15 million) database
population parameters.
Stages execution time (40 measurements) were (R2=0.966), which shows a very high modelling
used for model adequacy evaluation. accuracy for large modelling time values (y=x in this
case). Fig. 10 demonstrates that for the values over 10
Table 2: Model Calibration Parameters and their Optimal seconds the modelling accuracy is good (the dots are
Values. close to the y=x line). The relative modelling error
Lower Upper Optimal (=100ꞏ|TExperiment-TModeling|÷TExperiment) for points to
Calibrated Model Parameter
Limit Limit Value right of x=10 (31 point) has the following
τf – processor time of filtration
1.0E-06 1.0E-05 1.14E-06 distribution: 35% points have error 10% , 19%
per record, s
τb – processor time for record points have 10%<20%, 23% points - 20%
read/write from Bloom filter, 1.0E-08 1.0E-07 2.07E-08 <30%, 13% points - 30%<40% -, and 10%
s points - >40 %.
τs – record sorting time per
1.0E-08 1.0E-07 2.11E-08 Fig. 10 shows that model parameters calibration
records, s
τd – deserialization processor allows building a good prognostic cost model for
time for a task in the Executor 1.0E-06 1.0E-04 7.59E-05 query execution time estimation for large databases
slot, s even with non-precise cardinality values of
τh – hashing time per record intermediate tables (cardinality values are estimated
(for further comparison and 1.0E-08 1.0E-06 4.27E-07
aggregation), s on probability, Pi in formula (4)). Further, we provide
KS – coefficient a theoretical justification for this finding.
correspondent to serialization
effect on the transmitted data 0.5 1.5 0.81
volume during shuffle
execution 5 MODEL ADEQUACY
RH - HDFS file system data
read intensity (MBps)
20.0 50.0 44.1 THEORETICAL EVALUATION
WL - LFS local file system
50.0 200 61.4
FOR LARGE DATABASES
data write intensity (MBps)
N1 – network switch data
transmission intensity (MBps)
100 500 186 Let us represent the random time of the i-th query
execution:
A model vs. experiment scatter plot in Fig. 10 (50
points) shows all modeling and experimental query 𝑡 ∑ ∑ 𝜉ijk , (8)
and their stages execution time from the two subsets.
here Ji is a number of tables taking part in the i-th
Modeling vs. Experiment query execution, |Rj| is j-th table number of records,
ijk ≥0 is random time of k-th record processing from
y ‐ experiment, seconds
The logarithmic scale is used for both axes: x1=lg here |Rz1| is the records number in table z given SF=1,
x, y1=lg y. The regression dependency between y and z∈(m,n,j), I(Rz1,a) – join attribute a cardinality
x is expressed as y=0.99x+4.0x+4. For the (unique values count) Rz1 table, z∈(m,n).
logarithmic scale, it will be y1=lg(10x1+4). For a large Formula (8) can be expressed in the following
enough x1, we get y1=x1 and hence y=x. For x1- way:
y1 lg 4 (horizontal asymptote in Fig. 10). The
coefficient of determination for the experimental data 𝑡 ∑ ∑ 𝜉ij , (10)
approximation of the regression is close to 1
Based on characteristic 2 of the synthetic databases data and use SF 𝑚𝑖𝑛 ∑ |𝑅 | , which is
let us consider independence of the random variables determined by the data stored in a database.
ij. These variables are limited on both sides so that The overall point distribution in Fig. 10
the conditions of the Lyapunov theorem are satisfied corresponds to the results described in (Leis et al.,
(Zukerman, 2019). Given numerous additives in (10), 2015) for query execution in the real database (please,
the ti PDF will be close to the normal distribution. The see, the left column in Fig. 8 in (Leis et al., 2015)).
mathematical expectation and variance of query Please note that these diagrams were plotted for non-
execution time can be derived from (10) in the calibrated cost models.
following form:
𝐸 𝑡 𝑆𝐹 ∑ |𝑅 |𝐸 𝜉ij 𝑆𝐹𝐸 𝑡 , (11)
𝑉𝑎𝑟 𝑡 𝑆𝐹 ∑ |𝑅 |𝑉𝑎𝑟 𝜉ij 𝐹𝑉𝑎𝑟 𝑡 , (12) 6 CONCLUSION
here E1(ti) and Var1(ti) are mathematical expectation We developed a mathematical model for Spark
and variance of query execution time for SF=1. processes based on the sub-models of connected
The confidence interval for an arbitrary query parallel processes (Fig. 9). The model can help to
execution time t can be calculated with the following predict SQL query execution time based on its
formula: schema. Fig. 2 and Fig. 4 provide schema
construction examples, and Table 1 shows how to do
|𝑡 𝐸 𝑡 | 𝑘 𝑉𝑎𝑟 𝑡 , (13) it for other queries.
Based on the experimental results (overall 50
here is the confidence level (13), k - quantile:
points) the model parameters were calibrated and its
0.95 quantile = 1.645, 0.99 quantile = 2.326, 0.999
adequacy evaluated. The coefficient of determination
quantile = 3.090.
for linear regression approximation is R2=0.966,
From (11), (12), (13) we derive:
which shows good model accuracy for high
modelling time values. It was shown that for
𝐸 𝑡 1 𝑡 𝐸 𝑡 1 ,
√ √ modelling time over 10 seconds the points are
(14) concentrated close to the y=x line (Fig. 10). 77% of
An arbitrary query set is used for model these points have relative modelling error <30%.
calibration so that the regression formula obtained This is satisfactory for predicting query execution
with the Least Squares Method (LSM) is: time in a distributed parallel system which requires
E(t)=y=x+c1, here х is modelling value, c1 is some time estimation, e.g. for a DaaS environment, or
constant. If time t has Normal Distribution then LSM performs comparison and selection of query
and MLE (maximum likelihood estimation) give the implementation option, i.e. for query optimization.
same result (Seber el al., 2012). The model gives an acceptable accuracy even with
From (14) we derive: non-precise intermediate table cardinalities. This is
important since unlike with relational databases
𝑥 1 1 𝑡 𝑥 1 1 , (15) calculation of the precise cardinality in a distributed
√ √
environment requires complete table analysis.
here 𝑐 𝑚𝑎𝑥 𝑘 𝑉𝑎𝑟 𝑡 𝐸 𝑡 .
Provided SF and x are large we derive from (15)
that query execution time t corresponds well with the REFERENCES
modelling value x. This confirms the distribution of
the “experiment vs. model” points in Fig. 10. Akdere, M. et al. (2012) Learning-based query performance
Real datasets have many correlations and uneven modeling and prediction //Data Engineering (ICDE),
2012 IEEE 28th International Conference on. – IEEE,
data distribution. The developed model though
2012. – pp. 390-401.
should not lose its adequacy with the real data. Query Armbrust M. et al. (2015) Spark SQL: Relational data
execution time (8) has Normal Distribution even if ijk processing in spark //Proceedings of the 2015 ACM
random variables correlate in case the maximum SIGMOD international conference on management of
correlation coefficient tends to 0 as the distance data. – ACM, 2015. – pp. 1383-1394.
between elements increases (Seber et al., 2012). We Bloom, B. H. (1970) Space/time trade-offs in hash coding
can relax the uniform distribution requirement for with allowable errors // Communications of the ACM.
– 1970. – Vol. 13. – №. 7. – Pages 422-426.
Burdakov, A., Ermakov, E., Panichkina, A., Ploutenko, A., Wu, W. et al. (2013) Predicting query execution time: Are
Grigorev, U., Ermakov, O., & Proletarskaya, V. (2019). optimizer cost models really unusable? //Data
Bloom Filter Cascade Application to SQL Query Engineering (ICDE), 2013 IEEE 29th International
Implementation on Spark. In 2019 27th Euromicro Conference on. – IEEE, 2013. – pp. 1081-1092.
International Conference on Parallel, Distributed and Xiong, P. et al. (2011) ActiveSLA: a profit-oriented
Network-Based Processing (PDP) (pp. 187-192). IEEE admission control framework for database-as-a-service
Chi, Y., Moon, H. J. and Hacigümüş, H. (2011) iCBS: providers //Proceedings of the 2nd ACM Symposium
incremental cost-based scheduling under piecewise on Cloud Computing. – ACM, 2011. – P. 15.
linear SLAs //Proceedings of the VLDB Endowment. – Zukerman, M. (2019) Introduction to Queueing Theory and
2011. – Т. 4. – №. 9. – pp. 563-574. Stochastic Teletrac Models. [Online]. Available:
Date, C. J., and Darwen, H. (1993). A Guide to the SQL http://www.ee.cityu.edu.hk/~zukerman/classnotes.pdf.
Standard (Vol. 3). Reading: Addison-wesley. [Accessed: Sept. 22, 2019].
Dean, J. and Ghemawat, S. (2004) MapReduce: Simplified
data processing on large clusters. In Proceedings of the
Sixth Conference on Operating System Design and
Implementation (Berkeley, CA, 2004).
Ganapathi, A. et al. (2009) Predicting multiple metrics for
queries: Better decisions enabled by machine learning
//Data Engineering, 2009. ICDE'09. IEEE 25th
International Conference on. – IEEE, 2009. – pp. 592-
603.
Guirguis, S. et al. (2009) Adaptive scheduling of web
transactions //Data Engineering, 2009. ICDE'09. IEEE
25th International Conference on. – IEEE, 2009. – pp.
357-368.
Mishra, C. and Koudas, N. (2009) The design of a query
monitoring system //ACM Transactions on Database
Systems (TODS). – 2009. – Т. 34. – №. 1.
Leis, V. et al. (2015) How good are query optimizers,
really? //Proceedings of the VLDB Endowment. –
2015. – Т. 9. – №. 3. – pp. 204-215.
Mistrík, I., Bahsoon, R., Ali, N., Heisel, M., & Maxim, B.
(Eds.). (2017). Software Architecture for Big Data and
the Cloud. Morgan Kaufmann.
Odersky, M., Spoon, L., & Venners, B. (2008).
Programming in scala. Artima Inc.
Seber, G. A., and Lee, A. J. (2012). Linear regression
analysis (Vol. 329). John Wiley & Sons.
Tarkoma, S., Rothenberg, C. and Lagerspetz, E. (2012)
“Theory and practice of bloom filters for distributed
systems” IEEE Comms. Surveys and Tutorials, vol. 14,
no. 1, pp. 131–155, 2012.
Tozer, S., Brecht, T. and Aboulnaga, A. (2010) Q-Cop:
Avoiding bad query mixes to minimize client timeouts
under heavy loads //Data Engineering (ICDE), 2010
IEEE 26th International Conference on. – IEEE, 2010.
– pp. 397-408.
TPC org. (2019) “Documentation on TPC-H performance
tests”, tpc.org. [Online]. Available:
http://www.tpc.org/tpc_documents_current_versions/p
df/tpc-h_v2.17.2.pdf. [Accessed: Sept. 22, 2019]
Vavilapalli, V.K., et al. (2013) "Apache hadoop yarn: Yet
another resource negotiator." Proceedings of the 4th
annual Symposium on Cloud Computing. ACM, 2013,
p. 5
Wasserman, T. J. et al. (2004) Developing a
characterization of business intelligence workloads for
sizing new database systems //Proceedings of the 7th
ACM International Workshop on Data Warehousing
and OLAP. – ACM, 2004. – pp. 7-13.