2022 DeepO a Learned Query Optimizer - 副本
2022 DeepO a Learned Query Optimizer - 副本
2421
Demonstration SIGMOD ’22, June 12–17, 2022, Philadelphia, PA, USA
SIGMOD ’22, June 12–17, 2022, Philadelphia, PA, USA Luming Sun, Tao Ji, Cuiping Li*, Hong Chen
Users PostgreSQL
Execution Plans
Web UI Query Optimizer Query Executor
New Query
Estimated Cost
Query Plan Dataset
Result Ranker with Confidence
our demo shows (i) how the cost learner embeds the query query execution plan. And in the cost learning step, embedded
plan and learns the cost, (ii) what are the optimization op- plans are used as input to train the neural network. We adopt tree-
tions and how to get them. structured network [7] to process the query plan vectors, and we
(3) We offer a web UI to interact with users and demonstrate modify the network with Bayesian neural network[1]. By doing so,
the improvements brought by DeepO. Our preliminary re- we can offer confidence-aware cost estimation for query plans.
sults show that DeepO is able to reduce query execution In the online optimization phase, DeepO firstly generates a set
time compared to baseline optimizer. We have open sourced of candidate optimization hints for new queries. At present, DeepO
DeepO, and it is available at github https://github.com/RUC- offers three kinds of optimizations including scan methods (i.e.
AIDB/DeepO. Sequential Scan and Index Scan), join methods (i.e. different kinds
of join methods such as Hash Join), and join orders (including the
order of different tables and the order of inner and outer tables).
2 SYSTEM DESCRIPTION
We utilize the pg_hint_plan2 extension to control execution plans
We first demonstrate the system workflow of DeepO, then we in- with hint phrases describing our optimization methods.
troduce the core component - the Cost Learner. After DeepO gets the optimization hint candidates, it will gen-
erate multiple query plans in parallel using the candidates. The
2.1 DeepO Workflow plans are sent to the Cost Learner to get estimated cost and the
As shown in Figure 1, DeepO consists of two phases: an offline confidence of the estimation. Then the Result Ranker will return the
cost learning phase, in which history query processing logs are top-K optimization hints that have the smallest estimation cost with
collected and the learning-based Cost Learner is trained, and an high confidence. Then in the web UI, users can use the hint with
online optimization phase, where DeepO provides users with query minimal cost directly or check the recommended hints and decide
optimization hints and corresponding estimated costs. which one to use. And the web UI also allows users to compare the
During the offline cost learning phase, DeepO collects query real performance of query execution plans using different optimiza-
execution plan logs from PostgreSQL. For example, given a SQL tion options. Every time the queries are executed by PostgreSQL,
statement, the EXPLAIN ANALYZE command in PostgreSQL can DeepO again collects the run-time performance of the optimized
be used to retrieve the tree-structured query execution plan as well plans and add them to plan dataset for further model update. In
as the actual running cost. DeepO takes these plan-cost pairs as case of change of database schema (e.g. new tables or attributes),
training data for later cost learning. DeepO will automatically fall back to PostgreSQL’s optimizer and
We train the Cost Learner using the collected data in a supervised add the plan to plan dataset to retrain the model. In this way, as
way. This process takes in executed query plans and gives out a DeepO optimizes more queries, the Cost Learner continues to learn
trained model. It involves three steps: operator embedding, plan and constantly improves its performance.
embedding, and cost learning. In the first step, all operators of the
query plans are embedded into dense and representative feature 2.2 Cost Learner Internals
vectors. Different embedding strategies are designed separately for DeepO treats the cost estimation task as a supervised learning
leaf operators and non-leaf ones in order to capture underlying and problem. During offline training process, it takes query plans as
complicated query characteristics. In the second step, we transform
the embedded operators to tree-structured format according to the 2 https://pghintplan.osdn.jp/pg_hint_plan.html
2422
Demonstration SIGMOD ’22, June 12–17, 2022, Philadelphia, PA, USA
DeepO: A Learned Query Optimizer SIGMOD ’22, June 12–17, 2022, Philadelphia, PA, USA
Predicted Cost
𝑉#
FC Layer
Hash Join TableA.id=TalbeB.id
Hash Join Non-leaf operator [0, 1, 0, … , 0, 1, 0, …, 0, 1, 0] FC layer
TableA.id=TableB.id
operator id column id
H#
Sequential Scan Sequential Scan
TableA Filter: a1 > 100
and a2 =‘test’ TableB Filter: b = 7 Cell
𝑉! 𝑉"
Embedding Unit
Preprocessing
Leaf operator
[0.345,0.476,...,0.327]
Leaf operator 𝑔𝑎𝑡𝑒𝑠 𝑔𝑎𝑡𝑒𝑠 𝑔𝑎𝑡𝑒𝑠
[0.145,0.532,...,0.614]
H! 𝑉# H"
(a) Plan Tree (b) Operator & Plan Embedding (c) BNN Tree Network
input and the actual execution latency as label. The Cost Learner is , where 𝑥 is the layer input, 𝑦ˆ is the layer output, and 𝜔 is the
the core of DeepO, and Figure 2 shows an example of the embedding hyper-parameter of probability distribution 𝑝.
process and network structure of the learner. The BNN-based Tree Network unit (Figure 2(c)) has gates and
Operator embedding & Plan embedding. An abstract repre- memory cell and it can process tree-structured input. It makes full
sentation of a query execution plan is a physical operator tree, as use of information from lower layer to upper layer by transferring
illustrated in Figure 2(a). We divide operators of a query plan into cell memory and affecting the gates weight with hidden states. For
two categories according to their characteristics and position in each operator of the encoded plan, a tree unit is applied to do the
the tree: the leaf operators and non-leaf ones. The leaf operators local children-parent style aggregation. As the figure shows, hidden
are Scan operators usually with multiple predicates. And the non- vectors of all children (𝐻𝑙 and 𝐻𝑟 ) as well as embedded vector of
leaf operators are Join, Sort, etc. We design different embedding the current parent 𝑉𝑝 are combined and transformed into a new
strategies for them in order to capture underlying and complicate hidden vector. The hidden vector generated by the unit is finally fed
query characteristics. For leaf operators, the Scan predicates can into a BNN-based fully connected layer to produce the estimated
be regarded as set of words, so we treat them as a sentence and cost and estimation confidence for the query plan.
represent each non-numeric word using one-hot encoding. Since
the range filter conditions on columns vary largely, we scale the
numbers into range [0,1] using min-max normalization. Then we
feed these sequential embedded vectors into a embedding network.
After that, leaf operators get represented in dense, information-rich
vectors in low dimension (green and blue color in Figure 2(b)).
For non-leaf operators, the encoded vectors consist of two parts.
First, the operator kind is encoded into a unique one-hot vector.
Then, the additional information about the operators (e.g. Join
condition) is encoded into a vector of length T where T is the
number of all the columns in database, and the corresponding bit is
set to 1 when the column is in the conditions (See Figure 2(b)). In
order to offer dense and coordinated input to cost learning network,
we add a fully connected layer to encode the vectors of non-leaf Figure 3: Demonstration of Plan Embedding
operators into the same length as leaf embedding. After the operator
embedding phase, a encoded tree with the same structure as the
input query plan is obtained, which we call plan embedding.
BNN-based Tree Network. We adopt tree-structured network[7] 3 EXPERIMENTS AND DEMONSTRATION
in order to capture the propagation process of query plan. And we In this section, we introduce the user interface of our demonstration
employ Bayesian neural network (BNN)[1] to deploy the network. and show the preliminary performance of DeepO. We implement
BNN serves as a bridge between deep learning and Bayesian in- the Cost Learner of DeepO using PyTorch3 . Our experiments are
ference, and in BNN we can sample the network weights for a carried on the IMDB dataset [2]. We use queries from MSCN4 ,
probability distribution and then optimize these distribution pa- and after the offline cost learning process, we set the connection
rameters rather than having deterministic weights. So with BNN, it configuration of PostgreSQL and launch the optimization service
is possible to give prediction intervals and measure the confidence with the learned model. Users can use the web UI to optimize queries
the predictions. To do so, we replace the deterministic parameters interactively with DeepO and compare the performance between
of the fully-connected layers by sampling the parameters of the DeepO and other optimizer.
layer with the following equations on each feed-forward operation:
3 https://pytorch.org/
𝑦ˆ = 𝑓𝑤 (𝑥), 𝑤 ∼ 𝑝𝜔 (𝜔) (1) 4 https://github.com/andreaskipf/learnedcardinalities
2423
Demonstration SIGMOD ’22, June 12–17, 2022, Philadelphia, PA, USA
SIGMOD ’22, June 12–17, 2022, Philadelphia, PA, USA Luming Sun, Tao Ji, Cuiping Li*, Hong Chen
5 ACKNOWLEDGEMENT
This work was partially supported by the National Key Research
and Develop Plan under Grant 2018YFB1004401, National Natu-
ral Science Foundation of China under Grant 62072460, 62076245,
62172424, and Beijing Natural Science Foundation under Grant
4212022.
REFERENCES
[1] Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. 2015.
Weight uncertainty in neural network. In International Conference on Machine
Learning. PMLR, 1613–1622.
[2] Viktor Leis, Andrey Gubichev, Atanas Mirchev, Peter A. Boncz, Alfons Kemper,
and Thomas Neumann. 2015. How Good Are Query Optimizers, Really? Proc.
VLDB Endow. 9, 3 (2015), 204–215.
[3] Ryan Marcus, Parimarjan Negi, Hongzi Mao, Nesime Tatbul, Mohammad Alizadeh,
Figure 4: Candidate optimization options and Tim Kraska. 2021. Bao: Making Learned Query Optimization Practical. Pro-
ceedings of the 2021 International Conference on Management of Data (2021).
[4] Andrew Pavlo, Gustavo Angulo, Joy Arulraj, Haibin Lin, Jiexi Lin, Lin Ma,
In the page of SQL Query Tool, users can set the connection Prashanth Menon, Todd C. Mowry, Matthew Perron, Ian Quah, Siddharth San-
configurations of PostgreSQL and enter the SQL they would like turkar, Anthony Tomasic, Skye Toor, Dana Van Aken, Ziqi Wang, Yingjun Wu,
to optimize. As Figure 4 demonstrates, after a query is submitted, Ran Xian, and Tieying Zhang. [n.d.]. Self-Driving Database Management Sys-
tems. In CIDR 2017, 8th Biennial Conference on Innovative Data Systems Research,
several optimization options will show on the page and users can Chaminade, CA, USA, January 8-11, 2017.
select the default or any option they want. [5] Luming Sun, Cuiping Li, Tao Ji, and Hong Chen. 2021. MOSE: A Monotonic
Selectivity Estimator Using Learned CDF. IEEE Transactions on Knowledge and
After several queries are optimized and executed, we can com- Data Engineering (2021).
pare their execution performance in the Plan Optimization Analysis [6] SunJi and LiGuoliang. 2019. An end-to-end learning-based cost estimator. In VLDB
page. As demonstrated in Figure 5, besides execution time, we also 2019.
[7] Kai Sheng Tai, Richard Socher, and Christopher D. Manning. 2015. Improved
demonstrate the plans using pev26 . Semantic Representations From Tree-Structured Long Short-Term Memory Net-
works. In ACL.
5 https://projector.tensorflow.org/
6 https://github.com/dalibo/pev2
2424