0% found this document useful (0 votes)
86 views9 pages

Project 2

The document describes Project 2 which involves writing a program to automatically generate user-friendly explanations of how query execution plans change during data exploration using SQL queries. Students are asked to design algorithms to identify changes in query plan trees and explain them to users using natural language and visuals. They must implement the program in Python using PostgreSQL and submit code files, a report, and peer assessments. The task aims to help non-technical users understand how their SQL queries are executed.

Uploaded by

jablejinx
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
86 views9 pages

Project 2

The document describes Project 2 which involves writing a program to automatically generate user-friendly explanations of how query execution plans change during data exploration using SQL queries. Students are asked to design algorithms to identify changes in query plan trees and explain them to users using natural language and visuals. They must implement the program in Python using PostgreSQL and submit code files, a report, and peer assessments. The task aims to help non-technical users understand how their SQL queries are executed.

Uploaded by

jablejinx
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

PROJECT 2: UNDERSTANDING QUERY PLANS DURING DATA

EXPLORATION
SC3020 DATABASE SYSTEM PRINCIPLES
TOTAL MARKS: 100

Due Date: April 16, 2023; 11:59 PM

Real-world users may write a sequence of SQL queries to explore the underlying
relational database for a specific task. For instance, a user may start with a SQL query
Q, execute it, browse the results and then modify Q to Q’ (e.g., by modifying certain
predicates in the WHERE clause), reexecute it, and view the refined results. Such
exploration can go on iteratively by executing a sequence of related SQL queries and
browsing corresponding results repeatedly. The following example shows Q and Q’ with
changes to Q highlighted in red.

select * select *
from customer C, orders O from customer C, orders O
where C.c_custkey = O.o_custkey where C.c_custkey = O.o_custkey
and customer.name like ‘%cheng’
Query Q Query Q’

The RDBMS query optimizer will execute a query execution plan (QEP) to process each
such SQL query during exploration. For instance, there will be two QEPs, P and P’,
associated with Q and Q’, respectively, in the above example. These QEPs are typically
displayed in the form of tree-structure by a DBMS software (e.g., PostgreSQL).
Unfortunately, to an end user who is not proficient in database technology, this may not
be the best way to understand how each of her queries has been executed during
data exploration.

Your task is to write a program that automatically generates user-friendly explanation


(e.g., natural and visual language description) of the changes to the query execution
plans that take place during data exploration. Specifically, let P 1 , P 2 , …, P n are the QEPs
generated by the DBMS for executing a sequence of queries Q 1 , Q 2 , …, Q n ,
respectively, during data exploration. Note that the queries are related as they have
evolved from the original query Q 1 . Hence, the QEPs may also share common content
among themselves. Your task is to generate user-friendly description of the way the
plans have evolved during data exploration (e.g., a hash join in P 1 has now evolved to
sort-merge join in P 2 due to changes in the WHERE clause in Q 2 ).

Project 2/SC3020
Hint: Design algorithm to efficiently identify the parts of a plan that have evolved in the
query plan trees and explain those to the end user using a combination of visual and
natural language form and connecting them with the changes to SQL.

To this end, your tasks are as follows:

• Design and implement an algorithm that takes as input the followings:


a. Old query Q 1 , its QEP P 1
b. New query Q 2, its QEP P 2

It generates a user-friendly description of what has changed from P 1 to P 2 , and


why. Your goal is to ensure generality of the solution (i.e., it can handle a wide
variety of query plans on different database instances) and the user-friendly
explanation should be concise without sacrificing important information related
to the plan. The better is the algorithm design for the task, the more credit you
will receive. Similarly, the more functionalities you support, the more credit you
will receive.

• A user-friendly, graphical user interface (GUI) to enable the aforementioned


goals.

You should use Python as the host language on Windows platform for your project.
For students using Mac platform, you can install Windows on your Mac by following
instructions in https://support.apple.com/en-sg/HT201468. The DBMS allowed in this
project is PostgreSQL. The example dataset you should use for this project is TPC-H
(see Appendix). You are free to use any off-the-shelf toolkits for your project.

Note that several parts of the project are left open-ended (e.g., how the GUI should
look like? What are the functionalities we should support? How should you explain to
an end user?) intentionally so that the project does not curb a group’s creative
endeavors. You are free to make realistic assumptions to achieve these tasks.

SUBMISSION REQUIREMENTS

You submission should include the followings:

• You should submit three program files: interface.py, explain.py, and project.py.
The file interface.py contains the code for the GUI. The explain.py contains the
code for generating the explanation. The project.py is the main file that invokes
all the necessary procedures. Note that we shall be running the project.py file
(either from command prompt or using the Pychamp IDE) to execute the
software. Make sure your code follows good coding practice: sufficient
comments, proper variable/function naming, etc. We will execute the software

Project 2/SC3020
to check its correctness using different query sets and dataset to check for the
generality of the solution. We will also check quality of algorithm design w.r.t
processing of the query plans.
• Softcopy report containing details of the software including formal descriptions of
the key algorithms with examples. You should also discuss limitations of the
software (if any).
• Peer assessment report from each member of the team. Each individual
member of a team needs to assess contributions of the group members. Details
of peer assessment form will be provided closer to the submission date.
• You must submit a document containing instructions to run your software
successfully. You will not receive any credit if your software fails to execute
based on your instructions.
• All submissions will be through NTU Learn.

Note: Late submission will be penalized.

Project 2/SC3020
Appendix

I. Creating TPC-H database in PostgreSQL


Follow the following steps to generate the TPC-H data:

1) Go to
http://www.tpc.org/tpc_documents_current_versions/current_specifications5.asp
and download TPC-H Tools v2.18.0.zip. Note that the version may defer as the tool
may have been updated by the developer.
2) Unzip the package. You will find a folder “dbgen” in it.
3) To generate an instance of the TPC-H database:
 Open up tpch.vcproj using visual studio software.
 Build the tpch project. When the build is successful, a command prompt will
appear with “TPC-H Population Generator <Version 2.17.3>” and several *.tbl
files will be generated. You should expect the following .tbl files: customer.tbl,
lineitem.tbl, nation.tbl, orders.tbl, part.tbl, partsupp.tbl, region.tbl, supplier.tbl
 Save these .tbl files as .csv files
 These .csv files contain an extra “|” character at the end of each line. These
“|” characters are incompatible with the format that PostgreSQL is
expecting. Write a small piece of code to remove the last “|” character in
each line. Now you are ready to load the .csv files into PostgreSQL
 Open up PostgreSQL. Add a new database “TPC-H”.
 Create new tables for “customer”, “lineitem”, “nation”, “orders”, “part”,
“partsupp”, “region” and “supplier”
 Import the relevant .csv into each table. Note that pgAdmin4 for PostgreSQL
(windows version) allows you to perform import easily. You can select to view
the first 100 rows to check if the import has been done correctly.
If encountered error (e.g., ERROR: extra data after last expected column)
while importing, create columns of each table first before importing. Note
that the types of each column has to be set appropriately. You may use the
SQL commands in Appendix II to create the tables.

Alternatively, you can also refer to https://docs.verdictdb.org/tutorial/tpch/ for


additional help on creating the TPC-H database

Project 2/SC3020
II. SQL commands for creating TPC-H data tables

Region table

1) Nation table

Project 2/SC3020
2) Part table

3) Supplier table

Project 2/SC3020
4) Partsupp table

5) Customer table

Project 2/SC3020
6) Orders table

Project 2/SC3020
7) Lineitem table

Project 2/SC3020

You might also like