Project 2
Project 2
EXPLORATION
SC3020 DATABASE SYSTEM PRINCIPLES
TOTAL MARKS: 100
Real-world users may write a sequence of SQL queries to explore the underlying
relational database for a specific task. For instance, a user may start with a SQL query
Q, execute it, browse the results and then modify Q to Q’ (e.g., by modifying certain
predicates in the WHERE clause), reexecute it, and view the refined results. Such
exploration can go on iteratively by executing a sequence of related SQL queries and
browsing corresponding results repeatedly. The following example shows Q and Q’ with
changes to Q highlighted in red.
select * select *
from customer C, orders O from customer C, orders O
where C.c_custkey = O.o_custkey where C.c_custkey = O.o_custkey
and customer.name like ‘%cheng’
Query Q Query Q’
The RDBMS query optimizer will execute a query execution plan (QEP) to process each
such SQL query during exploration. For instance, there will be two QEPs, P and P’,
associated with Q and Q’, respectively, in the above example. These QEPs are typically
displayed in the form of tree-structure by a DBMS software (e.g., PostgreSQL).
Unfortunately, to an end user who is not proficient in database technology, this may not
be the best way to understand how each of her queries has been executed during
data exploration.
Project 2/SC3020
Hint: Design algorithm to efficiently identify the parts of a plan that have evolved in the
query plan trees and explain those to the end user using a combination of visual and
natural language form and connecting them with the changes to SQL.
You should use Python as the host language on Windows platform for your project.
For students using Mac platform, you can install Windows on your Mac by following
instructions in https://support.apple.com/en-sg/HT201468. The DBMS allowed in this
project is PostgreSQL. The example dataset you should use for this project is TPC-H
(see Appendix). You are free to use any off-the-shelf toolkits for your project.
Note that several parts of the project are left open-ended (e.g., how the GUI should
look like? What are the functionalities we should support? How should you explain to
an end user?) intentionally so that the project does not curb a group’s creative
endeavors. You are free to make realistic assumptions to achieve these tasks.
SUBMISSION REQUIREMENTS
• You should submit three program files: interface.py, explain.py, and project.py.
The file interface.py contains the code for the GUI. The explain.py contains the
code for generating the explanation. The project.py is the main file that invokes
all the necessary procedures. Note that we shall be running the project.py file
(either from command prompt or using the Pychamp IDE) to execute the
software. Make sure your code follows good coding practice: sufficient
comments, proper variable/function naming, etc. We will execute the software
Project 2/SC3020
to check its correctness using different query sets and dataset to check for the
generality of the solution. We will also check quality of algorithm design w.r.t
processing of the query plans.
• Softcopy report containing details of the software including formal descriptions of
the key algorithms with examples. You should also discuss limitations of the
software (if any).
• Peer assessment report from each member of the team. Each individual
member of a team needs to assess contributions of the group members. Details
of peer assessment form will be provided closer to the submission date.
• You must submit a document containing instructions to run your software
successfully. You will not receive any credit if your software fails to execute
based on your instructions.
• All submissions will be through NTU Learn.
Project 2/SC3020
Appendix
1) Go to
http://www.tpc.org/tpc_documents_current_versions/current_specifications5.asp
and download TPC-H Tools v2.18.0.zip. Note that the version may defer as the tool
may have been updated by the developer.
2) Unzip the package. You will find a folder “dbgen” in it.
3) To generate an instance of the TPC-H database:
Open up tpch.vcproj using visual studio software.
Build the tpch project. When the build is successful, a command prompt will
appear with “TPC-H Population Generator <Version 2.17.3>” and several *.tbl
files will be generated. You should expect the following .tbl files: customer.tbl,
lineitem.tbl, nation.tbl, orders.tbl, part.tbl, partsupp.tbl, region.tbl, supplier.tbl
Save these .tbl files as .csv files
These .csv files contain an extra “|” character at the end of each line. These
“|” characters are incompatible with the format that PostgreSQL is
expecting. Write a small piece of code to remove the last “|” character in
each line. Now you are ready to load the .csv files into PostgreSQL
Open up PostgreSQL. Add a new database “TPC-H”.
Create new tables for “customer”, “lineitem”, “nation”, “orders”, “part”,
“partsupp”, “region” and “supplier”
Import the relevant .csv into each table. Note that pgAdmin4 for PostgreSQL
(windows version) allows you to perform import easily. You can select to view
the first 100 rows to check if the import has been done correctly.
If encountered error (e.g., ERROR: extra data after last expected column)
while importing, create columns of each table first before importing. Note
that the types of each column has to be set appropriately. You may use the
SQL commands in Appendix II to create the tables.
Project 2/SC3020
II. SQL commands for creating TPC-H data tables
Region table
1) Nation table
Project 2/SC3020
2) Part table
3) Supplier table
Project 2/SC3020
4) Partsupp table
5) Customer table
Project 2/SC3020
6) Orders table
Project 2/SC3020
7) Lineitem table
Project 2/SC3020