Project 2 Database Design and ETL: 1 Introduction: What Is This Project All About?
Project 2 Database Design and ETL: 1 Introduction: What Is This Project All About?
Project 2 Database Design and ETL: 1 Introduction: What Is This Project All About?
Project 2 - Database Design and ETL Due: October 23rd, 2019, 11:59 P.M.
Project 2
Database Design and ETL
Out: October 3rd, 2018
Due: October 23rd, 2019, 11:59 P.M.
This project is about putting it all together. Given a large amount of unstructured airline
data, we want you to create a working database of that data.
ETL stands for Extract, Transform, Load, a process for unifying multiple sources of com-
plimentary data stored in different formats. Companies spend a lot of money on this every
year, because the problem is just slippery enough to escape the grasp of most algorithms,
leaving the problem to us, the DBA’s.
2 Goal
Before we jump into explaining the individual components, here’s a broad overview of what
we’d like you to do for this project:
• Model the data in the system as both an E-R diagram and as SQL. There are a few
caveats:
1. Your model must contain all of the data we supply you with (unless otherwise
specified). You are not allowed to omit any fields and any actual data we provide
should be reflected in your database. The one exception is in cases of data
integrity issues, which will be discussed in more depth later.
2. The resulting schema must be in BCNF or 3NF (this shouldn’t be too difficult,
as the few FD’s are pretty clear)
3. For the SQL, we will look for more than naive table creation: this means labeling
your primary keys, foreign keys, constraints, etc.
4. You will be importing your schema into a SQLite database using standard SQL
constructs.
• Write an application, import, which will use the various CSV files to populate a
SQLite database.
• Write an application, query, which will make pre-defined queries against the SQLite
database and print the results to the console.
To help you out, what follows is a basic overview of the data contained in the various
files you will be working with. Read it over carefully and be on the lookout for structural
elements to incorporate into your design.
Note: This overview may not fully explain all of the nuances of the data: you are en-
couraged to look at the files yourselves (CSVs are human-readable) to better understand
them. You should take all of this data and be able to enter it into database of your design,
avoiding redundancies.
3.1 airlines.csv
This file contains basic informations on all of the airlines. There are two fields: the first is
a code that is unique to the airlines (eg: YX) and the second is the name of the airline (eg:
Republic Airlines). Note that not all airlines may have flight data associated with them.
3.2 airports.csv
This file contains information on all of the airports. There are two fields: the first is a code
that corresponds uniquely to a particular airport and the second is the full, canonical name
of the airport. Note that not all airports may have flight data associated with them.
3.3 flights.csv
It contains information on every single flight limited to a single month of data (note that
your design should still be able to accommodate data from other months and/or years!).
• A code that corresponds uniquely to a particular airport (in this case, the origin)
Some other constraints to consider are that the arrival time of flights should be neither
earlier nor the same as the departure time, and certain variables such as delay times should
not be negative.
3.6 Schema
Your database should be modeled after the following schema. We’ve only labeled primary
keys, so make sure to include other constraints in your SQL table creation:
airlines (airline
id, airline code, airline name)
airports airport id, airport code, airport name, city, state
flights(f light id, airline id, f light num, origin airport id,
dest airport id, departure dt, depart dif f, arrival dt,
arrival dif f, cancelled, carrier delay, weather delay, air traf f ic delay, security delay)
Note: It’s worth noting that the schema we provided here use additional airline id and
airport id as the Primary Key. The primary key is just a unique integer that we automati-
cally assign to each tuple using the AUTOINCREMENT keyword in the DDL. In addition
to this, you need to ensure that the airline code and airport code are unique.
4 The Applications
4.1 import
This script is designed to load data from CSV files for flights, airport, and airlines, normal-
ize it, and create a SQL database containing the information. It should be callable from
the command line as ./import.
./import \
/course/cs1270/pub/etl/airports.csv \
/course/cs1270/pub/etl/airlines.csv \
/course/cs1270/pub/etl/flights.csv \
~/course/cs1270/etl/data.db
The script is not strictly required to output any information. However, verbose error mes-
sages are encouraged to aid in debugging.
If you find yourself repeatedly typing this long command, you can run:
alias im="./import \
/course/cs1270/pub/etl/airports.csv \
/course/cs1270/pub/etl/airlines.csv \
/course/cs1270/pub/etl/flights.csv \
~/course/cs1270/etl/data.db"
As long as you remain in the same shell session, you can now run im, and it will be equivalent
to running the long-form of the alias.
4.2 query
This script should make pre-defined queries against a specified SQL database. The query
to be executed will be specified via the command line. If the query requires input from
the user (ie: a name, a start/end date, etc), that information will also be passed in via the
command line.
./query \
~/course/cs1270/etl/data.db \
query1
Given those inputs, the application should execute Query #1 (queries are defined and num-
bered below) against the SQLite database at ~/course/cs1270/etl/data.db.
A more complex call for the program might look like this:
./query \
~/course/cs1270/etl/data.db \
query8 \
"Southwest Airlines Co." \
2101 \
01/01/2012 \
01/31/2012
The script is expected to output the results of the query in CSV format (omitting any
“header row”). The expected input and output columns for each query are described in
more detail below.
4.3.2 query
To test your query application, cd to your ETL directory and run /usr/bin/ant test.
Your application will be run using various inputs and compared to outputs from the TA
solution code. The testcases can be found in /course/cs127/pub/etl/tests. Your appli-
cation should pass all of the testcases: this is one of the major ways your handin will be
evaluated.
If you believe that the script is returning incorrect results, please feel free to contact the
TAs. Be sure to provide relevant lines of code so the TAs can evaluate your objection.
You can also test your SQL code outside of the Java application using the test database.
See Appendix for details.
5 Queries
You will need to design SQL queries for your database that answer the following questions.
Unless otherwise noted, all queries should be composed of a single SQL statement.
Input: N/A
Output: One column. Number of airport codes.
Note: You can check the correct output at /course/cs1270/pub/etl/tests/0001/output
Input: N/A
Output: One column. Number of airline codes.
Note: You can check the correct output at /course/cs1270/pub/etl/tests/0002/output
Input: N/A
Output: One column. Number of flights.
8. If I had wanted to get from one city to another on a specific day (flight must have
taken off and landed on the specified day), what were my options if I limited myself
to one hop (aka: a direct flight)? Results should be sorted by total flight duration,
lowest to highest, and then sorted alphabetically by airline code, A-Z. Remember that
we’re looking at historical data: as such, we’re interested in actual departure/arrival
times, inclusive of delays.
9. Same as above, but for two hops. Results should be sorted by total duration, then
sorted alphabetically by airline code for each hop, and then sorted by the actual
depart time of the first hop, from the earliest to the latest.
10. Same as above, but for three hops. Results should be sorted by total duration, then
sorted alphabetically by airline code for each hop, and then sorted by the actual
depart time of the first hop, from the earliest to the latest.
The final column should indicate the total travel time in minutes, from departure
of the first flight to arrival of the last.
Note: The city, state restriction from Query 10 still holds.
The directory contains the build file build.xml. This enables automation in compiling
your project. To compile, while in that directory type /usr/bin/ant. This automatically
includes the support code in your classpath when compiling.
The directory is also an Eclipse project. That means students using Eclipse as their IDE
should be able to import the project into their workspace using Eclipse’s File → Import
functionality.
Libraries are included as JARs in the lib/ directory. Your code should go in src/.
2. Open Eclipse. From the top menu bar, navigate to File → Import.
3. From there, expand the “General” tab, and select “Existing Projects into Workspace.”
4. Click the “Browse” button next to “Select root directory” and browse to the etl
directory inside your course directory. Click OK.
5. Check the box next to the project (if it isn’t already checked) and click Finish.
8 Tips
8.1 INSERT OR IGNORE in SQLite
The stencil code suggests that students enable foreign key constraint checking by calling
PRAGMA foreign keys = ON. This is important for ensuring the correctness of your code
and we highly recommend that students do it. After executing that statement, SQLite will
enforce foreign key constraints across all future queries using the same connection.
However, there is a cost associated with that constraint checking. If you are using batch
inserts and any row in the batch violates a foreign key constraint, every row in the batch
will fail to be inserted into the table. We suggested using INSERT OR IGNORE as a
workaround: ideally, that would mean bad rows would be ignored and the rest of the rows
would be inserted. However, it turns out that INSERT OR IGNORE does not work with
foreign key constraints (see http://www.sqlite.org/lang_conflict.html if you’re inter-
ested).
So what is a CS127 student to do? Well, you can validate your foreign key constraints at the
application level! Before adding a new row to be inserted, make sure that any foreign key
constraints are satisfied (either via a SQL query to the corresponding table or via a lookup
data structure in your application). If you’ve done that properly, the database should never
complain about a foreign key violation.
Note: Think about what datatype is most appropriate for the given field.
e.g. the pros and cons of using TEXT as opposed to CHAR(n) or VARCHAR(n)
the date formatted according to the format string specified in argument first. The second
parameter is used to mention the time string and followed by one or more modifiers can be
used to get a different result.
For example, SELECT strftime(%́Y-%m-%d %H:%M:%S’,’now’) returns the formatted text
string of current date. And here is a complete list of valid strftime() substitutions:
%d day of month: 00
%f fractional seconds: SS.SSS
%H hour: 00-24
%j day of year: 001-366
%J Julian day number
%m month: 01-12
%M minute: 00-59
%s seconds since 1970-01-01
%S seconds: 00-59
%w day of week 0-6 with Sunday==0
%W week of year: 00-53
%Y year: 0000-9999
%% %
Also, you might need to do string concatenation, and “k” is the string concat operator
in SQLite instead of “+”, which is more commonly seen in other languages.
9 Handin
We expect the following components to be included in your handin (this is a reiteration of
the Goal section of the handout):
• An E-R Diagram of your design.
• Your import application.
• Your query application.
• A README file, describing any bugs in your code (or asserting their absence)
You can handin your project by running the following command from the directory
containing all your files:
/course/cs1270/bin/cs127_handin etl
10 Q&A
Here are some FAQs. If you have any question, check this section first to see whether there
is an answer here.
• Are we being graded on coding style, efficiency and commenting for our
importer code for the ETL?
No, but if your importer is incorrect then having comments and neat code would help
your grader in allotting partial credit where it is due.
• When creating the airports db, should we use a JOIN to add the city and
state to the airports table or is it preferable to do that within Java code?
You can use Java data structures or SQL to do that, it’s up to you! But it may be
easier to use Java data structures such as HashMap/Sets however.
• I find I don’t have the permission to read ouput files in the test.
That is intentional. Running ”/usr/bin/ant test” should give you a good indication
of whether your queries are correct but will not reveal what is missing/extra in your
query results if they do not match up - that’s for you to find!
• For some airports, we cannot find their cities and states. Whats their
expected value in the table?
NULL or empty string.
• For query 7, if the airport name doesn’t exist, such as ‘ABCD’, should we
output a line like (‘ABCD’, 0, 0) for it, or just omit it?
Omit it.
• For Query 6, you are expected to get ALL airlines, along with the number of flights
for that airline for the given day. The question is not “get the airlines that have a
flight on the given day”. Therefore, an airline could have 0 flight for a day and you
would want to have that in your output.
• Should we check if the user inputs the correct number of args for our
queries?
Not necessary. You can assume that the correct number of args is always given.
• For the middle hop in query 10, would the hop for XXX → JFK then LGA
→ YYY be a valid path from XXX to YYY since JFK and LGA are both
in New York, NY?
No. You always have to depart from the airport you just arrived at.
11 Appendix
11.1 Test database
In addition to the test scripts mentioned above, there is a test database that contains the
results of all the queries given certain parameters. This will allow you to test your SQL
queries outside of the Java application using those parameters, which may be helpful in
isolating bugs. The test database is located at
/course/cs1270/pub/etl/test.db
1. N/A
2. N/A
3. N/A
4. N/A
5. Month: 2
Day: 1
Year: 2012
6. Month: 2
Day: 1
Year: 2012
Airports: ‘LaGuardia’, ‘Washington Dulles International’, ‘Logan International’
12 Final Words
Good luck, and as always, feel free to ask TA’s any questions!