Lecture 8 - CS50's Computer Science For Lawyers
Lecture 8 - CS50's Computer Science For Lawyers
OpenCourseWare
Donate (https://cs50.harvard.edu/donate)
Lecture 8
Spreadsheets
Databases
Data Types
SQL
Executing SQL Queries
Race Conditions
SQL Injection Attacks
Scalability
Spreadsheets
Let’s create a sample spreadsheet for users in a web application. The first row of a spreadsheet
generally includes a header.
Each column has a data type, like a number or some text. For the username and email fields, the
values are additionally unique.
We can also relate spreadsheets with one another by having some sort of common identifier; perhaps
we would like a spreadsheets with customer information, invoices, products, or any number of other
types of data. To relate the data in these spreadsheets, we can use a customer ID.
We might start needing larger and larger spreadsheets, and eventually, we might have more rows of
data than these programs, like Microsoft Excel or Google Sheets, can handle. In that case, we might
want to use a database instead.
Databases
A database is a piece of software that can run on our operating systems. Commonly, however, it runs
on a server or somewhere else in the cloud to which our software connects.
Relational databases mimic the design of spreadsheets—data is stored in rows and columns, and we
get more and more rows as we have more and more data.
Instead of having spreadsheets, we have a database, and instead of calling individual tabs
sheets, we call them tables.
Additionally, we have to choose between various data types for each column. This will make
searching and sorting the data much more efficient.
Examples include Oracle, Microsoft Access, SQL Server, My SQL, Postgres, or SQLite.
Data Types
Some of the data types we can use in SQLite include integer , real , numeric , text , and blob .
We can use an integer if we would like to represent something like 1, 2, 3, or a negative.
We can use real if we’d like to store a floating point value.
We can use numeric if we would like to represent dates, times, or other types that are
numbers but have more formal structures.
We can use text if we have words, phrases, or even whole paragraphs.
We can use blob if we would like to store binary large objects, or binary data, like actual files!
Other databases support more than these data types, which allows the database to make smarter
decisions when storing this data, so when we query the data, the database can respond quickly.
In some databases, the integer type can be subdivided into smallint , integer , and
bigint , where the smallint type takes up 2 bytes, the integer type takes up 4 bytes, and
the bigint type takes up 8 bytes.
The real data type can be subdivided into real and double precision , where a real
type takes up 4 bytes and double precision takes up 8 bytes.
The numeric data type can include boolean , date , datetime , numeric(scale,
precision) , time , and timestamp .
Note that we’ve used varchar(255) for many of these fields. Historically, many databases had
limits of 255 for a varchar, so this is used very commonly.
These data types are not the only ones that can be used. Based on the data we’re storing, these
types can be different.
SQL
SQL, a database language, has some fundamental operations that allow us to work with data inside
databases. These operations include:
CREATE
SELECT
UPDATE , INSERT
DELETE , DROP
We will use SQLite, an implementation of SQL, which stores its tables locally, as a file. Additionally, a
number of programs can be used to communicate with a SQL database. We will use DB Browser,
which can also be installed on a Mac or PC.
When we open DB Browser, we’ll see a list of tables, as shown below.
If we open a few tables, we can see the columns inside each, as shown.
We might note that we see a lot of columns with various IDs. This will help us avoid redundancy.
For example, let us reconsider our spreadsheet. We currently have the address of the user
“malan” as “33 Oxford Street, Cambridge, MA 02138.”
It would be rather inefficient for us to perform simple queries on that address column, like
searching for the zip code “02138.” We would have to search all of the values in the column and
also ignore the extraneous information stored in that address.
To fix this, we can change the address field to multiple fields. For example, we can have
this table instead:
33
David
malan [email protected] Oxford Cambridge MA 02138
Malan
Street
Additionally, it turns out that multiple people have that address (we leave them unnamed here).
Then, this address would appear repeatedly in our table.
33 617
David
malan [email protected] Oxford Cambridge MA 02138 495
Malan
Street 500
33 Cambridge MA 02138
Oxford
username name email street city state zip pho
Street
33
Oxford Cambridge MA 02138
Street
33
Oxford Cambridge MA 02138
Street
33
Oxford Cambridge MA 02138
Street
To avoid this redundancy, we can create a separate table called “cities.” Inside this table,
we’ll have to include an ID to be able to link the previous spreadsheet to this spreadsheet.
We might format the cities table like this:
1 Cambridge MA 02138
33 617-
David
malan [email protected] Oxford 1 495-
Malan
Street 5000
33
Oxford 1
Street
33
Oxford 1
Street
33
Oxford 1
Street
33
Oxford 1
Street
username name email street city_id phone age
33
Oxford 1
Street
It’s much more efficient for the computer to perform operations on and relate these small
numbers we’ve used for IDs.
Back to our music database, we’ll notice that the Album table is related to the Artist table via an
ArtistId. This way, multiple albums can have the same artist without storing the artist’s name multiple
times.
Additionally, if an artist were to change their name, we can simply change their name in the Artist
table, without having to go through the entire Album table, searching for that artist.
5 5 Big Ones 3
If we’re curious about who wrote “For Those About To Rock We Salute You,” we’ll take note of the
ArtistId for this song, or 1. Then, we can execute another SQL query: SELECT * FROM Artist
WHERE ArtistId = 1; . This is our result:
ArtistId Name
1 1 AC/DC
In this case, we’re joining the Album table and the Artist table. To join them, we’re matching
the ArtistId column of the Album table with the ArtistId column of the Artist table.
Our first five rows might look like this:
We might not want to see all the columns in the middle, though. Instead, we can write:
Title Name
To return a table with album counts for each artist, we can execute this SQL query:
We use the GROUP BY keyword to group albums with the same artist name together.
We use the SELECT COUNT(Name) function to return the number of times this artist appears.
The first five rows look like this:
Name COUNT(Name)
1 AC/DC 2
3 Aaron Goldberg 1
To return the same table as above but only for counts greater than 1, we can write:
Name COUNT(Name
1 AC/DC 2
2 Accept 2
3 Amy Winehouse 2
5 Audioslave 3
There are 56 rows in this table, and they were returned in 1 millisecond!
We’ve provided some specifications that help the computer respond to these queries quickly.
If we designate a column as a primary key, then this column uniquely identifies every row.
Generally, the primary key is an integer, since it is much easier to ensure that there are no
duplicates with integers than with characters, for example.
For example, the id column in the cities table is a primary key.
If we define a column to be a primary key in one table, in another table, a column with the same
value is called a foreign key.
For example, we had an Album table and an Artist table earlier. The Artist table
had an ArtistID column. Within the Artist table, that column is the primary key. In
the Album table, there was also an ArtistID column, and that column, in this context,
is a foreign key.
These two keys in conjunction help the database know how to link the two tables efficiently.
We can also specify that a column is unique. This doesn’t have to be the primary key; for
example, for the users table we created, we would like the usernames and emails to be
unique.
We can index a column, which will allow future queries on that column to be much quicker. By
indexing a column, we’re telling the database beforehand that we plan to search and sort this
column frequently. The database will then create a 2D tree structure from the data in that
column, thereby ensuring that it doesn’t take as many steps for searching or sorting in the
future.
We can also autoincrement a column. Instead of manually assigning Cambridge the id of 1 and
the next city the id of 2, the autoincrement function will, each time we add a row, increment
the previous value. This provides us with unique values that we can use as our primary key.
The not null keyword will ensure that the rows being added to our database aren’t only sparsely
filled with real data.
SQL also provides functions like AVG , COUNT , MAX , MIN , SUM , among many others. We can simply
pass in data to these functions to return the value we would like.
Race Conditions
We’ll introduce race conditions with an example. Suppose Alice and Bob exist on this planet, and both
have just seen David’s latest Instagram post.
First, we’ll look at what Instagram can do with a “like.” When someone likes a post, Instagram can use
a SELECT statement to see how many likes this post currently has. Then, an UPDATE statement can
be executed to increment that number.
Now, suppose that Alice and Bob both like David’s instagram post at the same time. Suppose David’s
latest post has 2141 likes.
Separately, the SELECT statements are processed. Since the likes were sent at the same time, both
SELECT statements return 2141, for the current number of likes. Then, both UPDATE statements will
change David’s post’s number of likes to 2141 + 1 = 2142, instead of 2143.
In this case, Alice and Bob’s inputs triggered a “race” to update data.
Now, let’s consider a different example. Suppose Alice and Bob are roommates, and Alice comes home,
opens the fridge and finds that there is no milk. She goes to the store to buy some milk. Bob comes
home while Alice is gone and finds that there is no milk. He, then, goes to the store and buys some
milk. Now, they have twice as much milk.
This problem occurred because both Alice and Bob inspected the state of some value and made
a decision on it before the other person was done acting on that information.
To fix this problem, Alice could write a note on the fridge, notifying Bob that she is gone for
milk. Or, more dramatically, Alice could simply lock the fridge, so Bob can’t even inspect that
state.
Databases provide atomicity, or the ability to do multiple things together or nothing at all. This
means that the database cannot be interrupted by someone else’s update.
SQL databases also support transactions, where one can execute one or more commands again
and again, back to back. All of these transactions have to go through before another person’s
command is allowed to get executed.
Scalability
If we have only one database, then we have a single point of failure. We could fix this by having two
databases, but note that if we store half of our data in one database and half in the other, we’ll
actually have two single points of failure. The second database will have to be a duplicate of the first
database to resolve this issue.
Note that the second database is a duplicate of the first, not a backup of the first. We don’t want
to waste time restoring from backup, as we’d like to maintain uptime. As a duplicate, whenever
we write data to one database, we write this data to the other as well.
If we have more and more data to store, we’ll require more than one primary database. Then, we’ll
have to duplicate this additional database as well. Eventually, we might hit a ceiling on vertical
scaling—there is only so much space on our servers. We might employ a different method of scaling.
Currently, our databases are doing both reading and writing. To read data from a database
means to take data from its memory. To write data into a database means to save such data into
the database.
If we have two primary databases, we can connect databases to each, as their replicas. However,
instead of having both read and write capabilities on these replicas, we can simply have read
capabilities. Then, these databases exist solely for the purpose to read from them again and
again.
An example of when this might be useful is on social media. On Facebook, there are likely more
reads than there are writes. If the data for a business follows this pattern, where reads are way
more common than writes, we can use this model instead.