Lec2 Notes
Lec2 Notes
Today we're going to talk about tabular data representations ("relations") and operations
over them ("relational algebra" + SQL)
Example:
bandfan.com
members
id
names
birdthdays
addresses
emails
bands
id
name
genre
...
examples:
members are fans of bands
bands play in shows
...
employees in departments working on projects
musicians in bands signed with labels
students in classes in universities
cars made by manufacturers bought by customers
parents with children who attend school
patients of doctors in different hospitals
...
Member-band-fans
What's wrong with this representation?
Duplicate info - why is that bad?
Inconsistency
Wasted space
No ability to represent missing data
Add NULL?
Try 2:
Still redundant information
Try 3:
Eliminates redundancy
This is a general approach: for many to many relationships, create a relationship table
to eliminate redundancy
Generally works but can get complicated when you start adding complex restrictions;
for example, suppose we wanted to allow each member to be a fan of just one band per
genre?
It's not possible to represent this in a single table without duplicating information, or
requiring me to connect several tables together to do it
What about one to many relationships? Show slide -- can add a reference column to
the original table
How to devise a schema? Most common way is to write down the nature of the
relationships (one to many, many to one), as well as the attributes, and then the tables
that represent it. Sometimes people use what's called an entity relationship diagram.
Study break
We're going to study lots of different ways to manipulate tables -- and of course it's
possible to perform arbitrary transformations over them with programs.
Suppose we just want to focus on the problem of extracting a set of records of interest
from a collection of tables.
We need to find a way to extract columns and rows of interest, and a way to follow
paths from one table to another. A fancy name for this is a relational algebra.
Here, a relation is just a table with a schema, with unordered rows and no duplicates
Algebra just refers to the fact that we have set of operations over relations that is
closed, i.e., each operation on a relation (or pair of relations) produces another relation.
Main operations:
Example showing how join & select works -- find creed shows
Notice that basic ops are all set oriented -- i.e., they produce another valid relation
Although we won't go into it much, one of the cool properties of these operations is that
they obey interesting algebraic identities that allow a system that executes relational
algebra expressions to choose the order in which it does work, for example:
sel reordering
Sel1(Sel2(A)) = Sel2(Sel1(A))
sel push down
Sel(A join B, pred) = Sel(A, pred) join Sel(b, pred)
Mbf = Member-band-fans
Note that SQL is "Declarative" - we say what we want, not how to achieve it
Even for a simple selection, may be:
1) Iterating over the rows
2) Keeping table sorted by primary key and do binary search
3) Keep the data in some kind of a tree structure and do logarithmic search
Note that as a user of a SQL database, you don't need to know how the system is
evaluating the query, or even what the physical representation of the data is.
This can be both a blessing and a curse -- cool because as a user you don't have to
worry about it, but bad because it can make understanding bad performance hard.
SELECT fans.name
FROM bands
JOIN band_likes bl ON bl.bandid = bands.id
JOIN fans ON fans.id = bl.fanid
WHERE bands.name = 'Justin Bieber'
Look at physical plan chosen
Note effect of creating an index on bands.name
For small bands table, has no effect
For larger table, will choose to use index
Depends on clustering