Databases Practical Notes
Databases Practical Notes
Null values; we can leave table cells empty; ‘null’, which is not part of the domain of this
column. Its neither the number 0 or a empty string, different from all values of any data type.
Can occur when the value doesnt exist, not applicable, now known or any value will do and the info in
their doesn’t matter. Without null values, we would have to split a relation into many more specific
relations and subclasses. Do not use fake values; different users will make their own strings, just a
dash etcetera. Sql uses a three-valued logic → true, false, unknown. Any comparison with null give
back unknown. A = null is a different query compared to A is null. Stupid and annoying af. SQL
allows to control whether the attribute value may be null or not, (not null restraint). Leads to simpler
application programs and fewer surprises during query evaluation.
Integrity constraints;
Model the relevant part of the world is the task, tables can lead to too many meaningless states.
Integrity constraints → keep values that certain cells have to satisfy, making it so that states are
realistic in a real world scenario.
Not null, key constraints (only appear once), foreign key constraints → values in a column must also
appear as key values in another table, check constraints must satisfy a given predicate. Can protect
against data input errors, company standard, inconsistency. Application/querying might be made
easier as well with these restrictions.
Keys;
Relational models; a key that uniquely identifies the tuples in R (relation). If all the attributes
in a tuple are the same, they are the same tuple. Disagree at least once to be different.
Key constraints only go for stuff in the same row, not for stuff in the same column
necessarily. If {A,B } is a key, rows may agree in A or B, but not both. All relations have a
key. If a key is in B key, b is weaker as more states exist that satisfy it. Minimal if no proper
subset is a key. (you can’t drop anything and have it still be a valid key).
A relation may have more than one minimal key. Primary key cannot be null, all other keys
are called alternate.secondary keys. Usually a single attribute that is never updated.
Keys are constraints; they refer to all possible states, not just the current/correct one. COnstraining the
right thing is important. No good choice → add an extra column with a simple column.
Foreign keys;
Basically info you get from other tables. Relational models don't provide explicit
relationships, links or pointers. Use the key attributes to reference a tuple.
Same column names might refer to the same columns. Foreign keys are not keys themselves →
unique identification is not true in the alternate table.
Foreign key constraints ensure that every type in t results where it is not null, there exists a
tuple u in students as well such that t.sid = u.sid.
Foreign keys may be null.
Deleting rows that are referenced → rejection, cascade ( tuples that are referenced also deleted),
foreign key set to null.
Foreign keys are denoted with an arrow
Data modelling
Database design phases.
Formal model serves as a measure of correctness. Needs expertise, flexibility, and size is
enormous.
Phase 1; conceptual; think of what you want to store, relations and the constraints.
phase 2; transformation of the conceptual schema into the schema supported by the
database. E.g. relational model
phase 3; physical; design indexes, table distribution, buffer sizes, maximize performance of
the final system.
IS-A Inheritance.
Employee inheritance attributes of previous things that are in the input of an ISA.
Lower level entities are subgroups of the top entity. They inherit all attributes, but also the
relationship sets. Can be created both top-down and bottom-up.
There can be some membership constraints. (value-based assignments). The default is user-defined;
manual assignment to subclasses. Disjointness→ something can belong to at most one subclass, else
overlapping possible (base assumption). Completeness → total specialization constraint; each
superclass entity must belong to a subclass; must be either of one.
Aggregation
Relation (works-on).
Relationships might need a connection that we want to make. We fix this not by connection
to all entities. SOlved by treating relationship set as an abstract entry, which allows relation
between relations.
Notation summary.
Entity set → rectangle → weak if line around.
Circle is an attribute, circle is multi-valued. Dotted line only is derived from others.
Relationship set is a diamond, where double diamond is an identifying relationship set for a
weak entity set. just watch the video
5. Advanced SQL
Self joins:
Same table can be queried more than once. Might have to consider more than one tuple of the same
relation → Homework marks example. You can just add more internal and classes to get a self join.
Duplicate elimination.
Duplicates have to be explicitly eliminated in sql. The distinct modifier may be applied to the
select clause; request explicit duplicate row duplication.
Superfluous distinct → uniquely determined by the result.
Algorithm to do this;
1. let k be this set of attributes in the select clause
2. If A=c in the where close and c is a constant, add A to K
3. if A=B in the where clause and B is in K, add A to K
4. if K has a key of a variable X, add all attributes of X to K
5. repeat till stable.
If K contains a key of every type available listed under from, then distinct is superfluous.
Common mistakes; missing join conditions, unnecessary joins, self joins with missing
equality conditions, unexpected duplicates, unnecessary distinct. Sometimes only slow, not
necessarily breaking.
outer and inner joins, left and right outer joins are two different options.
Natural → yields comparison of columns with the same name
using(A1, An), with columns appearing in both tables, the join predicate will return equality.
Join predicates; determine when things are matching so when they are true.
inner[join]: is the base form of join, cartesian product.
left[outer] join: preserves rows of the left table, right table attributes will become null
right[outer] join: same thing but in reverse:
Full [outer]join: preserves rows of both tables
cross join: cartesian product (all combinations of the rows of the two tables).
Join eliminates the tuples without a partner.
Left out join presserves all tuples in its left argument, filling the attributes of the second table
with null. Cartesian product is just all possible combinations.
Count is an action is SQL that will make you count the amount of times a certain attribute is
there.
Even if you have statements or conditions, they stay in the final table and can cause
confusion. Filtering the table before the drawing operations in order to properly remove
them. Filtering things in the tables on the side of the join class is not possible, and has to be
done beforehand.
Joining with on/using/natural all require slightly different methodology.
Part 2
Non-monotonic queries.
Example would be to find student who hasn’t submitted homework. However, if you would
then add a new row, would lead to less rows in the answer.
Currently, we can’t formulate non-monotic behaviour. “negated existential quantification”. Bowls
down to → test whether a query yields a non-empty result.
Not in
Attribute shouldn’t appear in the result subquery - check. The subquery is evaluated before
the main query.
NOt exists
True if the result of the subquery is empty. The outer query and subquery are currelated, and
the subquery is parameterized. Else it can just be replaced with true or false.
NOn-correlated subqueries with non exists are almost always an indication of error.
Exists without negation; will be true if it is not empty.
For all
In logic; existential and universal quantifiers exist, one where “exists” and the other where
“for all that exist”. Although we do have a restricted form available.
for all x(p) == -EX(-P)
all cars are red == there exists no car that is not red.
Implication does not exist; for all X(alpha → beta)
becomes -Any where X(alpha and not beta)
Translating natural language to sql can be difficult.
For all and implication relation can be sometimes interchanged, they are logically equivalent
to each other in the example at 7:27
Translation from there is then almost trivial.
Nested Subqueries.
If we want to find everyone who has solved all assignments → need a loop.
Need to have an outer query to solve that. You can repeat that over and over, having loops
inside of loops over and over again.
Aggregation functions.
We go from a set or multiset to a single value. The input is the set of values in an entire
column, and the output is a singular value depending on which aggregation we do.
Count(*); we count all rows of a result
Some things are sensitive to duplicates, while others are very insensitive (min, max).
Simple aggregations feed the value set of the entire column into an aggregation function.
Done through grouping the columns. If we don’t want to count things in a column twice;
make sure to add “distinct”. avg calculates the average of the values in the attributes
mentioned in its brackets. Simple aggregations may not be nested; single values will have
no sense anyway. Aggregations can not be used in a where clause, as it is a condition for a
single row, not a column. AGgregation function is used without group by, no attributes may
appear in the select clause. Null values are filtered out before aggregations are applied, the
exception being count(*), which counts rows. If the input set is empty, it will be null (or if all
attributes are null). The count function of an empty input is zero. null and zero difference
seen here!! It's a bit confusing, pay attention.
Aggregation with GROUP BY and HAVING
Group by partitions the rows of a table into disjoint groups, based on value equality for the
group by attributes. Aggregation functions are now applied for each group separately. The
groups are formed after evaluation of the from and where clauses. The “group by” never
over produces empty groups. Be specific and include topics in your query to avoid issues.
Aggregations may not be used in the where clause, but in the having clause we are ONLY
allowed to use aggregation functions. Having clauses can make us drop entire groups in
queries.
introduction
Reasoning about good or bad design for databases.
Functional dependencies → generalization of keys, central part of relational database design theory.
Defines when relation is in normal form.
Violation of the normal form → sign of bad database design.
Data stored redundantly often.
3NF and BCNF are often used. 3NF in practice.
BCNF requires that all FD’s are keys.
Normalization algorithm → construct good relation schemas, the derived tables will automatically be
in BCNF.
First normal form → all table entries are atomic (not lists, sets, records, relations)
Next steps are based upon this one.
Functional dependencies
Whenever two rows agree on 1 thing in the row, they must also agree on another variable.
Functional dependency; A1…An → B1…Bn holds for relation if and only if
the same things point to each other; seems redundant. Similar to partial keys, determining
uniquely some attributes, but not in general.
A,B → C,D implies A,B → C and A,B → D but not A → C,D or B → C,D
Keys vs functional dependencies; keys really are functional dependencies. A key uniquely
determines all attributes of its relation.
Functional dependencies are partial keys, as we restrict it to the attributes that are shown in
the list. We want to turn FD’s into keys as databases can control those better. Pay attention
to what attribute actually uniquely determines other attributes. This can also be a
combination of different attributes.
Determining keys
Finding a minimal key is done through the covers. You can get different keys depending on
algorithm/order. Just remove stuff until you can’t remove anything anymore that is implied.
To find all minimal keys → candidates that do not exist on the right. If you can conclude something
new is necessary, test all of them and see if they fit and which was the least necessary. All of them
that cover everything make up all minimal keys (that are minimal).
Determinants
A set is a determinant of another if the functional dependencies hold, the left side is minimal and the
set is not trivial (not a superset of the other). If you drop any attribute it changes the right hand side →
minimal. Basically just work through all the options once again and find what describes everything,
just don’t consider ‘minimal’
3NF
Slightly more general than BCNF. Key attribute appears in the minimal key, necessary for
this.
If something is in bcnf, it's also in 3nf. 3nf; “b is a key attribute of R”. It basically has the extra
option; the right-hand side is an attribute of a minimal key. Conversion of real life examples
to letters can be complicated; make sure to think about combinations of attributes forming a
key.
Splitting relations
Splitting relations; make sure that is lossless by using a join. Why split? Good question.
Split is lossless if the set of shared attributes is a key in at least one of them. We can always
transform a relation into bcnf by lossless splitting (right/left rule automatically fills). It
sometimes leads to opportunities to store data that applies to 2 columns but maybe not to 2
others. Think about the usability of it. Can waste storage space.
Preservation of FD’s is handy and nice but not a restriction.
Transformation to BCNF
1. compute canonical set of FD’s
2. maximise the right-hand sides of the Fd’s
3. split off violating FD’s one by one. Just practice a lot, the theoretical rules are
confusing af.
A → d, B → C, B→ D and D → E
With {a,b} as minimal key, not in bcnf.
Write everything that’s on the left and maximise the information gain you can get from it.
Split in relations and have 1 be the key for the one that violates.
Denormalization.
Process of adding redundant columns in order to improve performance. This avoids the
otherwise required joins. Insertion and deletion anomalies are avoided, although there will be
update anomalies. Can also be used to create entirely new separate redundant tables (or
aggregation).
Concurrency anomalies.
What happens when multiple people at the same time access the database.
What happens if things fail to comply before finishing the process? Atomicity is needed;
need to finish all or nothing at all. Sometimes you can also have an inconsistent database
state if you do things at the same time or during other processes. If there are problems
during transactions might cause problems, even if it undoes all the actions done before.
Lost update anomaly, inconsistent read, dirty read and unrepeatable read are the names.
ACID properties; atomicity; fully or not at all, consistency; consistent state, isolation; modify
without seeing each other's actions. Durability; once committed, persistent regardless of
crash.
Transactions
They are a list of actions. ends with either commit or abort.
Scheduler responsible for execution order of concurrent database access.
The order in which two actions appear must be the same as described in the transaction
originally. Schedule is serial if actions are not interleaved; one after another. Serializable;
effect on the database the same as serial schedule. If it is not, we might get anomalies as
mentioned before. They conflict if; they are from different transactions, involve the same data
item and one of the actions is a write. WR/RW/WW conflict may make it not serializable.
WR → dirty read
RW conflict → unrepeatable (t1 read y and t2 reads y)
WW → overwritten (t1 writes y, then t2 writes Y)
We can swap actions without changing the actions if the actions are non-conflicting.
COnflict equivalent -> swaps of non-conflicting adjacent actions.
Conflict serializable if it is conflict equivalent to some serial schedule.
precedence graph → The graph has a node for each transaction, edge from t1 to t2 if there is a
conflicting action between them in which t1 occurs first. COnflict seems to be just things you can’t
swap without it making the transaction different. If no cycles ; equivalent serial schedule can be made
by topological sort. Blind writes → don’t matter in the end.
Conflict serializable if no cycle and order needed is clear, then doing it in that order.
Two phase locking
checks for concurrency control to ensure serializability during runtime.
Hard as you don’t know the transactions that will be happening. You need a strategy;
pessimistic/optimistic/multi-version. Transactions must lock objects before using them.
Shared-lock/exclusive lock → before reading/before writing. Only one can hold an exclusive lock.
Objects can have both locks at the same time. So when unlocking, you can write US(A) or UX(A) to
unlock them individually. A transaction cannot get new locks once it releases any lock. Pretty easy to
check. If after U there is a S or X, it works like that. The 2-phase lock protocol does not work in the
wife money example → solved with x-lock.
Cascading rollbacks.
Problems with canceling abortions. Committing can not be done if other transactions have
read from something written in an aborted transaction. They need to abort as well (else dirty
reads). They can cascade onto multiple transactions. Recoverable if t2 reads from t1, then
the commit of t2 must be after the commit of t1. Cascadeless schedule; you can also delay
the entire read before t1 is committed, avoiding all dirty reads and rollbacks that do not
cascade. Recoverable is less drastic then cascadeless.
Granularity of locking.
At what level are we locking?Concurrency vs overhead. If you lock entire databases, we have low
concurrency but very low overhead. But if we only lock rows, we also get high concurrency and
overhead. If we can use multi-granularity locking, it will be better. We can determine the granularity
for each individual transaction. Row-lock will be used for things that only edit a single row. Table
lock for things that select all rows in an entire table or a big part of a table. Intention locks →
intention shared/intention exclusive are different things. Before introducing lock, the intention is to
share locks on all coarser levels of granularity. Before a granule g can be locked in a mode, it has to
first obtain the intention lock on all coarser granularities before obtaining the actual lock. Intention
shares apply to the bigger surrounding as well i guess.
Isolation levels.
Some degree of inconsistency will be acceptable for increased concurrency and
performance. the levels are;
- read uncommitted
- read committed
- repeatable read
- serializable
Making certain things possible; dirty read, non-repeat read, phantom rows.
What isolation levels are supported depends on the management system that is being used.
Phantom row; using multi-granularity locking solves the phantom row problem.
SQL has some stuff to help with this; set autocommit on/off, start transaction, commit,
rollback, set transaction isolation level.