STAT 624 Computing Tools For Data Science: Module 1: Relational Databases
STAT 624 Computing Tools For Data Science: Module 1: Relational Databases
Data Science
• Each column must be defined as to the type of data it contains (i.e. the
attribute/structure). Examples: strings, date, integers, decimals.
• Orders can have many items, and dishes can be included in many
orders.
• ACID:
• Atomic: transaction must completely happen or not at all.
Reason for failure is irrelevant.
• Consistent: transaction must leave DB in a consistent state (i.e.
there can be no violation of rules/structure).
• Isolated: only one transaction for a data element at a time. Data
must be locked for the transaction.
• Durable: transaction must be robust. A "success" must
guarantee the transaction happened correctly and will not be
lost due to service outages, crashes, or other causes.
• Methods for modeling databases have been tested since the 1970s
(e.g. 45+ years of experience).
Planning your database
• What are you trying to accomplish?
• Be careful about simple answers.
• Entities: be granular.
• LastName, FirstName, Suffix, ZipCode, etc.
• Easier to get to data ---no need to ‘extract’.
• Remove repeating groups and create another table that satisfies 1NF
to hold the values:
First normal form (1NF)
• 1NF often extended to include idea that there aren’t duplicate rows in
a table.
• Also suggests order of rows and columns is not important to the data.
Second normal form (2NF)
• 2NF: No value in a table should depend on only part of a key that
can be used to uniquely identify a row.
• Create new table reflecting fact that each event is held at just one
place. Now both tables have values dependent on full keys.
Third normal form (3NF)
• 3NF: No non-key field is dependent on any other non-key field (e.g.
“Can I figure out values of a row from any other value of that row?”)
• Risk: someone could edit the Lunch Price and not the Price, which
would violate the 50% discount rule.
• Solution: drop lunch prices from table (it can be calculated) and
possibly create separately table containing lunch prices.
Denormalization
• Sometimes we choose not to normalize for convenience or
performance reasons.
Denormalization
• Sometimes we choose not to normalize for convenience or
performance reasons.