0% found this document useful (0 votes)
34 views32 pages

STAT 624 Computing Tools For Data Science: Module 1: Relational Databases

Relational
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views32 pages

STAT 624 Computing Tools For Data Science: Module 1: Relational Databases

Relational
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

STAT 624 Computing Tools for

Data Science

Module 1: Relational Databases

Instructor: Scott A. Bruce

© 2022 Scott A. Bruce. Do not distribute.


This module contains material from the
LinkedIn Learning course

Programming Foundations: Databases


by Scott Simpson

and is accessible via


https://linkedinlearning.tamu.edu/
What are databases?

• A database is an organized collection of data stored and accessed


electronically from a computer system.

• Can’t we already do this using spreadsheets, files, folders and such?

• Data problems: 1) size, 2) ease of updating, 3) accuracy, 4) security, 5)


redundancy, 6) importance.

• Database solutions: 1) scalable, 2) accessible, 3) accurate, 4) secure,


5) consistent, 6) permanent.

Databases give us structure.

Databases impose your rules on the data!


Database management systems

• A database is an organized collection of data stored and accessed


electronically from a computer system.

• This is different from a database management system, or DBMS,


which manages databases and guarantees your rules and structure
are applied.

• A single DBMS usually manages many different databases.

• Most common: relational DBMS.

• Other DBMSs typically assume knowledge of relational DBMSs and


use similar vocabulary.
Features of a relational database
• A database has one or more tables. Tables are the fundamental
building blocks of a relational database.

• All data are stored in tables and are often represented as a


“spreadsheet”.

• Each column must be defined as to the type of data it contains (i.e. the
attribute/structure). Examples: strings, date, integers, decimals.

• Each row is a 'record' or 'tuple’ – a single data item.


Unique values and primary keys

• Every table will have a key. It is a way to identify a particular row in


a table:

• A key must have unique values for each record. No exceptions!


• Sometimes we have a natural key (e.g. UIN).
• Every table will have a primary key. DB enforced.
• DB will generate a primary key (synthetic key) for your tables if you
do not specify one.
Defining table relationships
• Keys are used to define relationships in tables:

• This is a one-to-many relationship (most common).


Many-to-many relationships
• Most RDBMSs cannot create a direct many-to-many relationship.
See this example:

• Orders can have many items, and dishes can be included in many
orders.

• If multiple columns are used to characterize this relationship, 1) it is


not clear how many columns should be used and 2) most fields will
be blank
Many-to-many relationships
• In such cases, can create a linking (or junction) table:
One-to-one relationships
• Less common since usually implies two tables should just be one
table.

• Some use cases exist (e.g. security):


Referential integrity constraints
• Databases that are aware of relationships won’t allow a user to
modify data in a way that violates those relationships.

• Helps maintain consistency and accuracy of database.


Referential integrity constraints
• Databases that are aware of relationships won’t allow a user to
modify data in a way that violates those relationships.

• Helps maintain consistency and accuracy of database.


Referential integrity for deletions
• Databases that are aware of relationships won’t allow a user to
modify data in a way that violates those relationships.

• Helps maintain consistency and accuracy of database.

• Example: deleting a customer automatically deletes corresponding


orders for that customer (known as cascading delete).
Transactions and the ACID test
• Transactions group queries or statements into a block of activities
(e.g. sending money from account X to account Y).

• ACID:
• Atomic: transaction must completely happen or not at all.
Reason for failure is irrelevant.
• Consistent: transaction must leave DB in a consistent state (i.e.
there can be no violation of rules/structure).
• Isolated: only one transaction for a data element at a time. Data
must be locked for the transaction.
• Durable: transaction must be robust. A "success" must
guarantee the transaction happened correctly and will not be
lost due to service outages, crashes, or other causes.

• The DBMS enforces ACID. Not the programmer’s job!


Structured Query Language (SQL)

• SQL is a language. Been around since the 1970’s.

• SQL is a declarative query language, not procedural or an


imperative language.
• You tell the DB what you want ---you let the DBMS worry about
how to do it.
• You do not worry about steps or algorithms on how to
accomplish a task (e.g. imperative).

• "I want all the books more than $40":


SELECT * FROM Books WHERE ListPrice> 40

• SQL -> CRUD (Create, Read, Update, Delete) data.


Introduction to database modeling
• Creating the formal description of our database (e.g. the schema).
That is, the tables, columns, keys, relationships, etc.

• Database modeling requires planning. Agile (i.e. iterative)


development is not well suited to database design.

• The point of database design is to impose structure on your data.


This requires thought and planning.

• Adding features is easier than modifying or changing fundamental


data structures and relationships.

• Methods for modeling databases have been tested since the 1970s
(e.g. 45+ years of experience).
Planning your database
• What are you trying to accomplish?
• Be careful about simple answers.

• What do you already have?


• Review the existing data and structure
• Examine your existing databases.

• What are your entities?


• Texas A&M: students, faculty, courses, buildings, departments,
colleges, etc. (singular or plural)?

• What are the relationships between the entities?

• ERM -> Entity Relationship Modeling


Entity Relationship Modeling Example
Identifying columns and data types

• Entities -> Tables; Attributes -> Columns

• Entities: be granular.
• LastName, FirstName, Suffix, ZipCode, etc.
• Easier to get to data ---no need to ‘extract’.

• Avoid spaces in entity & attribute names.

• Specify (the data type) on what columns are.


• Character + length (ASCII, Unicode), date, integer (size),
decimal, binary, etc.
• Other characteristics: Allow null? Pattern match (e.g. email,
social security no, etc.)?
• Precise data types allow the DBMS to enforce structure and be
more efficient.
Example: DB2 datatypes
Choosing primary keys
• Choose a primary key (PK) for each entity (e.g. table).
• If there is none, we must make one.
• DBMS will have some mechanism to create a column as a
primary key (e.g. Customary, etc.). Generally this is an
incrementing number.

• Sometimes we can combine


two or more columns and
make them a composite
primary key.
• It is often more useful to
generate a synthetic
primary key.
Database normalization

• Database normalization is the process of organizing the columns


(attributes) and tables (relations) of a relational database to reduce
data redundancy and improve data integrity.

• Edgar F. Codd introduced three rules for organizing data in a


database (1NF, 2NF & 3NF) in the 1970s.

• Informally, a relational database table is often described as


normalized if it meets Third Normal Form.

• Most 3NF tables are free of insertion, update, and deletion


anomalies.
First normal form (1NF)
• 1NF: 1) Values in each cell should be atomic (i.e. only one value)
and 2) tables should have no repeating groups.

• These tables violate 1NF. What to do?


First normal form (1NF)
• 1NF: 1) Values in each cell should be atomic (i.e. only one value)
and 2) tables should have no repeating groups.

• Remove repeating groups and create another table that satisfies 1NF
to hold the values:
First normal form (1NF)
• 1NF often extended to include idea that there aren’t duplicate rows in
a table.

• Also suggests order of rows and columns is not important to the data.
Second normal form (2NF)
• 2NF: No value in a table should depend on only part of a key that
can be used to uniquely identify a row.

• Only an issue when using composite primary keys.

• Location not dependent on the full candidate key (only dependent on


name). Changing event name could leave DB in an inconsistent
state since no guarantee location will also be changed.
Second normal form (2NF)
• 2NF: No value in a table should depend on only part of a key that
can be used to uniquely identify a row.

• Only an issue when using composite primary keys.

• Create new table reflecting fact that each event is held at just one
place. Now both tables have values dependent on full keys.
Third normal form (3NF)
• 3NF: No non-key field is dependent on any other non-key field (e.g.
“Can I figure out values of a row from any other value of that row?”)

• Table is in 1NF and 2NF, but violates 3NF (why?):


Third normal form (3NF)
• 3NF: No non-key field is dependent on any other non-key field (e.g.
“Can I figure out values of a row from any other value of that row?”)

• Risk: someone could edit the Lunch Price and not the Price, which
would violate the 50% discount rule.

• Solution: drop lunch prices from table (it can be calculated) and
possibly create separately table containing lunch prices.
Denormalization
• Sometimes we choose not to normalize for convenience or
performance reasons.
Denormalization
• Sometimes we choose not to normalize for convenience or
performance reasons.

• Risk of someone updating quantity in orders table and data would be


inconsistent.

• Trade-off between speed and risk of inconsistency and accuracy.


Denormalization
• Sometimes data appear denormalized but actually are not.

• ZIP code does NOT uniquely identify city and state.

You might also like