0% found this document useful (0 votes)
2 views

01-relationalmodel

The document outlines the course logistics and structure for the Intro to Database Systems class, including policies, communication channels, and assignment submission methods. It introduces key concepts such as the relational model, database management systems, and data integrity, emphasizing the importance of structured data and relationships between entities. The lecture also covers various data models and their applications in database systems, highlighting the evolution of DBMS and the significance of constraints and data manipulation languages.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

01-relationalmodel

The document outlines the course logistics and structure for the Intro to Database Systems class, including policies, communication channels, and assignment submission methods. It introduces key concepts such as the relational model, database management systems, and data integrity, emphasizing the importance of structured data and relationships between entities. The lecture also covers various data models and their applications in database systems, highlighting the evolution of DBMS and the significance of constraints and data manipulation languages.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

Intro to Database

Systems (15-445/645)

Lecture #01
Relational
Model &
Algebra
FALL 2023 Prof. Andy Pavlo Prof. Jignesh Patel
15-445/645 (Fall 2023)
3

C O U R S E LO G I S T I C S

Course Policies + Schedule: Course Web Page


Discussion + Announcements: Piazza
Homeworks + Projects: Gradescope
Final Grades: Canvas

Non-CMU students can complete all assignments


using Gradescope (Code: KK5DVJ).
→ Do not post your solutions on Github.
→ Do not email instructors / TAs for help.
→ Discord Channel: https://discord.gg/YF7dMCg
→ Somebody needs to finish Andy's Wikipedia article.
15-445/645 (Fall 2023)
4

C O U R S E LO G I S T I C S

Course Policies + Schedule: Course Web Page


Discussion + Announcements: Piazza
Homeworks + Projects: Gradescope
Final Grades: Canvas

Non-CMU students can complete all assignments


using Gradescope (Code: KK5DVJ).
→ Do not post your solutions on Github.
→ Do not email instructors / TAs for help.
→ Discord Channel: https://discord.gg/YF7dMCg
→ Somebody needs to finish Andy's Wikipedia article.
15-445/645 (Fall 2023)
5

L E C T U R E RU L E S

Do interrupt us for the following reasons:


→ We are speaking too fast.
→ You don't understand what we are talking about.
→ You have a database-related question.

Do not interrupt us for the following reasons:


→ Whether you can use the bathroom.
→ Questions about blockchains.

We will not answer questions about the lecture


immediately after class.

15-445/645 (Fall 2023)


6

TO DAY ’ S AG E N DA

Database Systems Background


Relational Model
Relational Algebra
Alternative Data Models

15-445/645 (Fall 2023)


Databases
9

DATA B A S E

Organized collection of inter-related data that


models some aspect of the real-world.

Databases are the core component of most


computer applications.

15-445/645 (Fall 2023)


10

DATA B A S E E X A M P L E

Create a database that models a digital music store


to keep track of artists and albums.

Things we need for our store:


→ Information about Artists
→ What Albums those Artists released

15-445/645 (Fall 2023)


11

F L AT F I L E S T R AW M A N

Store our database as comma-separated value


(CSV) files that we manage ourselves in our
application code.
→ Use a separate file per entity.
→ The application must parse the files each time they want
to read/update records.
Artist(name, year, country) Album(name, artist, year)
"Wu-Tang Clan",1992,"USA" "Enter the Wu-Tang","Wu-Tang Clan",1993

"Notorious BIG",1992,"USA" "St.Ides Mix Tape","Wu-Tang Clan",1994


"Liquid Swords","GZA",1990
"GZA",1990,"USA"

15-445/645 (Fall 2023)


12

F L AT F I L E S T R AW M A N

Example: Get the year that GZA went solo.

Artist(name, year, country)


for line in file.readlines():
"Wu-Tang Clan",1992,"USA" record = parse(line)
"Notorious BIG",1992,"USA" if record[0] == "GZA":
"GZA",1990,"USA" print(int(record[1]))

15-445/645 (Fall 2023)


13

F L AT F I L E S : DATA I N T E G R I T Y

How do we ensure that the artist is the same for


each album entry?

What if somebody overwrites the album year with


an invalid string?

What if there are multiple artists on an album?

What happens if we delete an artist that has


albums?

15-445/645 (Fall 2023)


14

F L AT F I L E S : I M P L E M E N TAT I O N

How do you find a particular record?

What if we now want to create a new application


that uses the same database? What if that
application is running on a different machine?

What if two threads try to write to the same file at


the same time?

15-445/645 (Fall 2023)


15

F L AT F I L E S : D U R A B I L I T Y

What if the machine crashes while our program is


updating a record?

What if we want to replicate the database on


multiple machines for high availability?

15-445/645 (Fall 2023)


16

DATA B A S E M A N AG E M E N T S Y S T E M

A database management system (DBMS) is


software that allows applications to store and
analyze information in a database.

A general-purpose DBMS supports the definition,


creation, querying, update, and administration of
databases in accordance with some data model.

15-445/645 (Fall 2023)


17

DATA M O D E L S

A data model is a collection of concepts for


describing the data in a database.

A schema is a description of a particular collection


of data, using a given data model.

15-445/645 (Fall 2023)


18

DATA M O D E L S

Relational
Key/Value
Graph
Document / XML / Object
Wide-Column / Column-family
Array / Matrix / Vectors
Hierarchical
Network
Multi-Value
15-445/645 (Fall 2023)
19

DATA M O D E L S

Relational ← Most DBMSs


Key/Value
Graph
Document / XML / Object
Wide-Column / Column-family
Array / Matrix / Vectors
Hierarchical
Network
Multi-Value
15-445/645 (Fall 2023)
20

DATA M O D E L S

Relational
Key/Value
Graph
← NoSQL
Document / XML / Object
Wide-Column / Column-family
Array / Matrix / Vectors
Hierarchical
Network
Multi-Value
15-445/645 (Fall 2023)
21

DATA M O D E L S

Relational
Key/Value
Graph
Document / XML / Object
Wide-Column / Column-family
Array / Matrix / Vectors ← Machine Learning
Hierarchical
Network
Multi-Value
15-445/645 (Fall 2023)
22

DATA M O D E L S

Relational
Key/Value
Graph
Document / XML / Object
Wide-Column / Column-family
Array / Matrix / Vectors
Hierarchical
Network ← Obsolete / Legacy / Rare
Multi-Value
15-445/645 (Fall 2023)
23

DATA M O D E L S

Relational ← This Course


Key/Value
Graph
Document / XML / Object
Wide-Column / Column-family
Array / Matrix / Vectors
Hierarchical
Network
Multi-Value
15-445/645 (Fall 2023)
15-445/645 (Fall 2023)
25

E A R LY D B M S s

Early database applications were difficult to build


and maintain on available DBMSs in the 1960s.
→ Examples: IDS, IMS, CODASYL
→ Computers were expensive, humans were cheap.

Tight coupling between logical and physical layers.

Programmers had to (roughly) know what queries


the application would execute before they could
deploy the database.

15-445/645 (Fall 2023)


26

E A R LY D B M S s

Ted Codd was a mathematician at


IBM Research in the late 1960s.

Codd saw IBM's developers rewriting


database programs every time the
database’s schema or layout changed.

Devised the relational model in 1969.

Edgar F. Codd
15-445/645 (Fall 2023)
27

E A R LY D B M S s

Ted Codd was a mathematician at


IBM Research in the late 1960s.

Codd saw IBM's developers rewriting


database programs every time the
database’s schema or layout changed.

Devised the relational model in 1969.

Edgar F. Codd
15-445/645 (Fall 2023)
28

R E L AT I O N A L M O D E L

The relational model defines a database abstraction


based on relations to avoid maintenance overhead.

Key tenets:
→ Store database in simple data structures (relations).
→ Physical storage left up to the DBMS implementation.
→ Access data through high-level language, DBMS figures
out best execution strategy.

15-445/645 (Fall 2023)


29

R E L AT I O N A L M O D E L

Structure: The definition of the database's


relations and their contents.

Integrity: Ensure the database's contents satisfy


constraints.

Manipulation: Programming interface for


accessing and modifying a database's contents.

15-445/645 (Fall 2023)


30

R E L AT I O N A L M O D E L

A relation is an unordered set that Artist(name, year, country)


contain the relationship of attributes
name year country
that represent entities. Wu-Tang Clan 1992 USA
Notorious BIG 1992 USA
A tuple is a set of attribute values (also GZA 1990 USA
known as its domain) in the relation.
→ Values are (normally) atomic/scalar. n-ary Relation
→ The special value NULL is a member of
every domain (if allowed). =
Table with n columns

15-445/645 (Fall 2023)


31

R E L AT I O N A L M O D E L : P R I M A R Y K E Y S

A relation's primary key uniquely Artist(name, year, country)


identifies a single tuple.
name year country
Some DBMSs automatically create an Wu-Tang Clan 1992 USA
internal primary key if a table does Notorious BIG 1992 USA
not define one. GZA 1990 USA

DBMS can auto-generation unique


primary keys via an identity column:
→ IDENTITY (SQL Standard?)
→ SEQUENCE (PostgreSQL / Oracle)
→ AUTO_INCREMENT (MySQL)
15-445/645 (Fall 2023)
32

R E L AT I O N A L M O D E L : P R I M A R Y K E Y S

A relation's primary key uniquely Artist(id, name, year, country)


identifies a single tuple.
id name year country
Some DBMSs automatically create an 101 Wu-Tang Clan 1992 USA
internal primary key if a table does 102 Notorious BIG 1992 USA
not define one. 103 GZA 1990 USA

DBMS can auto-generation unique


primary keys via an identity column:
→ IDENTITY (SQL Standard?)
→ SEQUENCE (PostgreSQL / Oracle)
→ AUTO_INCREMENT (MySQL)
15-445/645 (Fall 2023)
33

R E L AT I O N A L M O D E L : F O R E I G N K E Y S

A foreign key specifies that an


attribute from one relation maps to a
tuple in another relation.

15-445/645 (Fall 2023)


34

R E L AT I O N A L M O D E L : F O R E I G N K E Y S

Artist(id, name, year, country)


id name year country
101 Wu-Tang Clan 1992 USA
102 Notorious BIG 1992 USA
103 GZA 1990 USA

Album(id, name, artists, year)


id name artists year
11 Enter the Wu-Tang 101 1993
22 St.Ides Mix Tape ??? 1994
33 Liquid Swords 103 1995

15-445/645 (Fall 2023)


35

R E L AT I O N A L M O D E L : F O R E I G N K E Y S

Artist(id, name, year, country)


id name year country
101 Wu-Tang Clan 1992 USA
ArtistAlbum(artist_id, album_id) 102 Notorious BIG 1992 USA
artist_id album_id 103 GZA 1990 USA
101 11
101 22 Album(id, name, artists, year)
103 22 id name artists year
102 22 11 Enter the Wu-Tang 101 1993
22 St.Ides Mix Tape ??? 1994
33 Liquid Swords 103 1995

15-445/645 (Fall 2023)


36

R E L AT I O N A L M O D E L : F O R E I G N K E Y S

Artist(id, name, year, country)


id name year country
101 Wu-Tang Clan 1992 USA
ArtistAlbum(artist_id, album_id) 102 Notorious BIG 1992 USA
artist_id album_id 103 GZA 1990 USA
101 11
101 22 Album(id, name, year)
103 22 id name year
102 22 11 Enter the Wu-Tang 1993
22 St.Ides Mix Tape 1994
33 Liquid Swords 1995

15-445/645 (Fall 2023)


37

R E L AT I O N A L M O D E L : C O N S T R A I N T S

User-defined conditions that must Artist(id, name, year, country)


hold for any instance of the database.
id name year country
→ Can validate data within a single tuple
or across entire relation(s). 101 Wu-Tang Clan 1992 USA
→ DBMS prevents modifications that 102 Notorious BIG 1992 USA
violate any constraint. 103 GZA 1990 USA

Unique key and referential (fkey)


constraints are the most common.
SQL:92 supports global asserts but CREATE ASSERTION myAssert
CHECK ( <SQL> );
these are rarely used (too slow).
15-445/645 (Fall 2023)
38

DATA M A N I P U L AT I O N L A N G UAG E S ( D M L )

Methods to store and retrieve information from a


database.

Procedural: ← Relational
→ The query specifies the (high-level) strategy Algebra
to find the desired result based on sets / bags.

Non-Procedural (Declarative): ← Relational


→ The query specifies only what data is wanted Calculus
and not how to find it.

15-445/645 (Fall 2023)


39

R E L AT I O N A L A LG E B R A

Fundamental operations to retrieve σ Select


and manipulate tuples in a relation. π Projection
→ Based on set algebra (unordered lists with
no duplicates). ∪ Union
∩ Intersection
Each operator takes one or more
relations as its inputs and outputs a – Difference
new relation. × Product
→ We can "chain" operators together to
create more complex operations. ⋈ Join

15-445/645 (Fall 2023)


40

R E L AT I O N A L A LG E B R A : S E L E C T
R(a_id,b_id)
Choose a subset of the tuples from a a_id b_id
relation that satisfies a selection a1 101
predicate. a2 102
a2 103
→ Predicate acts as a filter to retain only
a3 104
tuples that fulfill its qualifying
requirement. σa_id='a2'(R) σa_id='a2'∧ b_id>102(R)
→ Can combine multiple predicates using a_id b_id a_id b_id
conjunctions / disjunctions. a2 102 a2 103
a2 103

Syntax: σpredicate(R) SELECT * FROM R


WHERE a_id='a2' AND b_id>102;

15-445/645 (Fall 2023)


41

R E L AT I O N A L A LG E B R A : P RO J E C T I O N
R(a_id,b_id)
Generate a relation with tuples that a_id b_id
contains only the specified attributes. a1 101
→ Rearrange attributes’ ordering. a2 102
→ Remove unwanted attributes. a2 103
→ Manipulate values to create derived a3 104
attributes. Πb_id-100,a_id(σa_id='a2'(R))
b_id-100 a_id
Syntax: ΠA1,A2,…,An(R) 2 a2
3 a2

SELECT b_id-100, a_id


FROM R WHERE a_id = 'a2';

15-445/645 (Fall 2023)


42

R E L AT I O N A L A LG E B R A : U N I O N
R(a_id,b_id) S(a_id,b_id)
Generate a relation that contains all a_id b_id a_id b_id
tuples that appear in either only one a1 101 a3 103
or both input relations. a2 102 a4 104
a3 103 a5 105

Syntax: (R ∪ S) (R ∪ S)
a_id b_id
a1 101
a2 102
(SELECT * FROM R) a3 103
UNION a4 104
(SELECT * FROM S); a5 105

15-445/645 (Fall 2023)


43

R E L AT I O N A L A LG E B R A : I N T E R S E C T I O N
R(a_id,b_id) S(a_id,b_id)
Generate a relation that contains only a_id b_id a_id b_id
the tuples that appear in both of the a1 101 a3 103
input relations. a2 102 a4 104
a3 103 a5 105

Syntax: (R ∩ S) (R ∩ S)
a_id b_id
a3 103

(SELECT * FROM R)
INTERSECT
(SELECT * FROM S);

15-445/645 (Fall 2023)


44

R E L AT I O N A L A LG E B R A : D I F F E R E N C E
R(a_id,b_id) S(a_id,b_id)
Generate a relation that contains only a_id b_id a_id b_id
the tuples that appear in the first and a1 101 a3 103
not the second of the input relations. a2 102 a4 104
a3 103 a5 105

Syntax: (R – S) (R – S)
a_id b_id
a1 101
a2 102
(SELECT * FROM R)
EXCEPT
(SELECT * FROM S);

15-445/645 (Fall 2023)


45

R E L AT I O N A L A LG E B R A : P RO D U C T
R(a_id,b_id) S(a_id,b_id)
Generate a relation that contains all a_id b_id a_id b_id
possible combinations of tuples from a1 101 a3 103
the input relations. a2 102 a4 104
a3 103 a5 105

Syntax: (R × S) R.a_id
(R × S)
R.b_id S.a_id S.b_id
a1 101 a3 103
a1 101 a4 104
a1 101 a5 105
SELECT * FROM R CROSS JOIN S; a2 102 a3 103
a2 102 a4 104
a2 102 a5 105
SELECT * FROM R, S; a3 103 a3 103
a3 103 a4 104
a3 103 a5 105

15-445/645 (Fall 2023)


46

R E L AT I O N A L A LG E B R A : J O I N
R(a_id,b_id) S(a_id,b_id,val)
Generate a relation that contains all a_id b_id a_id b_id val
tuples that are a combination of two a1 101 a3 103 XXX
tuples (one from each input relation) a2 102 a4 104 YYY

with a common value(s) for one or a3 103 a5 105 ZZZ

more attributes. (R ⋈ S)
R.a_id R.b_id S.a_id S.b_id S.val a_id b_id val

Syntax: (R ⋈ S) a3 103 a3 103 XXX a3 103 XXX

15-445/645 (Fall 2023)


47

R E L AT I O N A L A LG E B R A : J O I N
R(a_id,b_id) S(a_id,b_id,val)
Generate a relation that contains all a_id b_id a_id b_id val
tuples that are a combination of two a1 101 a3 103 XXX
tuples (one from each input relation) a2 102 a4 104 YYY

with a common value(s) for one or a3 103 a5 105 ZZZ

more attributes. (R ⋈ S)
a_id b_id val

Syntax: (R ⋈ S) a3 103 XXX

SELECT * FROM R NATURAL JOIN S;

SELECT * FROM R JOIN S USING (a_id, b_id);

SELECT * FROM R JOIN S


ON R.a_id = S.a_id AND R.b_id = S.b_id;
15-445/645 (Fall 2023)
48

R E L AT I O N A L A LG E B R A : E X T R A O P E R ATO R S

Rename (ρ)
Assignment (R←S)
Duplicate Elimination (δ)
Aggregation (γ)
Sorting (τ)
Division (R÷S)

15-445/645 (Fall 2023)


49

O B S E R VAT I O N

Relational algebra defines an ordering of the high-


level steps of how to compute a query.
→ Example: σb_id=102(R⋈S) vs. (R⋈(σb_id=102(S))

A better approach is to state the high-level answer


that you want the DBMS to compute.
→ Example: Retrieve the joined tuples from R and S where
b_id equals 102.

15-445/645 (Fall 2023)


50

R E L AT I O N A L M O D E L : Q U E R I E S

The relational model is independent of any query


language implementation.

SQL is the de facto standard (many dialects).

for line in file.readlines():


SELECT year FROM artists
record = parse(line)
WHERE name = 'GZA';
if record[0] == "GZA":
print(int(record[1]))

15-445/645 (Fall 2023)


51

DATA M O D E L S

Relational
Key/Value
Graph
Document / XML / Object ← Leading Alternative
Wide-Column / Column-family
Array / Matrix / Vectors ← Current Hotness
Hierarchical
Network
Multi-Value
15-445/645 (Fall 2023)
40

D O C U M E N T DATA M O D E L

A collection of record documents containing a


hierarchy of named field/value pairs.
→ A field's value can either a scalar type, an array of values,
or another document.
→ Modern implementations use JSON. Older systems use
XML or custom object representations.

Avoid "relational-object impedance mismatch" by


tightly coupling objects and database.

15-445/645 (Fall 2023)


53

D O C U M E N T DATA M O D E L

Artist R1(id,…)


ArtistAlbum R2(artist_id,album_id)


Album R3(id,…)

15-445/645 (Fall 2023)


54

D O C U M E N T DATA M O D E L

Artist R1(id,…)


ArtistAlbum R2(artist_id,album_id)


Album R3(id,…)

15-445/645 (Fall 2023)


55

D O C U M E N T DATA M O D E L

Application Code {
class Artist { "name": "GZA",
Artist int id;
"year": 1990,
"albums": [
String name; {
int year; "name": "Liquid Swords",
Album albums[]; "year": 1995
},
} {
class Album { "name": "Beneath the Surface",
int id; "year": 1999
Album String name; }
]
int year;
}
}

15-445/645 (Fall 2023)


42

V E C TO R DATA M O D E L

One-dimensional arrays used for nearest-neighbor


search (exact or approximate).
→ Used for semantic search on embeddings generated by
ML-trained transformer models (think ChatGPT).
→ Native integration with modern ML tools and APIs (e.g.,
LangChain, OpenAI).

At their core, these systems use specialized indexes


to perform NN searches quickly.

15-445/645 (Fall 2023)


42

V E C TO R DATA M O D E L

One-dimensional arrays used for nearest-neighbor


search (exact or approximate).
→ Used for semantic search on embeddings generated by
ML-trained transformer models (think ChatGPT).
→ Native integration with modern ML tools and APIs (e.g.,
LangChain, OpenAI).

At their core, these systems use specialized indexes


to perform NN searches quickly.

15-445/645 (Fall 2023)


43

V E C TO R DATA M O D E L
Album(id, name, year) Embeddings
id name year Id1 → [0.32, 0.78, 0.30, ...]
11 Enter the Wu-Tang 1993 Id2 → [0.99, 0.19, 0.81, ...]
Transformer
22 St.Ides Mix Tape 1994 Id3 → [0.01, 0.18, 0.85, ...]

33 Liquid Swords 1995 ⋮

Vector
Index

HNSW, IVFFlat
Meta Faiss, Spotify Annoy
15-445/645 (Fall 2023)
43

V E C TO R DATA M O D E L
Album(id, name, year) Embeddings
id name year Id1 → [0.32, 0.78, 0.30, ...]
11 Enter the Wu-Tang 1993 Id2 → [0.99, 0.19, 0.81, ...]
Transformer
22 St.Ides Mix Tape 1994 Id3 → [0.01, 0.18, 0.85, ...]

33 Liquid Swords 1995 ⋮

Query [0.02, 0.10, 0.24, ...]

Find albums similar


Vector
to "Liquid Swords" Ranked List of Ids
Index

HNSW, IVFFlat
Meta Faiss, Spotify Annoy
15-445/645 (Fall 2023)
60

CONCLUSION

Databases are ubiquitous.

Relational algebra defines the primitives for


processing queries on a relational database.

We will see relational algebra again when we talk


about query optimization + execution.

15-445/645 (Fall 2023)


61

NEXT CLASS

Modern SQL
→ Make sure you understand basic SQL before the lecture.

15-445/645 (Fall 2023)

You might also like