0% found this document useful (0 votes)
66 views46 pages

Overview RDBMS

Traditional relational database management systems (RDBMSs) are used to store and manage structured data in centralized databases. They provide robust and efficient access to large amounts of persistent data through transactions and queries. Some key aspects of traditional RDBMSs include storing data in tables with rows and columns, supporting ACID transactions, and providing SQL for declarative querying and updates. RDBMS software like Oracle, SQL Server, DB2, and MySQL allow applications to interact with relational databases.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views46 pages

Overview RDBMS

Traditional relational database management systems (RDBMSs) are used to store and manage structured data in centralized databases. They provide robust and efficient access to large amounts of persistent data through transactions and queries. Some key aspects of traditional RDBMSs include storing data in tables with rows and columns, supporting ACID transactions, and providing SQL for declarative querying and updates. RDBMS software like Oracle, SQL Server, DB2, and MySQL allow applications to interact with relational databases.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Overview of traditional RDBMS

Traditional ?
Structured data (relational), centralized, disk-based, OLTP workloads, ACID

COMP7104 - DASC7104 2
What is a database
A database is a collection of data, typically
related and describing activities of an
organization in terms of entities (real-world
objects) and relationships between them.

COMP7104 - DASC7104 3
What is a RDBMS
A big piece of software that allows us to manage
efficiently a large database and allows it to
persist over long periods of time
• Quick / robust / safe / simple access

Examples: DB2 (IBM), SQL Server (MS), Oracle,


Postgres, MySQL, etc…

COMP7104 - DASC7104 4
RDBMS

5
COMP7104 - DASC7104
Example: IMDB
• The Internet Movie Database
• http://www.imdb.com

Entities: Actors (1.5M), Movies (1.8M), Directors

Relationships: who played where, who directed


what, …

COMP7104 - DASC7104 6
Tables (or relations)
Actor Casting

ActorID FirstName LastName ActorID MovieID

45601 Clint Eastwood 45601 10230

... ...

Movie MovieID Title Year

10230 Star Wars 1977

...

COMP7104 - DASC7104 7
Queries – SQL (declarative!)
SELECT * SELECT count(*) SELECT count(*)
FROM Movie FROM Actor FROM Movie
WHERE Title LIKE ‘Star %’

SELECT * SELECT *
FROM Movie FROM Actor
LIMIT 50 WHERE FirstName= ‘Clint’

SELECT MovieID, count(*) SELECT min(Year)


FROM Casting FROM Movie
GROUP BY MovieID
COMP7104 - DASC7104 8
Queries – SQL

SELECT *
FROM Movie M, Actor A, Casting C
WHERE A.FirstName=’Clint'
and M.MovieID = C.MovieID
and A.ActorID=C.ActorID
and M.Year=1995

This query has selections and joins


• 1.8M actors, 11M castings, 1.5M movies
• how do we make it fast ?

COMP7104 - DASC7104 9
How can we evaluate the query ?
Classical query execution Classical query optimizations Classical statistics
• Index-based selection • Pushing selections down • Table cardinalities
• Hash-join • Join reorder • # distinct values
• Merge-join • histograms
• Index-join
⨝ ⨝



vs.
σFirstName=‘Clint’ σYear=1995 σFirstName=‘Clint’
σYear=1995

Actor Casting Movie Actor Casting Movie

COMP7104 - DASC7104 10
Query optimization
• Rule-based query optimization, heuristics
• Dynamic programing cost-based query
optimization

COMP7104 - DASC7104 11
Role of a RDBMS
DML
1. Creation and storage of data
2. Queries and updates
DDL
3. Change / evolution of structure
4. Concurrency control (multiple users) Transactions
5. Recovery ACID
6. Data integrity and security
7. Reliability (99.99999%) Grant, Revoke, roles

8. Reduced application development time


9. Efficiency (thousands of queries par seconde)

… and behind the curtains, our focus...

COMP7104 - DASC7104 12
COMP7104 - DASC7104 13
COMP7104 - DASC7104 14
Data independence

Physical schema
• Storage as files, row vs.
column store, indexes

COMP7104 - DASC7104 15
Data independence
Application programs are insulated from changes
in the way the data is structured and stored
• Key property of a DBMS
– Logical independence
– Physical independence

COMP7104 - DASC7104 16
Logical data independence
Users isolated from changes in the logical structure of data

• Initially one table for actors:


Actor(ActorID: String, FirstName: String, LastName:String)

• Then divided into two relations:

AmericanActor(ActorID:String,FirstName:String, LastName:String)

WorldActor(ActorID: String, FirstName: String, LastName:String)

• Still a “view” Actor can be obtained using the above new relations,
by merging them

• Users/applications querying view Actor get same answer as before


COMP7104 - DASC7104 17
Physical data independence (1)
Physical data independence: applications should
be isolated from
• changes to the physical organization
– add/drop index
– different storage organization

(Actor,Movie*)*
(Movie,Actor*)*

(Movie*, Casting*, Actor*)


COMP7104 - DASC7104 18
Physical data independence (2)
The logical schema insulates users from changes in
physical storage details
• how the data is stored on disk
• the file structure
• the choice of indexes
The application remains unaltered
• Performance may be affected by such changes !

COMP7104 - DASC7104 19
Physical data independence (3)

Query processor à Translates WHAT into HOW

• SQL = WHAT we want = declarative

• Relational algebra = HOW to get it = algorithm

• RDBMS are about translating WHAT into HOW


COMP7104 - DASC7104 20
SQL – Structured Query Language

Language for computing on relations HW1: SQL

• that separates the WHAT from the HOW

• and enables the system to choose the best how

given the data and its layout

COMP7104 - DASC7104 21
Data model: relational model
A data model is a collection of high-level data description
constructs (for structure, operations, constraints) that
hide many low-level storage details.

In this course:
– structured data: all elements have a fixed format
(relational, nested relational, semi-structured)
– relational model: tables

COMP7104 - DASC7104 22
Other important data models
• Semi-structured data model (XML, JSON)
– some structure but not fixed
– hierarchically nested tagged-elements in tree
structure
• Nested relational model: nested tables
• Graph model
• Unstructured data: text, image, audio, video

COMP7104 - DASC7104 23
Applications on DBMS
Any compute service that maintains state today is an
application on top of some kind of DBMS
– Uber
– Cathay Pacific
– Amazon
– HSBC
– SCMP

COMP7104 - DASC7104 24
Applications want something from the DBMS

• Queries and updates


• Real applications are composed of many sorts of
statements being generated by user behaviors
• Many users work with the database at the same time

S QL
S QL
Database
Management S QL S QL S QL
System S QL
S QL

COMP7104 - DASC7104 25
Concurrency control and recovery
• Concurrency control (RDBMS’s Transaction Manager)
– Correct and fast data access in the presence of concurrent
work by many users
– Disorderly processing that provides the illusion of order !

• Recovery (RDBMS’s Recovery Manager)


– Ensures database is fault tolerant, and not corrupted by
software, system, or media failure
– Storage (hard disk) guarantees for mission-critical data

COMP7104 - DASC7104 26
Why multiple transactions running concurrently ?

Throughput: increased processor and disk utilization leads to


more transactions per second (TPS) completed
– Single core: e.g., one transaction using the CPU, while
another is reading from or writing to disk
– Multi-core: ideally, scale throughput in number of
processors
Latency: multiple transactions can run at the same time, so
(with ample resources) one transaction’s latency need not be
dependent on another unrelated transaction (hopefully)

COMP7104 - DASC7104 27
Example

UPDATE Budget SELECT sum(balance)


SET balance = balance – 500 FROM Budget
WHERE uid = 1

UPDATE Budget
SET balance = balance + 200
WHERE uid = 2

UPDATE Budget
SET balance = balance + 300 Would like to treat each
WHERE uid = 3 group of instructions as an
atomic unit!

What could go wrong?


COMP7104 - DASC7104 28
Transactions
• Major component of database systems
• Critical for most applications
• Turing Awards to database researchers:
– Charles Bachman (1973) for pioneering early DBMS,
including IDS
– Edgar Codd (1981) for inventing relational DBs
– Jim Gray (1998) for inventing transaction processing
– Michael Stonebraker (2015) for pioneering relational
DBMSs, including Ingres and Postgres

COMP7104 - DASC7104 29
What is a transaction?
• Sequence of many actions considered to be one atomic
unit of work (one logical unit)
• Usage:
1. Begin transaction
2. Set of SQL statements
3. End transaction
• Examples:
– Transfer balance between accounts
– Book a flight, a hotel, and a car together on Expedia

COMP7104 - DASC7104 30
Transaction model in RDBMS
• Transaction

– sequence of reads and writes of database objects

– batch of work that must commit or abort as an atomic unit

• RDBMS’s transaction manager controls execution of


transactions

• Program logic is invisible to the RDBMS

– The DBMS only see data read/written from/to the DB

– Arbitrary computation possible on data fetched from the DB


COMP7104 - DASC7104 31
ACID guarantees
• A tomicity: All actions in the transaction happen, or none
happen (Recovery Manager)

• C onsistency: If the DB starts out consistent, it ends up


consistent at the end of the transaction (Transaction
Manager)

• I solation: Execution of each transaction is isolated from


that of other transactions (Transaction Manager)

• D urability: If a transaction commits, its effects persist


(Recovery Manager) COMP7104 - DASC7104 32
Quick quizz – which is which ?
1. Maintain integrity constraints
• Atomicity
• Consistency 2. All or nothing
• Isolation 3. Committed data survives failures
• Durability 4. No worry of race conditions

COMP7104 - DASC7104 33
WHAT WE’LL LEARN ABOUT RDBMS

COMP7104 - DASC7104 34
RDBMS anatomy
SQL Client
Completed

Query Parsing
& Optimization
We will unpack a database system Relational Operators
and explore its modular design.
Files and Index Management
Database
Management
Buffer Management
System
Disk Space Management

You are here


Database (storage)
COMP7104 - DASC7104 35
Abstraction at each level
Query Parsing
& Optimization
What à How

Relational Operators How à Dataflow on files

Files and Index Management Files à Blocks in memory


Database
Management
Buffer Management Memory Blocks à Disk pages
System
Disk Space Management Pages on disk à Bytes

You are here


Database (storage)
Each level hiding the
complexity of the next
COMP7104 - DASC7104 36
We’ll study data layout and indexes
Record SSNz Last
Name
First
Name
Age Salary

123 Adams Elmo 31 $400


Bob Harmon M 32 94703 443 Grouch Oscar 32 $300

244 Oz Bert 55 $140


Varchar Varchar Char Int Int Page 1
134 Sanders Ernie 55 $400 Frame
Byte Representation of Record
File
Page 6
Frame Page 2
Frame
Header
94703

3
M 2 Bob Harmon

Page 1 Page 2
Slotted Page

Page
Header Page 3 Page 4 Page 5
Frame Page 3
Frame

Page 5 Page 6
Page 4
Frame

• Knowledge of data and access patterns can affect choice of


data-layout and caching strategies
• Choice à Challenges à Motivated query optimization
COMP7104 - DASC7104 37
We’ll learn how to index stored data

COMP7104 - DASC7104 38
We’ll learn how to index stored data
HW2:
Indexing

COMP7104 - DASC7104 39
We’ll learn about query execution
• Simple closed set of operators
– σ (selection) Indexed
⨝ Nested-Loop
Join
– Π (projection)
Indexed
– ρ (renaming) Nested-Loop
Join
– ⋈ (join) On-the-fly
Select ⨝
Operator On-the-fly
• Combined together via iterators σFirstName=‘Clint’ Select
Operator
into a data flow σYear=1995
– Iterator
– Materialization Actor Casting Movie
B+Tree
– Vectorization B+Tree
IndexedScan
B+Tree
IndexedScan
IndexedScan
Iterator Iterator Iterator

COMP7104 - DASC7104 40
We’ll bridge the WHAT with the HOW
• Query optimization!
• Three stages
– Plan space
– Cost estimation
– Search algorithm

PA1: Query
optimisation
(Oracle)
COMP7104 - DASC7104 41
We’ll reason about transaction
ordering and concurrency control
• Correct (ideal): serially-ordered
• Desire: interleave to maximize performance
• Risk: disorder may lead to data anomalies
• Allowable orders: (conflict) serializability
• Implementation: (Strict) 2PL

PA2:
Transactions
(Oracle)
COMP7104 - DASC7104 42
We’ll learn about recovery

• Write-Ahead Logging (WAL)


• ARIES

DB RAM

LSNs pageLSNs flushedLSN

COMP7104 - DASC7104 43
We’ll relay some key messages
• Query optimization is good (omni-present): most of
today’s popular systems have a query optimizer of
some kind
• Declarative languages are good: SQL in DBMS, SQL
on Big Data (more on that later)
• Schema is good
• Secondary indexes are good

COMP7104 - DASC7104 44
But we’ll also chart and explore the
limitations of RDBMS
Relational databases have been around for decades
and they are very well designed for…
• … structured data
• … often/concurrent read/writes, integrity (OLTP)

Their core concepts were designed also decades ago


– Sequential access to disk are slow, RAM is scarce
– Data / schema normalization
• Removing redundancy / duplication and optimizing storage
– Originally designed for single machines
• Scaling usually mean to buy a new and “bigger” machine

COMP7104 - DASC7104 45
What’s changed since RDBMS inception?
• Dropping cost of disks (more on that soon)
– Cheaper to store everything than to figure out what
we really need !
• Types of data collected
– From data that’s obviously valuable to data whose
value is less apparent
• Rise of social media and user-generated content
– Large increase in data volume, need for data analytics
• Growing maturity of data mining techniques
– Demonstrates value of data analytics
COMP7104 - DASC7104 46

You might also like