0% found this document useful (0 votes)
6 views

Database Systems

Uploaded by

anoophiremath986
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Database Systems

Uploaded by

anoophiremath986
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 103

DATABASE SYSTEMS

WEEK 1
- Database :
A large integrated structured collection data usually intended to model some real world
enterprise
Example: A social media site
1. Entities e.g posts, users, photos
2. Relationships

- Database Management System(DBMS):


A database management system is a software system designed to store, manage and
facilitate access to databases.

- Manage data in a structured way

- Database advantages:
1. Data independence:
Separation of data and program, application logic, central management

2. Minimal data redundancy:


Redundancy can be controlled

3. Improved data consistency:


Single store: no disagreements, update problems, less storage space

4. Improved data sharing:


Data is shared, a corporate resource, not necessity for an application, external
users can be allowed access, multiple views of data, arbitrary views of data

5. Reduced program maintenance:


Data structure can change without application data changing

6. Less coding - SQL

- Database design:
1. Conceptual Design:
Construction of a model of the data used in the database independent of all
physical considerations(irrespective in mysql, sql etc)
Result in the ER diagrams.
Example: investment banking - investment bank has a number of branches,
within each branch a number of departments operate and are structured in a
hierarchical manner. The bank employees staff who are assigned to work
- Need a database to record staff details including which department and
branch they are assigned to
2. Logical Design:
Construction of a relational model of data based on the conceptual design - data
organised in relations.
Involves arranging data into a series of logical relationships called entities and
attributes.
Independent of any particular database management system like SQL

3. Physical Design:
A description of implementation of the logical design for a specific DBMS(sql).
Defines data types and file organisation

Describes:
- Basic relations(data types)
- File organisation
- Indexes

- Choosing Data Types


1. Prevents people from typing long ass character names
2. Selection of data types improves data integrity
3. Reduces waste
4. Data types help the DBMS store and use information efficiently

- Database Development Lifecycle


- Examinable
1. Can you discuss the database development lifecycle
2. What is done at each stage of Design?

WEEK 2

- Conceptual Design: Objectives:


1. Entities and relationships in the enterprise (Eg: entity - student id number)
2. Information about entities and relationships we store in the database (eg: for
student it will be ur address, full name etc)
3. Integrity constraints (Eg: 2 students have same id number - not allowable)

1. ER Model: Entity and its attributes:


- Entity : Real world object distinguishable from other objects. It is described using
a set of attributes
Eg: If entity is a student its attributes will be student ID name, address, number

- Entity Set: A collection of entities that have the same properties. All entities is an
entity set have the same set of attributes, each entity has a key
2. ER Model: Relationship:
- Relationship: Association among two or more entities. Relationships can have
their own attributes
Eg: Student enrols in INFO20003, student gets assessed for assignments
Fred works in the Pharmacy Department

- Relationship Set: Collection of relationships of the same type.


Eg: Employees work in departments.

3. ER Model: Relationship Roles


Same entity can participate in different relationship sets or even different “roles” in
the same set.
Eg: an employee can be a supervisor and a subordinate to a supervisor(different roles
in the same set)

- Constraints
1. Key Constraints: Types
Key constraints determine the number of objects taking part in the relationship
set (how many from each side)

Eg: One-to-One - A car can be owned by one person only, one person one vote
One-to-Many - One person can own many cars
Many-to-Many- Employee can work in many departments and a department can
Have many

- Participation Constraints
Explores whether all entities of one entity set take part in a relationship

If yes: this is a total participation: each entity takes part in at least one relationship

Eg: each department is managed by an employee and every employee manages a


department

If no: this is partial

Eg: each department is managed by an employee but every employee doesn’t need to
manage a department

EXAMPLE:

1. Every employee must work in a department. Each department has at least one
employee - If it must so its Mandatory so its total, at least one so it partial
2. Each department must have a manager (but not everyone is a manager) -

- Weak vs Strong Entities:

1. Weak Entity: Can be identified uniquely only by considering another


entity. They are represented as a “bold” rectangle. Cannot be uniquely
identified by its attributes alone, so it uses another key(owner) in
conjunction with its own attributes to become a primary key.
Anytime we see a FK becoming a PK it is a weak entity.

On Chen’s Notation:
- Weak entity set must have total participation in this
relationship set. Such relationship is called identifying and is
represented as “bold”
- Weak entities have only a “partial key”(dashes underline) and
they are identified uniquely only when considering the primary key
of the owner entity.
- EXAMPLE
1. Strong - “has a key which may be defined without reference to
other entities” eg: Character - CharID = C001, CharName =
“Tracer”

2. Weak - “has a key.. Requires the existence of one or more other


entities” - e.g CharacterSkin - PK: CharSkinID = T0004,
CharID=C001. CharSkinID is dependant on CharID

- Ternary Relation

There are three relationships - supplier, part and department

- Special Attribute Type : Multi-valued attributes:


Can have multiple(finite set of) values of the same type - attributes with multiple values
Eg: for employees we need to capture their home phone number and work phone
number
- Special attribute type: Composite attributes
Have a structure hidden inside (each element can be of different type)
Eg: we need to capture an address consisting of a postcode, street name and number

EXERCISE
- Entities: [Subjects], [Professors]
- Each subject has ID (eg INFO20003), title, time
- Make up suitable attributes for [Professors]

Basics
0. Any number of professors teach any number of subjects (many-to-many)
1. Every professor teaches exactly one subject (no more, no less). A subject can be taught
by multiple professors( INFO20003) but in some cases, there might be no teacher(eg
virtual class)

CONCEPTUAL DESIGN

- Design Choices
- Should a concept be modelled as an entity or an attribute?
- Should a concept be modelled as an entity or a relationship ?
- Should we model relationships as binary, ternary, n-ary?

- Constraints in the ER Model:


A lot of data semantics can and should be captured

- Entity VS Attribute
- Example: Should “address” be an attribute of employees or an entity

Consider:
Depends upon how we want to use address information, and the semantics of
the data:
1. If we have several addresses per employee, address must be an entity.
2. What if an address links to both employee and say
WorkFromHomeContract?

- Summary of Conceptual Design:


Follows requirements analysis:
- Yields a high-level description of data to be stored

ER model popular for conceptual design.


- Constructs are expressive, close to the way people think about their applications
- Originally proposed by Peter Chen, 1976

Basic constructs: entities, relationships and attributes

Other constructs: Weak entities

Whats Examinable??
1. Draw conceptual diagrams yourself
2. Given a problem: determine entities/attributes/relationships
3. What is a key constraint and participation constraint, weak entity
4. Determine constraints for the given entities and their relationships

Primary key is underlined in a chen Diagram


Relationship should be in a diamond shape
Entities are in a square shape

- Relational Data Model


Data model allows us to translate real world things into structures that a computer can
store. Many models - Relational, ER, O-O, Network, Hierarchical etc

- Relational Model:
1. Rows and Columns ( Tuples/records and Attributes/fields)
2. Key and Foreign Key to link relations
Definitions
- Relational Database: a set of relations.
- Relation: Made up of 2 parts
1. Schema:
Specifies name of relation, plus name and type of each column (attribute).
Example: Students(sid: string, name: string, login: string, age: integer, gpa: real)
Analogy: function definition in programming…

2. Instance: a table, with rows and columns.


#rows = cardinality == number of rows counting down
#fields = degree or arity == number of columns counting left to right

Relation can be thought of as a set of row or tuples


All rows are distinct, no order among rows - cant have a row with same student id, name
etc

Logical Design: ER to Relational Model


Chen/Conceptual -> Logical

- In logical design entity set becomes a relation


- Attributes become attributes of a relation
- Add data types to attributes of the entity

The Entire Cycle:

1. Conceptual :
Chen Diagram

2. Logical:
Employee (ssn, name, age) - pseudo code

3. Physical (add data types):


Employee (ssn CHAR(11), name VARCHAR(20), age INT)

4. Implementation (actual SQL code):


CREATE TABLE Employee ( ssn CHAR(11), name VARCHAR(20), age INTEGER,
PRIMARY KEY (ssn))

5. Instance:
- Creating Relations in SQL
CREATE TABLE RelationName (entity1 CHAR(20), entity2 FLOAT, entity INTEGER etc)

Eg: Students (sid: string, name: string, login: string, age: integer, gpa: real)

CREATE TABLE Students ( sid VARCHAR(20), name VARCHAR(20), gpa FLOAT etc)

- KEY:
Keys are a way to associate tuples in different relations.
Keys are one form of integrity constraint (IC). Eg: if a student dropped out of university
it wouldn't make sense if the student is still enrolled in a unit, employee enrolled in a unit
which is wrong as he should be teaching

Example: Only students can be enrolled in subjects.

- Primary Keys(PK):
- A set of fields in a superkey if no two distinct tuples can have the same values in
all key fields.
Eg: Two people named john - won't be a superkey, but if John 1 likes kendrick
and John2 likes Cole - it would be a superkey as they are distinct values

- A set of fields is a key for a relation if it is a superkey and no subset of the fields
is a superkey (minimal subset).

- Out of all keys one is chosen to be the primary key of the relation. Other keys
are called candidate keys.

- Each relation has a primary key.

- Foreign Keys (FK):


- Foreign key - A set of fields in one relation that is used to ‘refer’ to a tuple in
another relation
- Foreign Key must correspond to the primary key of the other relation.
- If all foreign key constraints are enforced in a DBMS, we say referential integrity
is achieved.

Referential Integrity (RI)


Consider Students and enrolled; sid in Enrolled is a foreign key that references Students.
- ADD: WHat should be done if an Enrolled tuple with a non-existent student id is
inserted? (REJECT IT)

- DELETE: What should be done if a Student tuple is deleted (eg Student changed unis)?
Options?
- Delete all Enrolled tuples that refer to it? (cascading delete to all their
enrolments)
- Disallow deletion of a Student tuple that is referred to?
- Set sid in Enrolled tuples that refer to it to a default sid?

Integrity Constraints
IC: condition that must be true for any instance of the database; eg: domain constraints.
- ICs are specified when schema is defined.
- ICs are checked when relations are modified.

A legal instance of a relation is one that satisfies all specified ICs


- DBMS should not allow illegal instances.

Handling multi-valued Attributes


Multi-valued attributes need to be unpacked (flattened) when converting to logical design.

Example: for employees we need to capture their home phone number and work phone number

Composite attributes need to be unpacked and expanded as well.

ER -> Logical Design : Many to Many

- Conceptual Design

- Logical Design
In translating a many-to-many relationship set to a relation, attributes of a new relation
must include:
1. Keys for each participating entity set (as FK)
2. All descriptive attributes

Logical Design (*Works_in is a new associative entity)


Employee (ssn, name, age)
Department (did, dname, budget)
Works_In (ssn, did, since)

Logical -> Physical Design: Many to Many


Modelling with MySQL Workbench

- Identifier or key: Fully identifies an instance

- Partial Identifier: Identifies an instance in conjunction with one or more partial identifiers

- Attributes types:
1. Mandatory - NOT NULL (blue diamond)
2. Optional - NULL (empty diamond)
3. [DERIVED] - e.g [YearsEMployed]
4. {Multivalued} - e.g {Skill}
5. Composite - eg Name (First, Middle, Last)
6.
- Derived Attributes (Chen/ Workbench)
Derived attributes imply that their values can be derived from some other attributes in a
database - Anything that can be calculated.
They do not need to be stored physically - they disappear at the physical design.
Conventions of ER Modelling (Workspace)
Cardinality:
- One to One:
Each entity will have exactly{0 or 1} related entities.

- One to Many:
One of the entities will have {0, 1 or more} related entities, the other will have {0 or 1}

- Many to Many:
Each of the entities will have {0, 1, or more related entities}

Cardinality Constraints
1. Optional Many: Partial participation without key constraint

The O and crow foot means minimum 0 maximum


infinity

2. Mandatory Many: Total participation without key constraint

The | means minimum one and the crow’s foot means


maximum infinity

3. Optional One: Partial participation key constraint

The O means minimum one and the | means maximum 1

4. Mandatory One: Total Participation Key constraint


Minimum one and maximum one

Customer Demo: Single Entity (Conceptual Model, Workbench)

- Empty Diamond means input is


not mandatory
- Filled Diamond means input is
necessary
- Customer can have no first and
last name

Dealing with {Multi-Valued Attributes}

If staff has only 2-3 roles you may decide to have these within the employee table at physical
design to sam on “JOIN” time
Eg ENUM(‘administration’, ‘logistics’, ‘transport’)
Dealing with Weak Entities
Map is the same way: Foreign Key goes into the relationship at the crow’s foot end. Only
difference, as seen, is that the Foreign Key becomes part of the Primary Key.

Dealing with Many-to-Many


Depends on business case

V1.0 has just ‘one’ address for all.


Address demo Dealing with Many-to-Many: Conceptual

V2.0 establishes a many-to-many Customer - Address.


We create an Associative Entity between the other two entities.

One and only one customer has at least one or infinitely many address and (second relation)
each address can be belonging to one or infinitely many address book items

Dealing with One-to-One (Binary Relationship)

One nurse is in charge of zero or one care centre.

- The tile is the Optional side of the relationship gets the foreign key.

- Depends on business rules


Unary: One-to-One

- Conceptual

- Logical
Person (ID, Name, DateOfBirth, SpouseID)

- Implementation
CREATE TABLE Person (
ID INT NOT NULL,
Name VARCHAR(50) NOT NULL,
DateOfBirth DATE NOT NULL,
SpouseID INT,
PRIMARY KEY (ID)
FOREIGN KEY (SpouseID) References (ID)
ON DELETE RESTRICT,
ON UPDATE CASCADE);
Unary: One-to-Many
- Conceptual

- Logical
Employee(ID, Name, DateOfBirth, ManagerID)

- Implementation
CREATE TABLE Employee(
ID INT NOT NULL,
Name VARCHAR(50) NOT NULL,
DateOfBirth DATE NOT NULL,
ManagerID INT,
PRIMARY KEY (ID),
FOREIGN KEY (ManagerID) REFERENCES Employee(ID),
ON DELETE RESTRICT
ON UPDATE CASCADE);
Unary: Many-to-Many

- Conceptual

- Logical
Create Associative Entity like usual; generate logical model
Item (ID, Name, UnitCost)
Component (ID, ComponentID, Quantity)
- Implementation

CREATE TABLE Item (


ID INT NOT NULL,
Name VARCHAR(50) NOT NULL,
UnitCost DOUBLE NOT NULL,
PRIMARY KEY (ID)
);

CREATE TABLE Component (


ID INT NOT NULL,
ComponentID INT NOT NULL,
QUANTITY INT NOT NULL,
PRIMARY KEY (ID, ComponentID),
FOREIGN KEY (ID) REFERENCED Item(ID)
);

What’s examinable?
- Need to be able to draw conceptual, logical and physical diagrams
- Assignment 1: Conceptual Chen’s pen and paper
- Assignment 1: Physical Crow’s foot with MySQL Workbench
- CREATE TABLE SQL statements
WEEK 4

Relational Algebra: 5 Basic Operations:


1. Selection (𝜎, sigma): Selects a subset of rows from relation (horizontal filtering).
2. Projection (𝜋, pi): Retains only wanted columns from relation (vertical filtering).
3. Union (U): Tuples in one relation and/or or in the other. (set theory)
4. Set-difference (-): Tuples in one relation, but not in the other.
5. Cross-product (x): Allows us to combine two relations.

Projection:
Retains only attributes that are in the projection list

Schema of result:
- Only the fields in the projection list, with the same names that they had in the input
relation

*Projection operator has to eliminate duplicates (in relational algebra)


‘Real’ SQL doesn’t - you need to explicitly ask using DISTINCT keywords.

- Example: if all rows have the same rating, the output via
𝜋rating(movies)

Selection (𝜎):
Selects rows that satisfy the selection condition.

Result is a relation
Schema of the result is same as that of the input relation

No need to remove duplicate rows as they won’t exist because by definition a DBMS doesn’t
have same rows.

Selection (𝜎) Conditions:


- Conditions are standard arithmetic expressions: >, <, >=, <=, =, !=
- Conditions are combined with AND/OR clauses:
- And: ⋀
- Or: V

Example: Find sailors whose rating is above 9 and who are younger than 50

Command to do it – 𝜎rating>8 ⋀ age<50 (S2)

Selection (𝜎) and Projection (𝜋):


Operations can be combines,
Select rows that satisfy selection condition and retain only certain attributes (columns)

Example: Find names and rating of sailors whose rating is above 8:

𝜋sname, rating(𝜎rating>8(S2)) - Brackets go first

Union (U) and Set-Difference (-)


- Union: Combines both relations together
Eg : BTS {Jin, Suga, J-Hope, RM, Jimin, V, Jungkook} and Coldplay {Jonny, Chris, Guy,
Will, Phil}

Union: {Jin, Suga, J-Hope, RM, Jimin, V, Jungkook, Jonny, Chris, Guy, Will, Phil}

- Set Difference: Retains row of one relation that do not appear in the other relation
Eg: Samsung phones {Fold 5, Flip 5, S23, A53}; Flip phones {Fold 5, Flip 5, Razr, Razr+}

Set-difference: Samsung phones which cannot flip {S23, A53}

These operations take two input relations, which must be union-compatible:


- Same number of fields
- Corresponding fields have the same type

Duplicates get removed when using union

Compound Operator: Intersection


- Intersections: Retains rows that appear in both relations and takes two input relations,
which must be union - compatible.

Cross Product

Cross product combines two relations:


- Each row of one input is merged with each row from another input
- Output is a new relation with all attributes of both inputs

X is used to denote cross-product


- Example: S1 x R1
- Each row of S1 paired with EACH row of R1 - note ALL possible combos, not just the
matching ones.
Joins
Joins are compound operators involving cross product, selection, and sometimes projections.
Most common type of join is a natural join (often just called join)
- R ⨝ S conceptual is a cross product that matches rows where attributes that appear in
BOTH relations have EQUAL values ( and we omit duplicates)

To obtain cross product R ⨝ S a DBMS must:


1. Compute R x S
2. Select rows where attributes that appear in both relations have equal values.
3. Project all unique attributes and one copy of each of the common ones.

EXAMPLE: Find all sailors (from relation S1) who have reserved a boat

- Step 1: S1 x R1
- Step 2: Select rows where attributes that appear in both relations have equal values.

- Step 3: Project all unique attributes and one copy of each of the common ones

Other types of Joins

1. Condition Join (or theta-join) is a cross product with a condition c (sometimes denotes
by theta)
R⨝c S = 𝜎c(RxS)
S1⨝(S1.sid < R1.sid)R1

Result schema is the same as that of cross- product

2. Equi- Join is a special case of condition join, where condition c contains only equalities
Eg: S1.sid = R1.sid hence
S1⨝(S1.sid = R1.sid)R1

Whats Examinable??
1. Relational Algebra Operations: The 5 basics and intersection sand Joins
2. Design queries with Relational algebra operations
3. Apply Relational Algebra Operations on tables (relations)

What is SQL?
SQL or SEQUEL is a language used in relational databases
Supports CRUD - Create, Read, Update and Delete commands

Data Definition Language(DDL)


To define and set up the database
CREATE, ALTER, DROP

Data Control Language (DCL)


To control access to the database
GRANT, REVOKE

Data Manipulation Language (DML)


TO maintain and use the database
SELECT, INSERT, DELETE, UPDATE

Other commands
- Administer the database (eg CHECK TABLE)
- Transactional Control (eg COMMIT)

INSERT COMMAND

1. INSERT INTO Customer


(CustomerFirstNAme, CustomerLastName, CustType)
VALUES (“Peter”, “Smith”, ‘Personal’);

2. INSERT INTO Customer


VALUES (DEFAULT, “James”, NULL, “Jones”, “JJ Enterprises”, ‘Company’);
No columns specified which means every columns need to be entered in
order

3. INSERT INTO Customer


VALUES (DEFAULT, “ “, NULL, “Smythe”, “ “, ‘Company’);

If MySQL complains, use ‘single quotes’

SELECT COMMAND
SELECT * FROM TableName;
*(star): Allows us to obtain all columns from a table
= Give me all information you have about TableName

SELECT and Projection (𝛱) with conditions


𝛱CustLastNAme(𝜎CustLastName=’Smith’(Customer))
SELECT column FROM tablename WHERE…. ;
SELECT CustLastName FROM Customer WHERE CustLastName= ‘Smith’;

𝜎a⋀b V c – a and b or c

SELECT and LIKE


Select CustLastName FROM Customer WHERE CustLastName LIKE “Sm%”;
% : Represents zero, one, or multiple characters
_ : Represents a single character
Example
LIKE “TA%a%”
Will this match
1. Taylor Swift - No because there is no second a
2. Taytay - Yes, because there is a second a
3. #TaylorSwiftErasTour - No, because it starts with #

SELECT and aggregate Functions


Aggregate functions operate on the (sub)set of values in a column of a relation (table) and
return a single value.
- AVG() - Average Value
- COUNT() - Number of values
- MIN() - Minimum value
- MAX() - Maximum value
- SUM() - Sum of Values

SELECT SUM (sname ) from S2, returns 4


Won't be possible as you cant sum names

SELECT and renaming columns


SELECT COUNT(CustomerID) AS Count FROM Customer WHERE … ;
COUNT - original name
Count - renamed
SELECT and GROUP BY
SELECT AGGREGATEFUNCTION(colname) FROM tablename WHERE GROUP BY …. ;

1. SELECT AVG(OutstandingBalance) FROM Account;

2. SELECT AVG(OutstandingBalance) FROM Account GROUP BY CustomerID;

3. Group by but with condition


List the number of customers of each business, but ONLY include businesses with more
than 5 customers.

SELECT COUNT(CustomerID), BusinessName FROM Customers GROUP BY


BusinessName HAVING COUNT(CustomerID) > 5;

SELECT and ORDER BY


SELECT and LIMIT | OFFSET

INNER JOIN = Equi - Join


NATURAL JOIN

OUTER JOIN
- Joins the tables over keys
- Can be left or right (not difference)
- Included records that don’t match the join from the other table

1. SELECT * FROM Rel1 LEFT OUTER JOIN Rel2 ON …. ;


2. SELECT * FROM Rel1 RIGHT OUTER JOIN Rel2 ON …. ;

Things to remember
1. SQL is case sensitive
2. Table name are case sensitive
3. Field names are case
4. You can do maths in SQL
SELECT 1*1+1/1-1
5. For SELECTs: no orders unless you ORDER BY

WEEK 5
Comparison and Logic
- Logic:
● AND
● OR
● NOT

- Comparison
● =
● <
● >
● <=
● >=
● <> OR != – Not equal to

String Functions
- UPPER () : Changes to uppercase
- LOWER() : Changes to lowercase
- LEFT() : Take the left X characters from a string
- RIGHT() : Take the X right characters from a string

Set Operators
- UNION:
Shows all rows returned from the queries (or tables)

- UNION ALL:
IF you want duplicate rows show in the results you need to use the ALL keyword
Eg: UNION ALL ….

- INTERSECT:
SHows only rows that are common in the queries

- INTERSECT ALL
Same as before

Sub-Query Nesting
- A nested query is simply another SELECT query you write to produce a table set
DDL: TRUNCATE/DROP
1. TRUNCATE
Same as DELETE * FROM TableName;
Faster but cannot ROLL BACK a TRUNCATE command

TRUNCATE TableName;

2. DROP
Kills a relation - removes the data, removes the relation. Entire table is gone

DROP TABLE TableName;

Storage and Indexing

Record:
A record refers to an individual row of a table and has a unique rid. The rid has the property that
we can identify the disk address of the page containing the record by using the rid. The rid
consists of the page ID and the offset within that page, for example, an rid of (3, 7) refers to the
seventh record from the beginning of the third page.

Page:
A page is an allocation of space on disk or in memory containing a collection of records.
Typically, every page is the same size.
File:
A file consists of a collection of pages containing records. In simple database scenarios, a file
corresponds to a single table

Index files and Indexes


- An index is a data structure built on top of data pages used for efficient search
- The index is built over specific fields called search key fields. The index speeds up
selections on the search key fields .
- Any subset of the fields of a relation can be the search key for an index on the relation
- Search key doesn’t have to be unique

Index Classification

1. Clustered
If the order of data records is the same as the order of index data entries, then the index
is called a clustered index.

- A data file can have a clustered index on at most one search key combination
(i.e. we cannot have multiple clustered indexes over a single table).
- Cost of retrieving data records through an index varies greatly based on whether
the index is clustered (cheaper for clustered).
- Clustered indexes are more expensive to maintain (require file reorganisation),
but are really efficient for range search.
- Cases where the query has a condition to check for a range
- Not good for equality conditions

2. Unclustered
If they are not in the same order its called unclusetered index
● IMPORTANT: (Approximated) cost of retrieving records found in range scan:
Clustered: cost ≈ # pages in data file with matching records
Unclustered: cost ≈ # of matching index data entries (data records)

Hash Index:
Hash indexes are best suited to support equality selections (queries where the WHERE clause
has an equality condition)

B-Tree Index:
A B-tree index is created by sorting the data on the search key and maintaining a hierarchical
search data structure (B+ tree) that will direct the search to the respective page of the data
entry. Insertion in such a structure is costly as the tree is updated with every insertion or
deletion. There will be situations in which a major part of tree will be re-written if a particular
node is overfilled or under-filled. Ideally the tree will automatically maintain an appropriate
number of levels, as well as optimal space usage in blocks

Hash vs. tree indexes


Similar to clustered and unclustered indexes, if we want to perform many range queries,
creating a B-tree index for the relation is the best choice. On the other hand, if there are
equality queries such as driving licence numbers, then hash indexes are the best choice
as they allow faster retrieval than a B-tree in such cases. (B-trees will still increase the
speed of most equality queries, but hash indexes are even faster.)

Heap Files:
- Simplest file structure
- Contains files in no specific order
- As a file grows and shrinks, disk pages are allocated and de-allocated.
- Heaps are fastest for inserts compared to other alternatives.
- Suitable when typical access is a file scan retrieving all records
- Cost = Number of Pages

Sorted Files:
- Similar structure like heap files
- Pages and records are ordered
- Best for retrieval in some order
- Cheapest cost for a sorted file is Log2 (Pages)

WEEK 6

Query processing overview


Some database operations are expensive
- A DBMS can greatly improve performance by being smart
- Can speed up 100000x over naive approach

Main weapons are:


1. Clever implementation techniques for operators
2. Exploiting ‘equivalencies’ of relational operators
3. Using cost models to choose among alternatives

Query Processing: Selections


Simple Selections: Estimate result size (reduction factor)
Size of result approximates as:

Reduction factor is usually called selectivity. It estimates what portion of the relation will qualify
for the given predicate , i.e., satisfy the given condition.

This is estimated by the optimizer.

Eg 30% of records qualify, or 5% of records qualify, etc.

Simple Selections, no index:


1. With no index, unsorted:
Cost = All pages (worst-case)
Must scan the whole relation i.e. performs Heap Scan

Cost = Npages(R)

Example : Reserves cost = NPages(R) = 1000 IO

2. With no index, but file is sorted:


Cost = binary search cost + number of pages containing results

Cost = log2(NPages(R)) + (RF*NPages(R))

Example, sorted, assume RF = 20%


Reserves cost = log2(1000 pages) + (RF*1000 pages)
= 10 IO + (0.2*1000)
= 220 IO

- Cost depends on the number of qualifying tuples.


Recall: Clustering is important when calculating the total cost.

- Steps to perform:
1. FInd qualifying data entries:
● Go through the index: height typically small (FYI: 2-4 I/O = B+ tree, 1.2
I/O - hash index, i.e., negligible if many records retrieved)
● Once data entries are reached, go through data entries one by one and
look up corresponding data records (in the data file)

2. Retrieve data records (in the data file)

- Formulas for Cost:


1. Clustered Cost = (Npages(I) + NPages(R)) * RF /*reduction factor*/
2. Unclustered Cost = (NPages(I) + NTuples(R)) * RF

Example
Lets say that 10% of Reserves tuples qualify: RF = 0.1

Let’s say that index occupies 50 pages:


NPages(I) = 50

Given:
NPages(R) = 1000
NTuplesPerPage(R) = 100
NTuples(R) = 1000* 100 = 100,000

Calculate the cost of:


1. Clustered Index = (NPages(I) + NPages(R)) * RF
2. Unclustered Index = (NPages(I) + NTuples(R)) * RF
3. Heap Scan, Unsorted? = NPages(R)

Answers:
1. Clustered Index = (1000 + 50) * 0.1 = 105
2. Unclustered Index = (1000 + 100,000) * 0.1 = 10100
3. Heap Scan, unsorted = 1000

General Selection Conditions


- Typically queries have multiple predicates (conditions)
Example: day <8/9/94 AND rname = ‘Paul AND bid = 5 AND sid = 3

A Binary-tree index matches (a combination of) predicates that involve only attributes in
a prefix of the search key.

● Index on <a,b,c> matches predicates on : (a,b,c), (a,b) and (a)


● Index on <a,b,c> matches a = 5 AND b = 3, but will not be used to answer b = 3
● This implies that only reduction factors of predicates that are part of the prefix will
be used to determine the cost (they are called matching predicates, or primary
conjuncts)

Selection approach

1. FInd the cheapest access path


An index or file scan with the least estimates page I/O

2. Retrieve tuples using it


Predicates that match this index reduce the number of tuples retrieved and impact the
cost.

3. Apply the predicates that don’t match the index (if any) later on
THese predicates are used to discard some retrieved tuples, but do not affect number of
tuples/pages fetched (nor the total cost)

B+ tree index, hash index

Example : day <8/9/94 AND bid = 5 AND sid = 3

- A B+ tree index on day can be used;


● RF = RF(day)
● Then, bid = 5 and sid = 3 must be checked for each retrieved tuple on the fly
- Similaryly, a hash index on <bid, sid> could be used:


● Then, day < 8/9/94 must be checked on the fly
● Horrible for range queries

Query Processing : Projections

The issue with projection is removing duplicates,

Example:

SELECT DISTINCT R.sid, R.bid FROM Reserved R

Projection can be done based on hashing or sorting.

Sorting - Basic approach


1. Scan R, extract only the needed attributes

2. Sort the result set (typically using external merge sort)


3. Remove adjacent duplicates
EXTERNAL MERGE SORT

- Rationale: An algo used for sorting, remember that if data does not fir in memory, we
need to several passes.

Projection : External Sort Operation Cost

1. Scan R, extract only the needed attributes


2. Sort the result set using EXTERNAL SORT
3. Remove adjacent duplicates

- PROJECTION FACTOR (PF)


Says how much are we projection, ratio with respect to all attributes (e.g. keeping 25% of
attributes, or 10% etc)

Cost = ReadTable + WriteProjectedPages + SortingCost + ReadProjectedPages

ReadTable = NPages(R) Cost to read entire table, keep only projected attributes
WriteProjectedPages = NPages(R) * PF
SortingCost = 2*NumPasses*ReadProjectedPages
ReadProjectedPages = NPages(R) * PF

JOINS
Are very common and can be very expensive (time/ processing)
Cross product in the worst case

Many implementation techniques for join operations


1. Nested-loops Join
2. Sort-merge Join
3. Hash Join

Joins: Equality Join, One Column


SELECT * FROM Reserves R1, Sailors S1 WHERE R1.sid = S1.sid

- Join is associative and commutative


AxB == BxA
Ax(BxC) == (AxB)xC

Simple Nested Loop Join (SNJL)


Cost: Cost (SNLJ) = NPages(Outer) + NTuples(Outer)*NPages(Inner)

Blocked Nested Loops Join (BNLJ)


Page-oriented NL doesn’t expolit extra memory buffers.

Cost (BNLJ) = NPages(Outer) + NBlocks(Outer) * NPages(Inner)

NBlocks(Outer) = | NPages(Outer)/(B-2)|
B = # pages of space in memory, i.e., Blocks

Sort-Merge Join
- Sort R and S on the join column, then scan them to do a merge (on join column), and
output result tuples.

- Sorted R is scanned once; Each S group of the same key values is scanned once per
matching R tuple (typically means Sorted S is scanned once too).

- Useful when:
1. One or both inputs are already sorted on join attribute(s)
2. Output is required to be sorted on join attribute(s)

- Cost(SMJ) = Sort(Outer) + Sort(Inner) + NPages(Outer) + NPages(Inner)

Assuming the following


● Sailors (S): 80 tuples per page, 500 pages
● Reserves (R) : 100 tuples per page, 1000 pages

Answer: 4000 + 2000 + 1000 + 500 = 7500


Hash Join

- Partition both relations using hash function h: R tuples in partition will only match S
tuples in partition

- Read in partition of R, hash is using h2 (<> h!). Scan matching partition of S, probe hash
table for matches.

- Cost (HJ) = 2 * NPages(Outer) + 2*NPages(Inner) + NPages(Outer) + NPages(Inner)


= 4500

WEEK 7

Query Optimization I

Query plan is at the core

- Is a tree, with relational algebra operators as nodes and access paths as leaves
- Each operator labelled with a choice of an algorithm.

- From bottom to top

Query Optimization Steps


1. Query first broken into “blocks”
2. Each block converted to relational algebra
3. Then, for each block, several alternative query plans are considered
4. Plan with the lowest estimates cost is selected

● Step 1
- Query block is any statement starting with SELECT
- Query block = unit of optimization / execution
- Typically inner most block is optimized first then moving towards outers

● Step 2 : Convert query block into relational algebra expression

● Step 3 : Relational Algebra Equivalences


- These equivalences allow us to ‘push’ selections and projections ahead of joins.

Example

- Selection: Every one of them are the same

- Projection:

- A projection commutes with a selection that only uses attributes retained by the
projection
Equivalences Involving Joins
- These equivalences allow us to choose different join orders

Mixing Join with Selection & Projections

● Converting Selection + cross-product to join

● Selection on just attributes of S commutes with

● We can also “push down” projection


COST BASED QUERY OPTIMIZATION

Cost Estimation

- Must estimate cost of each operation in plan tree


● Depends on input cardinalities

- Must estimate size of result for each operation in tree


● Use information about input relations (from the system catalogs), and apply
certain well established rules

Statistics and Catalogues

- TO decide on the cost, the optimizer need information about the relations and indexes
involved. THis information is stored in the system catalogs

- Catalogs typically contain at least

- Statistics in catalogs are updated periodically


Result Size Estimation

- Consider a query block

- MAximum number of tuples in the result in the product of the cardinalities of relations in
the FROM clause

- Reduction factor (RF) associated with each predicate reflects the impact of the predicate
in reducing the result size.
RF is also called selectivity.

Result size estimation calculations

- SIngle table selection:

- Joins (over k tables):

- If there are no selections (no predicates), reduction factors are simply ignored - equal to
1.

Calculation Reduction Factor (RF)

- Depends on the type of the predicate:


1. Col = value
RF = 1/NKeys(Col)

2. Col > value


RF = (High(Col) - value) / (High(Col) - Low(Col))

3. Col < value


RF = (val - Low(Col)) / (High(Col) - Low(Col))

4. Col_A = Col_B (for joins)


RF = 1/ (Max (Nkeys(Col_A), NKeys(Col_B)))
5. If no information about Nkeys or interval, use a “magic number” 1/10
RF = 1/10

EXAMPLE
Enumeration of Alternative Plans

● When enumeration alternative plans, there are two main cases:


- Single-Relation plans
- Multiple-relation plans (joins)

● For queries over a single relation:


- Each available access path (file scan / index) is considered, and the one with lowest
estimates cost is chosen
- Heap scan is always one alternative
- Each index can be another alternative (if matching selection predicates)

- Other operations can be performed on top of access paths, but they typically do not incur
additional cost since they are done on the fly (projections, additional non-matching
predicates)

Cost Estimates for Single-Relation Plans

1. Sequential (heap) scan of data file:


Cost = NPages(R)

2. Index selection over a primary key (just a single tuple):


Cost(B+ tree) = Height(I) + 1, HEight is the index height
Cost(Hashindex) = ProbeCost(I) + 1, ProbeCost(I) ~ 1.2

3. Clustered index matching one or more predicates:

4. Non-clustered index matching one or more predicates:

EXAMPLE
Plan Enumeration for multi-relation plans

Queries Over Multiple Relations


- As number of joins increase, number of alternative plans grows rapidly - need to restrict search
space
- Fundamental decision in System R (first DBMS): only left-deep join trees are considered
- Left-deep trees allow us to generate all fully pipelined plans

Plan Enumeration Example


1. Enumerate relation orderings:

Don't take cartesian products only natural joins

2. Enumerate join algorithm choices

3. Enumerate access method choices:


● CASE 1

SxR
Cost (SxR) = (NLJ) NPages(S) + NPages(S) * NPages(R) = 500 + 500*1000 = 500500

(SxR)xB
- NTuples(S) = 500(Npages) * 80 (Ntuplesperpages) = 40000
Result size (SxR) = 40000 * 100000 * 1/40000 = 100000 tuples = 1000 pages
So, 100000/100 = 1000 pages

● CASE 2

● CASE 3 and CASE 4


Anomalies in Denormalized Data
- Consider the following table (relation):

Why is this not a good design?


1. Insertion Anomaly: A new course cannot be added until at least one student has enrolled (which
comes first student or course?)

2. Deletion Anomaly: If student 425 withdraws, we lose all record of course C400 and its fee

3. Update Anomaly: If the fee for course C200 changes, we have to change it in multiple records
(rows), else the data will be inconsistent.

Normalisation
- A technique used to remove undesired redundancy from databases.
- Break one large table into several smaller tables.
- A relation is normalised if all determinants are candidate keys.

Normalized relations and ER Diagram

More Formal Concept: Functional Dependency


- A functional dependency concerns values of attributes in a relation
- A set of attributes X determines another set of attributes Y(each value of X is associated with on
value of Y)
- If i know X i also know Y
Functional dependencies: Definitions
- Determinants (X,Y ->Z) A(X,Y,Z,D)
The attributes on the left hand side of the arrow

- Key and Non-Key attributes


Each attribute is either part of the primary key or it is not

- Partial functional dependency (Y->Z):


A PFD of one or more non-key attributes upon part (but not all) of the primary key

- Transitive Dependency (Z->D)


A functional dependency between 2 or more non-key attributes

Armstrong’s Axioms
Functional dependencies can be identified using Armstrong’s axioms
Let A = (X1, X2, … Xn) and B = (Y1, Y2…. Yn)

1. Reflexivity:
Example - Student_ID, name -> name

2. Augmentation
Example - Student_ID -> name => Student_ID, surname ->name, surname\

3. Transitivity:
Example: ID -> birthday and birthdate -> age then ID -> age

Steps in Normalisation
First Normal Form: Remove Repeating Groups
- Repeating groups of attributes cannot be represented in a flat, two dimensional table
- Removing cells with multiple values (keep atomic data)

- Break them into two tables and use Primary key or Foreign keys to connect them
Second Normal Form: Remove Partial Dependencies
- A non-key attribute cannot be identified by part of a composite key

Partial Dependency Anomalies


Solution

Third Normal Form: Remove Transitive Dependencies


- A non-key attribute cannot be identified by another non-key attribute

Example: Employee(Emp#, Ename, Dept#, Dname)


Boyce-Codd Normal Form and Anomalies
- Every determinant must be a candidate key

Solution
Normalisation vs Denormalization

1. Normalisation
- Normalised relations contain a minimum amount of redundancy and allow users to insert,
modify and delete rows in tables without errors or inconsistencies (anomalies).

2. Denormalization
- The pay off: query speed
- The price: extra work on updates to keep redundant data consistent
- Denormalization may be used to improve performance of time-critical operations.

What is a database transaction?


- Also called the atomic unit of work.
- A logical unit of work that must either be entirely completed or aborted .
- DML or SQL statements if nothing else stated, are commonly assumed to be atomic in most
DBMSs
- They are not the common “unit” of execution though
- DBMSs allow for user-defined units: transactions
- A successful transaction changes the database from one consistent state to another: all integrity
constraints satisfied

Transaction Properties (ACID)


● Atomicity
- A transaction is treated as a single, indivisible, logical unit of work. All operation sin a
transaction must be completed: if not, then the transaction is aborted and everything is
then undone
- Either it is completely or aborted
● Consistency
- Constraints that hold before a transaction must also after it
- Multiple users accessing the same data see the same value (at the same time)

● Isolation
- Changes made during execution of a transaction cannot be seen by other transaction
until this on is completed
- If Multiple people are accessing the database everyone will see the same data

● Durability
- When a transaction is complete, the changes made to the database are permanent, even
if the system fails

Why do we need transactions?


● Transactions solve mainly TWO problems:
1. User’s need for the ability to define a unit of work
2. Oncurrent access to data by >1 user or program

Problem 1: Unit of work


- Single SQL, DML or even DDL command (implicit transaction)
- Changes are “all or none”
- Example: Update 700 records, but DBMS crashes after 200 records processed than the
server restarts no changes are made to any records

- Multiple Statements (user-defined transaction)

Business case for transaction as units of work


- Each transaction consists of several statements, embedded within a larger application program
- Transaction need to be treated as an indivisible unit of work
- “Indivisible” means that either the whole job gets done, or none gets done: if an error occurs, we
dont leave the database with the job half done, in an inconsistent state
- In the case of an error:
1. Any SQL statements already completed must be reversed
2. Show an error message to the user
3. When ready, the user can try the transaction again
4. This is briefly annoying - but inconsistent data is disastrous

Problem 2: Concurrent access


- What happens if we have multiple users accessing the database at the same time?
- E.g. Concurrent execution of DML against a shared database
- Not theat the sharing of data among multiple users is where must of the benefit of databases
come from - users communicate and collaborate visa shared data

- But what could possibly go wrong?


- Lost updates
- Uncommitted data
- Inconsistent retrievals

Serializability
- Transaction ideally should run in a schedule that is “serializable”
- Multiple, concurrent transactions appear as if they were executed one after another
- Ensures that the concurrent execution of several transactions yields consistent results

Concurrency Control Methods


- To achieve efficient execution of transactions, the DBMS creates a schedule of read and write
operation for concurrent transactions
- Interleaves the execution of operations, based on concurrency control algorithms such as locking
or time stamping

Concurrency Control with Locking


● Locks:
Guaranteed exclusive use of a data item to a transactions
- T1 acquires a lock prior to data access; the lock is released when the transaction is
complete
- T2 does not have access to data item currently being used by T1
- T2 has to wait until T1 releases the lock

Required to prevent another transaction from reading inconsistent data

● Lock manager
Responsible for assigning and policing the locks used by the transactions

Lock Granularity Options


● Database-level lock
- Entire database is locked
- Good for batch processing but unsuitable for multi-user DBMSs
- T1 and T2 cannot access the same database concurrently even if they use different
tables

● Table-level lock
- Entire table is locked - as above but not quite as bad
- T1 and T2 can access the same database concurrently as they use different tables
- Can cause bottlenecks, even if transactions want to access different parts of the table
and would not interfere with each other.
- Not suitable for highly multi-user DBMSs

● Page-level lock
- An entire disk page is locked
- Not commonly used now
● Row-level lock
- Allows concurrent transactions to access different rows of the same table, even if the
rows are located on the same page
- Improves data availability but with high overhead (each row has a lock that must be read
and write to)
- Currently the most popular approach (MySQL, Oracle)

● Field-level lock
- Allows concurrent transactions to access the same row, as long as they access different
attributes within that row
- Most flexible lock but requires an extremely high level of overhead
- Not commonly used

Types of Locks

● Binary Locks
- Has only two states: locked or unlocked
- Eliminates “Lost update” problem
Lock is not released until the statement is completed
- Considered too restrictive to yield optimal concurrency, as it locks even for two READs
(when no update being done)

● The alternative is to allow both shared and exclusive locks


- Often called Read and Write locks

Shared and Exclusive locks

● Exclusive Lock
- Access is reserved for the transaction that locked the object
- Must be used when transaction intends to WRITE
- Granted if and only if no other locks are held on the data item

● Shared Lock
- Other transactions are also granted Read access
- Issued when a transaction wants to READ data, and no Exclusive lock is held on that
data item
Multiple transactions can each have a shared lock on the same data item if they are all
just reading it

- Prevention : Look at the schedule and try to stop


- Detection: Graph or data structure to see who is locking what and when there is a cycle and
notice a deadlock so we kill one of the transactions

Alternative concurrency control methods


● Timestamp
- Assigns a global unique timestamp to each transaction
- Each data item accessed by a transaction gets a timestamp
- Thus for every data item, the DBMS knows which transaction performed the last read or
write on it
- When a transaction wants to read or write, the DBMS compares its timestamp with the
timestamps already attached to the items and decided whether to allow access

● Optimistic
- Based on the assumption that the majority of database operations do not conflict
- Transaction is executed without restrictions or checking
- Then when it is ready to commit, the DBMS checks whether any of the data is read has
been altered - if so rollback

Logging transactions
- Allow us to restore the database to a previous consistent state
- If a transaction cannot be completed, it must be aborted and any changed rolled back
- To enable this, DBMS tracks all updates to data

Transaction log
- Also provides the ability to restore a crashed database
- If a system failure occurs, the DBMS will examine the log for all uncommitted or incomplete
transactions and it will restore the database to a previous state
DATABASE ADMINISTRATION

CAPACITY PLANNING
- The process of predicting when future load levels will saturate the system and determining the
most cost-effective way of delaying system saturation as mich as possible.

- When implementing a database, need to consider:


1. Disk space requirements
2. Transaction throughput
3. At go-live and throughout the life of the system

ESTIMATING DATABASE USAGE


Capacity Planning in the dev life cycle
Estimating Disk Space Requirements

● Treat database size as the sum of all table sizes


- Table size = number of rows * average of row width

Calculating row widths


- These sizes are for MySQL and are slightly different for other vendors

- For VARCHAR/BLOB we use the average size (from catalogue)


Estimate growth of tables

Estimating transaction load


● Consider each business transaction
- How often will each transaction be run?
- For each transaction, what SQL statements are being run

BACKUP AND RECOVERY

What is a Backup
● A backup is a copy of your data
- However there are several types of backup

● If data becomes corrupted or deleted or held to ransom it can be restored from the backup copy

● A backup and recovery strategy is needed


- To plan how data is backed up
- To plan how it will be recovered

Protect data from different eros

● Human error: accidental drop or delete

● Hardware or software malfunction


- Bug in application
- Hard drive
- CPU
- Memory

Protect against
1. Malicious activity: security compromise
2. Natural or man-made disasters
3. Government regulations
- Historical archiving rules
- Metadata collection
- Privacy rules
Categories of Failures

1. Statement Failure: Syntactically incorrect


2. User process failure: The process doing the work fails (error occurs, dies)
3. Network Failure: Network failure between the user and the database
4. User Error: User accidentally drops the rows, tables, database
5. Memory Failure: Memory fails, become corrupt
6. Media Failure: Disk failure, corruption, deletion
Types of Backups
● Physical vs Logical
● Online vs Offline
● Full vs Incremental
● Onsite vs Offsite

Physical Backup
- Raw copies of files or directories
- Suitable for large databases that need fast recovery
- Database is preferably offline when backup occurs
- Backup = exact copies of the database directories and files
- Backup should include logs
- Backup is only portable oto machines with a similar configuration
- To restore

Logical Backup
- Backup completed through SQL queries
- Slower than physical: SQL selects rather than OS copy
- Output is larger than physical
- Doesnt include log or config files
- Machine independent
- Server is available during the backup
- In MySQL can use the backup using
- Mysqldump
- SELECT …. iNTO OUTFILE

- To restore: USe mysqlimport, or LOAD DATA INFILE within the mysql client

Online (or HOT) backup


- Backups occur when the database is “live”
- Clients dont realise a backup is in progress
- Need to have appropriate locking to ensure integrity of data

Offline (or COLD) backup


- Backups occur when the database is stopped
- To maximise availability to users take backup from replication server not live server
- simpler to perform
- Cold backup is preferable, but not available in all situations: applications without downtime

Full Backup
- A full backup is where the complete database is backed up: may be Physical or Logical, Online or
Offline
- It includes everything you need to get the database operations in the event of a failure

Incremental Backup
- Only the changes since the last backup are backed up
- For most databases this means only backup log files
- To restore
- Stop the database, copy backed up log files to disk
- Start the database and tell it to redo the log files

Offsite Backup
- Enables disaster recovery (because backup is not physically near the disaster site)
- Example solutions:
- Backup tapes transported to underground vault
- Remote mirror database maintained via replication
- Backup to cloud

WEEK 10

Introduction of Data Warehousing

Relational Databases for Operational Processing


- Used to run day to day business operations
- Automation of routine business processes
1. Accounting
2. Inventory
3. Purchasing
4. Sales
- Created huge efficiencies

Due to multiple database providers

- It recreated problem in databases


1. Duplicated data
2. Inaccessible data
3. Inconsistent data

What can be done?


- Need an integrated way of getting the ENTIRE organisational data
- It really an informational Database, rather than a Transactional Database
- A single database that allows all of the organisations data to be stored in a form that can
be used to support organisational decision processes

Warehouse: An informational Database

● Data Warehouse:
- A single repository of organisational data
- Integrates data from multiple sources
- Extracts data from source systems, transforms, loads into the warehouse
- Makes data available to managers/users
- Supports analysis and decision-making

● Involve a large data store (often several Terabytes, Petabytes of data)

Difference between Transactional and Informational Systems


Database Warehouse Supports Analytical queries

● One is interested in numerical aggregations


- How many?
- What is the average?
- What is the total cost?

● One is interested in understanding dimensions


- Sales by state by customer type
- Sales by product by store by quarter

Characteristics of a Database Warehouse

● Subject oriented
- Data warehouses are organised around particular subjects (sales, customers, products)

● Validates, Integrated data


- Data from different systems converted to a common format: allows comparison and
consolidation of data from different sources
- Data from various sources validated before storing it in a data warehouse

● Time variant
- Historical data
- Trend analysis crucial for decision support: requires historical data
- Data consists of a series of “snapshots” which are time stamped

● Non-volatile
- Users have read access only- all updating done automatically by ETL process and
periodically by a DBA

A Database Warehouse Architecture

Business Analyst World


- Identifying facts and dimensions from the requirements the client provides in order to make a
database

Dimensional Modelling

● A dimensional model consists of:


- Fact table
- Several dimensional tables
- (Sometimes) hierarchies in the dimensions

● Essentially a simple and restricted type of ER model

Fact Table

● A fact table contains the actual business measures (additive, aggregates), called facts
● The fact table also contains foreign keys pointing to dimensions
Star Schema - dimensional model
Dimension Hierarchies

Dimension Table - Example


- Captures a factor by which a fact can be described or classified
- Actual data might look like this
- Hierarchy evident in data

Dimensional model as an ER model

Designing a Dimensional Model


1. Choose a Business Process
2. Choose the measured facts (usually numeric, additive quantities)
3. Choose the granularity of the fact table
4. Choose the dimensions
5. Complete the dimension tables

Distributed database
- A single logical database physically spread across multiple computers in multiple locations that
are connected by a data communications link
- Appears to users as though it is one database
Decentralised Database
- A collection of independent databases which are not networked together as one logical database
- Appears to users as though many databases

We are connected by Distributed Databases

Advantages of Distributed DBMS


- Good fit for geographically distributed organisations/ users: Utilise the internet
- Data located near site with greatest demand
Eg: ESPN Weekend Sports Scores

Need to horizontal scaling cause if people wanna watch a sport which is popular in other country
they have to go to that country’s site

- Faster data access (to local data)


- Faster data processing: Workload split amongst physical servers

Advantages of distributed DBMS

● Allows modular growth


- Add new servers as load increases (horizontal scalability)\

● Increased reliability and availability


- Less danger of a single-point of failure (SPOF), IF data is replicated
● Supports database recovery
- When data is replicated across multiple sites

Disadvantages of distributed DBMS

● Complexity of management and control


- Database or/amd application must stitch together data across sites
- Who and where is the current version of the record (row & column)?
- How does the logic display this to the web and application server?
- Who is waiting to update the information and where are they?

● Data Integrity
- Additional exposure to improper updating
- If two users in two locations update the record at the exact same time who decides which
statement should “win”?
- Solution: Transaction Manager or Master-slave design

● Security
- Many server sites -> higher chance of breach
- Multiple access sites require protection including network and strong infrastructure from
both cyber and physical attacks

● Lack of standards
- Different Relational DDBMS vendors use different protocols

● Increased training and maintenance costs


- Most complex IT infrastructure
- Increased Disk Storage
- Fast intra and inter network infrastructure
- Clustering software
- Network Speed
Working in a stock exchange there should be no delay in the network and there should
be a direct line to minimise delay

● Increased storage requirements


- Replication model all parts of data is distributed everywhere - increases disk req
Objectives of distributed DBMS

● Location transparency
- A user does not need to know where particular data are stored

● Local autonomy
- A node can continue to function for local users if connectivity to the network is lost

Location Transparency
- A user (or program) accessing data do not need to know the location of the data in the network of
DBMS
- Requests to retrieve or update data from any site are automatically forwarded by the system to
the site or sites related to the processing request
- A single query can join data from tables in multiple sites

Local Autonomy
- Being able to operate locally when connections to other databases fail
- Users can administer their local database
● Control local data
● Administer security
● Log transactions
● Recover when local failures occur
● Provide full access to local data

Functions of a distributed DBMS


- Locate data with a distributed catalogue (meta data)
- Determine location from which to retrieve data dn process query components
- DBMS translation between nodes with different local DBMSs
- Data consistency
- Global primary key control
- Scalability
- Security, concurrency, query optimisation, failure recovery

Distribution Options
● When distributing data around the world- the data can be partitioned or replicated
● Data replication is a process of duplicating data to different nodes
● Data partitioning is a process of partitioning data into subsets that are shipped to different nodes
● Many real-life systems use a combination of two (partition data and keep some replicas around)
Data Replication - Advantages
- High reliability due to redundant copies of data
- Fast access to data at the location where it is most accessed
- May avoid complicated distributed integrity routines
Replicated data is refreshed at scheduled intervals
- Decoupled nodes don't affect data availability: If some nodes are down it doesn't affect the other
nodes where data is stored
- Reduced network traffic at prime time: If updates can be delayed
- This is currently popular as a way of achieving high availability for global systems: Most Sql or
NoSQL database offer replication

Data Replication - Disadvantages


- Need more storage: Each server stores a copy
- Data Integrity : Takes time for update operations
● HIgh tolerance for out of date data may be required
● Updates may cause performance problems for busy nodes
● Retrieve incorrect data if updates have not arrived

- Network communication capabilities


● Updates can place heavy demand on telecommunications/ networks
● High speed networks are expensive

DATA PARTITIONING
- Split data into chunks, store chunks in different nodes
- A chunk can be set rows or columns
- Thus, two types of partitioning: horizontal and vertical
Horizontal partitioning
- Table rows distributed across nodes
- Different rows of table at different sites
- Advantages
● Data stored to where it is used : efficiency
● Local access optimisation: better performance
● Only relevant data is stored locally: security
● Unions across partitions: ease of query (combining rows)

- Disadvantages
● Accessing data across partitions: inconsistent access speed
● No data replication : backup vulnerability
- Row 1 will be in Europe cause its been watched in europe most cause EPL

Vertical Partitioning
- Different columns of a table at different sites
- Advantages and disadvantages are the same except
● Combining data across partitions is more difficult because it requires joins (instead of
unions)
Trade-offs when dealing with DDBMS

- Trade-offs
● Availability vs Consistency
The CAP theorem says we need to decide whether to make data always available OR
always consistent

● Synchronous vs Asynchronous updates


Are changes immediately visible everywhere (great BUT expensive) or later propagated
(less expensive, but seeing stale data)

CAP THEOREM
● Cannot have all three in a database
The dominance of the relational model
● Pros of relational databases
- Simple, can capture any business use case
- Can integrate multiple applications via shared data store
- Standard interface language SQL
- Ad-hoc queries, across and within “data aggregates”
- Fast, reliable, concurrent, consistent

● Cons of relational databases


- Object Relational (OR) impedance mismatch
- Not good with big data
- Not good with clustered/replicated servers

● Adoption of NoSQL driven by “cons” of Relational


● But ‘polyglot persistence’ = Relational will not go away

Big data and its 3Vs


● Data that exist in very large volumes and many different varieties (data types) and that need to be
processed at a very high velocity (speed)
- Volume: much larger quantity of data than typical for relational databases
- Variety: lots of different data types and formats
- Velocity: data comes at very fast rate (e.g. mobile sensors, web clock streams)
Big Data Characteristics

● Schema on Read, rather than Schema on Write


- Schema on Write - pre existing data model, how traditional databases are designed
(relational databases)
- Schema on Read - data model determined later, depends on how you want to use it
(XML, JSON)
- Capture and store the data, and dont worry about how you want to use it later

● Data Lake
- A large integrates repository for internal and external data that does not follow a
predefined schema
- Capture everything, dive in anywhere, flexible access

Schema on write vs schema on read

NoSQL database properties

● Features
- Does not use relational model or SQL language
- Runs well on distributed servers
- Most are open-source
- Built for the modern web
- Schema-less
- Supports schema on read
- Not ACID compliant
- Eventually consistent

● Goals
- To improve programmer productivity (OR mismatch)
- To handle large data volumes and throughput (big data)

Types of NoSQL databases

Types of NoSQL - key-value stores


- Key = primary key
- Value = anything (number, array, image, JSON) - the application is in charge of interpreting what
it means
- Operations: Put (for storing), Get and Update
- Example: Riak, Redis, Memached, BerkeleyDB, HamsterDK, Project Voldemort, couchbase

Types of NoSQL: document databases

- Similar to a key-value store except that the


document is “examinable” by the databases,
so its content can be queried and parts of it
updated

- Document = JSON file

- Examples: MongoDB, CouchDB, Terrastore,


OrientDB, RavenDB

Types of NoSQL: column families


- Columns rather than rows are stored together on disk
- Makes analysis faster, as less data is fetched
- This is like automatic vertical partitioning
- Related columns grouped together into ‘families’
- Example: Cassandra, Big Table, HBase

Aggregate-oriented databases
● Pros:
- Entire aggregate of data is stored together (no need for transactions)
- Efficient storage on clusters/ distributed databases

● Cons
- Hard to analyse across subfields of aggregates
- E.g. sum over products instead of orders

Types of NoSQL: Graph Databases


- Category by itself
- Used by big companies
- A ‘graph’ is a node-and-arc network
- Good to display social networks and roads and shit
- Graph are difficult to program in relational Database
- A graph DB stores entities and their relationships
- Graph queries deduce knowledge from the graph
- Example: Neo4J, Infinite graph, OrientDBv, FlockDB, TAO

ACID vs BASE

ACID (Atomic, Consistent, Isolated, Durable)


Vs
BASE(Basically Available, Soft State, Eventual Consistency)
● Basically Available: The constraint states that the system does guarantee the availability of the
data; there will be a response to any request. But data may be in an inconsistent or changing
state.

● Soft State: The state of the system could change over time- even during times without input
there may be changes going on due to ‘eventual consistency’.

● Eventual Consistency: the system will eventually become consistent once it stops receiving
input. The data will propagate to everywhere it needs to, sooner or later, but the system will
continue to receive input and is not checking the consistency of every transaction before it moves
on to the next

Eventual Consistency Variations


● Causal Consistency
- Processes that have casual relationships will see consistent data
- People cant see inconsistent data so communication makes sense
- It provides a guarantee that if one process causally depends on the result of another
process's update, it will observe that update in the same order

● Read-your write consistency


- A process always accesses the recent data after its update operations and never sees an
older value

● Session Consistency
- As long as session exists, system guarantees read-your-write consistency

● Monotonic read consistency


- If a process has seen a particular value of data item, any subsequent processes will
never return any previous values

● Monotonic write consistency


- The system guarantees to serialise the writes by the same process

● In Practice:
- A number of these properties can be combined
- Monotonic reads and read-your-writes are most desirable

You might also like