Database Systems
Database Systems
WEEK 1
- Database :
A large integrated structured collection data usually intended to model some real world
enterprise
Example: A social media site
1. Entities e.g posts, users, photos
2. Relationships
- Database advantages:
1. Data independence:
Separation of data and program, application logic, central management
- Database design:
1. Conceptual Design:
Construction of a model of the data used in the database independent of all
physical considerations(irrespective in mysql, sql etc)
Result in the ER diagrams.
Example: investment banking - investment bank has a number of branches,
within each branch a number of departments operate and are structured in a
hierarchical manner. The bank employees staff who are assigned to work
- Need a database to record staff details including which department and
branch they are assigned to
2. Logical Design:
Construction of a relational model of data based on the conceptual design - data
organised in relations.
Involves arranging data into a series of logical relationships called entities and
attributes.
Independent of any particular database management system like SQL
3. Physical Design:
A description of implementation of the logical design for a specific DBMS(sql).
Defines data types and file organisation
Describes:
- Basic relations(data types)
- File organisation
- Indexes
WEEK 2
- Entity Set: A collection of entities that have the same properties. All entities is an
entity set have the same set of attributes, each entity has a key
2. ER Model: Relationship:
- Relationship: Association among two or more entities. Relationships can have
their own attributes
Eg: Student enrols in INFO20003, student gets assessed for assignments
Fred works in the Pharmacy Department
- Constraints
1. Key Constraints: Types
Key constraints determine the number of objects taking part in the relationship
set (how many from each side)
Eg: One-to-One - A car can be owned by one person only, one person one vote
One-to-Many - One person can own many cars
Many-to-Many- Employee can work in many departments and a department can
Have many
- Participation Constraints
Explores whether all entities of one entity set take part in a relationship
If yes: this is a total participation: each entity takes part in at least one relationship
Eg: each department is managed by an employee but every employee doesn’t need to
manage a department
EXAMPLE:
1. Every employee must work in a department. Each department has at least one
employee - If it must so its Mandatory so its total, at least one so it partial
2. Each department must have a manager (but not everyone is a manager) -
On Chen’s Notation:
- Weak entity set must have total participation in this
relationship set. Such relationship is called identifying and is
represented as “bold”
- Weak entities have only a “partial key”(dashes underline) and
they are identified uniquely only when considering the primary key
of the owner entity.
- EXAMPLE
1. Strong - “has a key which may be defined without reference to
other entities” eg: Character - CharID = C001, CharName =
“Tracer”
- Ternary Relation
EXERCISE
- Entities: [Subjects], [Professors]
- Each subject has ID (eg INFO20003), title, time
- Make up suitable attributes for [Professors]
Basics
0. Any number of professors teach any number of subjects (many-to-many)
1. Every professor teaches exactly one subject (no more, no less). A subject can be taught
by multiple professors( INFO20003) but in some cases, there might be no teacher(eg
virtual class)
CONCEPTUAL DESIGN
- Design Choices
- Should a concept be modelled as an entity or an attribute?
- Should a concept be modelled as an entity or a relationship ?
- Should we model relationships as binary, ternary, n-ary?
- Entity VS Attribute
- Example: Should “address” be an attribute of employees or an entity
Consider:
Depends upon how we want to use address information, and the semantics of
the data:
1. If we have several addresses per employee, address must be an entity.
2. What if an address links to both employee and say
WorkFromHomeContract?
Whats Examinable??
1. Draw conceptual diagrams yourself
2. Given a problem: determine entities/attributes/relationships
3. What is a key constraint and participation constraint, weak entity
4. Determine constraints for the given entities and their relationships
- Relational Model:
1. Rows and Columns ( Tuples/records and Attributes/fields)
2. Key and Foreign Key to link relations
Definitions
- Relational Database: a set of relations.
- Relation: Made up of 2 parts
1. Schema:
Specifies name of relation, plus name and type of each column (attribute).
Example: Students(sid: string, name: string, login: string, age: integer, gpa: real)
Analogy: function definition in programming…
1. Conceptual :
Chen Diagram
2. Logical:
Employee (ssn, name, age) - pseudo code
5. Instance:
- Creating Relations in SQL
CREATE TABLE RelationName (entity1 CHAR(20), entity2 FLOAT, entity INTEGER etc)
Eg: Students (sid: string, name: string, login: string, age: integer, gpa: real)
CREATE TABLE Students ( sid VARCHAR(20), name VARCHAR(20), gpa FLOAT etc)
- KEY:
Keys are a way to associate tuples in different relations.
Keys are one form of integrity constraint (IC). Eg: if a student dropped out of university
it wouldn't make sense if the student is still enrolled in a unit, employee enrolled in a unit
which is wrong as he should be teaching
- Primary Keys(PK):
- A set of fields in a superkey if no two distinct tuples can have the same values in
all key fields.
Eg: Two people named john - won't be a superkey, but if John 1 likes kendrick
and John2 likes Cole - it would be a superkey as they are distinct values
- A set of fields is a key for a relation if it is a superkey and no subset of the fields
is a superkey (minimal subset).
- Out of all keys one is chosen to be the primary key of the relation. Other keys
are called candidate keys.
- DELETE: What should be done if a Student tuple is deleted (eg Student changed unis)?
Options?
- Delete all Enrolled tuples that refer to it? (cascading delete to all their
enrolments)
- Disallow deletion of a Student tuple that is referred to?
- Set sid in Enrolled tuples that refer to it to a default sid?
Integrity Constraints
IC: condition that must be true for any instance of the database; eg: domain constraints.
- ICs are specified when schema is defined.
- ICs are checked when relations are modified.
Example: for employees we need to capture their home phone number and work phone number
- Conceptual Design
- Logical Design
In translating a many-to-many relationship set to a relation, attributes of a new relation
must include:
1. Keys for each participating entity set (as FK)
2. All descriptive attributes
- Partial Identifier: Identifies an instance in conjunction with one or more partial identifiers
- Attributes types:
1. Mandatory - NOT NULL (blue diamond)
2. Optional - NULL (empty diamond)
3. [DERIVED] - e.g [YearsEMployed]
4. {Multivalued} - e.g {Skill}
5. Composite - eg Name (First, Middle, Last)
6.
- Derived Attributes (Chen/ Workbench)
Derived attributes imply that their values can be derived from some other attributes in a
database - Anything that can be calculated.
They do not need to be stored physically - they disappear at the physical design.
Conventions of ER Modelling (Workspace)
Cardinality:
- One to One:
Each entity will have exactly{0 or 1} related entities.
- One to Many:
One of the entities will have {0, 1 or more} related entities, the other will have {0 or 1}
- Many to Many:
Each of the entities will have {0, 1, or more related entities}
Cardinality Constraints
1. Optional Many: Partial participation without key constraint
If staff has only 2-3 roles you may decide to have these within the employee table at physical
design to sam on “JOIN” time
Eg ENUM(‘administration’, ‘logistics’, ‘transport’)
Dealing with Weak Entities
Map is the same way: Foreign Key goes into the relationship at the crow’s foot end. Only
difference, as seen, is that the Foreign Key becomes part of the Primary Key.
One and only one customer has at least one or infinitely many address and (second relation)
each address can be belonging to one or infinitely many address book items
- The tile is the Optional side of the relationship gets the foreign key.
- Conceptual
- Logical
Person (ID, Name, DateOfBirth, SpouseID)
- Implementation
CREATE TABLE Person (
ID INT NOT NULL,
Name VARCHAR(50) NOT NULL,
DateOfBirth DATE NOT NULL,
SpouseID INT,
PRIMARY KEY (ID)
FOREIGN KEY (SpouseID) References (ID)
ON DELETE RESTRICT,
ON UPDATE CASCADE);
Unary: One-to-Many
- Conceptual
- Logical
Employee(ID, Name, DateOfBirth, ManagerID)
- Implementation
CREATE TABLE Employee(
ID INT NOT NULL,
Name VARCHAR(50) NOT NULL,
DateOfBirth DATE NOT NULL,
ManagerID INT,
PRIMARY KEY (ID),
FOREIGN KEY (ManagerID) REFERENCES Employee(ID),
ON DELETE RESTRICT
ON UPDATE CASCADE);
Unary: Many-to-Many
- Conceptual
- Logical
Create Associative Entity like usual; generate logical model
Item (ID, Name, UnitCost)
Component (ID, ComponentID, Quantity)
- Implementation
What’s examinable?
- Need to be able to draw conceptual, logical and physical diagrams
- Assignment 1: Conceptual Chen’s pen and paper
- Assignment 1: Physical Crow’s foot with MySQL Workbench
- CREATE TABLE SQL statements
WEEK 4
Projection:
Retains only attributes that are in the projection list
Schema of result:
- Only the fields in the projection list, with the same names that they had in the input
relation
- Example: if all rows have the same rating, the output via
𝜋rating(movies)
Selection (𝜎):
Selects rows that satisfy the selection condition.
Result is a relation
Schema of the result is same as that of the input relation
No need to remove duplicate rows as they won’t exist because by definition a DBMS doesn’t
have same rows.
Example: Find sailors whose rating is above 9 and who are younger than 50
Union: {Jin, Suga, J-Hope, RM, Jimin, V, Jungkook, Jonny, Chris, Guy, Will, Phil}
- Set Difference: Retains row of one relation that do not appear in the other relation
Eg: Samsung phones {Fold 5, Flip 5, S23, A53}; Flip phones {Fold 5, Flip 5, Razr, Razr+}
Cross Product
EXAMPLE: Find all sailors (from relation S1) who have reserved a boat
- Step 1: S1 x R1
- Step 2: Select rows where attributes that appear in both relations have equal values.
- Step 3: Project all unique attributes and one copy of each of the common ones
1. Condition Join (or theta-join) is a cross product with a condition c (sometimes denotes
by theta)
R⨝c S = 𝜎c(RxS)
S1⨝(S1.sid < R1.sid)R1
2. Equi- Join is a special case of condition join, where condition c contains only equalities
Eg: S1.sid = R1.sid hence
S1⨝(S1.sid = R1.sid)R1
Whats Examinable??
1. Relational Algebra Operations: The 5 basics and intersection sand Joins
2. Design queries with Relational algebra operations
3. Apply Relational Algebra Operations on tables (relations)
What is SQL?
SQL or SEQUEL is a language used in relational databases
Supports CRUD - Create, Read, Update and Delete commands
Other commands
- Administer the database (eg CHECK TABLE)
- Transactional Control (eg COMMIT)
INSERT COMMAND
SELECT COMMAND
SELECT * FROM TableName;
*(star): Allows us to obtain all columns from a table
= Give me all information you have about TableName
𝜎a⋀b V c – a and b or c
OUTER JOIN
- Joins the tables over keys
- Can be left or right (not difference)
- Included records that don’t match the join from the other table
Things to remember
1. SQL is case sensitive
2. Table name are case sensitive
3. Field names are case
4. You can do maths in SQL
SELECT 1*1+1/1-1
5. For SELECTs: no orders unless you ORDER BY
WEEK 5
Comparison and Logic
- Logic:
● AND
● OR
● NOT
- Comparison
● =
● <
● >
● <=
● >=
● <> OR != – Not equal to
String Functions
- UPPER () : Changes to uppercase
- LOWER() : Changes to lowercase
- LEFT() : Take the left X characters from a string
- RIGHT() : Take the X right characters from a string
Set Operators
- UNION:
Shows all rows returned from the queries (or tables)
- UNION ALL:
IF you want duplicate rows show in the results you need to use the ALL keyword
Eg: UNION ALL ….
- INTERSECT:
SHows only rows that are common in the queries
- INTERSECT ALL
Same as before
Sub-Query Nesting
- A nested query is simply another SELECT query you write to produce a table set
DDL: TRUNCATE/DROP
1. TRUNCATE
Same as DELETE * FROM TableName;
Faster but cannot ROLL BACK a TRUNCATE command
TRUNCATE TableName;
2. DROP
Kills a relation - removes the data, removes the relation. Entire table is gone
Record:
A record refers to an individual row of a table and has a unique rid. The rid has the property that
we can identify the disk address of the page containing the record by using the rid. The rid
consists of the page ID and the offset within that page, for example, an rid of (3, 7) refers to the
seventh record from the beginning of the third page.
Page:
A page is an allocation of space on disk or in memory containing a collection of records.
Typically, every page is the same size.
File:
A file consists of a collection of pages containing records. In simple database scenarios, a file
corresponds to a single table
Index Classification
1. Clustered
If the order of data records is the same as the order of index data entries, then the index
is called a clustered index.
- A data file can have a clustered index on at most one search key combination
(i.e. we cannot have multiple clustered indexes over a single table).
- Cost of retrieving data records through an index varies greatly based on whether
the index is clustered (cheaper for clustered).
- Clustered indexes are more expensive to maintain (require file reorganisation),
but are really efficient for range search.
- Cases where the query has a condition to check for a range
- Not good for equality conditions
2. Unclustered
If they are not in the same order its called unclusetered index
● IMPORTANT: (Approximated) cost of retrieving records found in range scan:
Clustered: cost ≈ # pages in data file with matching records
Unclustered: cost ≈ # of matching index data entries (data records)
Hash Index:
Hash indexes are best suited to support equality selections (queries where the WHERE clause
has an equality condition)
B-Tree Index:
A B-tree index is created by sorting the data on the search key and maintaining a hierarchical
search data structure (B+ tree) that will direct the search to the respective page of the data
entry. Insertion in such a structure is costly as the tree is updated with every insertion or
deletion. There will be situations in which a major part of tree will be re-written if a particular
node is overfilled or under-filled. Ideally the tree will automatically maintain an appropriate
number of levels, as well as optimal space usage in blocks
Heap Files:
- Simplest file structure
- Contains files in no specific order
- As a file grows and shrinks, disk pages are allocated and de-allocated.
- Heaps are fastest for inserts compared to other alternatives.
- Suitable when typical access is a file scan retrieving all records
- Cost = Number of Pages
Sorted Files:
- Similar structure like heap files
- Pages and records are ordered
- Best for retrieval in some order
- Cheapest cost for a sorted file is Log2 (Pages)
WEEK 6
Reduction factor is usually called selectivity. It estimates what portion of the relation will qualify
for the given predicate , i.e., satisfy the given condition.
Cost = Npages(R)
- Steps to perform:
1. FInd qualifying data entries:
● Go through the index: height typically small (FYI: 2-4 I/O = B+ tree, 1.2
I/O - hash index, i.e., negligible if many records retrieved)
● Once data entries are reached, go through data entries one by one and
look up corresponding data records (in the data file)
Example
Lets say that 10% of Reserves tuples qualify: RF = 0.1
Given:
NPages(R) = 1000
NTuplesPerPage(R) = 100
NTuples(R) = 1000* 100 = 100,000
Answers:
1. Clustered Index = (1000 + 50) * 0.1 = 105
2. Unclustered Index = (1000 + 100,000) * 0.1 = 10100
3. Heap Scan, unsorted = 1000
A Binary-tree index matches (a combination of) predicates that involve only attributes in
a prefix of the search key.
Selection approach
3. Apply the predicates that don’t match the index (if any) later on
THese predicates are used to discard some retrieved tuples, but do not affect number of
tuples/pages fetched (nor the total cost)
●
● Then, day < 8/9/94 must be checked on the fly
● Horrible for range queries
Example:
- Rationale: An algo used for sorting, remember that if data does not fir in memory, we
need to several passes.
ReadTable = NPages(R) Cost to read entire table, keep only projected attributes
WriteProjectedPages = NPages(R) * PF
SortingCost = 2*NumPasses*ReadProjectedPages
ReadProjectedPages = NPages(R) * PF
JOINS
Are very common and can be very expensive (time/ processing)
Cross product in the worst case
NBlocks(Outer) = | NPages(Outer)/(B-2)|
B = # pages of space in memory, i.e., Blocks
Sort-Merge Join
- Sort R and S on the join column, then scan them to do a merge (on join column), and
output result tuples.
- Sorted R is scanned once; Each S group of the same key values is scanned once per
matching R tuple (typically means Sorted S is scanned once too).
- Useful when:
1. One or both inputs are already sorted on join attribute(s)
2. Output is required to be sorted on join attribute(s)
- Partition both relations using hash function h: R tuples in partition will only match S
tuples in partition
- Read in partition of R, hash is using h2 (<> h!). Scan matching partition of S, probe hash
table for matches.
WEEK 7
Query Optimization I
- Is a tree, with relational algebra operators as nodes and access paths as leaves
- Each operator labelled with a choice of an algorithm.
● Step 1
- Query block is any statement starting with SELECT
- Query block = unit of optimization / execution
- Typically inner most block is optimized first then moving towards outers
Example
- Projection:
- A projection commutes with a selection that only uses attributes retained by the
projection
Equivalences Involving Joins
- These equivalences allow us to choose different join orders
Cost Estimation
- TO decide on the cost, the optimizer need information about the relations and indexes
involved. THis information is stored in the system catalogs
- MAximum number of tuples in the result in the product of the cardinalities of relations in
the FROM clause
- Reduction factor (RF) associated with each predicate reflects the impact of the predicate
in reducing the result size.
RF is also called selectivity.
- If there are no selections (no predicates), reduction factors are simply ignored - equal to
1.
EXAMPLE
Enumeration of Alternative Plans
- Other operations can be performed on top of access paths, but they typically do not incur
additional cost since they are done on the fly (projections, additional non-matching
predicates)
EXAMPLE
Plan Enumeration for multi-relation plans
SxR
Cost (SxR) = (NLJ) NPages(S) + NPages(S) * NPages(R) = 500 + 500*1000 = 500500
(SxR)xB
- NTuples(S) = 500(Npages) * 80 (Ntuplesperpages) = 40000
Result size (SxR) = 40000 * 100000 * 1/40000 = 100000 tuples = 1000 pages
So, 100000/100 = 1000 pages
● CASE 2
2. Deletion Anomaly: If student 425 withdraws, we lose all record of course C400 and its fee
3. Update Anomaly: If the fee for course C200 changes, we have to change it in multiple records
(rows), else the data will be inconsistent.
Normalisation
- A technique used to remove undesired redundancy from databases.
- Break one large table into several smaller tables.
- A relation is normalised if all determinants are candidate keys.
Armstrong’s Axioms
Functional dependencies can be identified using Armstrong’s axioms
Let A = (X1, X2, … Xn) and B = (Y1, Y2…. Yn)
1. Reflexivity:
Example - Student_ID, name -> name
2. Augmentation
Example - Student_ID -> name => Student_ID, surname ->name, surname\
3. Transitivity:
Example: ID -> birthday and birthdate -> age then ID -> age
Steps in Normalisation
First Normal Form: Remove Repeating Groups
- Repeating groups of attributes cannot be represented in a flat, two dimensional table
- Removing cells with multiple values (keep atomic data)
- Break them into two tables and use Primary key or Foreign keys to connect them
Second Normal Form: Remove Partial Dependencies
- A non-key attribute cannot be identified by part of a composite key
Solution
Normalisation vs Denormalization
1. Normalisation
- Normalised relations contain a minimum amount of redundancy and allow users to insert,
modify and delete rows in tables without errors or inconsistencies (anomalies).
2. Denormalization
- The pay off: query speed
- The price: extra work on updates to keep redundant data consistent
- Denormalization may be used to improve performance of time-critical operations.
● Isolation
- Changes made during execution of a transaction cannot be seen by other transaction
until this on is completed
- If Multiple people are accessing the database everyone will see the same data
● Durability
- When a transaction is complete, the changes made to the database are permanent, even
if the system fails
Serializability
- Transaction ideally should run in a schedule that is “serializable”
- Multiple, concurrent transactions appear as if they were executed one after another
- Ensures that the concurrent execution of several transactions yields consistent results
● Lock manager
Responsible for assigning and policing the locks used by the transactions
● Table-level lock
- Entire table is locked - as above but not quite as bad
- T1 and T2 can access the same database concurrently as they use different tables
- Can cause bottlenecks, even if transactions want to access different parts of the table
and would not interfere with each other.
- Not suitable for highly multi-user DBMSs
● Page-level lock
- An entire disk page is locked
- Not commonly used now
● Row-level lock
- Allows concurrent transactions to access different rows of the same table, even if the
rows are located on the same page
- Improves data availability but with high overhead (each row has a lock that must be read
and write to)
- Currently the most popular approach (MySQL, Oracle)
● Field-level lock
- Allows concurrent transactions to access the same row, as long as they access different
attributes within that row
- Most flexible lock but requires an extremely high level of overhead
- Not commonly used
Types of Locks
● Binary Locks
- Has only two states: locked or unlocked
- Eliminates “Lost update” problem
Lock is not released until the statement is completed
- Considered too restrictive to yield optimal concurrency, as it locks even for two READs
(when no update being done)
● Exclusive Lock
- Access is reserved for the transaction that locked the object
- Must be used when transaction intends to WRITE
- Granted if and only if no other locks are held on the data item
● Shared Lock
- Other transactions are also granted Read access
- Issued when a transaction wants to READ data, and no Exclusive lock is held on that
data item
Multiple transactions can each have a shared lock on the same data item if they are all
just reading it
● Optimistic
- Based on the assumption that the majority of database operations do not conflict
- Transaction is executed without restrictions or checking
- Then when it is ready to commit, the DBMS checks whether any of the data is read has
been altered - if so rollback
Logging transactions
- Allow us to restore the database to a previous consistent state
- If a transaction cannot be completed, it must be aborted and any changed rolled back
- To enable this, DBMS tracks all updates to data
Transaction log
- Also provides the ability to restore a crashed database
- If a system failure occurs, the DBMS will examine the log for all uncommitted or incomplete
transactions and it will restore the database to a previous state
DATABASE ADMINISTRATION
CAPACITY PLANNING
- The process of predicting when future load levels will saturate the system and determining the
most cost-effective way of delaying system saturation as mich as possible.
What is a Backup
● A backup is a copy of your data
- However there are several types of backup
● If data becomes corrupted or deleted or held to ransom it can be restored from the backup copy
Protect against
1. Malicious activity: security compromise
2. Natural or man-made disasters
3. Government regulations
- Historical archiving rules
- Metadata collection
- Privacy rules
Categories of Failures
Physical Backup
- Raw copies of files or directories
- Suitable for large databases that need fast recovery
- Database is preferably offline when backup occurs
- Backup = exact copies of the database directories and files
- Backup should include logs
- Backup is only portable oto machines with a similar configuration
- To restore
Logical Backup
- Backup completed through SQL queries
- Slower than physical: SQL selects rather than OS copy
- Output is larger than physical
- Doesnt include log or config files
- Machine independent
- Server is available during the backup
- In MySQL can use the backup using
- Mysqldump
- SELECT …. iNTO OUTFILE
- To restore: USe mysqlimport, or LOAD DATA INFILE within the mysql client
Full Backup
- A full backup is where the complete database is backed up: may be Physical or Logical, Online or
Offline
- It includes everything you need to get the database operations in the event of a failure
Incremental Backup
- Only the changes since the last backup are backed up
- For most databases this means only backup log files
- To restore
- Stop the database, copy backed up log files to disk
- Start the database and tell it to redo the log files
Offsite Backup
- Enables disaster recovery (because backup is not physically near the disaster site)
- Example solutions:
- Backup tapes transported to underground vault
- Remote mirror database maintained via replication
- Backup to cloud
WEEK 10
● Data Warehouse:
- A single repository of organisational data
- Integrates data from multiple sources
- Extracts data from source systems, transforms, loads into the warehouse
- Makes data available to managers/users
- Supports analysis and decision-making
● Subject oriented
- Data warehouses are organised around particular subjects (sales, customers, products)
● Time variant
- Historical data
- Trend analysis crucial for decision support: requires historical data
- Data consists of a series of “snapshots” which are time stamped
● Non-volatile
- Users have read access only- all updating done automatically by ETL process and
periodically by a DBA
Dimensional Modelling
Fact Table
● A fact table contains the actual business measures (additive, aggregates), called facts
● The fact table also contains foreign keys pointing to dimensions
Star Schema - dimensional model
Dimension Hierarchies
Distributed database
- A single logical database physically spread across multiple computers in multiple locations that
are connected by a data communications link
- Appears to users as though it is one database
Decentralised Database
- A collection of independent databases which are not networked together as one logical database
- Appears to users as though many databases
Need to horizontal scaling cause if people wanna watch a sport which is popular in other country
they have to go to that country’s site
● Data Integrity
- Additional exposure to improper updating
- If two users in two locations update the record at the exact same time who decides which
statement should “win”?
- Solution: Transaction Manager or Master-slave design
● Security
- Many server sites -> higher chance of breach
- Multiple access sites require protection including network and strong infrastructure from
both cyber and physical attacks
● Lack of standards
- Different Relational DDBMS vendors use different protocols
● Location transparency
- A user does not need to know where particular data are stored
● Local autonomy
- A node can continue to function for local users if connectivity to the network is lost
Location Transparency
- A user (or program) accessing data do not need to know the location of the data in the network of
DBMS
- Requests to retrieve or update data from any site are automatically forwarded by the system to
the site or sites related to the processing request
- A single query can join data from tables in multiple sites
Local Autonomy
- Being able to operate locally when connections to other databases fail
- Users can administer their local database
● Control local data
● Administer security
● Log transactions
● Recover when local failures occur
● Provide full access to local data
Distribution Options
● When distributing data around the world- the data can be partitioned or replicated
● Data replication is a process of duplicating data to different nodes
● Data partitioning is a process of partitioning data into subsets that are shipped to different nodes
● Many real-life systems use a combination of two (partition data and keep some replicas around)
Data Replication - Advantages
- High reliability due to redundant copies of data
- Fast access to data at the location where it is most accessed
- May avoid complicated distributed integrity routines
Replicated data is refreshed at scheduled intervals
- Decoupled nodes don't affect data availability: If some nodes are down it doesn't affect the other
nodes where data is stored
- Reduced network traffic at prime time: If updates can be delayed
- This is currently popular as a way of achieving high availability for global systems: Most Sql or
NoSQL database offer replication
DATA PARTITIONING
- Split data into chunks, store chunks in different nodes
- A chunk can be set rows or columns
- Thus, two types of partitioning: horizontal and vertical
Horizontal partitioning
- Table rows distributed across nodes
- Different rows of table at different sites
- Advantages
● Data stored to where it is used : efficiency
● Local access optimisation: better performance
● Only relevant data is stored locally: security
● Unions across partitions: ease of query (combining rows)
- Disadvantages
● Accessing data across partitions: inconsistent access speed
● No data replication : backup vulnerability
- Row 1 will be in Europe cause its been watched in europe most cause EPL
Vertical Partitioning
- Different columns of a table at different sites
- Advantages and disadvantages are the same except
● Combining data across partitions is more difficult because it requires joins (instead of
unions)
Trade-offs when dealing with DDBMS
- Trade-offs
● Availability vs Consistency
The CAP theorem says we need to decide whether to make data always available OR
always consistent
CAP THEOREM
● Cannot have all three in a database
The dominance of the relational model
● Pros of relational databases
- Simple, can capture any business use case
- Can integrate multiple applications via shared data store
- Standard interface language SQL
- Ad-hoc queries, across and within “data aggregates”
- Fast, reliable, concurrent, consistent
● Data Lake
- A large integrates repository for internal and external data that does not follow a
predefined schema
- Capture everything, dive in anywhere, flexible access
● Features
- Does not use relational model or SQL language
- Runs well on distributed servers
- Most are open-source
- Built for the modern web
- Schema-less
- Supports schema on read
- Not ACID compliant
- Eventually consistent
● Goals
- To improve programmer productivity (OR mismatch)
- To handle large data volumes and throughput (big data)
Aggregate-oriented databases
● Pros:
- Entire aggregate of data is stored together (no need for transactions)
- Efficient storage on clusters/ distributed databases
● Cons
- Hard to analyse across subfields of aggregates
- E.g. sum over products instead of orders
ACID vs BASE
● Soft State: The state of the system could change over time- even during times without input
there may be changes going on due to ‘eventual consistency’.
● Eventual Consistency: the system will eventually become consistent once it stops receiving
input. The data will propagate to everywhere it needs to, sooner or later, but the system will
continue to receive input and is not checking the consistency of every transaction before it moves
on to the next
● Session Consistency
- As long as session exists, system guarantees read-your-write consistency
● In Practice:
- A number of these properties can be combined
- Monotonic reads and read-your-writes are most desirable