RDBMS Notes
RDBMS Notes
• DBMS:-
✓ Well organized collection of data so as to be able to carry
out operations.
• RDBMS:-
✓ Stands for "Relational Database Management System." An
RDBMS is a DBMS designed specifically for relational
databases. Therefore, RDBMSes are a subset of DBMSes.
✓ A relational database refers to a database that stores data
in a structured format, using rows and columns. This
makes it easy to locate and access specific values within
the database. It is "relational" because the values within
each table are related to each other. Tables may also be
related to other tables. The relational structure makes it
possible to run queries across multiple tables at once.
Types of DDBMS:-
1. Homogeneous Database
2. Heterogeneous Database
1. Homogeneous Database:-
In a homogeneous database, all different sites store database
identically. The operating system, database management
system and the data structures used – all are same at all sites.
Hence, they’re easy to manage.
2. Heterogeneous Database:-
In a heterogeneous distributed database, different sites can use
different schema and software that can lead to problems in
query processing and transactions. Also, a particular site might
be completely unaware of the other sites. Different computers
may use a different operating system, different database
application. They may even use different data models for the
database. Hence, translations are required for different sites to
communicate.
Distributed Data Storage:-
There are 2 ways in which data can be stored on different sites.
These are:
1. Replication
In this approach, the entire relation is stored redundantly at 2 or
more sites. If the entire database is available at all sites, it is a fully
redundant database. Hence, in replication, systems maintain
copies of data.
This is advantageous as it increases the availability of data at
different sites. Also, now query requests can be processed in
parallel.
However, it has certain disadvantages as well. Data needs to be
constantly updated. Any change made at one site needs to be
recorded at every site that relation is stored or else it may lead to
inconsistency. This is a lot of overhead. Also, concurrency control
becomes way more complex as concurrent access now needs to be
checked over a number of sites.
2. Fragmentation
In this approach, the relations are fragmented (i.e., they’re
divided into smaller parts) and each of the fragments is stored in
different sites where they’re required. It must be made sure that
the fragments are such that they can be used to reconstruct the
original relation (i.e, there isn’t any loss of data).
Fragmentation is advantageous as it doesn’t create copies of data,
consistency is not a problem.
Fragmentation of relations can be done in two ways:
• Horizontal fragmentation – Splitting by rows – The relation is
fragmented into groups of tuples so that each tuple is assigned
to at least one fragment.
• Vertical fragmentation – Splitting by columns – The schema
of the relation is divided into smaller schemas. Each fragment
must contain a common candidate key so as to ensure lossless
join.
Architectural Models:-
It means that many users can access data at the same time.
Concurrency Control:-
1. Lost updates
2. Dirty read
3. Unrepeatable read
Here,
o At time t2, transaction-X reads A's value.
o At time t3, Transaction-Y reads A's value.
o At time t4, Transactions-X writes A's value on the basis of
the value seen at time t2.
o At time t5, Transactions-Y writes A's value on the basis of the
value seen at time t3.
o So at time T5, the update of Transaction-X is lost because
Transaction y overwrites it without looking at its current
value.
o Such type of problem is known as Lost Update Problem as
update made by one transaction is lost here.
2. Dirty Read:-
o The dirty read occurs in the case when one transaction
updates an item of the database, and then the transaction
fails for some reason. The updated database item is accessed
by another transaction before it is changed back to the
original value.
o A transaction T1 updates a record which is read by T2. If T1
aborts then T2 now has values which have never formed part
of the stable database.
Example:-
o At time t2, transaction-Y writes A's value.
o At time t3, Transaction-X reads A's value.
o At time t4, Transactions-Y rollbacks. So, it changes A's value
back to that of prior to t1.
o So, Transaction-X now contains a value which has never
become part of the stable database.
o Such type of problem is known as Dirty Read Problem, as
one transaction reads a dirty value which has not been
committed.
Example:-
1. Shared lock:
2. Exclusive lock:
o In the exclusive lock, the data item can be both reads as well
as written by the transaction.
o This lock is exclusive, and in this lock, multiple transactions
do not modify the same data simultaneously.
Growing phase: In the growing phase, a new lock on the data item
may be acquired by the transaction, but none can be released.
Example:
The following way shows how unlocking and locking work with 2-
PL.
Transaction T1:
Transaction T2:
Where,
In the above figure, T1's read and precedes T1's write of the same
data item. This schedule does not conflict serializable.
Thomas write rule checks that T2's write is never seen by any
transaction. If we delete the write operation in transaction T2,
then conflict serializable schedule can be obtained which is shown
in below figure.
Multiple Granularity
Database Transaction:-
All types of database access operation which are held between the
beginning and end transaction statements are considered as a
single logical transaction in DBMS. During the transaction the
database is inconsistent. Only once the database is committed the
state is changed from one consistent state to another.
Facts about Database Transactions
States of Transactions
Schedule:-
2. Non-serial Schedule:-
3. Serializable schedule:-
o If a precedence graph contains a single edge Ti → Tj, then all the instructions of
Ti are executed before the first instruction of Tj is executed.
For example:
Explanation:
The precedence graph for schedule S1 contains a cycle that's why Schedule
S1 is non-serializable.
Explanation:
Conflict operations: -
The two operations are called conflicting operations, if all the following
three conditions are satisfied:
• Both the operation belongs to separate transactions.
• Both works on the same data item.
• At least one of them contains one write operation.
Note: Conflict pairs for the same data item are:
Read-Write
Write-Write
Write-Read
Step1: Check the vertex in the precedence graph where indegree=0. So,
take the vertex T2 from the graph and remove it from the graph.
Step3: And at last, take the vertex T1 and connect with T3.
Precedence graph equivalent to schedule S2
View Serializability: -
View Serializability is a process to find out that a given schedule is view
serializable or not.
Condition-01:
For each data item X, if transaction Ti reads X from the database initially in
schedule S1, then in schedule S2 also, Ti must perform the initial read of X
from the database.
Thumb Rule
“Initial readers must be same for all the data items”.
Condition-02:
If transaction Ti reads a data item that has been updated by the transaction
Tj in schedule S1, then in schedule S2 also, transaction Ti must read the
same data item that has been updated by the transaction Tj.
Thumb Rule
“Write-read sequence must be same.”.
Condition-03:
Thumb Rule
“Final writers must be same for all the data items”.
Method-01:
Method-02:
Thumb Rule
No blind write means not a view serializable schedule.
Method-03:
Example-01:
Step-01:
List all the conflicting operations and determine the dependency between
the transactions-
• W1(B) , W2(B) (T1 → T2)
• W1(B) , W3(B) (T1 → T3)
• W1(B) , W4(B) (T1 → T4)
• W2(B) , W3(B) (T2 → T3)
• W2(B) , W4(B) (T2 → T4)
• W3(B) , W4(B) (T3 → T4)
Step-02:
• Clearly,
there exists no cycle in the precedence graph.
• Therefore, the given schedule S is conflict serializable.
• Thus, we conclude that the given schedule is also view serializable.
Example-02:
Step-01:
List all the conflicting operations and determine the dependency between
the transactions-
• R1(A) , W3(A) (T1 → T3)
• R2(A) , W3(A) (T2 → T3)
• R2(A) , W1(A) (T2 → T1)
• W3(A) , W1(A) (T3 → T1)
Step-02:
Draw the precedence graph-
• Clearly,
there exists a cycle in the precedence graph.
• Therefore, the given schedule S is not conflict serializable.
Now,
• Since, the given schedule S is not conflict serializable, so, it may or
may not be view serializable.
• To check whether S is view serializable or not, let us use another
method.
• Let us check for blind writes.
Now,
• To check whether S is view serializable or not, let us use another
method.
• Let us derive the dependencies and then draw a dependency graph.
• Clearly,
there exists a cycle in the dependency graph.
• Thus, we conclude that the given schedule S is not view serializable.
Example-03:
List all the conflicting operations and determine the dependency between
the transactions-
• R1(A) , W2(A) (T1 → T2)
• R2(A) , W1(A) (T2 → T1)
• W1(A) , W2(A) (T1 → T2)
• R1(B) , W2(B) (T1 → T2)
• R2(B) , W1(B) (T2 → T1)
Step-02:
• Clearly,
there exists a cycle in the precedence graph.
• Therefore, the given schedule S is not conflict serializable.
Now,
• Since, the given schedule S is not conflict serializable, so, it may or
may not be view serializable.
• To check whether S is view serializable or not, let us use another
method.
• Let us check for blind writes.
Alternatively,
• You could directly declare that the given schedule S is not view
serializable.
• This is because there exists no blind write in the schedule.
• You need not check for conflict serializability.
Recoverability of schedule: -
Sometimes a transaction may not execute completely due to a software
issue, system crash or hardware failure. In that case, the failed transaction
has to be rollback. But some other transaction may also have used value
produced by the failed transaction. So we also have to rollback those
transactions.
The above table 1 shows a schedule which has two transactions. T1 reads
and writes the value of A and that value is read and written by T2. T2
commits but later on, T1 fails. Due to the failure, we have to rollback T1. T2
should also be rollback because it reads the value written by T1, but T2 can't
be rollback because it already committed. So this type of schedule is known
as irrecoverable schedule.
Irrecoverable schedule: The schedule will be irrecoverable if Tj reads
the updated value of Ti and Tj committed before Ti commit.
To find that where the problem has occurred, we generalize a failure into
the following categories:
1. Transaction failure
2. System crash
3. Disk failure
1. Transaction failure
2. System Crash
3. System failure can occur due to power failure or other hardware
or software failure. Example: Operating system error.
3. Disk Failure
4. It occurs where hard-disk drives or storage drives used to fail
frequently. It was a common problem in the early days of
technology evolution.
5. Disk failure occurs due to the formation of bad sectors, disk
head crash, and unreachability to the disk or any other failure,
which destroy all or part of disk storage.
Log-Based Recovery: -
o The log is a sequence of records. Log of each transaction is
maintained in some stable storage so that if any failure occurs, then it
can be recovered from there.
o If any operation is performed on the database, then it will be recorded
in the log.
o But the process of storing the logs should be done before the actual
transaction is applied in the database.
When the system is crashed, then the system consults the log to find which
transactions need to be undone and which need to be redone.
1. If the log contains the record <Ti, Start> and <Ti, Commit> or <Ti,
Commit>, then the Transaction Ti needs to be redone.
2. If log contains record<Tn, Start> but does not contain the record
either <Ti, commit> or <Ti, abort>, then the Transaction Ti needs to
be undone.
Checkpoint: -
o The checkpoint is a type of mechanism where all the previous logs are
removed from the system and permanently stored in the storage disk.
o The checkpoint is like a bookmark. While the execution of the
transaction, such checkpoints are marked, and the transaction is
executed then using the steps of the transaction, the log files will be
created.
o When it reaches to the checkpoint, then the transaction will be
updated into the database, and till that point, the entire log file will be
removed from the file. Then the log file is updated with the new step
of transaction till next checkpoint and so on.
o The checkpoint is used to declare a point before which the DBMS was
in the consistent state, and all transactions were committed.
In the following manner, a recovery system recovers the database from this
failure:
o The recovery system reads log files from the end to start. It reads log
files from T4 to T1.
o Recovery system maintains two lists, a redo-list, and an undo-list.
o The transaction is put into redo state if the recovery system sees a log
with <Tn, Start> and <Tn, Commit> or just <Tn, Commit>. In the
redo-list and their previous list, all the transactions are removed and
then redone before saving their logs.
o For example: In the log file, transaction T2 and T3 will have <Tn,
Start> and <Tn, Commit>. The T1 transaction will have only <Tn,
commit> in the log file. That's why the transaction is committed after
the checkpoint is crossed. Hence it puts T1, T2 and T3 transaction
into redo list.
o The transaction is put into undo state if the recovery system sees a log
with <Tn, Start> but no commit or abort log found. In the undo-list,
all the transactions are undone, and their logs are removed.
o For example: Transaction T4 will have <Tn, Start>. So T4 will be put
into undo list since this transaction is not yet complete and failed
amid.
Blocking: -
Blocking occurs when one or more sessions request a lock on a same
resource, such as a row, page, or table.
when one connection holds a lock and another second connection also
requires a lock that case second connection will to be blocked until the first
connection complete.
Deadlock in DBMS: -
Deadlock Avoidance
Deadlock Detection
The wait for a graph for the above scenario is shown below:
Deadlock Prevention
Let's assume there are two transactions Ti and Tj and let TS(T) is a
timestamp of any transaction T. If T2 holds a lock by some other
transaction and T1 is requesting for resources held by T2 then the
following actions are performed by DBMS:
Relational Algebra: -
Every database management system must define a query language to allow
users to access the data stored in the database. Relational Algebra is a
procedural query language used to query the database tables to access data
in different ways.
1. Select Operation:-
o The select operation selects tuples that satisfy a given predicate.
o It is denoted by sigma (σ).
1. Notation: σ p(r)
Where:
Input:
σ BRANCH_NAME="perryride" (LOAN)
Output:
2. Project Operation:
o This operation shows the list of those attributes that we wish to
appear in the result. Rest of the attributes are eliminated from the
table.
o It is denoted by ∏.
Output:
NAME CITY
Jones Harrison
Smith Rye
Hays Harrison
Curry Rye
Johnson Brooklyn
Brooks Brooklyn
3. Union Operation:
o Suppose there are two tuples R and S. The union operation contains
all the tuples that are either in R or S or both in R & S.
o It eliminates the duplicate tuples. It is denoted by ∪.
Notation: R ∪ S
Example:
DEPOSITOR RELATION
CUSTOMER_NAME ACCOUNT_NO
Johnson A-101
Smith A-121
Mayes A-321
Turner A-176
Johnson A-273
Jones A-472
CUSTOMER_NAME
Johnson
Smith
Hayes
Lindsay A-284
BORROW RELATION
CUSTOMER_NAME LOAN_NO
Jones L-17
Smith L-23
Hayes L-15
Jackson L-14
Curry L-93
Smith L-11
Williams L-17
Input:
Output:
Turner
Jones
Lindsay
Jackson
Curry
Williams
Mayes
4. Set Intersection:
o Suppose there are two tuples R and S. The set intersection operation
contains all tuples that are in both R & S.
o It is denoted by intersection ∩.
Notation: R ∩ S
Input:
Output:
CUSTOMER_NAME
Smith
Jones
5. Set Difference:
o Suppose there are two tuples R and S. The set intersection operation
contains all tuples that are in R but not in S.
o It is denoted by intersection minus (-).
Notation: R - S
Example: Using the above DEPOSITOR table and BORROW table
Input:
∏ CUSTOMER_NAME (BORROW) -
∏ CUSTOMER_NAME (DEPOSITOR)
Output:
CUSTOMER_NAME
Jackson
Hayes
Willians
Curry
6. Cartesian product
o The Cartesian product is used to combine each row in one table with
each row in the other table. It is also known as a cross product.
o It is denoted by X.
Notation: E X D
Example:
EMPLOYEE
DEPARTMENT
DEPT_NO DEPT_NAME
A Marketing
B Sales
C Legal
Input:
EMPLOYEE X DEPARTMENT
Output:
7. Rename Operation:
ρ(STUDENT1, STUDENT)
Join Operations:
Example:
EMPLOYEE
EMP_CODE EMP_NAME
101 Stephan
102 Jack
103 Harry
SALARY
EMP_CODE SALARY
101 50000
102 30000
103 25000
Result:
Example: Let's use the above EMPLOYEE table and SALARY table:
Input:
EMP_NAME SALARY
Stephan 50000
Jack 30000
Harry 25000
2. Outer Join:
Example:
FACT_WORKERS
Input:
(EMPLOYEE ⋈ FACT_WORKERS)
Output:
Input:
EMPLOYEE ⟕ FACT_WORKERS
Input:
1. EMPLOYEE ⟖ FACT_WORKERS
Output:
Input:
EMPLOYEE ⟗ FACT_WORKERS
Output:
Example:
CLASS_ID NAME
1 John
2 Harry
3 Jackson
CUSTOMER RELATION
PRODUCT
PRODUCT_ID CITY
1 Delhi
2 Mumbai
3 Noida
Input:
CUSTOMER ⋈ PRODUCT
Output:
Suppose a user executes a query. As we have learned that there are various
methods of extracting the data from the database. In SQL, a user wants to
fetch the records of the employees whose salary is greater than or equal to
10000. For doing this, the following query is undertaken:
select emp_name from Employee where salary>10000;
After translating the given query, we can execute each relational algebra
operation by using different algorithms. So, in this way, a query processing
begins its working.
Evaluation
Optimization
o The cost of the query evaluation can vary for different types of
queries. Although the system is responsible for constructing the
evaluation plan, the user does need not to write their query
efficiently.
o Usually, a database system generates an efficient query evaluation
plan, which minimizes its cost. This type of task performed by the
database system and is known as Query Optimization.
o For optimizing a query, the query optimizer should have an estimated
cost analysis of each operation. It is because the overall operation cost
depends on the memory allocations to several operations, execution
costs, and so on.
Finally, after selecting an evaluation plan, the system evaluates the query
and produces the output of the query.
DEPARTMENT
DNo DName Location
Example 1
Let us consider the query as the following.
$$\pi_{EmpID} (\sigma_{EName = \small "ArunKumar"}
{(EMPLOYEE)})$$
The corresponding query tree will be −
Example 2
Let us consider another query involving a join.
$\pi_{EName, Salary} (\sigma_{DName = \small "Marketing"}
{(DEPARTMENT)}) \bowtie_{DNo=DeptNo}{(EMPLOYEE)}$
Following is the query tree for the above query.
Step 2 − Query Plan Generation
After the query tree is generated, a query plan is made. A query plan is an
extended query tree that includes access paths for all operations in the
query tree. Access paths specify how the relational operations in the tree
should be performed. For example, a selection operation can have an
access path that gives details about the use of B+ tree index for selection.
Besides, a query plan also states how the intermediate tables should be
passed from one operator to the next, how temporary tables should be
used and how operations should be pipelined/combined.
Step 3− Code Generation
Code generation is the final step in query optimization. It is the executable
form of the query, whose form depends upon the type of the underlying
operating system. Once the query code is generated, the Execution
Manager runs it and produces the results.
This is based on the cost of the query. The query can use different paths
based on indexes, constraints, sorting methods etc. This method mainly
uses the statistics like record size, number of records, number of records
per block, number of blocks, table size, whether whole table fits in a block,
organization of tables, uniqueness of column values, size of columns etc.
Suppose, we have series of table joined in a query.
T1 ∞ T2 ∞ T3 ∞ T4∞ T5 ∞ T6
For above query we can have any order of evaluation. We can start taking
any two tables in any order and start evaluating the query. Ideally, we can
have join combinations in (2(n-1))! / (n-1)! ways. For example, suppose we
have 5 tables involved in join, then we can have 8! / 4! = 1680
combinations. But when query optimizer runs, it does not evaluate in all
these ways always. It uses Dynamic Programming where it generates the
costs for join orders of any combination of tables. It is calculated and
generated only once. This least cost for all the table combination is then
stored in the database and is used for future use. i.e.; say we have a set of
tables, T = { T1 , T2 , T3 .. Tn}, then it generates least cost combination for
all the tables and stores it.
• Dynamic Programming
As we learnt above, the least cost for the joins of any combination of table is
generated here. These values are stored in the database and when those
tables are used in the query, this combination is selected for evaluating the
query.
While generating the cost, it follows below steps :
Suppose we have set of tables, T = {T1 , T2 , T3 .. Tn}, in a DB. It picks the
first table, and computes cost for joining with rest of the tables in set T. It
calculates cost for each of the tables and then chooses the best cost. It
continues doing the same with rest of the tables in set T. It will generate 2n
– 1 cases and it selects the lowest cost and stores it. When a query uses
those tables, it checks for the costs here and that combination is used to
evaluate the query. This is called dynamic programming.
In this method, time required to find optimized query is in the order of 3n,
where n is the number of tables. Suppose we have 5 tables, then time
required in 35 = 243, which is lesser than finding all the combination of
tables and then deciding the best combination (1680). Also, the space
required for computing and storing the cost is also less and is in the order
of 2n. In above example, it is 25 = 32.
• Left Deep Trees
This is another method of determining the cost of the joins. Here, the tables
and joins are represented in the form of trees. The joins always form the
root of the tree and table is kept at the right side of the root. LHS of the root
always point to the next join. Hence it gets deeper and deeper on LHS.
Hence it is called as left deep tree.
Here instead of calculating the best join cost for set of tables, best join cost
for joining with each table is calculated. In this method, time required to
find optimized query is in the order of n2n, where n is the number of tables.
Suppose we have 5 tables, then time required in 5*25 =160, which is lesser
than dynamic programming. Also, the space required for computing storing
the cost is also less and is in the order of 2n. In above example, it is 25 = 32,
same as dynamic programming.
• Interesting Sort Orders
This method is an enhancement to dynamic programming. Here, while
calculating the best join order costs, it also considers the sorted tables. It
assumes, calculating the join orders on sorted tables would be efficient. i.e.;
suppose we have unsorted tables T1 , T2 , T3 .. Tn and we have join on these
tables.
(T1 ∞T2)∞ T3 ∞… ∞ Tn
This method uses hash join or merge join method to calculate the cost.
Hash Join will simply join the tables. We get sorted output in merge join
method, but it is costlier than hash join. Even though merge join is costlier
at this stage, when it moves to join with third table, the join will have less
effort to sort the tables. This is because first table is the sorted result of first
two tables. Hence it will reduce the total cost of the query.
But the number of tables involved in the join would be relatively less and
this cost/space difference will be hardly noticeable.
All these cost based optimizations are expensive and are suitable for large
number of data. There is another method of optimization called heuristic
optimization, which is better compared to cost based optimization.
2. Heuristic Optimization (Logical): -
This method is also known as rule based optimization. This is based on the
equivalence rule on relational expressions; hence the number of
combination of queries get reduces here. Hence the cost of the query too
reduces.
This method creates relational tree for the given query based on the
equivalence rules. These equivalence rules by providing an alternative way
of writing and evaluating the query, gives the better path to evaluate the
query. This rule need not be true in all cases. It needs to be examined after
applying those rules. The most important set of rules followed in this
method is listed below:
• Perform all the selection operation as early as possible in the
query. This should be first and foremost set of actions on the
tables in the query. By performing the selection operation, we
can reduce the number of records involved in the query, rather
than using the whole tables throughout the query.
Suppose we have a query to retrieve the students with age 18 and studying
in class DESIGN_01. We can get all the student details from STUDENT
table, and class details from CLASS table. We can write this query in two
different ways.
Here both the queries will return same result. But when we observe them
closely we can see that first query will join the two tables first and then
applies the filters. That means, it traverses whole table to join, hence the
number of records involved is more. But he second query, applies the filters
on each table first. This reduces the number of records on each table (in
class table, the number of record reduces to one in this case!). Then it joins
these intermediary tables. Hence the cost in this case is comparatively less.
Instead of writing query the optimizer creates relational algebra and tree
for above case.
• Perform all the projection as early as possible in the query. This
is similar to selection but will reduce the number of columns in
the query.
Suppose for example, we have to select only student name, address and
class name of students with age 18 from STUDENT and CLASS tables.
Here again, both the queries look alike, results alike. But when we compare
the number of records and attributes involved at each stage, second query
uses less records and hence more efficient.
All these methods need not be always true. It also depends on the table size,
column size, type of selection, projection, join sort, constraints, indexes,
statistics etc. Above optimization describes the best way of optimizing the
queries.
Case 1: Relations that are having either small or medium size than main
memory.
In Case 1, the small or medium size relations do not exceed the size of the
main memory. So, we can fit them in memory. So, we can use standard
sorting methods such as quicksort, merge sort, etc., to do so.
For Case 2, the standard algorithms do not work properly. Thus, for such
relations whose size exceeds the memory size, we use the External Sort-
Merge algorithm.
The sorting of relations which do not fit in the memory because their size is
larger than the memory size. Such type of sorting is known as External
Sorting. As a result, the external-sort merge is the most suitable method
used for external sorting.
1. i = 0;
1. repeat
2. read either M blocks or the rest of the relation having a smaller si
ze;
3. sort the in-memory part of the relation;
4. write the sorted data to run file Ri;
5. i =i+1;
1. Until the end of the relation
In Stage 1, we can see that we are performing the sorting operation on the
disk blocks. After completing the steps of Stage 1, proceed to Stage 2.
However, if the size of the relation is larger than the memory size, then
either M or more runs will be generated in Stage 1. Also, it is not possible to
allocate a single block for each run while processing Stage 2. In such a case,
the merge operation process in multiple passes. As M-1 input buffer blocks
have sufficient memory, each merge can easily hold M-1 runs as its input.
So, the initial phase works in the following way:
o It merges the first M-1 runs for getting a single run for the next one.
o Similarly, it merges the next M-1 runs. This step continues until it
processes all the initial runs. Here, the number of runs has a reduced
M-1 value. Still, if this reduced value is greater than or equal to M, we
need to create another pass. For this new pass, the input will be the
runs created by the first pass.
o The work of each pass will be to reduce the number of runs by M-1
value. This job repeats as many times as needed until the number of
runs is either less than or equal to M.
o Thus, a final pass produces the sorted output.
Every pass read and write each block of the relation only once. But with two
exceptions:
o The final pass can give a sorted output without writing its result to the
disk
o There might be chances that some runs may not be read or written
during the pass.
Neglecting such small exceptions, the total number of block transfers for
external sorting comes out:
We need to add the disk seek cost because each run needs seeks reading
and writing data for them. If in Stage 2, i.e., the merge phase, each run is
allocated with bb buffer blocks or each run reads bb data at a time, then each
merge needs [br /bb] seeks for reading the data. The output is written
sequentially, so if it is on the same disk as input runs, the head will need to
move between the writes of consecutive blocks. Therefore, add a total of
2[br /bb] seeks for each merge pass and the total number of seeks comes
out:
2 Γ b r /M ˥ + Γb r /b b ˥(2ΓlogM-1(b r /M)˥ - 1)
Thus, we need to calculate the total number of disk seeks for analyzing the
cost of the External merge-sort algorithm.
Let's understand the working of the external merge-sort algorithm and also
analyze the cost of the external sorting with the help of an example.
The two different notions of time - valid time and transaction time - allow
the distinction of different forms of temporal databases. A historical
database stores data with respect to valid time, a rollback database stores
data with respect to transaction time. A bitemporal database stores data
with respect to both valid time and transaction time.
Commercial DBMS are said to store only a single state of the real world,
usually the most recent state. Such databases usually are called snapshot
databases. A snapshot database in the context of valid time and transaction
time is depicted in the following picture:
On the other hand, a bitemporal DBMS such as TimeDB stores the history
of data with respect to both valid time and transaction time. Note that the
history of when data was stored in the database (transaction time) is
limited to past and present database states, since it is managed by the
system directly which does not know anything about future states.
1. Valid Time
2. Transaction Time
3. Bitemporal Time
Valid Time: - Valid time is a time period during which a fact is true in the
real world.
Use-
Transaction Time: -it is the time period during which a fact stored in the
database was known.
Use-
• To rollback database.
More preferable.
Name Department Salary Transaction Time Valid start time Valid end time
James PHP 10000 2016 2005 2006
Ben SEO 20000 2016 2000 2002
Eric HR 15000 2016 2003 2007
Multimedia Database: -
Multimedia database is the collection of interrelated multimedia data that
includes text, graphics (sketches, drawings), images, animations, video,
audio etc and have vast amounts of multisource multimedia data. The
framework that manages different types of multimedia data which can be
stored, delivered and utilized in different ways is known as multimedia
database management system. There are three classes of the multimedia
database which includes static media, dynamic media and dimensional
media.
• Knowledge dissemination: -
Multimedia database is a very effective tool for knowledge
dissemination in terms of providing several resources. Example:
Electronic books.
Data Mining: -
Data warehouses:
A Data Warehouse is the technology that collects the data from various
sources within the organization to provide meaningful business insights.
The huge amount of data comes from multiple places such as Marketing
and Finance. The extracted data is utilized for analytical purposes and helps
in decision- making for a business organization. The data warehouse is
designed for the analysis of data rather than transaction processing.
Data Repositories:
Object-Relational Database:
Transactional Database:
These are the following areas where data mining is widely used:
Apprehending a criminal is not a big deal, but bringing out the truth from
him is a very challenging task. Law enforcement may use data mining
techniques to investigate offenses, monitor suspected terrorist
communications, etc. This technique includes text mining also, and it seeks
meaningful patterns in data, which is usually unstructured text. The
information collected from the previous investigations is compared, and a
model for lie detection is constructed.
Although data mining is very powerful, it faces many challenges during its
execution. Various challenges could be related to performance, data,
methods, and techniques, etc. The process of data mining becomes effective
when the challenges or problems are correctly recognized and adequately
resolved.
Incomplete and noisy data:
The process of extracting useful data from large volumes of data is data
mining. The data in the real-world is heterogeneous, incomplete, and noisy.
Data in huge quantities will usually be inaccurate or unreliable. These
problems may occur due to data measuring instrument or because of
human errors. Suppose a retail chain collects phone numbers of customers
who spend more than $ 500, and the accounting employees put the
information into their system. The person may make a digit mistake when
entering the phone number, which results in incorrect data. Even some
customers may not be willing to disclose their phone numbers, which
results in incomplete data. The data could get changed due to human or
system error. All these consequences (noisy and incomplete data)makes
data mining challenging.
Data Distribution:
Complex Data:
Performance:
Data Visualization:
Data mining includes the utilization of refined data analysis tools to find
previously unknown, valid patterns and relationships in huge data sets.
These tools can incorporate statistical models, machine learning
techniques, and mathematical algorithms, such as neural networks or
decision trees. Thus, data mining incorporates analysis and prediction.
In recent data mining projects, various major data mining techniques have
been developed and used, including association, classification, clustering,
prediction, sequential patterns, and regression.
1. Classification:
3. Regression:
4. Association Rules:
This data mining technique helps to discover a link between two or more
items. It finds a hidden pattern in the data set.
o Lift:
This measurement technique measures the accuracy of the confidence
over how often item B is purchased.
(Confidence) / (item B)/ (Entire dataset)
o Support:
This measurement technique measures how often multiple items are
purchased and compared it to the overall dataset.
(Item A + Item B) / (Entire dataset)
o Confidence:
This measurement technique measures how often item B is purchased
when item A is purchased as well.
(Item A + Item B)/ (Item A)
5. Outer detection:
This type of data mining technique relates to the observation of data items
in the data set, which do not match an expected pattern or expected
behavior. This technique may be used in various domains like intrusion,
detection, fraud detection, etc. It is also known as Outlier Analysis or
Outilier mining. The outlier is a data point that diverges too much from the
rest of the dataset. The majority of the real-world datasets have an outlier.
Outlier detection plays a significant role in the data mining field. Outlier
detection is valuable in numerous fields like network interruption
identification, credit or debit card fraud detection, detecting outlying in
wireless sensor network data, etc.
6. Sequential Patterns:
7. Prediction:
Data Warehouse: -
It is not used for daily operations and transaction processing but used for
making decisions.
Next →
A Data Warehouse is a group of data specific to the entire organization, not only to a
particular group of users.
It is not used for daily operations and transaction processing but used for making
decisions.
A Data Warehouse can be viewed as a data system with the following attributes:
Subject-Oriented: -
Non-Volatile: -
The idea of data warehousing came to the late 1980's when IBM researchers
Barry Devlin and Paul Murphy established the "Business Data Warehouse."
4) For data consistency and quality: Bringing the data from different
sources at a commonplace, the user can effectively undertake to bring
the uniformity and consistency in data.
5) High response time: Data warehouse has to be ready for somewhat
unexpected loads and types of queries, which demands a significant
degree of flexibility and quick response time.
Flat Files
Meta Data
A set of data that defines and gives information about other data.
Meta Data summarizes necessary information about data, which can make
finding and work with particular instances of data more accessible. For
example, author, data build, and data changed, and file size are examples of
very basic document metadata.
The area of the data warehouse saves all the predefined lightly and highly
summarized (aggregated) data generated by the warehouse manager.
We must clean and process your operational information before put it into
the warehouse.
The figure illustrates an example where purchasing, sales, and stocks are
separated. In this example, a financial analyst wants to analyze historical
data for purchases and sales or mine historical information to make
predictions about customer behavior.
Properties of Data Warehouse Architectures
Single-Tier Architecture: -
Two-Tier Architecture: -
The requirement for separation plays an essential role in defining the two-
tier architecture for a data warehouse system, as shown in fig:
Although it is typically called two-layer architecture to highlight a
separation between physically available sources and data warehouses, in
fact, consists of four subsequent data flow stages:
Three-Tier Architecture: -
• The Data Source Layer is the layer where the data from the source is
encountered and subsequently sent to the other layers for desired
operations.
• The data can be of any type.
• The Source Data can be a database, a Spreadsheet or any other kinds
of a text file.
• The Source Data can be of any format. We cannot expect to get data
with the same format considering the sources are vastly different.
• In Real Life, Some examples of Source Data can be
• Log Files of each specific application or job or entry of employers in a
company.
• Survey Data, Stock Exchange Data, etc.
• Web Browser Data and many more.
The Data received by the Source Layer is feed into the Staging Layer where
the first process that takes place with the acquired data is extraction.
• The Data in Landing Database is taken and several quality checks and
staging operations are performed in the staging area.
• The Structure and Schema are also identified and adjustments are
made to data that are unordered thus trying to bring about a
commonality among the data that has been acquired.
• Having a place or set up for the data just before transformation and
changes is an added advantage that makes the Staging process very
important.
• It makes data processing easier.
• This Layer where the users get to interact with the data stored in the
data warehouse.
• Queries and several tools will be employed to get different types of
information based on the data.
• The information reaches the user through the graphical
representation of data.
• Reporting Tools are used to get Business Data and Business logic is
also applied to gather several kinds of information.
• Meta Data Information and System operations and performance are
also maintained and viewed in this layer.