Advanced Database Su NEW
Advanced Database Su NEW
Data mining
Methods for machine learning such as association, correlation, classification & clustering.
Employed in Market Basket analysis, Web usage mining, continuous production, etc.
Apriori Algorithm
- Used for Association Rule learning.
• Support: of item x is ratio of number of transactions in which item x appears to total transactions.
- Rules in Apriori are good when the confidence in each rule is greater than or equal to the minimum
confidence given in the problem.
Transactions
Single User Database Systems Multi-User Database Systems
Multi-user can use system, access & update data
Single-user if at most one user at a time can use the
the database concurrently.
system. Not have Multiprogramming
Because of concept of Multiprogramming.
• The operations performed in a transaction include one or more of database operations like insert,
delete, update or retrieve data.
• It is an atomic process that is either performed into completion entirely or is not performed at all.
• Transaction involving only data retrieval without any data update is called read-only transaction.
Transaction Operations
Low level operation High level operations
begin_transaction: Marker that specifies start of
transaction execution
can be divided into a number of low-level tasks or
read_item or write_item: Database operations that operations
may be interleaved with main memory operations
as a part of transaction.
“Data Update Operation” divided into three tasks
end_transaction: Marker that specifies end of
transaction
read_item(): reads data storage to main memory
commit: Signal to specify that transaction has been
successfully completed in its entirety and will not modify_item(): change value in main memory.
be undone write_item(): write modified value from main
memory to storage
rollback: Signal to specify that transaction has
been unsuccessful and so all temporary changes in
the database are undone. Database access is restricted to read_item() &
write_item() operations.
A committed transaction cannot be rolled back
Transaction States May go through a subset of five states:
Active: The initial state where transaction enters is active state. Transaction remains in this state while it is
executing read, write or other operations.
Partially Committed: Enters this state after last statement of transaction has been executed.
Committed: Enters this state after successful completion of the transaction and system checks have
issued commit signal.
Failed: The transaction goes from partially committed state or active state to failed state when it is
discovered that normal execution can no longer proceed or system checks fail.
Aborted: This is state after transaction has been rolled back after failure and database has been restored
to its state that was before the transaction began.
2- Consistency: A transaction should take database from one consistent state to another consistent
state.
There should not be any interference from the other concurrent transactions that are simultaneously
running.
4- Durability: If a committed transaction brings about a change, that change should be durable in
database and not lost in case of any failure.
These updates now become permanent and are stored in nonvolatile memory.
Atomicity
• Transactions do not occur partially.
• Each transaction is considered as one unit and either runs to completion or is not executed at all.
Abort: If a transaction aborts, changes made to the database are not visible.
Commit: If a transaction commits, changes made are visible. Atomicity is also known as the ‘All or nothing
rule’.
• The following transaction T consisting of T1 and T2: Transfer of 100 from account X to account Y.
Consistency
• This means that integrity constraints must be maintained so that the database is consistent before &
after the transaction.
• Referring to the example above, total amount before & after transaction must be maintained.
Isolation
• Transactions occur independently without interference.
• Changes occurring in a particular transaction will not be visible to any other transaction.
• This property ensures that execution of transactions concurrently will result in a state that these were
executed serially in some order.
• Suppose T has been executed till Read (Y) and then T’’ starts. As a result, interleaving of operations takes
place due to which T’’ reads the correct value of X but the incorrect value of Y and sum computed by :
• Ensures that once transaction has completed execution, updates & modifications to database are
stored in & written to disk and they persist even if a system failure occurs.
• These updates now become permanent & are stored in nonvolatile memory.
Types of Schedules
1- Serial Schedules: At any point of time, only one transaction is active, no overlapping of transactions.
Example: Two transactions T1 & T2 which have some operations. If it has no interleaving of operations,
then there are the following two possible outcomes:
2- Parallel Schedules: More than one transaction is active simultaneously, contain operations that
overlap at time.
Serializable schedule
• Used to find non-serial schedules that allow the transaction to execute concurrently without
interfering with one another.
• It identifies which schedules are correct when executions of transaction have interleaving of their
operations.
• A non-serial schedule will be serializable if its result is equal to result of its transactions executed
serially.
Testing of Serializability
• Serialization Graph is used to test Serializability of a schedule.
• This graph has a pair G = (V, E), where V consists a set of vertices, and E consists a set of edges.
• The set of edges is used to contain all edges Ti ->Tj for which one of the three conditions holds:
• If a precedence graph contains a single edge Ti → Tj, then all the instructions of Ti are executed before
the first instruction of Tj is executed.
Conflicts in Schedules
• A conflict occurs when two active transactions perform noncompatible operations. Two operations said
to be are in conflict, when all of the following three conditions exists simultaneously:
• At least one of operations is a write_item() operation, it tries to modify the data item.
• A schedule is called conflict serializability if after swapping of nonconflicting operations, it can transform
into a serial schedule.
1- Conflict equivalence: Two schedules are said to be conflict equivalent if both contain the same set of
transactions and has the same order of conflicting pairs of operations.
2- Result equivalence: Two schedules producing identical results are said to be result equivalent.
3- View equivalence: Two schedules that perform similar action in a similar manner are said to be view
equivalent.
1- Conflict Equivalent
3- View Equivalent
• Two schedules S1 and S2 are said to be view equivalent if they satisfy the following conditions:
Suppose two schedule S1 and S2. In schedule S1, if a transaction T1 is reading the data item A, then in S2,
2- Updated Read:
In schedule S1, if Ti is reading A which is updated by Tj then in S2 also, Ti should read A which is updated
by Tj.
3- Final Write: A final write must be the same between both the schedules.
In schedule S1, if a transaction T1 updates A at last then in S2, final writes operations should also be done
by T1.
3- Result Equivalent Schedules
• Let X = 2 and Y = 5.
The main goal of recovery techniques is to ensure data integrity & consistency & prevent data loss.
• Is accomplished by undoing changes made by transaction using log records stored in transaction log.
• Transaction log contains a record of all transactions that have been performed on database.
The system uses log records to undo the changes made by the failed transaction and restore the database
to its previous state.
• Accomplished by using log records stored in transaction log to redo changes made by transaction that
was in progress at the time of failure or error.
System uses log records to reapply changes made by transaction & restore database to its most recent
consistent state.
Checkpoint recovery
• Used to reduce recovery time by periodically saving state of database in checkpoint file.
• In event of failure, system can use checkpoint file to restore database to the most recent consistent
state before failure occurred, rather than going through entire log to recover database.
Recovery techniques
It contains information about start & end of each transaction and any updates which occur during
transaction.
• The log keeps track of all transaction operations that affect values of database items.
• The log is kept on disk start_transaction(T): This log entry records that transaction T starts the
execution.
• read_item(T, X): This log entry records that transaction T reads the value of database item X.
• write_item(T, X, old_value, new_value): This log entry records that transaction T changes the value of
the database item X from old_value to new_value.
The old value is sometimes known as a before an image of X, and the new value is known as an afterimage
of X.
• commit(T): This log entry records that transaction T has completed all accesses to the database
successfully and its effect can be committed (recorded permanently) to the database.
• checkpoint: A checkpoint is a mechanism where all the previous logs are removed from the system and
stored permanently in a storage disk. Checkpoint declares a point before which the DBMS was in a
consistent state, and all the transactions were committed.
Backup Techniques
Full database Backup Differential Backup Transaction Log Backup
all events that have occurred in
including data & database &
Stores only data changes that database.
information needed to restore the
have occurred since last full
whole database, including full- like a record of every single
database backup.
text catalogs. statement executed is backed up.
Differential Backup:
When some data has changed many times since the last full database backup, a differential backup stores
most recent version of changed data.
It is backup of transaction log entries and contains all transactions that had happened to the database.
Through this, the database can be recovered to a specific point in time.
Starvation or Livelock: is situation when a transaction has to wait for an indefinite period of time to
acquire a lock.
Starvation Reasons:
• Resource leak.
2- One transaction may hold a lock on a resource that another transaction needs, while the second
transaction may hold a lock on a resource that first transaction needs.
Both transactions are then blocked, waiting for the other to release the resource they need.
Dead lock detection
DBMSs often use various techniques to detect & resolve deadlocks automatically.
Timeout mechanisms: where a transaction is forced to release its locks after a certain period of time.
Deadlock detection algorithms: which periodically scan the transaction log for deadlock cycles and then
choose a transaction to abort to resolve the deadlock.
Data visualization: Is representation of data through use of common graphics, such as charts, plots,
infographics, and even animations.
- These visual displays of information communicate complex data relationships and data-driven insights in
a way that is easy to understand.
Data visualization
Data visualization: Representation of data through use of common graphics, such as charts, plots,
infographics, & even animations.
• These visual displays of information communicate complex data relationships and data-driven insights
in a way that is easy to understand.
Common visualization techniques
1- Tables: This consists of rows and columns used to compare variables.
2- Pie charts and stacked bar charts: These graphs are divided into sections that represent parts of a
whole.
They provide a simple way to organize data and compare the size of each component to one other.
3- Line charts and area charts: These visuals show change in one or more quantities by plotting a series
of data points over time and are frequently used within predictive analytics.
4- Histograms: This graph plots a distribution of numbers using a bar chart (with no spaces between the
bars)
5- Scatter plots: These visuals are beneficial in reveling the relationship between two variables.
6- Heat maps: These graphical representation displays are helpful in visualizing behavioral data by
location.
7- Tree maps: which display hierarchical data as a set of nested shapes, typically rectangles.
There are many different algorithms used for cluster analysis, such as
k-means, hierarchical clustering, & density-based clustering.
The choice of algorithm will depend on the specific requirements of the analysis and the nature of the
data being analyzed.
- It helps marketers to find the distinct groups in their customer base and they can characterize their
customer groups by using purchasing patterns.
- It can be used in the field of biology, by deriving animal and plant taxonomies and identifying genes with
the same capabilities.
• Inverted indexes are widely used in search engines, database systems, and other applications where
efficient text search is required.
• Inverted index is an index data structure storing a mapping from content, such as words or numbers, to
its locations in a document or a set of documents. In simple words, it is a HashMap-like data structure
that directs you from a word to a document or a web page.
Inverted Index
• Document 1: The quick brown fox jumped over the lazy dog.
• To create an inverted index for these documents, we first tokenize the documents into terms, as follows.
• Document 1: The, quick, brown, fox, jumped, over, the lazy, dog.
Next, we create an index of the terms, where each term points to a list of documents that contain that
term, as follows.
The -> Document 1, Document 2 Lazy -> Document 1, Document 2
• To search for documents containing a particular term or set of terms, the search engine queries the
inverted index for those terms and retrieves the list of documents associated with each term. The search
engine can then use this information to rank the documents based on relevance to the query and present
them to the user in order of importance.
There are two types of inverted indexes:
1- Record-Level Inverted Index: Record Level Inverted Index contains a list of references to
documents for each word.
2- Word-Level Inverted Index: Word Level Inverted Index additionally contains the positions of each
word within a document. The latter form offers more functionality but needs more processing power and
space to be created.
If we index by (text, word), the index with a location in the text is:
• Fetch the Document: Removing of Stop Words: Stop words are the most occurring and useless words in
documents like “I”, “the”, “we”, “is”, and “an”.
• Stemming of Root Word: Whenever I want to search for “cat”, I want to see a document that has
information about it. But the word present in the document is called “cats” or “catty” instead of “cat”. To
relate both words, I’ll chop some part of every word I read so that I could get the “root word”. There are
standard tools for performing this like “Porter’s Stemmer”.
• Record Document IDs: If the word is already present add a reference of the document to index else
creates a new entry. Add additional information like the frequency of the word, location of the word, etc.
Starvation Example:
• Suppose there are 3 transactions namely T1, T2, and T3 in a database that is trying to acquire a lock on
data item ‘ I ‘.
• Now, suppose the scheduler grants the lock to T1(maybe due to some priority), and the other two
transactions are waiting for the lock.
• As soon as the execution of T1 is over, another transaction T4 also comes over and requests a lock on
data item I. Now, this time the scheduler grants lock to T4, and T2, T3 has to wait again. In this way, if new
transactions keep on requesting the lock, T2 and T3 may have to wait for an indefinite period of time,
which leads to Starvation.
Simultaneously, Transaction T2 holds locks on those very rows (Which T1 needs to update) in the Grades
table but needs to update the rows in the Student table held by Transaction T1.
• Now, the main problem arises. Transaction T1 will wait for transaction T2 to give up the lock, and
similarly, transaction T2 will wait for transaction T1 to give up the lock. As a consequence, All activity
comes to a halt and remains at a standstill forever unless the DBMS detects the deadlock and aborts one
of the transactions.