0% found this document useful (0 votes)
2 views15 pages

Advanced Database Su NEW

The document discusses advanced database concepts, focusing on data mining methods such as association rule learning, particularly the Apriori algorithm, and transaction management in database systems. It outlines transaction properties (ACID), scheduling types, deadlock detection, recovery techniques, and data visualization methods. Additionally, it covers clustering algorithms and the use of inverted indexes for efficient information retrieval.

Uploaded by

salmaforbank
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views15 pages

Advanced Database Su NEW

The document discusses advanced database concepts, focusing on data mining methods such as association rule learning, particularly the Apriori algorithm, and transaction management in database systems. It outlines transaction properties (ACID), scheduling types, deadlock detection, recovery techniques, and data visualization methods. Additionally, it covers clustering algorithms and the use of inverted indexes for efficient information retrieval.

Uploaded by

salmaforbank
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Advanced Database

Data mining
Methods for machine learning such as association, correlation, classification & clustering.

Association Rule Learning: very important concept of machine learning

Employed in Market Basket analysis, Web usage mining, continuous production, etc.

Apriori Algorithm
- Used for Association Rule learning.

- Searches for series of frequent sets of items in datasets.

- Builds on associations & correlations between item sets.

- Assumes that any subset of frequent itemset must be frequent.

• Support: of item x is ratio of number of transactions in which item x appears to total transactions.

• Confidence: (x => y) likelihood purchased item y when item x is purchased.

This method takes into account popularity of item x.

- Rules in Apriori are good when the confidence in each rule is greater than or equal to the minimum
confidence given in the problem.

Transactions
Single User Database Systems Multi-User Database Systems
Multi-user can use system, access & update data
Single-user if at most one user at a time can use the
the database concurrently.
system. Not have Multiprogramming
Because of concept of Multiprogramming.

single CPU can only execute at most one process at


Most DBMSs are multi-user.
a time.

Data is neither integrated nor shared among any


Data is integrated & shared among other users
other user

Example: Databases of Banks, insurance agencies,


Example: Personal Computers
stock exchanges, supermarkets
Transactions is a Program including a collection of database operations, executed as a logical unit of
data processing.

• The operations performed in a transaction include one or more of database operations like insert,
delete, update or retrieve data.

• It is an atomic process that is either performed into completion entirely or is not performed at all.

• Transaction involving only data retrieval without any data update is called read-only transaction.

Transaction Operations
Low level operation High level operations
begin_transaction: Marker that specifies start of
transaction execution
can be divided into a number of low-level tasks or
read_item or write_item: Database operations that operations
may be interleaved with main memory operations
as a part of transaction.
“Data Update Operation” divided into three tasks
end_transaction: Marker that specifies end of
transaction
read_item(): reads data storage to main memory
commit: Signal to specify that transaction has been
successfully completed in its entirety and will not modify_item(): change value in main memory.
be undone write_item(): write modified value from main
memory to storage
rollback: Signal to specify that transaction has
been unsuccessful and so all temporary changes in
the database are undone. Database access is restricted to read_item() &
write_item() operations.
A committed transaction cannot be rolled back
Transaction States May go through a subset of five states:

Active: The initial state where transaction enters is active state. Transaction remains in this state while it is
executing read, write or other operations.

Partially Committed: Enters this state after last statement of transaction has been executed.

Committed: Enters this state after successful completion of the transaction and system checks have
issued commit signal.

Failed: The transaction goes from partially committed state or active state to failed state when it is
discovered that normal execution can no longer proceed or system checks fail.

Aborted: This is state after transaction has been rolled back after failure and database has been restored
to its state that was before the transaction began.

Desirable Properties of Transactions: Any transaction must maintain ACID properties.


1- Atomicity: States that a transaction is an atomic unit of processing, No partial update should exist.

2- Consistency: A transaction should take database from one consistent state to another consistent
state.

3- Isolation: A transaction should be executed as if it is the only one in the system.

There should not be any interference from the other concurrent transactions that are simultaneously
running.

4- Durability: If a committed transaction brings about a change, that change should be durable in
database and not lost in case of any failure.
These updates now become permanent and are stored in nonvolatile memory.

Atomicity
• Transactions do not occur partially.

• Each transaction is considered as one unit and either runs to completion or is not executed at all.

It involves the following two operations:

Abort: If a transaction aborts, changes made to the database are not visible.

Commit: If a transaction commits, changes made are visible. Atomicity is also known as the ‘All or nothing
rule’.
• The following transaction T consisting of T1 and T2: Transfer of 100 from account X to account Y.

Consistency
• This means that integrity constraints must be maintained so that the database is consistent before &
after the transaction.

• It refers to the correctness of a database.

• Referring to the example above, total amount before & after transaction must be maintained.

Total before T occurs = 500 + 200 = 700.

Total after T occurs = 400 + 300 = 700.

Therefore, database is consistent.

Inconsistency occurs in case T1 completes but T2 fails. As a result, T is incomplete.

Isolation
• Transactions occur independently without interference.

• Changes occurring in a particular transaction will not be visible to any other transaction.

• This property ensures that execution of transactions concurrently will result in a state that these were
executed serially in some order.

Let X = 500, Y = 500.

Consider two transactions T and T”.

• Suppose T has been executed till Read (Y) and then T’’ starts. As a result, interleaving of operations takes
place due to which T’’ reads the correct value of X but the incorrect value of Y and sum computed by :

T’’: (X+Y = 50, 000+500=50, 500)

is thus not consistent with the sum at end of the transaction:

T: (X+Y = 50, 000 + 450 = 50, 450).


Durability

• Ensures that once transaction has completed execution, updates & modifications to database are
stored in & written to disk and they persist even if a system failure occurs.

• These updates now become permanent & are stored in nonvolatile memory.

• Effects of transaction, are never lost.

Schedules: is the total order of execution of operations.

Types of Schedules

1- Serial Schedules: At any point of time, only one transaction is active, no overlapping of transactions.

Example: Two transactions T1 & T2 which have some operations. If it has no interleaving of operations,
then there are the following two possible outcomes:

a. Execute all operations of T1 which was followed by all operations of T2.

b. Execute all operations of T2 which was followed by all operations of T1.

2- Parallel Schedules: More than one transaction is active simultaneously, contain operations that
overlap at time.

Serializable schedule

• Used to find non-serial schedules that allow the transaction to execute concurrently without
interfering with one another.

• It identifies which schedules are correct when executions of transaction have interleaving of their
operations.

• A non-serial schedule will be serializable if its result is equal to result of its transactions executed
serially.
Testing of Serializability
• Serialization Graph is used to test Serializability of a schedule.

• This graph has a pair G = (V, E), where V consists a set of vertices, and E consists a set of edges.

• The set of vertices is used to contain all transactions participating in schedule.

• The set of edges is used to contain all edges Ti ->Tj for which one of the three conditions holds:

1- Create a node Ti → Tj if Ti executes write (Q) before Tj executes read (Q).

2- Create a node Ti → Tj if Ti executes read (Q) before Tj executes write (Q).

3- Create a node Ti → Tj if Ti executes write (Q) before Tj executes write (Q)

• If a precedence graph contains a single edge Ti → Tj, then all the instructions of Ti are executed before
the first instruction of Tj is executed.

• If a precedence graph for schedule S contains cycle, then S is nonserializable.

• If the precedence graph has no cycle, then S is serializable.

Conflicts in Schedules

• A conflict occurs when two active transactions perform noncompatible operations. Two operations said
to be are in conflict, when all of the following three conditions exists simultaneously:

• Two operations are parts of different transactions.

• Both operations access same data item.

• At least one of operations is a write_item() operation, it tries to modify the data item.

Conflict Serializable Schedule

• A schedule is called conflict serializability if after swapping of nonconflicting operations, it can transform
into a serial schedule.

• The schedule will be a conflict serializable if it is conflict equivalent to a serial schedule.


Equivalence of Schedules types:

1- Conflict equivalence: Two schedules are said to be conflict equivalent if both contain the same set of
transactions and has the same order of conflicting pairs of operations.

2- Result equivalence: Two schedules producing identical results are said to be result equivalent.

3- View equivalence: Two schedules that perform similar action in a similar manner are said to be view
equivalent.

1- Conflict Equivalent

• One can be transformed to another by swapping non-conflicting operations. S2 is conflict equivalent to


S1 (S1 can be converted to S2 by swapping non-conflicting operations).

• Two schedules are said to be conflict equivalent if and only if:

-They contain the same set of the transaction.

- If each pair of conflict operations are ordered in the same way.

3- View Equivalent

• Two schedules S1 and S2 are said to be view equivalent if they satisfy the following conditions:

1- Initial Read: An initial read of both schedules must be the same.

Suppose two schedule S1 and S2. In schedule S1, if a transaction T1 is reading the data item A, then in S2,

transaction T1 should also read A.

2- Updated Read:

In schedule S1, if Ti is reading A which is updated by Tj then in S2 also, Ti should read A which is updated

by Tj.

3- Final Write: A final write must be the same between both the schedules.

In schedule S1, if a transaction T1 updates A at last then in S2, final writes operations should also be done

by T1.
3- Result Equivalent Schedules

• If any two schedules generate the same result.

Example: Check if the following three schedules result equivalent.

• Let X = 2 and Y = 5.

• On substituting these values, the results produced by each schedule are:

• Results by Schedule S1- X = 21 and Y = 10

• Results by Schedule S2- X = 21 and Y = 10

• Results by Schedule S3- X = 11 and Y = 10

Deadlock & Recovery


• Database recovery techniques are used in DBMS to restore a database to a consistent state after a
failure or error has occurred.

The main goal of recovery techniques is to ensure data integrity & consistency & prevent data loss.

There are mainly two types of recovery techniques used in DBMS


1- Rollback/Undo Recovery Technique
• Based on principle of backing out or undoing the effects of a transaction that has not been completed
successfully due to a system failure or error.

• Is accomplished by undoing changes made by transaction using log records stored in transaction log.

• Transaction log contains a record of all transactions that have been performed on database.

The system uses log records to undo the changes made by the failed transaction and restore the database
to its previous state.

2- Commit/Redo Recovery Technique


• Based on principle of reapplying changes made by transaction that has been completed successfully
to database.

• Accomplished by using log records stored in transaction log to redo changes made by transaction that
was in progress at the time of failure or error.

System uses log records to reapply changes made by transaction & restore database to its most recent
consistent state.
Checkpoint recovery

• Used to reduce recovery time by periodically saving state of database in checkpoint file.

• In event of failure, system can use checkpoint file to restore database to the most recent consistent
state before failure occurred, rather than going through entire log to recover database.

Recovery techniques

• Heavily dependent upon existence of a special file known as a system log.

It contains information about start & end of each transaction and any updates which occur during
transaction.

• The log keeps track of all transaction operations that affect values of database items.

This information is needed to recover from transaction failure.

• The log is kept on disk start_transaction(T): This log entry records that transaction T starts the
execution.

• read_item(T, X): This log entry records that transaction T reads the value of database item X.

• write_item(T, X, old_value, new_value): This log entry records that transaction T changes the value of
the database item X from old_value to new_value.

The old value is sometimes known as a before an image of X, and the new value is known as an afterimage
of X.

• commit(T): This log entry records that transaction T has completed all accesses to the database
successfully and its effect can be committed (recorded permanently) to the database.

• abort(T): This records that transaction T has been aborted.

• checkpoint: A checkpoint is a mechanism where all the previous logs are removed from the system and
stored permanently in a storage disk. Checkpoint declares a point before which the DBMS was in a
consistent state, and all the transactions were committed.
Backup Techniques
Full database Backup Differential Backup Transaction Log Backup
all events that have occurred in
including data & database &
Stores only data changes that database.
information needed to restore the
have occurred since last full
whole database, including full- like a record of every single
database backup.
text catalogs. statement executed is backed up.

Differential Backup:

When some data has changed many times since the last full database backup, a differential backup stores
most recent version of changed data.

Transaction Log Backup:

It is backup of transaction log entries and contains all transactions that had happened to the database.
Through this, the database can be recovered to a specific point in time.

Starvation or Livelock: is situation when a transaction has to wait for an indefinite period of time to

acquire a lock.

Starvation Reasons:

• Waiting scheme for locked items is unfair. (priority queue)

• Victim selection (same transaction is selected as a victim repeatedly)

• Resource leak.

• Via denial-of-service attack.

Deadlock can happen in multi-user environments when:


1- Two or more transactions are running concurrently and try to access same data in a different order.

2- One transaction may hold a lock on a resource that another transaction needs, while the second
transaction may hold a lock on a resource that first transaction needs.

Both transactions are then blocked, waiting for the other to release the resource they need.
Dead lock detection
DBMSs often use various techniques to detect & resolve deadlocks automatically.

Timeout mechanisms: where a transaction is forced to release its locks after a certain period of time.

Deadlock detection algorithms: which periodically scan the transaction log for deadlock cycles and then
choose a transaction to abort to resolve the deadlock.

Data visualization: Is representation of data through use of common graphics, such as charts, plots,
infographics, and even animations.

- These visual displays of information communicate complex data relationships and data-driven insights in
a way that is easy to understand.

Data visualization
Data visualization: Representation of data through use of common graphics, such as charts, plots,
infographics, & even animations.

• These visual displays of information communicate complex data relationships and data-driven insights
in a way that is easy to understand.
Common visualization techniques
1- Tables: This consists of rows and columns used to compare variables.

2- Pie charts and stacked bar charts: These graphs are divided into sections that represent parts of a
whole.

They provide a simple way to organize data and compare the size of each component to one other.

3- Line charts and area charts: These visuals show change in one or more quantities by plotting a series
of data points over time and are frequently used within predictive analytics.

4- Histograms: This graph plots a distribution of numbers using a bar chart (with no spaces between the
bars)

5- Scatter plots: These visuals are beneficial in reveling the relationship between two variables.

6- Heat maps: These graphical representation displays are helpful in visualizing behavioral data by
location.

7- Tree maps: which display hierarchical data as a set of nested shapes, typically rectangles.

There are many different algorithms used for cluster analysis, such as
k-means, hierarchical clustering, & density-based clustering.

The choice of algorithm will depend on the specific requirements of the analysis and the nature of the
data being analyzed.

Applications Of Cluster Analysis:


- It is widely used in image processing, data analysis, & pattern recognition.

- It helps marketers to find the distinct groups in their customer base and they can characterize their
customer groups by using purchasing patterns.

- It can be used in the field of biology, by deriving animal and plant taxonomies and identifying genes with
the same capabilities.

- It also helps in information discovery by classifying documents on the web.


Inverted Index
• Inverted Index is a data structure used in information retrieval systems to efficiently retrieve
documents or web pages containing a specific term or set of terms. In an inverted index, the index is
organized by terms (words), and each term points to a list of documents or web pages that contain that
term.

• Inverted indexes are widely used in search engines, database systems, and other applications where
efficient text search is required.

• Inverted index is an index data structure storing a mapping from content, such as words or numbers, to
its locations in a document or a set of documents. In simple words, it is a HashMap-like data structure
that directs you from a word to a document or a web page.

Inverted Index
• Document 1: The quick brown fox jumped over the lazy dog.

Document 2: The lazy dog slept in the sun.

• To create an inverted index for these documents, we first tokenize the documents into terms, as follows.

• Document 1: The, quick, brown, fox, jumped, over, the lazy, dog.

Document 2: The, lazy, dog, slept, in, the, sun.

Next, we create an index of the terms, where each term points to a list of documents that contain that
term, as follows.
The -> Document 1, Document 2 Lazy -> Document 1, Document 2

Quick -> Document 1 Dog -> Document 1, Document 2

Brown -> Document 1 Slept -> Document 2

Fox -> Document 1 In -> Document 2

Jumped -> Document 1 Sun -> Document 2

Over -> Document 1

• To search for documents containing a particular term or set of terms, the search engine queries the
inverted index for those terms and retrieves the list of documents associated with each term. The search
engine can then use this information to rank the documents based on relevance to the query and present
them to the user in order of importance.
There are two types of inverted indexes:
1- Record-Level Inverted Index: Record Level Inverted Index contains a list of references to
documents for each word.

2- Word-Level Inverted Index: Word Level Inverted Index additionally contains the positions of each
word within a document. The latter form offers more functionality but needs more processing power and
space to be created.

Suppose we want to search the texts: “hello everyone, ”

“this article is based on an inverted index, ”

“which is hashmap-like data structure“.

If we index by (text, word), the index with a location in the text is:

hello (1, 1) inverted (2, 6)

everyone (1, 2) index (2, 7)

this (2, 1) which (3, 1)

article (2, 2) hashmap (3, 3)

is (2, 3); (3, 2) like (3, 4)

based (2, 4) data (3, 5)

on (2, 5) structure (3, 6)

Steps to Build an Inverted Index

• Fetch the Document: Removing of Stop Words: Stop words are the most occurring and useless words in
documents like “I”, “the”, “we”, “is”, and “an”.

• Stemming of Root Word: Whenever I want to search for “cat”, I want to see a document that has
information about it. But the word present in the document is called “cats” or “catty” instead of “cat”. To
relate both words, I’ll chop some part of every word I read so that I could get the “root word”. There are
standard tools for performing this like “Porter’s Stemmer”.

• Record Document IDs: If the word is already present add a reference of the document to index else
creates a new entry. Add additional information like the frequency of the word, location of the word, etc.
Starvation Example:

• Suppose there are 3 transactions namely T1, T2, and T3 in a database that is trying to acquire a lock on
data item ‘ I ‘.

• Now, suppose the scheduler grants the lock to T1(maybe due to some priority), and the other two
transactions are waiting for the lock.

• As soon as the execution of T1 is over, another transaction T4 also comes over and requests a lock on
data item I. Now, this time the scheduler grants lock to T4, and T2, T3 has to wait again. In this way, if new
transactions keep on requesting the lock, T2 and T3 may have to wait for an indefinite period of time,
which leads to Starvation.

Dead lock detection


• Suppose, Transaction T1 holds a lock on some rows in the Students table and needs to update some
rows in the Grades table.

Simultaneously, Transaction T2 holds locks on those very rows (Which T1 needs to update) in the Grades
table but needs to update the rows in the Student table held by Transaction T1.

• Now, the main problem arises. Transaction T1 will wait for transaction T2 to give up the lock, and
similarly, transaction T2 will wait for transaction T1 to give up the lock. As a consequence, All activity
comes to a halt and remains at a standstill forever unless the DBMS detects the deadlock and aborts one
of the transactions.

You might also like