0% found this document useful (0 votes)
245 views46 pages

CIT 401 Lecture Note

The document provides information about file organization and indexing in a database management system course. It discusses different methods of file organization like sequential, heap, hash, and indexed sequential access method. It also covers primary indexing, secondary indexing, clustered and non-clustered indexing. The objectives of file organization are also stated as optimal record selection, efficient storage, and fast data modification.

Uploaded by

Daniel Izevbuwa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
245 views46 pages

CIT 401 Lecture Note

The document provides information about file organization and indexing in a database management system course. It discusses different methods of file organization like sequential, heap, hash, and indexed sequential access method. It also covers primary indexing, secondary indexing, clustered and non-clustered indexing. The objectives of file organization are also stated as optimal record selection, efficient storage, and fast data modification.

Uploaded by

Daniel Izevbuwa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

CIT 401

DATABASE MANAGEMENT SYSTEM I


LECTURE NOTE

FILE ORGANISATION AND INDEXING

File organization: is a method of arranging the records in a file when the file is stored on disk.
A relation is typically stored as a file of records.

The File is a collection of records. Using the primary key, we can access the records. The type
and frequency of access can be determined by the type of file organization which was used for
a given set of records.

File organization is a logical relationship among various records. This method defines how file
records are mapped onto disk blocks.

File organization is used to describe the way in which the records are stored in terms of blocks,
and the blocks are placed on the storage medium.

The first approach to map the database to the file is to use the several files and store only one
fixed length record in any given file. An alternative approach is to structure our files so that we
can contain multiple lengths for records.

Files of fixed length records are easier to implement than the files of variable length records.

FILE INDEXING

Indexing is a way to optimize the performance of a database by minimizing the number of


disk accesses required when a query is processed. It is a data structure technique which is
used to quickly locate and access the data in a database. Indexes are created using a few
database columns.

Indexes are used to quickly locate data without having to search every row in a database
table every time a database table is accessed. ... Indexes can be created using one or more
columns of a database table, providing the basis for both rapid random lookups and efficient
access of ordered records.

1
Importance of Indexing

Indexing is very much important to protect file and documents of large size business
organization.

Indexing helps to locate the position of the specific document in files at a short period of
time. The cross-reference means the preservation of files at the appropriate location by
more than one name or number.

Two main types of indexing methods are:

1. Primary Indexing
2. Secondary Indexing.
Primary Index is an ordered file which is fixed length size with two fields. The
primary Indexing is also further divided into

Two Types:

1. Dense Index
2. Sparse Index

Indexing is a data structure technique to efficiently retrieve records from the database files
based on some attributes on which the indexing has been done. Indexing in database
systems is similar to what we see in books. Indexing is defined based on
its indexing attributes.

2
Dense Index
In a dense index, a record is created for every search key valued in the database. This helps
you to search faster but needs more space to store index records. In this Indexing, method
records contain search key value and points to the real record on the disk.

Sparse Index
It is an index record that appears for only some of the values in the file. Sparse Index helps
you to resolve the issues of dense Indexing in DBMS. In this method of indexing technique, a
range of index columns stores the same data block address, and when data needs to be
retrieved, the block address will be fetched.

However, sparse Index stores index records for only some search-key values. It needs less
space, less maintenance overhead for insertion, and deletions but It is slower compared to the
dense Index for locating records.

Below is a database index Example of Sparse Index

3
Secondary Index in DBMS

The secondary Index in DBMS can be generated by a field which has a unique value for each
record, and it should be a candidate key. It is also known as a non-clustering index.

This two-level database indexing technique is used to reduce the mapping size of the first
level. For the first level, a large range of numbers is selected because of this; the mapping
size always remains small.

Secondary Index Example


Let’s understand secondary indexing with a database index example:

In a bank account database, data is stored sequentially by acc_no; you may want to find all
accounts in of a specific branch of ABC bank.

Here, you can have a secondary index in DBMS for every search-key. Index record is a
record point to a bucket that contains pointers to all the records with their specific search-key
value.

4
Clustering Index in DBMS
In a clustered index, records themselves are stored in the Index and not pointers. Sometimes
the Index is created on non-primary key columns which might not be unique for each record.
In such a situation, you can group two or more columns to get the unique values and create an
index which is called clustered Index. This also helps you to identify the record faster.

Example:

Let’s assume that a company recruited many employees in various departments. In this case,
clustering indexing in DBMS should be created for all employees who belong to the same
dept.

It is considered in a single cluster, and index points point to the cluster as a whole. Here,
Department _no is a non-unique key.

What is Multilevel Index?


Multilevel Indexing in Database is created when a primary index does not fit in memory. In
this type of indexing method, you can reduce the number of disk accesses to short any record
and kept on a disk as a sequential file and create a sparse base on that file.

5
B-Tree Index
B-tree index is the widely used data structures for tree based indexing in DBMS. It is a
multilevel format of tree based indexing in DBMS technique which has balanced binary
search trees. All leaf nodes of the B tree signify actual data pointers.

Moreover, all leaf nodes are interlinked with a link list, which allows a B tree to support both
random and sequential access.

• Lead nodes must have between 2 and 4 values.


• Every path from the root to leaf are mostly on an equal length.
• Non-leaf nodes apart from the root node have between 3 and 5 children nodes.
• Every node which is not a root or a leaf has between n/2] and n children.

6
Advantages of Indexing
Important pros/ advantage of Indexing are:

• It helps you to reduce the total number of I/O operations needed to retrieve that data,
so you don’t need to access a row in the database from an index structure.
• Offers Faster search and retrieval of data to users.
• Indexing also helps you to reduce tablespace as you don’t need to link to a row in a
table, as there is no need to store the ROWID in the Index. Thus you will able to
reduce the tablespace.
• You can’t sort data in the lead nodes as the value of the primary key classifies it.

Disadvantages of Indexing
Important drawbacks/cons of Indexing are:

• To perform the indexing database management system, you need a primary key on the
table with a unique value.
• You can’t perform any other indexes in Database on the Indexed data.
• You are not allowed to partition an index-organized table.
• SQL Indexing Decrease performance in INSERT, DELETE, and UPDATE query.

OBJECTIVE OF FILE ORGANIZATION

It contains an optimal selection of records, i.e., records can be selected as fast as possible.

To perform insert, delete or update transaction on the records should be quick and easy.

The duplicate records cannot be induced as a result of insert, update or delete.

For the minimal cost of storage, records should be stored efficiently.

Types of File Organization:

File organization contains various methods. These particular methods have pros and cons on
the basis of access or selection. In the file organization, the programmer decides the best-suited
file organization method according to his requirement.

Types of file organization are as follows:

7
1. Sequential file organization
2. Heap file organization
3. Hash file organization
4. B+ file organization
5. Indexed sequential access method (ISAM)
6. Cluster file organization

1. Sequential File Organization

This method is the easiest method for file organization. In this method, files are stored
sequentially. This method can be implemented in two ways:

i. Pile File Method:

It is a quite simple method. In this method, we store the record in a sequence, i.e., one after
another. Here, the record will be inserted in the order in which they are inserted into tables.

In case of updating or deleting of any record, the record will be searched in the memory
blocks. When it is found, then it will be marked for deleting, and the new record is inserted.

8
Suppose we have four records R1, R3 and so on upto R9 and R8 in a sequence. Hence, records
are nothing but a row in the table. Suppose we want to insert a new record R2 in the sequence,
then it will be placed at the end of the file. Here, records are nothing but a row in any table.

ii. Sorted File Method:

In this method, the new record is always inserted at the file's end, and then it will sort the
sequence in ascending or descending order. Sorting of records is based on any primary key
or any other key.

In the case of modification of any record, it will update the record and then sort the file, and
lastly, the updated record is placed in the right place.

9
2. Heap File Organization

It is the simplest and most basic type of organization. It works with data blocks. In heap file
organization, the records are inserted at the file's end. When the records are inserted, it doesn't
require the sorting and ordering of records.

When the data block is full, the new record is stored in some other block. This new data block
need not to be the very next data block, but it can select any data block in the memory to store
new records. The heap file is also known as an unordered file.

In the file, every record has a unique id, and every page in a file is of the same size. It is the
DBMS responsibility to store and manage the new records.

Insertion of a new record

Suppose we have five records R1, R3, R6, R4 and R5 in a heap and suppose we want to insert
a new record R2 in a heap. If the data block 3 is full then it will be inserted in any of the
database selected by the DBMS, let's say data block 1.

10
If we want to search, update or delete the data in heap file organization, then we need to traverse
the data from staring of the file till we get the requested record.

If the database is very large then searching, updating or deleting of record will be time-
consuming because there is no sorting or ordering of records. In the heap file organization, we
need to check all the data until we get the requested record.

Advantages of Heap File Organization

It is a very good method of file organization for bulk insertion. If there is a large number of
data which needs to load into the database at a time, then this method is best suited.

In case of a small database, fetching and retrieving of records is faster than the sequential
record.

Disadvantages of Heap file organization

This method is inefficient for the large database because it takes time to search or modify the
record.

This method is inefficient for large databases.

3. Hash File Organization

11
Hash File Organization uses the computation of hash function on some fields of the records.
The hash function's output determines the location of disk block where the records are to be
placed.

When a record has to be received using the hash key columns, then the address is generated,
and the whole record is retrieved using that address. In the same way, when a new record has
to be inserted, then the address is generated using the hash key and record is directly inserted.
The same process is applied in the case of delete and update.

In this method, there is no effort for searching and sorting the entire file. In this method, each
record will be stored randomly in the memory.

12
4. B+ File Organization

B+ tree file organization is the advanced method of an indexed sequential access method. It
uses a tree-like structure to store records in File.

It uses the same concept of key-index where the primary key is used to sort the records. For
each primary key, the value of the index is generated and mapped with the record.

The B+ tree is similar to a binary search tree (BST), but it can have more than two children. In
this method, all the records are stored only at the leaf node. Intermediate nodes act as a pointer
to the leaf nodes. They do not contain any records.

13
The above B+ tree shows that:

There is one root node of the tree, i.e., 25.

There is an intermediary layer with nodes. They do not store the actual record. They have only
pointers to the leaf node.

The nodes to the left of the root node contain the prior value of the root and nodes to the right
contain next value of the root, i.e., 15 and 30 respectively.

There is only one leaf node which has only values, i.e., 10, 12, 17, 20, 24, 27 and 29.

Searching for any record is easier as all the leaf nodes are balanced.

In this method, searching any record can be traversed through the single path and accessed
easily

Advantages of B+ Tree File Organization

In this method, searching becomes very easy as all the records are stored only in the leaf nodes
and sorted the sequential linked list.

Traversing through the tree structure is easier and faster.

The size of the B+ tree has no restrictions, so the number of records can increase or decrease
and the B+ tree structure can also grow or shrink.

14
It is a balanced tree structure, and any insert/update/delete does not affect the performance of
tree.

Disadvantages of B+ Tree File Organization

This method is inefficient for the static method.

When a record has to be received using the hash key columns, then the address is generated,
and the whole record is retrieved using that address. In the same way, when a new record has
to be inserted, then the address is generated using the hash key and record is directly inserted.
The same process is applied in the case of delete and update.

HASHING IN DBMS
Hashing is a DBMS technique for searching for needed data on the disc without utilising an
index structure. The hashing method is basically used to index items and retrieve them in a
DB since searching for a specific item using a shorter hashed key rather than the original
value is faster.

In a huge database structure, it is very inefficient to search all the index values and reach the
desired data. Hashing technique is used to calculate the direct location of a data record on the
disk without using index structure.

In this technique, data is stored at the data blocks whose address is generated by using the
hashing function. The memory location where these records are stored is known as data bucket
or data blocks.

15
In this, a hash function can choose any of the column value to generate the address. Most of
the time, the hash function uses the primary key to generate the address of the data block. A
hash function is a simple mathematical function to any complex mathematical function. We
can even consider the primary key itself as the address of the data block. That means each row
whose address will be the same as a primary key stored in the data block.

What is Hashing in DBMS?


It can be nearly hard to search all index values through all levels of a large database structure
and then get to the target data block to obtain the needed data. Hashing is a method for
calculating the direct position of an information record on the disk without the use of an
index structure.
To generate the actual address of a data record, hash functions containing search keys as
parameters are used.

Properties of Hashing in DBMS


Data is kept in data blocks whose addresses are produced using the hashing function in this
technique. Data buckets or data blocks are the memory locations where these records are
stored.
In this case, a hash function can produce the address from any column value. The primary
key is frequently used by the hash function to generate the data block’s address. To every
complex mathematical function, a hash function is a basic mathematical function. The
primary key can also be considered as the data block’s address, i.e. each row with the same
address as a primary key contained in the data block.

The data block addresses are the same as the primary key value in the picture above. This
hash function could alternatively be a simple mathematical function, such as exponential,
mod, cos, sin, and so on. Assume we’re using the mod (5) hash function to find the data

16
block’s address. In this scenario, the primary keys are hashed with the mod (5) function,
yielding 3, 3, 1, 4, and 2, respectively, and records are saved at those data block locations.

Hash Organization
Bucket – A bucket is a type of storage container. Data is stored in bucket format in a hash
file. Typically, a bucket stores one entire disc block, which can then store one or more
records.
Hash Function – A hash function, abbreviated as h, refers to a mapping function that
connects all of the search-keys K to that address in which the actual records are stored. From
the search keys to the bucket addresses, it’s a function.

Types of Hashing
Hashing is of the following types:

Static Hashing
Whenever a search-key value is given in static hashing, the hash algorithm always returns the
same address. If the mod-4 hash function is employed, for example, only 5 values will be
generated. For this function, the output address must always be the same. At all times, the
total number of buckets available remains constant.

17
In static hashing, the resultant data bucket address will always be the same. That means if we
generate an address for EMP_ID =103 using the hash function mod (5) then it will always
result in same bucket address 3. Here, there will be no change in the bucket address.

Operations of Static Hashing

Searching a record: When a record needs to be searched, then the same hash function retrieves
the address of the bucket where the data is stored.

• Insert a Record

When a new record is inserted into the table, then we will generate an address for a new record
based on the hash key and record is stored in that location.

• Delete a Record

To delete a record, we will first fetch the record which is supposed to be deleted. Then we will
delete the records for that address in memory.

• Update a Record

To update a record, we will first search it using a hash function, and then the data record is
updated.

If we want to insert some new record into the file but the address of a data bucket generated by
the hash function is not empty, or data already exists in that address. This situation in the static
hashing is known as bucket overflow. This is a critical situation in this method.

To overcome this situation, there are various methods. Some commonly used methods are as
follows:

1. Open Hashing

When a hash function generates an address at which data is already stored, then the next bucket
will be allocated to it. This mechanism is called as Linear Probing.

For example: suppose R3 is a new address which needs to be inserted, the hash function
generates address as 112 for R3. But the generated address is already full. So the system
searches next available data bucket, 113 and assigns R3 to it.

2. Close Hashing

18
When buckets are full, then a new data bucket is allocated for the same hash result and is linked
after the previous one. This mechanism is known as Overflow chaining.

Dynamic Hashing
The disadvantage of static hashing is that it doesn’t expand or contract dynamically as the
database size grows or diminishes. Dynamic hashing is a technology that allows data buckets
to be created and withdrawn on the fly. Extended hashing is another name for dynamic
hashing.
In dynamic hashing, the hash function is designed to output a huge number of values, but
only a few are used at first. Click here to learn more on dynamic hashing.
Dynamic Hashing
o The dynamic hashing method is used to overcome the problems of static hashing like
bucket overflow.
o In this method, data buckets grow or shrink as the records increases or decreases. This
method is also known as Extendable hashing method.
o This method makes hashing dynamic, i.e., it allows insertion or deletion without
resulting in poor performance.

How to search a key

o First, calculate the hash address of the key.


o Check how many bits are used in the directory, and these bits are called as i.
o Take the least significant i bits of the hash address. This gives an index of the directory.
o Now using the index, go to the directory and find bucket address where the record might
be.

How to insert a new record

o Firstly, you have to follow the same procedure for retrieval, ending up in some bucket.
o If there is still space in that bucket, then place the record in it.
o If the bucket is full, then we will split the bucket and redistribute the records.

For example:

Consider the following grouping of keys into buckets, depending on the prefix of their hash
address:

19
The last two bits of 2 and 4 are 00. So it will go into bucket B0. The last two bits of 5 and 6 are
01, so it will go into bucket B1. The last two bits of 1 and 3 are 10, so it will go into bucket B2.
The last two bits of 7 are 11, so it will go into B3.

Insert key 9 with hash address 10001 into the above structure:

o Since key 9 has hash address 10001, it must go into the first bucket. But bucket B1 is
full, so it will get split.
o The splitting will separate 5, 9 from 6 since last three bits of 5, 9 are 001, so it will go
into bucket B1, and the last three bits of 6 are 101, so it will go into bucket B5.
o Keys 2 and 4 are still in B0. The record in B0 pointed by the 000 and 100 entry because
last two bits of both the entry are 00.
o Keys 1 and 3 are still in B2. The record in B2 pointed by the 010 and 110 entry because
last two bits of both the entry are 10.
o Key 7 are still in B3. The record in B3 pointed by the 111 and 011 entry because last
two bits of both the entry are 11.

20
Advantages of dynamic hashing

In this method, the performance does not decrease as the data grows in the system. It simply
increases the size of memory to accommodate the data.

In this method, memory is well utilized as it grows and shrinks with the data. There will not be
any unused memory lying.

This method is good for the dynamic database where data grows and shrinks frequently.

Disadvantages of dynamic hashing

In this method, if the data size increases then the bucket size is also increased. These addresses
of data will be maintained in the bucket address table. This is because the data address will
keep changing as buckets grow and shrink. If there is a huge increase in data, maintaining the
bucket address table becomes tedious.

In this case, the bucket overflow situation will also occur. But it might take little time to reach
this situation than static hashing.

21
B+ Tree

The B+ tree is a balanced binary search tree. It follows a multi-level index format.

In the B+ tree, leaf nodes denote actual data pointers. B+ tree ensures that all leaf nodes remain
at the same height.

In the B+ tree, the leaf nodes are linked using a link list. Therefore, a B+ tree can support
random access as well as sequential access.

Structure of B+ Tree

In the B+ tree, every leaf node is at equal distance from the root node. The B+ tree is of the
order n where n is fixed for every B+ tree.

It contains an internal node and leaf node.

Internal node

An internal node of the B+ tree can contain at least n/2 record pointers except the root node.

At most, an internal node of the tree contains n pointers.

Leaf node

The leaf node of the B+ tree can contain at least n/2 record pointers and n/2 key values.

At most, a leaf node contains n record pointer and n key values.

Every leaf node of the B+ tree contains one block pointer P to point to next leaf node.

Searching a record in B+ Tree

22
Suppose we have to search 55 in the below B+ tree structure. First, we will fetch for the
intermediary node which will direct to the leaf node that can contain a record for 55.

So, in the intermediary node, we will find a branch between 50 and 75 nodes. Then at the end,
we will be redirected to the third leaf node. Here DBMS will perform a sequential search to
find 55.

The 3rd leaf node has the values (50, 55, 60, 65, 70) and its current root node is 50. We will
split the leaf node of the tree in the middle so that its balance is not altered. So we can group
(50, 55) and (60, 65, 70) into 2 leaf nodes.

If these two has to be leaf nodes, the intermediate node cannot branch from 50. It should have
60 added to it, and then we can have pointers to a new leaf node.

B+ Tree Deletion

23
Suppose we want to delete 60 from the above example. In this case, we have to remove 60
from the intermediate node as well as from the 4th leaf node too. If we remove it from the
intermediate node, then the tree will not satisfy the rule of the B+ tree. So we need to modify
it to have a balanced tree.

After deleting node 60 from above B+ tree and re-arranging the nodes, it will show as follows:

Concurrency Control

Concurrency Control is the working concept that is required for controlling and managing the
concurrent execution of database operations and thus avoiding the inconsistencies in the
database. Thus, for maintaining the concurrency of the database, we have the concurrency
control protocols.

Concurrency Control Protocols

The concurrency control protocols ensure the atomicity, consistency, isolation,


durability and serializability of the concurrent execution of the database transactions.
Therefore, these protocols are categorized as:

i. Lock Based Concurrency Control Protocol


ii. Time Stamp Concurrency Control Protocol
iii. Validation Based Concurrency Control Protocol
iv. DBMS Concurrency Control

Concurrency Control is the management procedure that is required for controlling concurrent
execution of the operations that take place on a database.

24
But before knowing about concurrency control, we should know about concurrent execution.

Concurrent Execution in DBMS

In a multi-user system, multiple users can access and use the same database at one time, which
is known as the concurrent execution of the database. It means that the same database is
executed simultaneously on a multi-user system by different users.

While working on the database transactions, there occurs the requirement of using the database
by multiple users for performing different operations, and in that case, concurrent execution of
the database is performed.

The thing is that the simultaneous execution that is performed should be done in an interleaved
manner, and no operation should affect the other executing operations, thus maintaining the
consistency of the database. Thus, on making the concurrent execution of the transaction
operations, there occur several challenging problems that need to be solved.

Concurrency control

In information technology and computer science, especially in the fields of computer


programming, operating systems, multiprocessors, and databases, concurrency control
ensures that correct results for concurrent operations are generated, while getting those results
as quickly as possible.

Computer systems, both software and hardware, consist of modules, or components. Each
component is designed to operate correctly, i.e., to obey or to meet certain consistency rules.
When components that operate concurrently interact by messaging or by sharing accessed data
(in memory or storage), a certain component's consistency may be violated by another
component. The general area of concurrency control provides rules, methods, design
methodologies, and theories to maintain the consistency of components operating concurrently
while interacting, and thus the consistency and correctness of the whole system. Introducing
concurrency control into a system means applying operation constraints which typically result
in some performance reduction. Operation consistency and correctness should be achieved with
as good as possible efficiency, without reducing performance below reasonable levels.
Concurrency control can require significant additional complexity and overhead in a concurrent
algorithm compared to the simpler sequential algorithm.

For example, a failure in concurrency control can result in data corruption from torn read or
write operations.

Concurrency control in databases

To ensure correctness, a DBMS usually guarantees that only serializable transaction schedules
are generated, unless serializability is intentionally relaxed to increase performance, but only
in cases where application correctness is not harmed. For maintaining correctness in cases of

25
failed (aborted) transactions (which can always happen for many reasons) schedules also need
to have the recoverability (from abort) property. A DBMS also guarantees that no effect of
committed transactions is lost, and no effect of aborted (rolled back) transactions remains in
the related database. Overall transaction characterization is usually summarized by the ACID
rules below. As databases have become distributed, or needed to cooperate in distributed
environments (e.g., Federated databases in the early 1990, and Cloud computing currently), the
effective distribution of concurrency control mechanisms has received special attention.

Database transaction

The concept of a database transaction (or atomic transaction) has evolved in order to enable
both a well understood database system behavior in a faulty environment where crashes can
happen any time, and recovery from a crash to a well understood database state. A database
transaction is a unit of work, typically encapsulating a number of operations over a database
(e.g., reading a database object, writing, acquiring lock, etc.), an abstraction supported in
database and also other systems. Each transaction has well defined boundaries in terms of
which program/code executions are included in that transaction (determined by the
transaction's programmer via special transaction commands). Every database transaction obeys
the following rules (by support in the database system; i.e., a database system is designed to
guarantee them for the transactions it runs):

• Atomicity - Either the effects of all or none of its operations remain ("all or nothing"
semantics) when a transaction is completed (committed or aborted respectively). In
other words, to the outside world a committed transaction appears (by its effects on the
database) to be indivisible (atomic), and an aborted transaction does not affect the
database at all. Either all the operation is done of the transaction or none any other.
• Consistency - Every transaction must leave the database in a consistent (correct) state,
i.e., maintain the predetermined integrity rules of the database (constraints upon and
among the database's objects). A transaction must transform a database from one
consistent state to another consistent state (however, it is the responsibility of the
transaction's programmer to make sure that the transaction itself is correct, i.e.,
performs correctly what it intends to perform (from the application's point of view)
while the predefined integrity rules are enforced by the DBMS). Thus since a database
can be normally changed only by transactions, all the database's states are consistent.
• Isolation - Transactions cannot interfere with each other (as an end result of their
executions). Moreover, usually (depending on concurrency control method) the effects
of an incomplete transaction are not even visible to another transaction. Providing
isolation is the main goal of concurrency control.
• Durability - Effects of successful (committed) transactions must persist through crashes
(typically by recording the transaction's effects and its commit event in a non-volatile
memory).

The concept of atomic transaction has been extended during the years to what has become
Business transactions which actually implement types of Workflow and are not atomic.
However also such enhanced transactions typically utilize atomic transactions as components.

26
Why is concurrency control needed?

If transactions are executed serially, i.e., sequentially with no overlap in time, no transaction
concurrency exists. However, if concurrent transactions with interleaving operations are
allowed in an uncontrolled manner, some unexpected, undesirable results may occur, such as:

1. The lost update problem: A second transaction writes a second value of a data-item
(datum) on top of a first value written by a first concurrent transaction, and the first
value is lost to other transactions running concurrently which need, by their precedence,
to read the first value. The transactions that have read the wrong value end with
incorrect results.
2. The dirty read problem: Transactions read a value written by a transaction that has been
later aborted. This value disappears from the database upon abort, and should not have
been read by any transaction ("dirty read"). The reading transactions end with incorrect
results.
3. The incorrect summary problem: While one transaction takes a summary over the
values of all the instances of a repeated data-item, a second transaction updates some
instances of that data-item. The resulting summary does not reflect a correct result for
any (usually needed for correctness) precedence order between the two transactions (if
one is executed before the other), but rather some random result, depending on the
timing of the updates, and whether certain update results have been included in the
summary or not.

Most high-performance transactional systems need to run transactions concurrently to meet


their performance requirements. Thus, without concurrency control such systems can neither
provide correct results nor maintain their databases consistently.

Concurrency control mechanisms

Categories

The main categories of concurrency control mechanisms are:

• Optimistic - Delay the checking of whether a transaction meets the isolation and other
integrity rules (e.g., serializability and recoverability) until its end, without blocking
any of its (read, write) operations ("...and be optimistic about the rules being met..."),
and then abort a transaction to prevent the violation, if the desired rules are to be
violated upon its commit. An aborted transaction is immediately restarted and re-
executed, which incurs an obvious overhead (versus executing it to the end only once).
If not too many transactions are aborted, then being optimistic is usually a good
strategy.
• Pessimistic - Block an operation of a transaction, if it may cause violation of the rules,
until the possibility of violation disappears. Blocking operations is typically involved
with performance reduction.
• Semi-optimistic - Block operations in some situations, if they may cause violation of
some rules, and do not block in other situations while delaying rules checking (if
needed) to transaction's end, as done with optimistic.

Different categories provide different performance, i.e., different average transaction


completion rates (throughput), depending on transaction types mix, computing level of

27
parallelism, and other factors. If selection and knowledge about trade-offs are available, then
category and method should be chosen to provide the highest performance.

The mutual blocking between two transactions (where each one blocks the other) or more
results in a deadlock, where the transactions involved are stalled and cannot reach completion.
Most non-optimistic mechanisms (with blocking) are prone to deadlocks which are resolved
by an intentional abort of a stalled transaction (which releases the other transactions in that
deadlock), and its immediate restart and re-execution. The likelihood of a deadlock is typically
low.

Blocking, deadlocks, and aborts all result in performance reduction, and hence the trade-offs
between the categories.

Methods

Many methods for concurrency control exist. Most of them can be implemented within either
main category above. The major methods,[1] which have each many variants, and in some cases
may overlap or be combined, are:

1. Locking (e.g., Two-phase locking - 2PL) - Controlling access to data by locks assigned
to the data. Access of a transaction to a data item (database object) locked by another
transaction may be blocked (depending on lock type and access operation type) until
lock release.
2. Serialization graph checking (also called Serializability, or Conflict, or Precedence
graph checking) - Checking for cycles in the schedule's graph and breaking them by
aborts.
3. Timestamp ordering (TO) - Assigning timestamps to transactions, and controlling or
checking access to data by timestamp order.
4. Commitment ordering (or Commit ordering; CO) - Controlling or checking
transactions' chronological order of commit events to be compatible with their
respective precedence order.

Other major concurrency control types that are utilized in conjunction with the methods above
include:

• Multiversion concurrency control (MVCC) - Increasing concurrency and performance


by generating a new version of a database object each time the object is written, and
allowing transactions' read operations of several last relevant versions (of each object)
depending on scheduling method.
• Index concurrency control - Synchronizing access operations to indexes, rather than to
user data. Specialized methods provide substantial performance gains.
• Private workspace model (Deferred update) - Each transaction maintains a private
workspace for its accessed data, and its changed data become visible outside the
transaction only upon its commit (e.g., Weikum and Vossen 2001). This model provides
a different concurrency control behavior with benefits in many cases.

Problems with Concurrent Execution

28
In a database transaction, the two main operations are READ and WRITE operations. So,
there is a need to manage these two operations in the concurrent execution of the transactions
as if these operations are not performed in an interleaved manner, and the data may become
inconsistent. So, the following problems occur with the Concurrent Execution of the
operations:

Problem 1: Lost Update Problems (W - W Conflict)

The problem occurs when two different database transactions perform the read/write
operations on the same database items in an interleaved manner (i.e., concurrent

Lock-Based Protocol

In this type of protocol, any transaction cannot read or write data until it acquires an appropriate
lock on it. There are two types of lock:

1. Shared lock:

It is also known as a Read-only lock. In a shared lock, the data item can only read by the
transaction.

It can be shared between the transactions because when the transaction holds a lock, then it
can't update the data on the data item.

2. Exclusive lock:

In the exclusive lock, the data item can be both reads as well as written by the transaction.

This lock is exclusive, and in this lock, multiple transactions do not modify the same data
simultaneously.

There are four types of lock protocols available:

1. Simplistic lock protocol

It is the simplest way of locking the data while transaction. Simplistic lock-based protocols
allow all the transactions to get the lock on the data before insert or delete or update on it. It
will unlock the data item after completing the transaction.

2. Pre-claiming Lock Protocol

29
Pre-claiming Lock Protocols evaluate the transaction to list all the data items on which they
need locks.

Before initiating an execution of the transaction, it requests DBMS for all the lock on all those
data items.

If all the locks are granted then this protocol allows the transaction to begin. When the
transaction is completed then it releases all the lock.

If all the locks are not granted then this protocol allows the transaction to rolls back and waits
until all the locks are granted.

3. Two-phase locking (2PL)

The two-phase locking protocol divides the execution phase of the transaction into three parts.

In the first part, when the execution of the transaction starts, it seeks permission for the lock it
requires.

In the second part, the transaction acquires all the locks. The third phase is started as soon as
the transaction releases its first lock.

In the third phase, the transaction cannot demand any new locks. It only releases the acquired
locks.

30
There are two phases of 2PL:

Growing phase: In the growing phase, a new lock on the data item may be acquired by the
transaction, but none can be released.

Shrinking phase: In the shrinking phase, existing lock held by the transaction may be released,
but no new locks can be acquired.

In the below example, if lock conversion is allowed then the following phase can happen:

Upgrading of lock (from S(a) to X (a)) is allowed in growing phase.

Downgrading of lock (from X(a) to S(a)) must be done in shrinking phase.

DATA SECURITY
Data security is a process of protecting files, database and accounts on a network by adopting
a set of controls, applications and techniques that identify the relative importance of different
datasets, their sensitivity, regulatory compliance requirements and then applying appropriate
protections to secure those resources.
Data security requirements
1. Confidentiality:
a. Access control: this is controlled by means of privileges, roles and user accounts.
b. Authenticated users: this is a way of implementing decisions of whom to trust with
data. Passwords, finger prints etc. can be used.
c. Secure storage of sensitive data: this is required to prevent data from hackers who
could damage sensitive data.
d. Privacy of communication: DBMS should be capable of controlling the spread of
confidential personal information from unauthorized people such as credit cards.

31
2. Integrity: this contributes to maintaining a secure database by preventing the data from
becoming invalid and giving inappropriate results. It is made up of :
a. System and object privileges control access to applications tables and commands so
that only authorized users can change the data.
b. Integrity constraints are applied to maintain the correctness and authenticity of the data
in the database.
c. Database must be protected from viruses so firewalls and antiviruses should be used.
d. Ensures that access to the network is controlled and data is not vulnerable to attacks
during transmission across network.
3. Availability: data should always be available for authorized user by the secured system
without interruption or delays. Denial of service attacks are attempts to block authorized
users ability to access and use the system when required.
Data security considerations
1. Where your sensitive data is located
2. Who has access to your data
3. Have you implemented continuous monitoring and real time alerting on your data?
Data security technologies
1. Data auditing
2. Data real time alert
3. Data risk assessment
4. Data minimization
5. Purge stale data

Types of data security control


1. Authenticity
2. Access control
3. Backups and recovery
4. Encryption
5. Data masking
6. Tokenization
7. Deletion and erasure
Data integration

32
This is the total accuracy, completeness and consistency of a data. It can also be defined as the
safety of data with regards to regulatory compliance. It is maintained by a collection of
processes, rules and standards implemented during the design phase.
Types of data integrity
There are two major types of data integrity: physical and the logical integrity
1. Physical integrity: this is the protection of data’s wholeness and accuracy as it is
stored and retrieved. In the occurrence of natural disasters, power interruption or
attack by hackers on the database, physical integrity is compromised.
2. Logical integrity: this keeps data unchanged as it is used in different ways in
relational database, it protects data from human error and hackers as well but in a
different way from the physical integrity. It is made up of four types:
• Entity integrity
• Referential integrity
• Domain integrity
• User-defined integrity
➢ Entity integration: it relies on the creation of primary keys or unique values that
identify pieces of data to ensure that data is not listed more than once and makes sure
there is no null field in a table
➢ Referential integrity: refers to series of processes that ensure data is stored and used
uniformly
➢ Domain integrity: this is the collection of processes that ensure the accuracy of each
piece of data in a domain
➢ User-defined integrity: these are rules and constrains created by the user to a
particular need
Data Integrity Risks
1. Human error
2. Transfer error
3. Bugs and viruses
4. Compromised hardware
Data Integrity Problems
1. Lost update
2. Uncommitted data
3. Inconsistent retrievals

33
Differences between data security and data integrity
1. Data security is a measure used to protect data from misuse or unauthorized user.
Data integrity deals with the accuracy and completeness of data preset in the database
2. Data security deals with protection of data while integrity deals with the validation
of data
3. Authentication, encryption and masking are popular means of data security while
backup, designing a suitable user interface and error detection/correction in data are
popular means of preserving integrity
4. Data security refers to making sure that data is accessed by its intended users thereby
ensuring the privacy and protection of data. Data integrity is the structure of the data
and how it matches the schema of the database.

Data protection: this is a process of safeguarding important information from corruption,


compromise or loss. Data protection increases in its importance as the number of data stored
and created continues to increase at a very high rate.

Principles of data protection:


1. Lawfulness, fairness and transparency: for a personal data to be processed it must
be lawful and fair. It should be visible to the individuals that their personal are being
collected, used or consulted and the extent to which it will be processed’ the principle
of transparency states that for any information and communication relating to the
processing of personal data must be easy to understand and easily accessible
2. Purpose limitation: personal data should only be collected for a specific, explicit and
legitimate purposes.
3. Data minimization: personal data processing must be adequate, relevant and limited
to what is necessary for the purpose of the processing. Personal data should be
processed if only the reason for the processing cannot be achieved through any other
means.
4. Accuracy: controllers must ensure that data is accurate and kept up to date whenever
regularly.
5. Storage limitation: personal data should be kept in the form that allows for easy
identification of the data subject for as long as necessary for the purpose for which the
data is processed. Controller should establish a time limit for erasure or periodic
reviews of persona; data in other to avoid the data not being kept for a long time
34
6. Integrity and confidentiality: personal data should be processed in a manner that
ensures adequate security and confidentiality of the data
7. Accountability: the controller is responsible for and must be able to determine the
submission of other named principle of data protection.

Methods of data protection

1. Risk assessments: this allows controllers to put together stages required to take in
order to control risks and ensure appropriate processing of personal data in advance.
Controllers must ensure that their data protection principles tackle the risk associated
with data processing effectively. Risk assessments pursuant to the General Data
Protection Regulation must be carried out from the perspective of the data subject
which means that controllers must assess

• which freedoms and rights of data subjects could be at risk and


• what kind of damage could be incurred by data subjects from the envisaged
processing of their personal data.

Damage can be physical, material or non-material.

2. Backups: this is a method of preventing data that can occur due to either user error
or technical malfunction. Backups should be made based on the principles of low-
importance document should not be backed up while the sensitive ones should be.
Sensitive data should not be stored in the cloud.
3. Encryption: this is the process that scrambles readable text so it can only be read by
the person who has the secret code or decryption key. It helps provide data security
for sensitive information. Encrypted data is commonly referred to as ciphertext
while unencrypted data is called plaintext. Currently encryption is one of the most
popular and effective data security methods used by organizations. Two main types
of data encryption exist - asymmetric encryption also known as public-key
encryption and symmetric encryption.
4. Pseudonymization: this is a data management and de-identification procedure by
which personally identifiable information fields within a data record are replaced by
one or more artificial identifiers or pseudonym.

35
5. Access control: Access control is a method of guaranteeing that users are who they
say they are and that they have the appropriate access to data. At a high-level access
control is a selective restriction of access to data.
There are six types of access control: Attribute-based access control (ABAC),
Discretionary access control (DAC), Mandatory access control (MAC), Role-based access
control (RBAC), Rule-based access control and Break-Glass access control.
➢ Attribute-based access control (ABAC): Here access is granted not on the rights of a
user after authentication but based on attributes. An attribute-based access
control policy specifies which claims need to be satisfied to grant access to the resource.
The end user has to prove so-called claims about their attributes to the access control
engine.
➢ Discretionary access control (DAC): Owners or administrators of the protected
system, data or resource set the policies defining who or what is authorized to access
the resource. These systems rely on administrators to limit the propagation of access
rights. They are criticized for their lack of centralized control.
➢ Mandatory access control (MAC): Access rights are regulated by a central authority
based on multiple levels of security. MAC is common in government and military
environments.
➢ Role-based access control (RBAC): An access system determines who can access a
resource rather than an owner. RBAC is common in commercial and military systems,
where multi-level security requirements may exist. RBAC differs from DAC in that
RBAC access is controlled at the system level outside of user control while DAC allows
users to control access. RBAC can be distinguished from MAC primarily by the way it
handles permissions.
➢ Rule-based access control: A security model where an administrator defines rules that
govern access to the resource. These rules may be based on conditions such as time of
day and location.
➢ Break-Glass access control: Traditional access control has the purpose of restricting
access which is why most access control models follow the principle of least
privilege and the default deny principle. This behavior may conflict with operations of
a system.
Data protection trends
• Hyper-convergence: this is an IT framework that combines storage, computing and
network into a single system in an effort to reduce data center complexity and increase
36
scalability. With the emergence of this trend, vendors now offer appliances that provide
backup and recovery for physical and virtual environments that are hyper-converged,
non-hyper converged and mixed.
• Ransomware: this is a subset of malware in which data on a victim’s computer is
locked by encryption and payment is requested before the ransomed data is decrypted
and access is granted to the victim.
• Copy data management: it cuts down the number of copies of data an organization
must save, reducing the required overhead to store and manage data and simplifying
data protection. Through automation and centralized control, copy data manage can
spread up application release cycles, increase productivity and reduce administrative
costs.
Database system failures and recovery
Database failure can be defined as the inability of the system to provide the required
functionality correctly.

Causes of database failures


1. System crash: this can result in loss of memory due to hardware or software error; it
causes loss in volatile memory of the computer and not the persistent storage. It is
made of the following
• Operating system failure
• Main memory crash
• Transaction failure or abortion
• System generated error like integer overflow or divide-by-zero error
• Failure of supporting software
• Power failure
2. Hard failure: this causes loss of data in the persistent or non-volatile storage. It is
caused by the following:
• Power failure
• Fault in media
• Read-write malfunction
• Corruption of information on the disk
• Read/ write head crash of disk

37
3. Network failure: failures like communication software will interrupt the normal
database system operation. They include errors induced in the database system due to
the distributed nature of the data and transferring data over the network. It is caused
by the following:
• Communication link failure.
• Network congestion.
• Information corruption during transfer.
• Site failures.
• Network partitioning.
4. User error: this occurs when a user unintentionally deletes a row or drops a table.
5. Carelessness: damages done to data or facilities by operators or users due to lack of
concentration.
6. Application software error: these are errors in program that is accessing the
database which can cause one or more transaction to fail.
7. Natural disasters: these are damages caused to data, hardware or software as a result
of natural disaster such as fire, floods, earthquakes, power failure etc.
8. Sabotage: this is a damage done on data hardware or software facilities intentionally.
Database recovery: This is a process of restoring database to a correct sate in the case of
power failure e.g
• System crashes
• Media failure
• Application software errors
• Natural disaster
• Carelessness
Concepts of database recovery
• Backup mechanism: makes periodic backup of the database
• Logging concept: keeps track of the current state of transaction and the changes
made in the database
• Checking point mechanism: enables update to be made permanently.
Making choice from the best possible strategy depends on
• Extent of damaged occurred on the database.
• The last backup copy of the data is restored if there has been physical damage like
disk crash.

38
• Changes caused inconsistency must be undone if database has become inconsistent
but not physically damaged. It may also be required to redo some transactions so as
to ensure that the update is reflected in the database.

Some of the backup techniques are as follows:

I. Full database backup – In this backup full database including data and
database, Meta information needed to restore the whole database, including full-
text catalogs are backed up in a predefined time series.
II. Differential backup – It stores only the data changes that have occurred since
last full database backup. When same data has changed many times since last
full database backup, a differential backup stores the most recent version of
changed data. For this first, we need to restore a full database backup.
III. Transaction log backup – In this backup, all events that have occurred in the
database, like a record of every single statement executed is backed up. It is the
backup of transaction log entries and contains all transaction that had happened
to the database. Through this, the database can be recovered to a specific point
in time. It is even possible to perform a backup from a transaction log if the data
files are destroyed and not even a single committed transaction is lost.

DATA WAREHOUSING VS DATA MINING

Data warehousing refer to the process of compiling and organizing data into one
common data base, where as:

Data Mining refers to the process of extracting useful data from the data base or
warehouse. The data mining process depends on the data compiled in the data warehousing
phase to recognize meaningful patterns. A data warehousing is created to support management
systems.

Data warehousing: Refers to a place where data can be stored for useful mining.

Other names for data warehousing

➢ Decision support system


➢ Executive information system
➢ Management system with exceptionally huge data storage capacity
➢ Business intelligent solution

39
➢ Analytic application

Data from the various organizations system are copied to the warehouse, where it can be
fetched and conformed to delete errors. Here, advanced requests can be made against the
warehouse storage of data.

DATA WAREHOUSING PROCESS

SOURCE Target
Extract
TRANSFORM LOAD
SOURCE

Data ware house combine data from numerous source consistency. Data warehouse
boosts system execution by separating analytics processing from transnational database.

Data base, flow into a data warehouse from different data base.

Data base flow into a data warehouse works by sorting out data into a pattern that
depicts the format and types of data. Query tools examine the data table using patterns.

DATA BASE & DATA WARE HOUSING

A data base is made to store current A data warehousing is built to sore a huge
transactions and allow quick access to amount of historical data and empowers fast
specific transactions, for ongoing business requests over all the data, typically using
processes, commonly known as on-line analytically process (OLAP).
transaction processing (OLTP)

IMPORTANT FEATURES OF DATA WAREHOUSE

1. Subject oriented
2. Time-variant
3. Integrated
4. Non-volatile

40
1. Subject oriented: A data warehousing is subject oriented. It provides useful data
about a subject like, customer, suppliers, marketing, product, promotion . etc.

2. Time Variant: The different data present in the data warehouse provides
information for a specific period.

3. Integrated: A data warehouse is built by joining data from heterogeneous


source, such as social database, level documents. Etc.

4. Non- volatile: hence data entered into the warehouse it cannot be change.

ADVANTAGES OF DATA WAREHOUSE

❖ More accurate data access


❖ Improved productivity and performance
❖ Cost efficient
❖ Consistent and quality data
❖ Data warehouse allows business users to quick access critical data from some source
all in one place (It provides consistent information on various cross-functional
activities.
❖ It helps integrate many source of data to reduce stress on the production system
DISADVANTAGES OF DATA WAREHOUSING
i. Creation and implementation of Data warehouse is surely time confusing affair.
ii. Data warehousing can be outdated relatively quickly.
iii. Difficult to make changes in data types
iv. It is too complex for the average uses.
v. Sometimes warehousing users will develop different business rules.
vi. Organization needs to spend lots of their resource for training an d implementation
on purpose

DATA WAREHOUSE TOOLS

• Mark logic
• Oracle
• Amazon red shift.

WHO NEED DATA WARE HOUSING

41
Data warehousing is need for all types of users like:

1. Decision makers who rely on mass amount of data


2. Users who customized processes to obtained information from multiple data source.
3. People who want simple technology to access data
4. People who want a systematic approach for making decision.

SECTORS THAT USES DATA WAREHOUSING

a. Airline: It is used for operational purpose like crew assignment, analysis of route
profitability frequent flyer program promotion etc.
b. Bank Sector: To manage recourses available on desk effectively. Research,
performance analysis of the product and operations.
c. Healthcare: Uses it to strategize & predict outcomes, generate patients treatment
reports.
d. Public sector: It help government agencies to maintain and analysis tax records, health
policy records for every individual.
e. Investment & Insurance Sector: Warehouse is used to analyzed data patterns, customer
trends, and to track market movements.
f. Retain chain: It is used for distribution and marketing.

HOW DATA WARE HOUSE WORK

A data warehouse works as a central repository where information arrives from one or more
data sources.

Data flows into a data warehouse from the transactional system and other relational database.

Date may be:

❖ Structured
❖ Semi-structured
❖ Unstructured data

The data is processed, transformed, and ingested so that users can access the processed
data in the data warehouse thoroughly business intelligence tools, SQL, clients, and spread
sheets. Data warehousing make data mining possible. Data mining is looking for patterns in
the data that may lead to higher sales& fit price.

42
TYPES OF DATA WARE HOUSE

1. Enterprise Data warehouse: Is a centralized warehouse, it provides decision


support services across the enterprise.

2. Operational Data store: In CDS it is referred in real time. Hence, it is widely


preferred for routine activities like storing record of the employees.

3. Data mart: A data mart is a subject of the data ware house. It is specially
designed for a particular line of business, such as sales, finance, sale or finance.

GENERAL STAGES OF DATE WARE HOUSE

The following are general stages of data ware house

1. Off line operational data base: Data is just copied from an operational system to
another server.

2. Offline Data warehouse: It is regularly up dated from the operational data base.

3. Real Time Data Warehouse: It is updated continuously when the operational


system performs and transaction

4. Component of data warehouse are

a. Load manage Query Manager

b. Warehouse manager End users access tools

End users is categorized into five different tools

i. Data reporting
ii. Query tools
iii. Application development tools
iv. E/S tools
v. OLAP tools and data mining tools

DATA MINING

43
Data mining refers to the analysis of data and extraction of useful information from it. It is a
computer supported process of analyzing huge sets of data that have either been complied by
computer systems or have been downloaded into computer analysis the data and extract useful
information out of it. It looks for hidden patterns within the data set and try to predict future
behaviour.

Data mining is used primarily to discover and indicate relationships among the data sets.

DATA MINING PROCESS

1. Select a target data set: The data needed for the data mining process may be obtained
from many different and heterogeneous data sources. This first step obtains the data
from various databases, files, and non electronic sources. With the help of one or more
human experts and knowledge discovery tools, we choose an initial set of data to be
analyzed.

2. Data preprocessing: The data to be used by the process may have incorrect or missing
data. There may be anomalous data from multiple sources involving different data types
and metrics. There may be many different activities performed at this time. We use
available resources to deal with noisy data. We decide what to do about missing data
values and how to account for time-sequence information.

3. Data transformation: Attributes and instances are added and/ or eliminated from the
target data. Data from different sources must be converted into a common format for
processing. Some data may be encoded or transformed into more usable formats. Data
reduction may be used to reduce the number of possible data values being considered.

4. Data mining: A best model for representing the data is created by applying one or more
data mining algorithms. Based on the data mining task being performed, this step
applies algorithms to the transformed data to generate the desired results.

5. Interpretation/ evaluation: The output from step 4 was examined to determine if what
has been discovered is both useful and interesting. Decisions are made about whether
to repeat previous steps using new attributes and/ or instances. How the data mining
results are presented to the users is extremely important because the usefulness of the
results is dependent on it. Various visualization and GUI strategies are used at this last
step.

44
IMPORTANT FEATURES OF DATA MINING

➢ It utilized the automated discovery of patterns.


➢ It predicts the expected results
➢ It focuses on large data sets and data base
➢ It creates actionable information

ADVANTAGES OF DATA MINING

I. Market analysis
II. Fraud dedication
III. Financial market analysis
IV. Trend analysis

1. Market Analysis: Data mining can predict the market that helps the business to
make the decision. Eg. It predicts who is to purchase what type of products.

2. Fraud Detection: Data mining methods can help to find cellular phone calls,
insurance claims, credit, or debit card purchases are going to be fraudulent.

3. Financial Market Analysis: Data mining techniques are widely used to help model
financial analysis.

4. Trend Analysis: Data mining analyzed the current existing trend in the market place.

DISADVANTAGES OF DATA MINING

a. Privacy issues

45
b. Security issues

WHERE IS DATA MINING USED TODAY?

Data mining is used in various field such as business,, healthcare, insurance, transportation,
& government.

WHY USE DATA MINING

Some main important reasons for using data mining are:

➢ Establish relevance and relationships amongst data. Use this information to generate
profitable insights.
➢ Business can make informed decision quickly help to find out unusual shopping
patterns in grocery stores.
➢ Optimize website business by providing customize offers to each visitor.
➢ Helps to measure customers response rates in business marketing.
➢ Creating and maintaining new customers groups for marketing purposes.
➢ Predict customers detections, like customer are mere likely to switch to groups.
➢ Another supplies in the nearest futures.
➢ Differentiate between profitable & unprofitable customers.
➢ Identify all kinds of suspicious behaviours.

46

You might also like