ET Unit 01
ET Unit 01
UNIT-1
DATABASE SYSTEM FUNDAMENTALS
File system versus DBMS
The file system is basically a way of arranging the files in a storage
medium like a hard disk. The file system organizes the files and helps in the
retrieval of files when they are required. File systems consist of different files which
are grouped into directories. The directories further contain other folders and files.
The file system performs basic operations like management, file naming, giving
access rules, etc.
1
EMERGING TECHNOLOGIES IN DATA PROCESSING
2
EMERGING TECHNOLOGIES IN DATA PROCESSING
through various data abstraction levels to allow users to easily interact with the
system.
Reduction in data Redundancy: When working with a structured database,
DBMS provides the feature to prevent the input of duplicate items in the database.
for e.g. – If there are two same students in different rows, then one of the duplicate
data will be deleted.
Application development: A DBMS provides a foundation for developing
applications that require access to large amounts of data, reducing development
time and costs.
Data sharing: A DBMS provides a platform for sharing data across multiple
applications and users, which can increase productivity and collaboration.
Data organization: A DBMS provides a systematic approach to organizing data in
a structured way, which makes it easier to retrieve and manage data efficiently.
Increased end-user productivity: The data which is available with the help of a
combination of tools that transform data into useful information, helps end-users to
make quick, informative, and better decisions that can make a difference between
success and failure in the global economy.
Simple: Database management system (DBMS) gives a simple and clear logical
view of data. Many operations like insertion, deletion, or creation of files or data
are easy to implement.
Describing and Storing Data in a DBMS
A data model is a collection of high-level data description constructs that hide
many low-level storage details
A semantic data model is a more abstract, high-level data model that makes it
easier for a user to come up with a good initial description of the data in an
enterprise.
A database design in terms of a semantic model serves as a useful starting point
and is subsequently translated into a database design in terms of the data model
the DBMS actually supports.
A widely used semantic data model called the entity-relationship (ER) model
allows us to pictorially denote entities and the relationships among them.
3
EMERGING TECHNOLOGIES IN DATA PROCESSING
Each row in the Students relation is a record that describes a student. Every row
follows the schema of the Student relation and schema can therefore be
regarded as a template for describing a student.
We can make the description of a collection of students more precise by
specifying integrity constraints, which are conditions that the records in a
relation must staisfy.
Other notable models: hierarchial model, network model, object-oriented model,
and the object-relational model.
Transaction Management
Transactions are a set of operations used to perform a logical set of work.
It is the bundle of all the instructions of a logical operation. A transaction usually
means that the data in the database has changed. One of the major uses of DBMS
is to protect the user’s data from system failures. It is done by ensuring that all the
data is restored to a consistent state when the computer is restarted after a crash.
The transaction is any one execution of the user program in a DBMS. One of the
important properties of the transaction is that it contains a finite number of steps.
Executing the same program multiple times will generate multiple transactions.
Structure of Database Management System
Database Management System (DBMS) is software that allows access to
data stored in a database and provides an easy and effective method of –
• Defining the information.
• Storing the information.
• Manipulating the information.
• Protecting the information from system crashes or data theft.
• Differentiating access permissions for different users.
5
EMERGING TECHNOLOGIES IN DATA PROCESSING
DML Compiler: It processes the DML statements into low level instruction
(machine language), so that they can be executed.
DDL Interpreter: It processes the DDL statements into a set of table containing
meta data (data about data).
Embedded DML Pre-compiler: It processes DML statements embedded
in an application program into procedural calls.
Query Optimizer: It executes the instruction generated by DML Compiler.
2.Storage Manager: Storage Manager is a program that provides an interface
between the data stored in the database and the queries received. It is also known
as Database Control System.
Authorization Manager: It ensures role-based access control, i.e,. checks
whether the particular person is privileged to perform the requested
operation or not.
Integrity Manager: It checks the integrity constraints when the database is
modified.
Transaction Manager: It controls concurrent access by performing the
operations in a scheduled way that it receives the transaction. Thus, it
ensures that the database remains in the consistent state before and after the
execution of a transaction.
File Manager: It manages the file space and the data structure used to
represent information in the database.
Buffer Manager: It is responsible for cache memory and the transfer of data
between the secondary storage and main memory.
6
EMERGING TECHNOLOGIES IN DATA PROCESSING
7
EMERGING TECHNOLOGIES IN DATA PROCESSING
colleges at the same time. Student entity has attributes such as Stu_Id, Stu_Name
& Stu_Addr and College entity has attributes such as Col_ID & Col_Name.
Here are the geometric shapes and their meaning in an E-R Diagram. We
will discuss these terms in detail in the next section (Components of a ER
Diagram) of this guide so don’t worry too much about these terms now, just go
through them once.
Rectangle: Represents Entity sets.
Ellipses: Attributes
Diamonds: Relationship Set
Lines: They link attributes to Entity Sets and Entity sets to
Relationship
Set
Double Ellipses: Multivalued Attributes
Dashed Ellipses: Derived Attributes
Double Rectangles: Weak Entity Sets
Double Lines: Total participation of an entity in a relationship set
Components of a ER Diagram
8
EMERGING TECHNOLOGIES IN DATA PROCESSING
3. Relationship
1. Entity
An entity is an object or component of data. An entity is
represented as rectangle in an ER diagram.
For example: In the following ER diagram we have two entities
Student and College and these
two entities have many to one relationship as many students study
in a single college. We will read more about relationships later, for
now focus on entities.
Weak Entity:
An entity that cannot be uniquely identified by its own attributes and
relies on the relationship with other entity is called weak entity. The
weak entity is represented by a double rectangle.
For example – a bank account cannot be uniquely identified
without knowing the bank to which the account belongs, so
bank account is a weak entity.
Weak Entities:
A weak entity is a type of entity which doesn't have its key attribute. It
can be identified uniquely by considering the primary key of another entity.
For that, weak entity sets need to have participation.
9
EMERGING TECHNOLOGIES IN DATA PROCESSING
2.Attribute
An attribute describes the property of an entity. An attribute is
represented as Oval in an ER diagram. There are four types of
attributes:
1. Key attribute
2. Composite attribute
3. Multivalued attribute
4. Derived attribute
10
EMERGING TECHNOLOGIES IN DATA PROCESSING
3. Relationship:
Cardinality: Defines the numerical attributes of the
relationship between two entities or entity sets.
A relationship is represented by diamond shape in ER diagram, it
shows the relationship among entities. There are four types of cardinal
relationships:
1. One to One
2. One to Many
3. Many to One
4. Many to Many
11
EMERGING TECHNOLOGIES IN DATA PROCESSING
example – many students can study in a single college but a student cannot
study in many colleges at the same time.
RELATIONSHIP SETS
Types of Relationship Sets
On the basis of degree of a relationship set, a relationship set can be
classified into the following types
12
EMERGING TECHNOLOGIES IN DATA PROCESSING
Three new concepts were added to the existing ER Model, they were:
• Generalization
• Specialization
• Aggregration
Specialization
Specialization is opposite to Generalization. In this, entity is divided into
subentities bases on their charactertics(distingvishing features). It breaks an entity
into multiple entities from higher level to lower level. It is a top down approach.
13
EMERGING TECHNOLOGIES IN DATA PROCESSING
Aggregration
Aggregation refers to the process by which entities are combined to form a
single meaningful entity. The specific entities are combined because they do not
make sense on their own. To establish a single entity, aggregation creates a
relationship that combines these entities. The resulting entity makes sense because
it enables the system to function well.
KEY CONSTRAINTS
• Constraints or nothing but the rules that are to be followed while
entering data into columns of the database table
• Constraints ensure that data entered by the user into columns must be
14
EMERGING TECHNOLOGIES IN DATA PROCESSING
Example
CREATE TABLE Orders (
OrderID int NOT NULL,
OrderNumber int NOT NULL,
PersonID int,
PRIMARY KEY (OrderID),
FOREIGN KEY (PersonID) REFERENCES Persons(PersonID));
Unique
• Sometimes we need to maintain only unique data in the column of a
15
EMERGING TECHNOLOGIES IN DATA PROCESSING
ID int UNIQUE,
LastName varchar(255)
NOT NULL,
FirstName varchar(255), Age int,);
DEFAULT
• Default clause in SQL is used to add default data to the columns
• When a column is specified as default with some value then all the
rows will use the same value i.e each and every time while entering
the data we need not enter that value
• But default column value can be customized i.e it can be overridden
when inserting a data for that row based on the requirement.
Primary Key
A primary key is a constraint in a table that uniquely identifies each row record
in a database table by enabling one or more the columns in the table as the primary
key.
16
EMERGING TECHNOLOGIES IN DATA PROCESSING
ID INT
AGEINT
COURSE VARCHAR(10)
17
EMERGING TECHNOLOGIES IN DATA PROCESSING
Indexing in Database
Indexing improves database performance by minimizing the number of disc
visits required to fulfill a query. It is a data structure technique used to locate and
quickly access data in databases. Several database fields are used to generate
indexes. The main key or candidate key of the table is duplicated in the first
column, which is the Search key. To speed up data retrieval, the values are also
kept in sorted order. It should be highlighted that sorting the data is not required.
The second column is the Data Reference or Pointer which contains a set of
pointers holding the address of the disk block where that particular key value can
be found.
Attributes of Indexing
• Access Types: This refers to the type of access such as value-based
search, range access, etc.
• Access Time: It refers to the time needed to find a particular data element
or set of elements.
• Insertion Time: It refers to the time taken to find the appropriate space
and insert new data.
• Deletion Time: Time taken to find an item and delete it as well as update
the index structure.
18
EMERGING TECHNOLOGIES IN DATA PROCESSING
Dense Index
• For every search key value in the data file, there is an index record.
• This record contains the search key and also a reference to the first data
record with that search key value.
Dense Index
Sparse Index
19
EMERGING TECHNOLOGIES IN DATA PROCESSING
The index record appears only for a few items in the data file.
Each item points to a block as shown.
To locate a record, we find the index record with the largest
search key value less than or equal to the search key value we are looking
for.
We start at that record pointed to by the index record, and
proceed along with the pointers in the file (that is, sequentially) until we
find the desired record.
Number of Accesses required=log₂(n)+1, (here n=number of
blocks acquired by index file)
20
EMERGING TECHNOLOGIES IN DATA PROCESSING
21
EMERGING TECHNOLOGIES IN DATA PROCESSING
22
EMERGING TECHNOLOGIES IN DATA PROCESSING
23
EMERGING TECHNOLOGIES IN DATA PROCESSING
then, since the last data block i.e data block 3 is full it will be inserted in any of the
data blocks selected by the DBMS, let’s say data block 1.
Secondary indexes
A secondary index is an additional index that is not part of the primary
key. It can be created on any column or expression that is frequently used in
queries or joins. A secondary index can have duplicate or null values, and it can be
either unique or non-unique. A secondary index can improve the performance of
queries that filter, sort, or group by the indexed column or expression. For
example, you can use a secondary index on a last_name column to speed up
queries that search for customers by their last name
24
EMERGING TECHNOLOGIES IN DATA PROCESSING
B+ tree Index
o The B+ tree is a balanced binary search tree. It follows a multilevel index
format.
o In the B+ tree, leaf nodes denote actual data pointers. B+ tree ensures that
all leaf nodes remain at the same height.
o In the B+ tree, the leaf nodes are linked using a link list. Therefore, a B+
tree can support random access as well as sequential access.
Structure of B+ Tree
In the B+ tree, every leaf node is at equal distance from the root node.
The B+ tree is of the order n where n is fixed for every B+ tree. o It contains an
internal node and leaf node.
Internal node
An internal node of the B+ tree can contain at least n/2 record pointers except
the root node. o At most, an internal node of the tree contains n pointers.
Leaf node
o The leaf node of the B+ tree can contain at least n/2 record pointers
and n/2 key values.
o At most, a leaf node contains n record pointer and n key values.
o Every leaf node of the B+ tree contains one block pointer P to point to
next leaf node.
Searching a record in B+ Tree
25
EMERGING TECHNOLOGIES IN DATA PROCESSING
B+ Tree Insertion
Suppose we want to insert a record 60 in the below structure. It will go to the
3rd leaf node after 55. It is a balanced tree, and a leaf node of this tree is already
full, so we cannot insert 60 there.
In this case, we have to split the leaf node, so that it can be inserted into tree
without affecting the fill factor, balance and order.
The 3rd leaf node has the values (50, 55, 60, 65, 70) and its current root node
is 50. We will split the leaf node of the tree in the middle so that its balance is not
altered. So we can group (50, 55) and (60, 65, 70) into 2 leaf nodes.
If these two has to be leaf nodes, the intermediate node cannot branch from
50. It should have 60 added to it, and then we can have pointers to a new leaf node.
B+ Tree Deletion
Suppose we want to delete 60 from the above example. In this case, we have
to remove 60 from the intermediate node as well as from the 4th leaf node too. If
we remove it from the intermediate node, then the tree will not satisfy the rule of
the B+ tree. So we need to modify it to have a balanced tree.
After deleting node 60 from above B+ tree and re-arranging the nodes, it will
show as follows:
26
EMERGING TECHNOLOGIES IN DATA PROCESSING
ashing I
Hashing is a DBMn DBMS
Hashing is a DBMS technique for searching for needed data on the disc
without utilising an index structure. The hashing method is basically used to index
items and retrieve them in a DB since searching for a specific item using a shorter
hashed key rather than the original value is faster.
The data block addresses are the same as the primary key value in the picture
above. This hash function could alternatively be a simple mathematical function,
such as exponential, mod, cos, sin, and so on. Assume we’re using the mod (5) hash
function to find the data block’s address. In this scenario, the primary keys are
hashed with the mod (5) function, yielding 3, 3, 1, 4, and 2, respectively, and
records are saved at those data block locations.
27
EMERGING TECHNOLOGIES IN DATA PROCESSING
Hash Organization
Bucket – A bucket is a type of storage container. Data is stored in
bucket format in a hash file. Typically, a bucket stores one entire disc block, which
can then store one or more records.
Hash Function – A hash function, abbreviated as h, refers to a
mapping function that connects all of the search-keys K to that address in which the
actual records are stored. From the search keys to the bucket addresses, it’s a
function.
Types of Hashing
Hashing is of the following types:
Static Hashing
Whenever a search-key value is given in static hashing, the hash algorithm
always returns the same address. If the mod-4 hash function is employed, for
example, only 5 values will be generated. For this function, the output address must
always be the same. At all times, the total number of buckets available remains
constant. Click here to learn more about static hashing.
Dynamic Hashing
The disadvantage of static hashing is that it doesn’t expand or contract
dynamically as the database size grows or diminishes. Dynamic hashing is a
technology that allows data buckets to be created and withdrawn on the fly.
Extended hashing is another name for dynamic hashing.
Multi-dimensional indexes
A multi-dimensional index maps multi-dimensional data in the form of
multiple numeric attributes to one dimension while mostly preserving locality so
that similar values in all of the dimensions remain close to each other in the
mapping to a single dimension. Queries that filter by multiple value ranges at once
can be better accelerated with such an index compared to a persistent index.
28
EMERGING TECHNOLOGIES IN DATA PROCESSING
30