Unit 4
Unit 4
File organization and database design are two important topics in database
management systems (DBMS).
Heap file organization is a simple and basic type of file organization in DBMS. It works
with data blocks, where records are inserted at the end of the file without any sorting
or ordering. When a data block is full, a new record is stored in any other available
block. This makes insertion very efficient, but searching, updating, or deleting records
can be slow and time-consuming, as the entire file has to be scanned until the
requested record is found. Heap file organization is also known as unordered or pile
file organization.
Hash file organization is a method of storing and accessing records in a database using
a hash function. A hash function takes a value of an attribute or a set of attributes,
called the hash key, and maps it to the address of a disk block, called the hash bucket,
where the record is stored. This allows for direct and fast access to records without
using an index structure.
However, hash file organization also has some drawbacks, such as:
It is difficult to support range queries, as the records are not stored in any sorted order.
It may cause bucket overflow, when more than one record is mapped to the same
bucket. This can be handled by using overflow buckets, chaining, or rehashing.
It may suffer from poor space utilization, if the hash function does not distribute the
records evenly among the buckets.
We use the ID attribute as the hash key, and apply a mod (5) hash function
to generate the bucket address. For example, if ID = 104, then the bucket
address is 104 mod 5 = 4.
| Data Block 1 | Data Block 2 | Data Block 3 | Data Block 4 | Data Block 5
|
|--------------|--------------|--------------|--------------|--------------
|
| ID = 100 | ID = 101 | ID = 102 | ID = 103 | ID = 104 |
| Name = Alice | Name = Bob | Name = Carol | Name = David | Name = Eve |
| Age = 20 | Age = 21 | Age = 19 | Age = 22 | Age = 18 |
| GPA = 3.5 | GPA = 3.2 | GPA = 3.8 | GPA = 3.4 | GPA = 3.6 |
| | | | | |
If we want to insert a new record with ID = 105, Name = Frank, Age = 20, and
GPA = 3.7, then the bucket address is 105 mod 5 = 0. Since data block 1 is
not full, we can insert the record there.
| Data Block 1 | Data Block 2 | Data Block 3 | Data Block 4 | Data Block 5
|
|--------------|--------------|--------------|--------------|--------------
|
| ID = 100 | ID = 101 | ID = 102 | ID = 103 | ID = 104 |
| Name = Alice | Name = Bob | Name = Carol | Name = David | Name = Eve |
| Age = 20 | Age = 21 | Age = 19 | Age = 22 | Age = 18 |
| GPA = 3.5 | GPA = 3.2 | GPA = 3.8 | GPA = 3.4 | GPA = 3.6 |
| ID = 105 | | | | |
| Name = Frank | | | | |
| Age = 20 | | | | |
| GPA = 3.7 | | | | |
If we want to search for the record with ID = 103, then the bucket address
is 103 mod 5 = 3. We can directly go to data block 4 and retrieve the record.
If we want to delete the record with ID = 102, then the bucket address is
102 mod 5 = 2. We can directly go to data block 3 and remove the record.
If we want to update the record with ID = 104, then the bucket address is
104 mod 5 = 4. We can directly go to data block 5 and modify the record.
It makes searching easy and fast, as the records are sorted and can be accessed by
traversing a single path in the tree.
It can grow or shrink dynamically, as the number of records increases or decreases.
It is a balanced tree, so the performance is not affected by insertions, deletions, or
updates.
It is inefficient for static files, where the records do not change frequently.
It requires extra space for storing the pointers and the index values
Advantages:
It makes joining faster and easier, as the related records are stored together and the
key attributes are stored only once.
It can handle dynamic changes in the number and size of records.
Disadvantages:
It is not suitable for static files, where the records do not change often.
It requires extra space for storing the pointers and the index values.
Database design
File structure
A file structure is a combination of representations for data in files. It is also a
collection of operations for accessing the data. It enables applications to read, write,
and modify data. File structures may also help to find the data that matches certain
criteria. An improvement in file structure has a great role in making applications
hundreds of times faster.
It is relatively easy to develop file structure designs that meet these goals when the files
never change. However, as files change, grow, or shrink, designing file structures that
can have these qualities is more difficult.
Database design
1. Database designs provide the blueprints of how the data is going to be stored in
a system. A proper design of a database highly affects the overall performance
of any application.
2. The designing principles defined for a database give a clear idea of the
behaviour of any application and how the requests are processed.
3. Another instance to emphasize the database design is that a proper database
design meets all the requirements of users.
4. Lastly, the processing time of an application is greatly reduced if the constraints
of designing a highly efficient database are properly implemented.
Life Cycle
Requirement Analysis
First of all, the planning has to be done on what are the basic requirements of the
project under which the design of the database has to be taken forward. Thus, they can
be defined as:-
Planning - This stage is concerned with planning the entire DDLC (Database
Development Life Cycle). The strategic considerations are taken into account before
proceeding.
System definition - This stage covers the boundaries and scopes of the proper
database after planning.
Database Designing
The next step involves designing the database considering the user-based
requirements and splitting them out into various models so that load or heavy
dependencies on a single aspect are not imposed. Therefore, there has been some
model-centric approach and that's where logical and physical models play a crucial
role.
Physical Model - The physical model is concerned with the practices and
implementations of the logical model.
Logical Model - This stage is primarily concerned with developing a model based on
the proposed requirements. The entire model is designed on paper without any
implementation or adopting DBMS considerations.
Implementation
The last step covers the implementation methods and checking out the behaviour that
matches our requirements. It is ensured with continuous integration testing of the
database with different data sets and conversion of data into machine understandable
language. The manipulation of data is primarily focused on these steps where queries
are made to run and check if the application is designed satisfactorily or not.
Data conversion and loading - This section is used to import and convert data
from the old to the new system.
Testing - This stage is concerned with error identification in the newly implemented
system. Testing is a crucial step because it checks the database directly and compares
the requirement specifications.
Objective of database
Removes Duplicity
If you have lots of data then data duplicity will occur for sure at any instance. DBMS
guarantee it that there will be no data duplicity among all the records. While storing new
records, DBMS makes sure that same data was not inserted before.
To reduce data redundancy and inconsistency by storing data in a single place and
enforcing rules and constraints on the data.
So it support data durability and recovery by creating backups and logs of the data
and restoring the data in case of system failures or crashes.
Integrity
Integrity means your data is authentic and consistent. DBMS has various validity checks
that make your data completely accurate and consistence.
So it ensure data integrity and quality by maintaining the accuracy and validity of
the data and preventing data anomalies and errors.
Platform Independent
One can run dbms at any platform. No particular platform is required to work on database
management system.
Normalization
A large database defined as a single relation may result in data duplication. This
repetition of data may result in:
So to handle these problems, we should analyze and decompose the relations with
redundant data into smaller, simpler, and well-structured relations that are satisfy
desirable properties. Normalization is a process of decomposing the relations into
relations with fewer attributes.
What is Normalization?
Normalization works through a series of stages called Normal forms. The normal
forms apply to individual relations. The relation is said to be in particular normal form
if it satisfies constraints.
2NF A relation will be in 2NF if it is in 1NF and all non-key attributes are fully
functional dependent on the primary key.
4NF A relation will be in 4NF if it is in Boyce Codd's normal form and has no
multi-valued dependency.
5NF A relation is in 5NF. If it is in 4NF and does not contain any join dependency,
joining should be lossless.
Advantages of Normalization
Disadvantages of Normalization
o You cannot start building the database before knowing what the user needs.
o The performance degrades when normalizing the relations to higher normal
forms, i.e., 4NF, 5NF.
o It is very time-consuming and difficult to normalize relations of a higher degree.
o Careless decomposition may lead to a bad database design, leading to serious
problems.