Unit 4
Unit 4
1. Normalization
Normalization is the process of organizing data in a database to reduce redundancy and improve data integrity.
It involves decomposing a table into smaller, more manageable tables while ensuring that the relationships
between the data are preserved.
Features of Good Relational Designs
Minimized Redundancy: Avoids duplicate data, which saves storage space and reduces
inconsistencies.
Data Integrity: Ensures accuracy and consistency of data through constraints like primary keys,
foreign keys, and unique constraints.
Ease of Maintenance: Simplifies updates, deletions, and insertions by reducing anomalies.
Scalability: Supports future growth and changes in the database structure.
Functional Dependencies
A functional dependency (FD) is a relationship between two sets of attributes in a relation. It is denoted
as X→YX \rightarrow Y, where XX determines YY.
Example: In a table Student(StudentID, Name, Age), StudentID → Name means that StudentID
uniquely determines Name.
Normal Forms
Normal forms are a series of guidelines to ensure that a database design is free from redundancy and
anomalies.
First Normal Form (1NF):
A table is in 1NF if:
o All attributes contain atomic (indivisible) values.
o Each column contains only one value per row.
Example: A table with a multi-valued attribute like PhoneNumbers is not in 1NF. It should be split into
separate rows.
Second Normal Form (2NF):
A table is in 2NF if:
o It is in 1NF.
o All non-key attributes are fully functionally dependent on the primary key (no partial
dependency).
Example: In a table Order(OrderID, ProductID, ProductName, Quantity), if ProductName depends
only on ProductID, it violates 2NF. Split into Order(OrderID, ProductID, Quantity) and
Product(ProductID, ProductName).
Third Normal Form (3NF):
A table is in 3NF if:
o It is in 2NF.
o There are no transitive dependencies (non-key attributes depend only on the primary key).
Example: In a table Employee(EmployeeID, DepartmentID, DepartmentLocation), if
DepartmentLocation depends on DepartmentID, it violates 3NF. Split into Employee(EmployeeID,
DepartmentID) and Department(DepartmentID, DepartmentLocation).
Boyce-Codd Normal Form (BCNF):
A stricter version of 3NF.
A table is in BCNF if:
o For every functional dependency X→YX \rightarrow Y, XX is a superkey.
Example: In a table Enrollment(StudentID, CourseID, Instructor), if Instructor determines CourseID, it
violates BCNF. Split into Enrollment(StudentID, CourseID) and Course(CourseID, Instructor).
Fourth Normal Form (4NF):
A table is in 4NF if:
o It is in BCNF.
o It has no multi-valued dependencies (MVDs) unless they are trivial.
Example: In a table EmployeeSkills(EmployeeID, Skill, Language), if an employee can have multiple
skills and languages independently, it violates 4NF. Split into EmployeeSkills(EmployeeID, Skill) and
EmployeeLanguages(EmployeeID, Language).
Functional Dependency Theory
Closure of a Set of Functional Dependencies: The set of all functional dependencies that can be
inferred from a given set of FDs.
Armstrong's Axioms: A set of rules to derive all possible FDs from a given set:
o Reflexivity: If Y⊆XY \subseteq X, then X→YX \rightarrow Y.
o Augmentation: If X→YX \rightarrow Y, then XZ→YZXZ \rightarrow YZ.
o Transitivity: If X→YX \rightarrow Y and Y→ZY \rightarrow Z, then X→ZX \rightarrow Z.
Multivalued Dependencies
A multivalued dependency (MVD) occurs when an attribute determines a set of values independently
of other attributes.
Denoted as X→→YX \rightarrow \rightarrow Y, meaning XX multidetermines YY.
Example: In a table Employee(EmployeeID, Skill, Language), if an employee can have multiple skills
and languages independently, there is an MVD.
Database Design Process
1. Requirement Analysis: Understand the data and its relationships.
2. Conceptual Design: Create an Entity-Relationship (ER) model.
3. Logical Design: Convert the ER model into relational schemas.
4. Normalization: Apply normal forms to eliminate redundancy.
5. Physical Design: Implement the database with file organization, indexing, and hashing.
Database design is a structured process that involves converting requirements and conceptual models into a set
of relational schemas that can be implemented in a database system. The process ensures that the database is
efficient, accurate, and easy to maintain, and normalization is a key part of that process.
Overview of the Database Design Process
1. ER Diagram to Relational Schema:
o The design often starts with an Entity-Relationship (E-R) diagram. This is a conceptual
model of the data.
o The E-R diagram is then converted into relation schemas. This step involves creating tables
(relations) based on the entities and their relationships in the diagram.
2. Normalization:
o After creating a relation schema, normalization is applied. Normalization is a process that
removes redundancy and ensures the schema adheres to specific normal forms (e.g., 1NF, 2NF,
3NF, BCNF).
o The goal of normalization is to organize the schema in such a way that it reduces anomalies and
ensures data integrity.
3. Denormalization:
o Sometimes, to improve performance, a database designer might intentionally denormalize the
schema. Denormalization introduces some redundancy (i.e., storing the same data in multiple
places) to make certain queries faster.
o This is particularly useful when reading operations are more frequent than writes. However, it
comes with the cost of maintaining consistency across redundant data.
Specific Design Issues
1. E-R Diagram and Normalization:
When creating an E-R diagram, careful attention to entity design can often eliminate the need for
extensive normalization later.
Functional dependencies in an entity set (like dept_name → dept_address) need to be handled, or they
will lead to redundancy.
In cases of complex relationships or multivalued dependencies, normalization helps in breaking down
the relation schema into smaller, more manageable pieces.
2. Naming Conventions:
Unique Role Assumption: Each attribute name should have a unique meaning to avoid confusion. For
instance, naming a field "number" for both a room number and phone number in different tables is
problematic.
Consistency in naming conventions across entities and relationships is crucial for clarity. This makes
database management easier, especially in large systems.
3. Denormalization for Performance:
Denormalization is sometimes used to speed up query performance at the cost of data redundancy.
A normalized schema may require joins to answer certain queries. Denormalizing the schema (like
storing combined information in one relation) can avoid costly joins.
Materialized views are an alternative to full denormalization, where the database stores the result of a
join or a computation for quicker access.
4. Other Design Issues:
Time-related Data: Storing time-series data can be tricky. For example, in a university database,
creating separate relations for every year (e.g., total_inst_2007, total_inst_2008, etc.) is problematic
because it requires new relations to be added every year and complicates queries.
Crosstab Representations: Sometimes data is represented in a cross-tab format (e.g., one column per
year for a department). While this may be useful for reports, it is not suitable for the underlying
database design.
Practical Steps in the Database Design Process
1. Start with Conceptual Design:
o Create the E-R diagram based on the requirements.
o Identify entities, relationships, and attributes.
2. Convert E-R Diagram to Relational Schema:
o Convert entities into tables.
o Define primary keys and foreign keys.
3. Normalize the Schema:
o Apply normalization rules (1NF, 2NF, 3NF, BCNF) to ensure the schema is well-structured and
free from redundancy.
4. Optimize for Performance:
o If performance is a concern, consider denormalization or materialized views to speed up query
execution.
5. Test and Refine:
o Test the schema with sample data and queries.
o Refine the design as necessary based on performance or new requirements.
By following these steps and considering design principles like normalization, naming conventions, and
performance optimizations, you can ensure a well-designed, efficient, and maintainable database.
Let me know if you would like to dive deeper into any specific area, like normalization or how to handle time-
series data.
4o mini
2. File Organization
File organization refers to how data is stored in files and accessed in a database system.
Types of File Organization
Heap File Organization: Records are stored in no particular order. Insertion is fast, but searching and
deletion are slow.
Sequential File Organization: Records are stored in a sorted order based on a key. Efficient for range
queries but slow for insertions and deletions.
Hash File Organization: Records are stored based on a hash function applied to a key. Fast for exact
match queries but inefficient for range queries.
Indexed Sequential Access Method (ISAM): Combines sequential and indexed file organization.
Uses an index to locate records quickly.
3. Indexing
Indexing is a data structure technique used to quickly locate and access data in a database.
Ordered Indices
Primary Index: Created on a sorted data file, typically on the primary key.
Secondary Index: Created on a non-key attribute, allowing faster access to records.
B+ Tree Index Files
A balanced tree structure where all leaf nodes are at the same level.
Supports efficient insertion, deletion, and search operations.
Used in most database systems for indexing.
B Tree Index File
Similar to B+ trees but stores data in both internal and leaf nodes.
Less commonly used in databases compared to B+ trees.
4. Hashing
Hashing is a technique used to map data to a fixed-size table (hash table) for fast access.
Static Hashing
The size of the hash table is fixed.
Collision: Occurs when two keys hash to the same location.
Collision Resolution Techniques:
o Chaining: Store multiple items in the same location using linked lists.
o Open Addressing: Find another location within the hash table.
Dynamic Hashing
The size of the hash table can grow or shrink dynamically.
Extendible Hashing: Uses a directory to point to buckets, allowing the hash table to expand.
Linear Hashing: Gradually increases the number of buckets as needed.
Summary
Normalization ensures a well-structured database by eliminating redundancy and anomalies.
File Organization determines how data is stored and accessed.
Indexing (e.g., B+ trees, B trees) improves query performance.
Hashing (static and dynamic) provides fast data retrieval but requires careful handling of collisions.
This unit is crucial for designing efficient, scalable, and maintainable database systems.