Unit I

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 11

Advanced Databases and Mining

UNIT-I

Introduction:
Concepts and Definitions:

An advanced database often refers to sophisticated database systems and technologies


designed to handle complex data management tasks beyond traditional relational databases.
Here are some key concepts and definitions related to advanced databases:

1. NoSQL Databases

 Definition: Non-relational databases designed to handle unstructured or semi-


structured data. They are scalable and provide flexible schema designs.
 Types:
o Document Stores: Store data in documents (e.g., MongoDB, CouchDB).
o Key-Value Stores: Use a simple key-value pair for data storage (e.g., Redis,
DynamoDB).
o Column-Family Stores: Store data in columns rather than rows (e.g.,
Cassandra, HBase).
o Graph Databases: Manage data in graph structures with nodes and edges
(e.g., Neo4j, Amazon Neptune).

2. NewSQL Databases

 Definition: Modern relational databases that aim to provide the same scalability and
performance as NoSQL systems but maintain SQL-based querying.
 Examples: Google Spanner, CockroachDB, NuoDB.

3. Distributed Databases

 Definition: Databases that spread data across multiple locations or nodes to ensure
reliability and scalability.
 Characteristics:
o Replication: Duplication of data across different nodes for redundancy.
o Sharding: Splitting data into smaller chunks and distributing them across
nodes to improve performance.

4. Data Warehousing

 Definition: A system used for reporting and data analysis, designed to handle large
volumes of data from different sources.
 Components:
o ETL (Extract, Transform, Load): Processes to gather, clean, and integrate
data.
o OLAP (Online Analytical Processing): Techniques for querying and
analyzing multidimensional data.

5. Big Data Technologies

 Definition: Technologies designed to process and analyze large volumes of data that
traditional databases cannot handle efficiently.
 Tools:
o Hadoop: An open-source framework for distributed storage and processing.
o Spark: A unified analytics engine for large-scale data processing.

6. Data Lakes

 Definition: Centralized repositories that allow you to store structured, semi-


structured, and unstructured data at scale.
 Features: Flexibility in data ingestion and storage, and the ability to perform
advanced analytics and machine learning.

7. Multi-Model Databases

 Definition: Databases that support multiple data models (e.g., relational, document,
graph) within a single system.
 Examples: ArangoDB, OrientDB.

8. Database as a Service (DBaaS)

 Definition: Cloud-based database management services that provide scalability,


management, and maintenance without the need for physical hardware.
 Examples: Amazon RDS, Azure SQL Database, Google Cloud SQL.

Relational models:

Relational models in database management systems (DBMS) are based on the concept of
relations, which are essentially tables. Here's a brief overview of the key components and
concepts:

1. Tables (Relations):
o Rows (Tuples): Each row represents a single record or data entry.
o Columns (Attributes): Each column represents a field or property of the
record.
2. Primary Key:
o A unique identifier for each row in a table. It ensures that each record can be
uniquely identified. For example, a student ID in a student table.
3. Foreign Key:
o An attribute in one table that links to the primary key of another table. It
establishes a relationship between the two tables. For example, a course ID in
a student enrollment table that refers to the course ID in a courses table.
4. Schema:
o The structure of the database, including tables, columns, data types, and
relationships between tables.
5. Normalization:
o The process of organizing data to reduce redundancy and improve data
integrity. It involves dividing tables into smaller tables and defining
relationships between them.
6. SQL (Structured Query Language):
o A language used to interact with the relational database, allowing you to
query, insert, update, and delete data.
7. Integrity Constraints:
o Rules applied to ensure the accuracy and consistency of data. Examples
include primary key constraints, foreign key constraints, and unique
constraints.
8. Relationships:
o One-to-One: Each record in one table is related to one record in another table.
o One-to-Many: A record in one table can be related to multiple records in
another table.
o Many-to-Many: Records in one table can be related to multiple records in
another table, and vice versa. This often requires a junction table to manage
the relationships.

The relational model provides a systematic way to manage data and relationships, making it
easier to handle and query complex datasets.

Data Modelling and Query Languages:

In database management systems (DBMS), data modeling and query languages are
fundamental for designing, managing, and interacting with databases. Here’s a breakdown of
both concepts:

Data Modelling

Data modeling is the process of creating a conceptual framework for organizing and
structuring data in a database. It involves defining how data will be stored, related, and
managed. The primary components of data modeling are:

1. Entities: Objects or things in the real world that are relevant to the database (e.g.,
customers, products).
2. Attributes: Properties or details about entities (e.g., customer name, product price).
3. Relationships: Associations between entities (e.g., customers place orders, orders
contain products).
4. Schema: The overall structure of the database, including tables, columns, and
relationships.

Common types of data models include:

 Entity-Relationship (ER) Model: Uses entities and relationships to represent data.


 Relational Model: Organizes data into tables (relations) with rows (tuples) and
columns (attributes).
 Object-Oriented Model: Integrates object-oriented programming principles into
database design.

Query Languages

Query languages are used to retrieve, manipulate, and manage data in a database. The most
common query languages are:

1. SQL (Structured Query Language): The standard language for relational databases.
It includes commands for querying (e.g., SELECT), updating (e.g., UPDATE), inserting
(e.g., INSERT), and deleting (e.g., DELETE) data.
o DML (Data Manipulation Language): Includes commands like SELECT,
INSERT, UPDATE, and DELETE.
o DDL (Data Definition Language): Includes commands like CREATE, ALTER,
and DROP to define and modify database structures.
o DCL (Data Control Language): Includes commands like GRANT and REVOKE
for managing permissions.
2. NoSQL Query Languages: Used in non-relational databases (NoSQL). These can
vary widely depending on the database type. Examples include:
o MongoDB Query Language (MQL): For querying MongoDB databases.
o CQL (Cassandra Query Language): For querying Cassandra databases.
o Gremlin: For querying graph databases.
3. SPARQL: Used for querying RDF (Resource Description Framework) data in
semantic web and linked data contexts.

Integrating Data Modeling and Query Languages

Data modeling provides the structure and design for how data will be stored, while query
languages enable users to interact with and manipulate this data. Effective data modeling
ensures that the database is well-organized and optimized for querying, and understanding the
query language helps users to efficiently retrieve and manage the data they need.

Database Objects:

In a Database Management System (DBMS), database objects are various structures that
store and organize data. Here’s a rundown of some common types:

1. Tables: The fundamental building blocks where data is stored. They consist of rows
(records) and columns (fields).
2. Views: Virtual tables created by querying one or more tables. They present data in a
specific format or subset, without storing it separately.
3. Indexes: Structures that improve the speed of data retrieval operations on a table.
They work like book indexes, allowing quick lookups.
4. Sequences: Objects used to generate unique numbers, often for primary keys. They
provide a way to automatically generate sequential numbers.
5. Stored Procedures: Precompiled collections of SQL statements that can be executed
as a unit. They help encapsulate logic and improve performance.
6. Functions: Similar to stored procedures, but they return a single value. They can be
used in SQL statements like expressions.
7. Triggers: Procedures that are automatically executed in response to certain events on
a table, like insertions or updates.
8. Constraints: Rules applied to table columns to enforce data integrity, such as primary
keys, foreign keys, unique constraints, and check constraints.
9. Views: Virtual tables based on the result of a SELECT query. They don't store data
themselves but provide a way to present it in a specific format or subset.
10. Schemas: Collections of database objects that group together tables, views, indexes,
etc. They help organize and manage these objects.

These objects work together to store, manage, and retrieve data efficiently in a DBMS.

Normalization Techniques:

Functional Dependency:

In database management systems (DBMS), normalization is a process used to organize a


database into tables and columns to reduce redundancy and improve data integrity. One key
concept in normalization is functional dependency.

Functional Dependency

A functional dependency (FD) describes a relationship between attributes in a relational


database. Specifically, it indicates that if two tuples (rows) of a relation (table) have the same
value for a set of attributes, then they must also have the same value for another set of
attributes.

Formally, if XXX and YYY are sets of attributes in a relation, we say that YYY is
functionally dependent on XXX, denoted as X→YX \rightarrow YX→Y, if and only if, for
any two tuples t1t_1t1 and t2t_2t2 in the relation, if t1[X]=t2[X]t_1[X] = t_2[X]t1[X]=t2[X],
then t1[Y]=t2[Y]t_1[Y] = t_2[Y]t1[Y]=t2[Y].

Types of Functional Dependencies

1. Trivial Functional Dependency: This is when YYY is a subset of XXX. For


example, if XXX is {A, B} and YYY is {A}, then A→AA \rightarrow AA→A is
trivial because the value of A in a tuple does not depend on B.
2. Non-trivial Functional Dependency: This is when YYY is not a subset of XXX. For
example, if XXX is {A} and YYY is {B}, and B depends on A, then A→BA \
rightarrow BA→B is non-trivial.
3. Completely Non-trivial Functional Dependency: If XXX and YYY are disjoint,
i.e., X∩Y=∅X \cap Y = \emptysetX∩Y=∅, then X→YX \rightarrow YX→Y is
completely non-trivial.

Normal Forms and Functional Dependency

Normalization involves decomposing tables into multiple tables based on functional


dependencies to achieve different normal forms:
1. First Normal Form (1NF): Ensures that each column contains only atomic
(indivisible) values and each column has a unique name.
2. Second Normal Form (2NF): Achieved when a table is in 1NF and all non-key
attributes are fully functionally dependent on the primary key. It addresses partial
dependency (where non-key attributes depend on part of a composite primary key).
3. Third Normal Form (3NF): Achieved when a table is in 2NF and all attributes are
functionally dependent only on the primary key, not on other non-key attributes. It
addresses transitive dependency (where non-key attributes depend on other non-key
attributes).
4. Boyce-Codd Normal Form (BCNF): A stricter version of 3NF, it requires that for
every functional dependency X→YX \rightarrow YX→Y, XXX must be a superkey.
5. Fourth Normal Form (4NF): Addresses multi-valued dependencies. A relation is in
4NF if it is in BCNF and has no multi-valued dependencies.
6. Fifth Normal Form (5NF): Deals with join dependencies and ensures that a relation
is decomposed to eliminate redundancy while preserving the ability to reconstruct the
original table.

Understanding functional dependencies is crucial in designing and normalizing database


schemas to ensure efficiency and consistency.

Normalization in database management systems (DBMS) is a process used to organize a


database to reduce redundancy and improve data integrity. Here’s a brief overview of the first
four normal forms (1NF, 2NF, 3NF, and BCNF):

1NF (First Normal Form)

A table is in 1NF if:

 All columns contain atomic (indivisible) values.


 Each column contains values of a single type.
 Each column must have a unique name.
 The order in which data is stored does not matter.

Example: Consider a table that records student information with subjects they are enrolled
in.

StudentID Name Subjects


1 Alice Math, Science
2 Bob English, History

This table is not in 1NF because the Subjects column contains multiple values. To convert it
to 1NF, you would split the subjects into separate rows.

2NF (Second Normal Form)

A table is in 2NF if:


 It is in 1NF.
 All non-key attributes are fully functionally dependent on the entire primary key (i.e.,
there are no partial dependencies).

Example: If you have a table where StudentID and CourseID together form the primary
key, but CourseName depends only on CourseID, this is a partial dependency.

StudentID CourseID CourseName Instructor


1 101 Math Dr. Smith
1 102 Science Dr. Jones

Here, CourseName is dependent only on CourseID, not the entire composite key (StudentID,
CourseID). To convert this to 2NF, you would split the table into:

Students-Courses Table:

StudentID CourseID
1 101
1 102

Courses Table:

CourseID CourseName Instructor


101 Math Dr. Smith
102 Science Dr. Jones

3NF (Third Normal Form)

A table is in 3NF if:

 It is in 2NF.
 There are no transitive dependencies (i.e., non-key attributes should not depend on
other non-key attributes).

Example: Suppose you have a table where StudentID is the primary key and there are
attributes AdvisorName and AdvisorOffice.

StudentID Name AdvisorName AdvisorOffice


1 Alice Dr. Smith Room 101
2 Bob Dr. Jones Room 102

Here, AdvisorOffice is dependent on AdvisorName, which in turn is dependent on


StudentID. To convert this to 3NF, split the table into:

Students Table:

StudentID Name AdvisorID


1 Alice 1
StudentID Name AdvisorID
2 Bob 2

Advisors Table:

AdvisorID AdvisorName AdvisorOffice


1 Dr. Smith Room 101
2 Dr. Jones Room 102

BCNF (Boyce-Codd Normal Form)

A table is in BCNF if:

 It is in 3NF.
 For every one of its non-trivial functional dependencies, the left side is a superkey.

Example: If you have a table where CourseID determines Instructor and Instructor
determines CourseID, this can be problematic because neither CourseID nor Instructor
alone is a superkey.

CourseID Instructor
101 Dr. Smith
102 Dr. Jones

If Instructor is not a candidate key but functionally determines CourseID, it violates


BCNF. To fix this, ensure that every functional dependency’s left side is a superkey. You
might need to decompose the table into two tables based on the functional dependencies.

Multi valued Dependency:

In Database Management Systems (DBMS), a Multi-Valued Dependency (MVD) is a type of


dependency that occurs when one attribute in a table determines another attribute, but the
determination is not unique and could involve multiple values.

Definition

A Multi-Valued Dependency occurs when, for a given value of an attribute AAA, a table
contains multiple values for another attribute BBB independently of a third attribute CCC.
Formally, for a relation RRR with attributes AAA, BBB, and CCC, a multi-valued
dependency A→→BA \rightarrow\rightarrow BA→→B means that for each value of AAA,
the set of values of BBB is independent of CCC. This means if two tuples have the same
AAA, they will have the same set of BBB values regardless of CCC.

Example

Consider a relation with attributes: StudentID, Course, and Hobby.


Suppose we have a relation:

StudentID Course Hobby


1 Math Reading
1 Math Cycling
1 Science Reading
1 Science Cycling
2 Math Painting
2 Math Drawing
2 Science Painting
2 Science Drawing

In this relation, StudentID →→ Hobby means that the set of hobbies for each student is
independent of the course. For each StudentID, the set of hobbies remains the same
regardless of which course they are taking.

Importance in Normalization

MVDs are significant in database normalization. They help in achieving the Fourth Normal
Form (4NF), which states that a relation is in 4NF if it is in Boyce-Codd Normal Form
(BCNF) and has no non-trivial multi-valued dependencies.

To decompose a relation with multi-valued dependencies into 4NF, you split it into two
relations: one for the multi-valued dependency and one for the remaining attributes.

Decomposition

To decompose a relation R(A,B,C)R(A, B, C)R(A,B,C) where A→→BA \rightarrow\


rightarrow BA→→B, you create:

1. A relation R1(A,B)R1(A, B)R1(A,B) to capture the multi-valued dependency.


2. A relation R2(A,C)R2(A, C)R2(A,C) to capture the remaining attributes.

This decomposition ensures that the multi-valued dependency is preserved while eliminating
redundancy and potential anomalies.

Loss-less Join and Dependency Preservation:

In database management systems (DBMS), the concepts of loss-less join and dependency
preservation are crucial when decomposing a database schema into multiple relations. Here’s
a breakdown of these concepts:

1. Loss-less Join Property

Definition: A decomposition of a relation RRR into R1R1R1 and R2R2R2 is said to be loss-
less if, when you join R1R1R1 and R2R2R2, you can recover the original relation RRR
without any loss of information.
Formally: For a decomposition of RRR into R1R1R1 and R2R2R2 to be loss-less, the
following must hold true: R=R1⋈R2R = R1 \bowtie R2R=R1⋈R2 where ⋈\bowtie⋈
denotes the natural join operation.

Condition for Loss-less Join:

 If RRR is decomposed into R1R1R1 and R2R2R2, the decomposition is loss-less if


the common attributes R1∩R2R1 \cap R2R1∩R2 (also called the intersection) satisfy
the condition:
o At least one of the following holds:
1. R1∩R2R1 \cap R2R1∩R2 is a superkey for R1R1R1.
2. R1∩R2R1 \cap R2R1∩R2 is a superkey for R2R2R2.

2. Dependency Preservation

Definition: A decomposition of a relation RRR into R1R1R1 and R2R2R2 is said to be


dependency-preserving if the set of functional dependencies (FDs) that hold on RRR can be
enforced by the FDs that hold on R1R1R1 and R2R2R2 individually.

Formally: Given a set of functional dependencies FFF on RRR, the decomposition of RRR
into R1R1R1 and R2R2R2 is dependency-preserving if: F=FR1∪FR2F = F_{R1} \cup
F_{R2}F=FR1∪FR2 where FR1F_{R1}FR1 and FR2F_{R2}FR2 are the functional
dependencies inferred from R1R1R1 and R2R2R2, respectively.

Why It's Important: Dependency preservation ensures that all functional dependencies are
enforced directly on the decomposed relations without requiring a joint operation to enforce
them.

Balancing Both Properties

In practice, achieving both loss-less join and dependency preservation can be challenging.
Some decompositions might preserve dependencies but not be loss-less, or they might be
loss-less but not preserve all dependencies. Therefore, careful consideration is needed when
designing a schema.

For example, the BCNF (Boyce-Codd Normal Form) decomposition guarantees loss-less join
but does not necessarily preserve all functional dependencies. On the other hand, 3NF (Third
Normal Form) decomposition might preserve dependencies but not always guarantee loss-
less joins.

Practical Approach

1. Decompose the relation to eliminate redundancy and anomalies while ensuring


the loss-less join property.
2. Check if the decomposition preserves all functional dependencies. If not,
consider additional decompositions or design modifications to preserve
dependencies.
By ensuring both properties, you can design a well-structured schema that maintains data
integrity and avoids anomalies.

You might also like