0% found this document useful (0 votes)
275 views

UGCNET_87_04 Database Management System

UGCNET Subject Code: 87 (Computer Science and Applications) Unit 04: Database Management System Buy: https://www.amazon.in/dp/B0DRNQPL1M

Uploaded by

sbarjun14
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
0% found this document useful (0 votes)
275 views

UGCNET_87_04 Database Management System

UGCNET Subject Code: 87 (Computer Science and Applications) Unit 04: Database Management System Buy: https://www.amazon.in/dp/B0DRNQPL1M

Uploaded by

sbarjun14
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
You are on page 1/ 127

UGC NET – 87

Computer Science and Applications

DATABASE
MANAGEMENT
SYSTEM
(Unit 4)

Prepared by Arjun Baskar


UNIT 4 – DATABASE MANAGEMENT SYSTEMS

Database System Concepts and Architecture:

Data Models, Schemas and Instances; Three-Schema Architecture and Data Independence; Database
Languages and Interfaces Centralized and Client/Server Architectures for DBMS

Data Modeling:

Entity-Relationship Diagram, Relational Model – Constraints, Languages, Design, and Programming, Relational
Database Schemas, Update Operations and Dealing with Constraint Violations; Relational Algebra and
Relational Calculus; Codd Rules.

SQL:

Data definition and Data Types; Constraints, Queries, Insert, Delete and Update Statements; View, Stored
Procedures and Functions; Database Triggers, SQL Injections.

Normalization for Relational Databases:

Functional Dependencies and Normalizations; Algorithms for Query Processing and Optimization; Transaction
Processing, Concurrency Control Techniques, Database Recovery Techniques, Object and Object-Relational
Databases; Database Security and Authorization.

Enhanced Data Models:

Temporal Database Concepts, Multimedia Databases, Deductive Databases, XML and Internet Databases;
Mobile Databases, Geographic Information Systems, Genome Data Management, Distributed Databases and
Client-Server Architectures.

Data Warehousing and Data Mining:

Data Modeling for Data Warehouses, Concept Hierarchy, OLAP and OLTP; Association Rules, Classification,
Clustering, Regression, Support Vector Machine, K-Nearest Neighbour, Hidden Markov Model, Summarization,
Dependency Modeling, Link Analysis, Sequencing Analysis, Social Network Analysis.

Big Data Systems:

Big Data Characteristics, Types of Big Data, Big Data Architecture, Introduction to Map-Reduce and Hadoop;
Distributed File System, HDFS.

NOSQL:

NOSQL and Query Optimization; Di erent NOSQL Products, Querying and Managing NOSQL; Indexing and
Ordering Data Sets; NOSQL in Cloud.
4.1. Database System Concepts and Architecture:

Data Models, Schemas and Instances; Three-Schema Architecture and Data Independence; Database
Languages and Interfaces Centralized and Client/Server Architectures for DBMS

4.1.1 Data Models:

 It is the modeling of the data description, data semantics, and consistency constraints of the data.
 It provides conceptual tools for describing the design of a database at each level of data abstraction.
 Therefore, there are following four data models used for understanding the structure of the database:

1) Relational Data Model: This type of model designs the data in the form of rows and columns within a table.
Thus, a relational model uses tables for representing data and in-between relationships. Tables are also called
relations. This model was initially described by Edgar F. Codd, in 1969. The relational data model is the widely
used model which is primarily used by commercial data processing applications.

2) Entity-Relationship Data Model: An ER model is the logical representation of data as objects and
relationships among them. These objects are known as entities, and relationships are an association among
these entities. This model was designed by Peter Chen and published in 1976 papers. It was widely used in
database designing. A set of attributes describes the entities. For example, student_name, student_id describes
the 'student' entity. A set of the same type of entities is known as an 'Entity set', and the set of the same type of
relationships is known as 'relationship set'.

3) Object-based Data Model: An extension of the ER model with notions of functions, encapsulation, and object
identity, as well. This model supports a rich type system that includes structured and collection types. Thus, in
1980s, various database systems following the object-oriented approach were developed. Here, the objects are
nothing but the data carrying its properties.

4) Semistructured Data Model: This type of data model is di erent from the other three data models (explained
above). The semistructured data model allows the data specifications at places where the individual data items
of the same type may have di erent attributes sets. The Extensible Markup Language, also known as XML, is
widely used for representing the semistructured data. Although XML was initially designed for including markup
information to the text document, it gains importance because of its application in the exchange of data.

4.1.2 Schemas and Instances

 The data which is stored in the database at a particular moment of time is called an instance of the database.
 The overall design of a database is called schema.
 A database schema is the skeleton structure of the database. It represents the logical view of the entire
database.
 A schema contains schema objects like table, foreign key, primary key, views, columns, data types, stored
procedure, etc.
 A database schema can be represented by using the visual diagram. That diagram shows the database
objects and their relationship with each other.
 A database schema is designed by the database designers to help programmers whose software will interact
with the database. The process of database creation is called data modeling.

A schema diagram can display only some aspects of a schema like the name of record type, data type, and
constraints. Other aspects can't be specified through the schema diagram. For example, the given figure neither
show the data type of each data item nor the relationship among various files.

In the database, actual data changes quite frequently. For example, in the given figure, the database changes
whenever we add a new grade or add a student. The data at a particular moment of time is called the instance of
the database.

4.1.3 Three-Schema Architecture:

 The three-schema architecture is also called ANSI/SPARC architecture or three-level architecture.


 This framework is used to describe the structure of a specific database system.
 The three-schema architecture is also used to separate the user applications and physical database.
 The three-schema architecture contains three levels. It breaks the database down into three di erent
categories.

The three-schema architecture is as follows:


In the above diagram:

 It shows the DBMS architecture.


 Mapping is used to transform the request and response between various database levels of architecture.
 Mapping is not good for small DBMS because it takes more time.
 In External / Conceptual mapping, it is necessary to transform the request from external level to conceptual
schema.
 In Conceptual / Internal mapping, DBMS transform the request from the conceptual to internal level.

Objectives of Three schema Architecture

The main objective of three level architecture is to enable multiple users to access the same data with a
personalized view while storing the underlying data only once. Thus, it separates the user's view from the physical
structure of the database. This separation is desirable for the following reasons:

 Di erent users need di erent views of the same data.


 The approach in which a particular user needs to see the data may change over time.

 The users of the database should not worry about the physical implementation and internal workings of the
database such as data compression and encryption techniques, hashing, optimization of the internal
structures etc.
 All users should be able to access the same data according to their requirements.
 DBA should be able to change the conceptual structure of the database without a ecting the user's
 The internal structure of the database should be una ected by changes to physical aspects of the storage.

1. Internal Schema:
 The internal level has an internal schema which describes the physical storage structure of the database.
 The internal schema is also known as a physical schema.
 It uses the physical data model. It is used to define that how the data will be stored in a block.
 The physical level is used to describe complex low-level data structures in detail

The internal level is generally concerned with the following activities:

 Storage space allocations.


For Example: B-Trees, Hashing etc.
 Access paths.
For Example: Specification of primary and secondary keys, indexes, pointers and sequencing.
 Data compression and encryption techniques.
 Optimization of internal structures.
 Representation of stored fields.

2. Conceptual Level

 The conceptual schema describes the design of a database at the conceptual level. Conceptual level is also
known as logical level.
 The conceptual schema describes the structure of the whole database.
 The conceptual level describes what data are to be stored in the database and also describes what
relationship exists among those data.
 In the conceptual level, internal details such as an implementation of the data structure are hidden.
 Programmers and database administrators work at this level.

3. External Level

 At the external level, a database contains several schemas that sometimes called as subschema. The
subschema is used to describe the di erent view of the database.
 An external schema is also known as view schema.
 Each view schema describes the database part that a particular user group is interested and hides the
remaining database from that user group.
 The view schema describes the end user interaction with database systems.

Mapping between Views

The three levels of DBMS architecture don't exist independently of each other. There must be correspondence
between the three levels, i.e. how they actually correspond with each other. DBMS is responsible for
correspondence between the three types of schemas. This correspondence is called Mapping.

There are basically two types of mapping in the database architecture:

 Conceptual/ Internal Mapping


 External / Conceptual Mapping

Conceptual/ Internal Mapping

The Conceptual/ Internal Mapping lies between the conceptual level and the internal level. Its role is to define
the correspondence between the records and fields of the conceptual level and files and data structures of the
internal level.

External/ Conceptual Mapping

The external/Conceptual Mapping lies between the external level and the Conceptual level. Its role is to define
the correspondence between a particular external and conceptual view.
4.1.4 Data Independence

 A database system normally contains a lot of data in addition to users’ data. For example, it stores data about
data, known as metadata, to locate and retrieve data easily.
 It is rather di icult to modify or update a set of metadata once it is stored in the database.

 But as DBMS expands, it needs to change over time to satisfy the requirements of the users. If the entire data
is dependent, it would become a tedious and highly complex job.
 Metadata itself follows a layered architecture, so that when we change data at one layer, it does not a ect
the data at another level. This data is independent but mapped to each other.

Logical Data Independence

 Logical data is data about database, that is, it stores information about how data is managed inside. For
example, a table (relation) stored in the database and all its constraints, applied on that relation.
 Logical data independence is a kind of mechanism, which liberalizes itself from actual data stored on the
disk. If we do some changes on table format, it should not change the data residing on the disk.

Physical Data Independence

 All the schemas are logical, and the actual data is stored in bit format on the disk. Physical data
independence is the power to change the physical data without impacting the schema or logical data.
 For example, in case we want to change or upgrade the storage system itself − suppose we want to replace
hard-disks with SSD − it should not have any impact on the logical data or schemas.

4.1.5 Database Languages

 A DBMS has appropriate languages and interfaces to express database queries and updates.
 Database languages can be used to read, store and update the data in the database.

Types of Database Language


1. Data Definition Language (DDL)

 DDL stands for Data Definition Language. It is used to define database structure or pattern.
 It is used to create schema, tables, indexes, constraints, etc. in the database.
 Using the DDL statements, you can create the skeleton of the database.
 Data definition language is used to store the information of metadata like the number of tables and schemas,
their names, indexes, columns in each table, constraints, etc.

Here are some tasks that come under DDL:

 Create: It is used to create objects in the database.


 Alter: It is used to alter the structure of the database.
 Drop: It is used to delete objects from the database.
 Truncate: It is used to remove all records from a table.
 Rename: It is used to rename an object.
 Comment: It is used to comment on the data dictionary.

These commands are used to update the database schema, that's why they come under Data definition
language.

2. Data Manipulation Language (DML)

DML stands for Data Manipulation Language. It is used for accessing and manipulating data in a database. It
handles user requests.

Here are some tasks that come under DML:

 Select: It is used to retrieve data from a database.


 Insert: It is used to insert data into a table.
 Update: It is used to update existing data within a table.
 Delete: It is used to delete all records from a table.
 Merge: It performs UPSERT operation, i.e., insert or update operations.
 Call: It is used to call a structured query language or a Java subprogram.
 Explain Plan: It has the parameter of explaining data.
 Lock Table: It controls concurrency.

3. Data Control Language (DCL)

 DCL stands for Data Control Language. It is used to retrieve the stored or saved data.
 The DCL execution is transactional. It also has rollback parameters.

(But in Oracle database, the execution of data control language does not have the feature of rolling back.)

Here are some tasks that come under DCL:


 Grant: It is used to give user access privileges to a database.
 Revoke: It is used to take back permissions from the user.

There are the following operations which have the authorization of Revoke:

CONNECT, INSERT, USAGE, EXECUTE, DELETE, UPDATE and SELECT.

4. Transaction Control Language (TCL)

TCL is used to run the changes made by the DML statement. TCL can be grouped into a logical transaction.

Here are some tasks that come under TCL:

 Commit: It is used to save the transaction on the database.


 Rollback: It is used to restore the database to its original since the last Commit.

4.1.6 Database Interfaces

 A database management system (DBMS) interface is a user interface which allows for the ability to input
queries to a database without using the query language itself.

User-friendly interfaces provide by DBMS may include the following:

1. Menu-Based Interfaces for Web Clients or Browsing

 These interfaces present the user with lists of options (called menus) that lead the user through the formation
of a request.
 Basic advantage of using menus is that they removes the tension of remembering specific commands and
syntax of any query language, rather than query is basically composed step by step by collecting or picking
options from a menu that is basically shown by the system.
 Pull-down menus are a very popular technique in Web based interfaces.
 They are also often used in browsing interface which allow a user to look through the contents of a database
in an exploratory and unstructured manner.

2. Forms-Based Interfaces

 A forms-based interface displays a form to each user.


 Users can fill out all of the form entries to insert a new data, or they can fill out only certain entries, in which
case the DBMS will redeem same type of data for other remaining entries.
 This type of forms is usually designed or created and programmed for the users that have no expertise in
operating system.
 Many DBMSs have forms specification languages which are special languages that help specify such forms.
 Example: SQL* Forms is a form-based language that specifies queries using a form designed in conjunction
with the relational database schema.

3. Graphical User Interface

 A GUI typically displays a schema to the user in diagrammatic form.


 The user then can specify a query by manipulating the diagram. In many cases, GUI’s utilize both menus and
forms.
 Most GUIs use a pointing device such as mouse, to pick certain part of the displayed schema diagram.

4. Natural language Interfaces


 These interfaces accept request written in English or some other language and attempt to understand them.
 A Natural language interface has its own schema, which is similar to the database conceptual schema as
well as a dictionary of important words.
 The natural language interface refers to the words in its schema as well as to the set of standard words in a
dictionary to interpret the request.
 If the interpretation is successful, the interface generates a high-level query corresponding to the natural
language and submits it to the DBMS for processing, otherwise a dialogue is started with the user to clarify
any provided condition or request.
 The main disadvantage with this is that the capabilities of this type of interfaces are not that much advance.

5. Speech Input and Output

 There is a limited use of speech say it for a query or an answer to a question or being a result of a request, it
is becoming commonplace.
 Applications with limited vocabularies such as inquiries for telephone directory, flight arrival/departure, and
bank account information are allowed speech for input and output to enable ordinary folks to access this
information.
 The Speech input is detected using predefined words and used to set up the parameters that are supplied to
the queries. For output, a similar conversion from text or numbers into speech takes place.
6. Interfaces for DBA

 Most database systems contain privileged commands that can be used only by the DBA’s sta .
 These include commands for creating accounts, setting system parameters, granting account authorization,
changing a schema, reorganizing the storage structures of a database.

4.1.7 Centralized and Client/Server Architectures for DBMS

 Centralized Architecture is ideal for smaller-scale applications where the simplicity of management and
the lack of need for high scalability are important factors.
 Client/Server Architecture is suited for larger, more complex applications requiring greater scalability, fault
tolerance, and the ability to serve multiple users from di erent locations.

Choosing between these architectures depends on the size, complexity, and growth expectations of the system
being developed.

Centralized Architecture of DBMS:

In a Centralized DBMS architecture, the database is stored and managed in a single location (typically a central
server). All users and applications interact directly with this central system.

Key Features:

 Single Database Instance: The database is stored in a single system, and all users access it through this
central point.
 Simple Management: Since the database is centralized, managing backups, security, and maintenance is
simpler.
 Control: The central server has control over all database operations, making it easier to enforce data integrity,
consistency, and security.
 Performance Issues: As all users and applications rely on a single central server, performance may degrade
with a high volume of concurrent users or large datasets.
 Scalability Concerns: It can be di icult to scale because you would have to upgrade the central server or
increase its resources.

Advantages:

 Easier to maintain and control.


 Simplified backup, recovery, and security.
 No need for complex synchronization mechanisms.

Disadvantages:

 Single point of failure; if the central server goes down, all services are impacted.
 Limited scalability due to reliance on one server.
 High latency for users far from the central server, leading to slower response times.
Client-server Architecture of DBMS:

The Client/Server DBMS architecture divides the system into two primary components: the client and the server.
The database server stores and manages the data, while clients (often users or applications) interact with the
server to perform queries and transactions.

Key Features:

 Client-Side: The client is typically a user application or interface that makes requests to the server. The client
sends queries, updates, or data retrieval requests to the server.
 Server-Side: The server is responsible for managing the database, processing queries, and returning results
to the client. It also handles security, data integrity, and transactions.
 Communication: Clients and the server communicate over a network (e.g., TCP/IP), and the client typically
does not have direct access to the database.
 Separation of Concerns: The architecture separates the user interface and the database management,
making the system more modular and easier to manage.
 Multi-tier Systems: In more advanced client/server architectures, multiple layers (tiers) of servers can be
added for business logic, application processing, or data caching, which can increase scalability and
flexibility.

Two-Tier Client Server Architecture:

Here, the term "two-tier" refers to our architecture's two layers-the Client layer and the Data layer. There are a
number of client computers in the client layer that can contact the database server. The API on the client
computer will use JDBC or some other method to link the computer to the database server. This is due to the
possibility of various physical locations for clients and database servers.

Three-Tier Client-Server Architecture:


The Business Logic Layer is an additional layer that serves as a link between the Client layer and the Data layer
in this instance. The layer where the application programs are processed is the business logic layer, unlike a Two-
tier architecture, where queries are performed in the database server. Here, the application programs are
processed in the application server itself.

Comparison Between Centralized and Client/Server Architectures

Aspect Centralized DBMS Client/Server DBMS


Architecture Single central server storing and Server stores and manages data, with clients accessing
managing data. it.
Data Access All users access the same Users access the database through a client-server
central system. communication model.
Performance Can degrade with high Can handle more users through load balancing and
concurrent user loads. optimized server infrastructure.
Scalability Limited scalability due to single Scalable by adding more servers or upgrading
server. infrastructure.
Maintenance Easier to maintain and secure. Requires more complex network and server
management.
Fault Single point of failure. Fault tolerance through multiple clients and possibly
Tolerance redundant servers.
Cost Lower initial cost, but may Potentially higher setup cost due to multiple
become costly to scale. components but better long-term scalability.
4.2 Data Modeling:

Entity-Relationship Diagram, Relational Model – Constraints, Languages, Design, and Programming, Relational
Database Schemas, Update Operations and Dealing with Constraint Violations; Relational Algebra and
Relational Calculus; Codd Rules.

4.2.1 Entity-Relationship Diagram

 ER model stands for an Entity-Relationship model. It is a high-level data model. This model is used to define
the data elements and relationship for a specified system.
 It develops a conceptual design for the database. It also develops a very simple and easy to design view of
data.
 In ER modeling, the database structure is portrayed as a diagram called an entity-relationship diagram.

For example, Suppose we design a school database. In this database, the student will be an entity with attributes
like address, name, id, age, etc. The address can be another entity with attributes like city, street name, pin code,
etc and there will be a relationship between them.

Component of ER Diagram

1. Entity:

An entity may be any object, class, person or place. In the ER diagram, an entity can be represented as rectangles.
Consider an organization as an example- manager, product, employee, department etc. can be taken as an
entity.

a. Weak Entity

An entity that depends on another entity called a weak entity. The weak entity doesn't contain any key attribute
of its own. The weak entity is represented by a double rectangle.

2. Attribute

The attribute is used to describe the property of an entity. Eclipse is used to represent an attribute.

For example, id, age, contact number, name, etc. can be attributes of a student.

a. Key Attribute

The key attribute is used to represent the main characteristics of an entity. It represents a primary key. The key
attribute is represented by an ellipse with the text underlined.

b. Composite Attribute

An attribute that composed of many other attributes is known as a composite attribute. The composite attribute
is represented by an ellipse, and those ellipses are connected with an ellipse.
c. Multivalued Attribute

An attribute can have more than one value. These attributes are known as a multivalued attribute. The double
oval is used to represent multivalued attribute.

For example, a student can have more than one phone number.

d. Derived Attribute

An attribute that can be derived from other attribute is known as a derived attribute. It can be represented by a
dashed ellipse.

For example, A person's age changes over time and can be derived from another attribute like Date of birth.

3. Relationship

A relationship is used to describe the relation between entities. Diamond or rhombus is used to represent the
relationship.

Types of relationship are as follows:

a. One-to-One Relationship

When only one instance of an entity is associated with the relationship, then it is known as one to one
relationship.

For example, A female can marry to one male, and a male can marry to one female.
b. One-to-many relationship

When only one instance of the entity on the left, and more than one instance of an entity on the right associates
with the relationship then this is known as a one-to-many relationship.

For example, Scientist can invent many inventions, but the invention is done by the only specific scientist.

c. Many-to-one relationship

When more than one instance of the entity on the left, and only one instance of an entity on the right associates
with the relationship then it is known as a many-to-one relationship.

For example, Student enrolls for only one course, but a course can have many students.

d. Many-to-many relationship

When more than one instance of the entity on the left, and more than one instance of an entity on the right
associates with the relationship then it is known as a many-to-many relationship.

For example, Employee can assign by many projects and project can have many employees.

Notation of ER diagram

Database can be represented using the notations. In ER diagram, many notations are used to express the
cardinality. These notations are as follows:
4.2.2 Relational Model

Relational model can represent as a table with columns and rows. Each row is known as a tuple. Each table of
the column has a name or attribute.

Domain: It contains a set of atomic values that an attribute can take.

Attribute: It contains the name of a column in a particular table. Each attribute Ai must have a domain, dom(Ai)

Relational instance: In the relational database system, the relational instance is represented by a finite set of
tuples. Relation instances do not have duplicate tuples.

Relational schema: A relational schema contains the name of the relation and name of all columns or
attributes.

Relational key: In the relational key, each row has one or more attributes. It can identify the row in the relation
uniquely.

NAME ROLL_NO PHONE_NO ADDRESS AGE

Ram 14795 7305758992 Noida 24


Example: STUDENT Relation
Shyam 12839 9026288936 Delhi 35
 In the given table, NAME,
Laxman 33289 8583287182 Gurugram 20 ROLL_NO, PHONE_NO,
Mahesh 27857 7086819134 Ghaziabad 27
ADDRESS, and AGE are
the attributes.
 The Ganesh 17282 9028 9i3988 Delhi 40 instance of schema
STUDENT has 5 tuples.
 t3 = <Laxman, 33289, 8583287182, Gurugram, 20>

Properties of Relations

 Name of the relation is distinct from all other relations.


 Each relation cell contains exactly one atomic (single) value
 Each attribute contains a distinct name
 Attribute domain has no significance
 tuple has no duplicate value
 Order of tuple can have a di erent sequence

4.2.3 Relational Model – Constraints

The following can be guaranteed via constraints

Data Accuracy − Data accuracy is guaranteed by constraints, which make sure that only true data is entered into
a database. For example, a limitation may stop a user from entering a negative value into a field that only accepts
positive numbers.

Data Consistency − The consistency of data in a database can be upheld by using constraints. These constraints
are able to ensure that the primary key value in one table is followed by the foreign key value in another table.

Data integrity − The accuracy and completeness of the data in a database are ensured by constraints. For
example, a constraint can stop a user from putting a null value into a field that requires one.

Types of Constraints in Relational Database Model

 Domain Constraints
 Key Constraints
 Entity Integrity Constraints
 Referential Integrity Constraints
 Tuple Uniqueness Constraints

Domain Constraints

In a database table, domain constraints are guidelines that specify the acceptable values for a certain property
or field. These restrictions guarantee data consistency and aid in preventing the entry of inaccurate or
inconsistent data into the database.

The following are some instances of domain restrictions in a Relational Database Model −

 Data type constraints − These limitations define the kinds of data that can be kept in a column. A column
created as VARCHAR can take string values, but a column specified as INTEGER can only accept integer
values.
 Length Constraints − These limitations define the largest amount of data that may be put in a column. For
instance, a column with the definition VARCHAR(10) may only take strings that are up to 10 characters long.
 Range constraints − The allowed range of values for a column is specified by range restrictions. A column
designated as DECIMAL(5,2), for example, may only take decimal values up to 5 digits long, including 2
decimal places.
 Nullability constraints − Constraints on a column's capacity to accept NULL values are known as nullability
constraints. For instance, a column that has the NOT NULL definition cannot take NULL values.
 Unique constraints − Constraints that require the presence of unique values in a column or group of columns
are known as unique constraints. For instance, duplicate values are not allowed in a column with the UNIQUE
definition.
 Check constraints − Constraints for checking data: These constraints outline a requirement that must hold
for any data placed into the column. For instance, a column with the definition CHECK (age > 0) can only
accept ages that are greater than zero.
 Default constraints − Constraints by default: Default constraints automatically assign a value to a column in
case no value is provided. For example, a column with a DEFAULT value of 0 will have 0 as its value if no other
value is specified.

Key Constraints
Key constraints are regulations that a Relational Database Model uses to ensure data accuracy and consistency
in a database. They define how the values in a table's one or more columns are related to the values in other
tables, making sure that the data remains correct.

In Relational Database Model, there are several key constraint kinds, including −

 Primary Key Constraint − A primary key constraint is an individual identifier for each record in a database. It
guarantees that each database entry contains a single, distinct value—or a pair of values—that cannot be
null—as its method of identification.
 Foreign Key Constraint − Reference to the primary key in another table is a foreign key constraint. It ensures
that the values of a column or set of columns in one table correspond to the primary key column(s) in another
table.
 Unique Constraint − In a database, a unique constraint ensures that no two values inside a column or
collection of columns are the same.

Entity Integrity Constraints

A database management system uses entity integrity constraints (EICs) to enforce rules that guarantee a table's
primary key is unique and not null. The consistency and integrity of the data in a database are maintained by
EICs, which are created to stop the formation of duplicate or incomplete entries.

Each item in a table in a relational database is uniquely identified by one or more fields known as the primary
key. EICs make a guarantee that every row's primary key value is distinct and not null. Take the "Employees" table,
for instance, which has the columns "EmployeeID" and "Name." The table's primary key is the EmployeeID
column. An EIC on this table would make sure that each row's unique EmployeeID value is there and that it is not
null.

If you try to insert an entry with a duplicate or null EmployeeID, the database management system will reject the
insertion and produce an error. This guarantees that the information in the table is correct and consistent.

EICs are a crucial component of database architecture and assist guarantee the accuracy and dependability of
the data contained in a database.

Referential Integrity Constraints

A database management system will apply referential integrity constraints (RICs) in order to preserve the
consistency and integrity of connections between tables. By preventing links between entries that don't exist
from being created or by removing records that have related records in other tables, RICs guarantee that the data
in a database is always consistent.

By the use of foreign keys, linkages between tables are created in relational databases. A column or collection of
columns in one table that is used as a foreign key to access the primary key of another table. RICs make sure
there are no referential errors and that these relationships are legitimate.

Consider the "Orders" and "Customers" tables as an illustration. The primary key column in the "Customers"
database corresponds to the foreign key field "CustomerID" in the "Orders" dataset. A RIC on this connection
requires that each value in the "CustomerID" column of the "Orders" database exist in the "Customers" table's
primary key column.

If an attempt was made to insert a record into the "Orders" table with a non-existent "CustomerID" value, the
database management system would reject the insertion and notify the user of an error.
Similar to this, the database management system would either prohibit the deletion or cascade the deletion in
order to ensure referential integrity if a record in the "Customers" table was removed and linked entries in the
"Orders" table.

In general, RICs are a crucial component of database architecture and assist guarantee that the information
contained in a database is correct and consistent throughout time.

Tuple Uniqeness Contraints

A database management system uses constraints called tuple uniqueness constraints (TUCs) to make sure that
every entry or tuple in a table is distinct. TUCs impose uniqueness on the whole row or tuple, in contrast to Entity
Integrity Constraints (EICs), which only enforce uniqueness on certain columns or groups of columns.

TUCs, then, make sure that no two rows in a table have the same values for every column. Even if the individual
column values are not unique, this can be helpful in cases when it is vital to avoid the production of duplicate
entries.

Consider the "Sales" table, for instance, which has the columns "TransactionID," "Date," "CustomerID," and
"Amount." Even if individual column values could be duplicated, a TUC on this table would make sure that no two
rows have the same values in all four columns.

The database management system would reject the insertion and generate an error if an attempt was made to
enter a row with identical values in each of the four columns as an existing entry. This guarantees the uniqueness
and accuracy of the data in the table.

TUCs may be a helpful tool for ensuring data correctness and consistency overall, especially when it's vital to
avoid the generation of duplicate entries.

4.2.4 Relational Modal – Languages

In the context of data modeling and specifically focusing on the Relational Model, languages refer to the tools
and languages used to define, query, and manipulate the data within relational databases.

1. Data Definition Language (DDL)

DDL is used to define the structure of database objects such as tables, indexes, views, and schemas. It is
responsible for creating and altering database objects.

Key commands in DDL:

 CREATE: Defines a new table, index, or schema.


 ALTER: Modifies an existing table or other database object.
 DROP: Removes a table or other database object.
 TRUNCATE: Deletes all rows in a table, but keeps the structure.

CREATE TABLE Employees (


EmployeeID INT PRIMARY KEY,
FirstName VARCHAR(50),
LastName VARCHAR(50),
HireDate DATE
);

2. Data Manipulation Language (DML)


DML is used to query and manipulate the data in relational databases. It handles the insertion, updating,
deletion, and retrieval of data.

Key commands in DML:

 SELECT: Retrieves data from one or more tables.


 INSERT: Adds new rows into a table.
 UPDATE: Modifies existing rows in a table.
 DELETE: Removes rows from a table.

INSERT INTO Employees (EmployeeID, FirstName, LastName, HireDate)


VALUES (1, 'John', 'Doe', '2024-01-01');
SELECT * FROM Employees WHERE LastName = 'Doe';

3. Data Control Language (DCL)

DCL is used to control access to the data in a database. It includes commands that allow database
administrators to grant or revoke permissions.

Key commands in DCL:

 GRANT: Provides users with access privileges to database objects.


 REVOKE: Removes access privileges from users.

GRANT SELECT, INSERT ON Employees TO user1;


REVOKE INSERT ON Employees FROM user1;

4. Transaction Control Language (TCL)

TCL is used to manage transactions in the database. A transaction is a sequence of operations performed as a
single unit of work. TCL commands ensure the consistency of data in case of errors or failures.

Key commands in TCL:

 COMMIT: Saves the changes made by the transaction.


 ROLLBACK: Undoes the changes made by the transaction.
 SAVEPOINT: Sets a point in the transaction to which you can roll back.
 SET TRANSACTION: Configures the properties of the transaction.

BEGIN TRANSACTION;
UPDATE Employees SET LastName = 'Smith' WHERE EmployeeID = 1;
COMMIT;

5. Structured Query Language (SQL)

The most widely used language for relational databases is SQL (Structured Query Language). SQL allows users
to create, modify, and query relational databases. It is divided into the categories mentioned above (DDL, DML,
DCL, TCL).

SQL syntax follows the relational model of data, which is based on tables (relations), and it is used to interact
with data in these tables. SQL is declarative, meaning that you specify what data you want, not how to retrieve it.

Key Concepts in Relational Model:


 Table (Relation): A collection of rows and columns where each row represents an entity and each column
represents an attribute of that entity.
 Primary Key: A column (or set of columns) that uniquely identifies each row in a table.
 Foreign Key: A column (or set of columns) that creates a link between two tables by referencing the primary
key of another table.
 Normalization: The process of organizing data to reduce redundancy and improve data integrity.
 Integrity Constraints: Rules that ensure data accuracy and consistency (e.g., NOT NULL, UNIQUE, CHECK,
and foreign key constraints).

Example of Relational Model:

Consider a simple system with two tables: Employees and Departments.

--Employees Table:
CREATE TABLE Employees (
EmployeeID INT PRIMARY KEY,
FirstName VARCHAR(50),
LastName VARCHAR(50),
DepartmentID INT,
HireDate DATE,
FOREIGN KEY (DepartmentID) REFERENCES Departments(DepartmentID)
);

--Departments Table:
CREATE TABLE Departments (
DepartmentID INT PRIMARY KEY,
DepartmentName VARCHAR(100)
);

Relationships:

The Employee table has a foreign key that references the Department table's DepartmentID, which is the primary
key of the Departments table. This forms a relationship between the two tables.

4.2.5 Relational Modal - Design


It refers to the process of creating a relational database that e iciently stores data and supports various
operations such as querying, updating, and reporting. The goal is to design a set of tables (relations), their
attributes, and the relationships between them, ensuring that the design adheres to best practices such as data
integrity, minimal redundancy, and ease of querying.

Key Concepts in Relational Database Design

1. Entity-Relationship Model (ER Model):


 The ER model is often the starting point for designing a relational database. It describes the data in terms of
entities (objects or things of interest) and relationships (associations between entities).
 Entities are mapped to tables in the relational model.
 Attributes of entities are mapped to columns (fields) in the corresponding tables.
 Relationships between entities (e.g., one-to-many, many-to-many) are represented using foreign keys in
relational models.
2. Normalization:
 Normalization is the process of organizing data to reduce redundancy and improve data integrity. The goal is
to structure the data so that each piece of information is stored only once.
 The process involves decomposing tables into smaller, well-structured tables without losing any data.
 There are normal forms (1NF, 2NF, 3NF, BCNF, etc.) that help guide this process, with each form addressing
specific types of redundancy or anomalies.
3. Keys:
 Primary Key: A column (or set of columns) in a table that uniquely identifies each row in the table. A primary
key must contain unique values and cannot contain NULL values.
 Foreign Key: A column (or set of columns) in one table that refers to the primary key of another table,
establishing a link between the two tables.
 Candidate Key: A set of one or more columns that can uniquely identify a row in a table. There can be multiple
candidate keys, but one is chosen as the primary key.
4. Referential Integrity:
 This ensures that foreign keys correctly point to valid primary keys in related tables. If a foreign key refers to
a primary key in another table, that primary key must exist, or the foreign key reference must be NULL (if
allowed).
5. Relationships between Tables:
 One-to-One: One record in Table A is related to exactly one record in Table B.
 One-to-Many: One record in Table A is related to multiple records in Table B.
 Many-to-Many: Multiple records in Table A are related to multiple records in Table B. This usually requires a
junction (link) table.

Steps for Relational Model Design

1. Requirements Gathering:

 Understand the business requirements and determine the data that needs to be stored.
 Identify entities (objects) and their attributes (properties).
 Determine relationships between entities (e.g., one-to-many, many-to-many).

2. Create the Entity-Relationship Diagram (ERD):

 Model the entities and relationships using an ER diagram. The ERD will guide the design of the relational
schema.
 Use rectangles to represent entities, diamonds for relationships, and ovals for attributes.
 Relationships should be labeled with cardinalities (1:1, 1:M, M:N) and attributes should be identified.

3. Map Entities to Tables:

 Convert entities into tables. Each entity will become a table, with its attributes becoming columns.
 Assign a primary key to each table (usually a unique identifier for each entity, such as ID).

4. Define Relationships:

 One-to-Many: Place the foreign key in the "many" side of the relationship. For example, if one department
has many employees, the DepartmentID will be a foreign key in the Employees table.
 Many-to-Many: Create a junction table that contains foreign keys referencing both related tables. For
example, if students can enroll in many courses and courses can have many students, create a
StudentCourse table with StudentID and CourseID as foreign keys.

5. Normalize the Database:

 Apply the normalization process (up to 3NF or BCNF) to reduce data redundancy and avoid anomalies.
 Normalize tables by eliminating repeating groups, partial dependencies, and transitive dependencies.

6. Apply Integrity Constraints:

 Ensure that primary keys, foreign keys, and unique constraints are set to maintain referential integrity.
 Define any necessary check constraints to enforce business rules (e.g., age must be greater than 18).

7. Optimize for Performance:

 Consider indexing frequently queried columns (e.g., primary keys, foreign keys) for better performance.
 Decide on denormalization if necessary for read-heavy operations (though it should be used cautiously to
avoid redundancy).

Example of Relational Database Design

Let's design a small relational database for a University Management System.

1. Entities:

 Students: Each student has attributes like StudentID, FirstName, LastName, DOB.
 Courses: Each course has attributes like CourseID, CourseName, Credits.
 Enrollments: A relationship between Students and Courses, where students can enroll in many courses.

2. ER Diagram:

 Students (StudentID, FirstName, LastName, DOB)


 Courses (CourseID, CourseName, Credits)
 Enrollments (StudentID, CourseID) — This is a many-to-many relationship.

3. Relational Schema:

CREATE TABLE Students (


StudentID INT PRIMARY KEY,
FirstName VARCHAR(50),
LastName VARCHAR(50),
DOB DATE
);

CREATE TABLE Courses (


CourseID INT PRIMARY KEY,
CourseName VARCHAR(100),
Credits INT
);

CREATE TABLE Enrollments (


StudentID INT,
CourseID INT,
PRIMARY KEY (StudentID, CourseID),
FOREIGN KEY (StudentID) REFERENCES Students(StudentID),
FOREIGN KEY (CourseID) REFERENCES Courses(CourseID)
);

4. Normalization:

 The tables are in 3NF:


 There is no repeating data.
 All non-key attributes depend on the primary key (no partial or transitive dependencies).

5. Relationships:

 One Student can enroll in many Courses (Many-to-Many).


 This is represented by the Enrollments table.

Considerations for Relational Model Design

 Data Redundancy: Minimize the duplication of data across tables to reduce storage costs and maintenance
overhead.
 Scalability: Design with scalability in mind, ensuring that the database can grow as more data is added.
 Data Integrity: Use constraints (e.g., foreign keys, unique constraints) to enforce data integrity and prevent
invalid data.
 Query Performance: Design tables and indexes to optimize query performance.

4.2.6 Relational Modal - Programming


It refers to writing code to interact with relational databases, typically using SQL (Structured Query Language) or
a programming language (e.g., Python, Java, C#, etc.) that interacts with a relational database management
system (RDBMS). This interaction allows users to manage and manipulate data according to the principles of the
relational model, including operations such as querying, inserting, updating, and deleting data.

Key Concepts in Relational Model Programming

SQL (Structured Query Language):

 SQL is the primary language used for programming relational databases. It allows users to define the
database schema, manipulate data, and perform transactions.

Relational Operators:

 These are operations that are based on the relational model to retrieve and manipulate data. The primary
operations are:
 Selection (filtering rows based on conditions)
 Projection (selecting specific columns)
 Join (combining data from two or more tables)
 Union (combining results from multiple queries)
 Di erence (finding records that exist in one set but not another)
 Intersection (finding common records between two sets)

Relational Programming Workflow:

 Database Connectivity: A program connects to a database using appropriate database drivers (e.g., JDBC
for Java, psycopg2 for Python with PostgreSQL).
 Query Execution: Once connected, SQL queries are executed against the database to perform operations
like retrieval, update, or delete.
 Result Handling: The results from SQL queries (like SELECT) are fetched and processed within the
programming language, whether it's displaying to a user, performing calculations, or generating reports.
 Transaction Management: Transaction control is crucial to maintain data integrity (e.g., using COMMIT,
ROLLBACK).

Database APIs:

 Programming languages often provide libraries or APIs to interact with relational databases, making it easier
to execute SQL queries from within the code. Common APIs include:
 JDBC (Java Database Connectivity)
 ODBC (Open Database Connectivity)
 ADO.NET (ActiveX Data Objects for .NET)
 SQLAlchemy (Python)

SQL in Relational Model Programming

Data Definition Language (DDL):

 CREATE, ALTER, DROP: Used to define, modify, and delete database structures such as tables, views, and
indexes.

CREATE TABLE Employees (


EmployeeID INT PRIMARY KEY,
FirstName VARCHAR(50),
DepartmentID INT,
HireDate DATE );
Data Manipulation Language (DML):

 SELECT, INSERT, UPDATE, DELETE: Used to query and modify the data in the database.

INSERT INTO Employees (EmployeeID, FirstName, LastName, DepartmentID, HireDate)


VALUES (1, 'John', 'Doe', 101, '2024-01-01');
SELECT * FROM Employees WHERE DepartmentID = 101;

Data Control Language (DCL):

 GRANT, REVOKE: Used for managing permissions and user access to data.

GRANT SELECT, INSERT ON Employees TO user1;

Transaction Control Language (TCL):

 COMMIT, ROLLBACK, SAVEPOINT: Used to manage transactions and ensure data consistency.

BEGIN TRANSACTION;
UPDATE Employees SET DepartmentID = 102 WHERE EmployeeID = 1;
COMMIT;

Example: Relational Model Programming in Python

Step 1: Setting Up the Database

import sqlite3

# Connect to a database (or create one if it doesn't exist)


conn = sqlite3.connect('company.db')
cursor = conn.cursor()

# Create Employees table


cursor.execute('''
CREATE TABLE IF NOT EXISTS Employees (
EmployeeID INTEGER PRIMARY KEY,
FirstName TEXT,
LastName TEXT,
DepartmentID INTEGER,
HireDate TEXT
)
''')

# Commit changes and close connection


conn.commit()

Step 2: Inserting Data

# Insert a new employee


cursor.execute('''
INSERT INTO Employees (EmployeeID, FirstName, LastName, DepartmentID, HireDate)
VALUES (?, ?, ?, ?, ?)
''', (1, 'John', 'Doe', 101, '2024-01-01'))

# Commit changes
conn.commit()
Step 3: Querying Data

# Query all employees from department 101


cursor.execute('SELECT * FROM Employees WHERE DepartmentID = ?', (101,))
rows = cursor.fetchall()

# Display the results


for row in rows:
print(row)

Step 4: Close the Connection

# Close the cursor and connection


cursor.close();
conn.close();
Key Considerations in Relational Model Programming

Database Connections:

 Ensure proper connection handling, such as opening and closing connections, to prevent resource leaks.
 Use connection pooling for better performance in high-concurrency environments.

Error Handling:

 Always handle database errors gracefully by using try-except blocks (in Python, for example) or database-
specific error-handling mechanisms.

Security:

 Use parameterized queries (like ? placeholders in SQL queries) to avoid SQL injection vulnerabilities.
 Ensure proper user authentication and authorization when accessing databases.
Transactions:

 Use transactions to ensure that multiple database operations are atomic, consistent, isolated, and durable
(ACID).
 Rollback transactions in case of errors to maintain data integrity.

4.2.7 Relational Database Schemas

 Relation schema defines the design and structure of the relation or table in the database.
 It is the way of representation of relation states in such a way that every relation database state fulfills the
integrity constraints set (Like Primary key, Foreign Key, Not null, Unique constraints) on a relational schema.
 It consists of the relation name, set of attributes/field names/column names. every attribute would have an
associated domain.

Components of a Relation Schema

 Relation Name:Name of the table that is stored in the database. It should be unique and related to the data
that is stored in the table. For example- The name of the table can be Employee store the data of the
employee.
 Attributes Name: Attributes specify the name of each column within the table. Each attribute has a specific
data type.
 Domains: The set of possible values for each attribute. It specifies the type of data that can be stored in each
column or attribute, such as integer, string, or date.
 Primary Key: The primary key is the key that uniquely identifies each tuple. It should be unique and not be
null.
 Foreign Key: The foreign key is the key that is used to connect two tables. It refers to the primary key of another
table.
 Constraints: Rules that ensure the integrity and validity of the data. Common constraints include NOT NULL,
UNIQUE, CHECK, and DEFAULT.

Example of Relation Schema

There is a student named Geeks, she is pursuing B.Tech, in the 4th year, and belongs to the IT department
(department no. 1) and has roll number 1601347 Mrs. S Mohanty proctors her. If we want to represent this using
databases we would have to create a student table with name, sex, degree, year, department, department
number, roll number, and proctor (adviser) as the attributes.

Student Table

student (rollNo, name, degree, year, sex, deptNo, advisor)

Note-If we create a database, details of other students can also be recorded.

Department Table

Similarly, we have the IT Department, with department Id 1, having Mrs. Sujata Chakravarty as the head of
department. And we can call the department on the number 0657 228662.

This and other departments can be represented by the department table, having department ID, name, hod and
phone as attributes.

department (deptId, name, hod, phone)

Course Table
The course that a student has selected has a courseid, course name, credit and department number.

course (coursId, ename, credits, deptNo)

Professor Table

The professor would have an employee Id, name, sex, department no. and phone number.

professor (empId, name, sex, startYear, deptNo, phone)

Enrollment Table

We can have another table named enrollment, which has roll no, courseId, semester, year and grade as the
attributes.

enrollment (rollNo, coursId, sem, year, grade)

Teaching Table

Teaching can be another table, having employee id, course id, semester, year and classroom as attributes.

teaching (empId, coursed, sem, year, Classroom)

Prerequisite Table

When we start courses, there are some courses which another course that needs to be completed before starting
the current course, so this can be represented by the Prerequisite table having prerequisite course and course id
attributes.

prerequisite (preReqCourse, courseId)

The relations between them is represented through arrows in the following Relation diagram,
 This represents that the deptNo in student table is same as deptId used in department table. deptNo in
student table is a foreign key. It refers to deptId in department table.
 This represents that the advisor in student table is a foreign key. It refers to empId in professor table.
 This represents that the hod in department table is a foreign key. It refers to empId in professor table.
 This represents that the deptNo in course table table is same as deptId used in department table. deptNo in
student table is a foreign key. It refers to deptId in department table.
 This represents that the rollNo in enrollment table is same as rollNo used in student table.
 This represents that the courseId in enrollment table is same as courseId used in course table.
 This represents that the courseId in teaching table is same as courseId used in course table.
 This represents that the empId in teaching table is same as empId used in professor table.
 This represents that preReqCourse in prerequisite table is a foreign key. It refers to courseId in course table.
 This represents that the deptNo in student table is same as deptId used in department table.

Note – startYear in professor table is same as year in student table

Operations And Constraint Violations In Relation Schema

Updates and retrieve are the two categories of operations on the relational schema. The basic types of updates
are:

1. Insert: Insert operation is used to add a new tuple in the relation. It is capable of violating-

 Primary Key Constraint ( Attempt to add duplicate value or null value).


 Foreign Key Constraint (Attempt to add a value in table that does not exists its parent table).
 Unique Constraint ( Inserting a duplicate value in a column that has a unique constraint).
 Not Null Constraint ( Inserting a NULL value into a column defined with a NOT NULL constraint).
 Check Constraint (Inserting a value that does not meet the condition defined in the check constraint).

2.Delete: Delete operation is used to delete existing tuples from the relation. It can only violate the referential
integrity constraint.

3. Modify: This operation is used to change the data or values of existing tuples based on the condition.

4. Retrive: This operation is used to retrieve the information or data from the relation Retrieval constraints do not
cause a violation of integrity constraints.

4.2.8 Update Operations and Dealing with Constraint Violations

There are mainly three operations that have the ability to change the state of relations, these modifications are
given below:

1. Insert - To insert new tuples in a relation in the database.


2. Delete - To delete some of the existing relation on the database.
3. Update (Modify) - To make changes in the value of some existing tuples.

Whenever we apply the above modification to the relation in the database, the constraints on the relational
database should not get violated.

Insert operation:

On inserting the tuples in the relation, it may cause violation of the constraints in the following way:

1. Domain constraint :

Domain constraint gets violated only when a given value to the attribute does not appear in the corresponding
domain or in case it is not of the appropriate datatype.
Example:

Assume that the domain constraint says that all the values you insert in the relation should be greater than 10,
and in case you insert a value less than 10 will cause you violation of the domain constraint, so gets rejected.

2. Entity Integrity constraint :

On inserting NULL values to any part of the primary key of a new tuple in the relation can cause violation of the
Entity integrity constraint.

Example:

Insert (NULL, ‘Bikash, ‘M’, ‘Jaipur’, ‘123456’) into EMP

The above insertion violates the entity integrity constraint since there is NULL for the primary key EID, it is not
allowed, so it gets rejected.

3. Key Constraints :

On inserting a value in the new tuple of a relation which is already existing in another tuple of the same relation,
can cause violation of Key Constraints.

Example:

Insert (’1200’, ‘Arjun’, ‘9976657777’, ‘Mumbai’) into EMPLOYEE

This insertion violates the key constraint if EID=1200 is already present in some tuple in the same relation, so it
gets rejected.

Referential integrity :

On inserting a value in the foreign key of relation 1, for which there is no corresponding value in the Primary key
which is referred to in relation 2, in such case Referential integrity is violated.

Example:

When we try to insert a value say 1200 in EID (foreign key) of table 1, for which there is no corresponding EID
(primary key) of table 2, then it causes violation, so gets rejected.

Solution that is possible to correct such violation is if any insertion violates any of the constraints, then the
default action is to reject such operation.

Deletion operation:

On deleting the tuples in the relation, it may cause only violation of Referential integrity constraints.

Referential Integrity Constraints :

It causes violation only if the tuple in relation 1 is deleted which is referenced by foreign key from other tuples of
table 2 in the database, if such deletion takes place then the values in the tuple of the foreign key in table 2 will
become empty, which will eventually violate Referential Integrity constraint.

Solutions that are possible to correct the violation to the referential integrity due to deletion are listed below:

 Restrict - Here we reject the deletion.


 Cascade - Here if a record in the parent table(referencing relation) is deleted, then the corresponding records
in the child table(referenced relation) will automatically be deleted.
 Set null or set default - Here we modify the referencing attribute values that cause violation and we either set
NULL or change to another valid value
Update Operation

The Update (or Modify) operation is used to change the values of one or more attributes in a tuple (or tuples) of
some relation R. It is necessary to specify a condition on the attributes of the relation to select the tuple (or
tuples) to be modified.

Here are some examples

Update the salary of the EMPLOYEE tuple with Ssn= ‘999887777’ to 28000.

 Acceptable.

Update the Dno of the EMPLOYEE tuple with Ssn= ‘999887777’ to 1.

 Acceptable.

Update the Dno of the EMPLOYEE tuple with Ssn= ‘999887777’ to 7.

 Unacceptable, because it violates referential integrity.

Update the Ssn of the EMPLOYEE tuple with Ssn= ‘999887777’ to‘987654321’.

 Unacceptable, because it violates primary key constraint by repeating a value that already exists as a
primary key in another tuple;it violates referential integrity constraints because there are other relations
that refer to the existing value of Ssn

Updating an attribute that is neither part of a primary key nor of a foreign key usually causes no problems; the
DBMS need only check to confirm that the new value is of the correct data type and domain. Modifying a primary
key value is similar to deleting one tuple and inserting another in its place because we use the primary key to
identify tuples.

Dealing with constraint violations:

Similar options exist to deal with referential integrity violations caused by Update as those options discussed for
the Delete operation

4.2.9 Relational Algebra

Relational algebra is a procedural query language. It gives a step by step process to obtain the result of the query.
It uses operators to perform queries.

Types of Relational operation

1. Select Operation:

 The select operation selects tuples that satisfy a given predicate.


 It is denoted by sigma (σ).

Notation: σ p(r)

Where:
σ is used for selection prediction

r is used for relation

p is used as a propositional logic formula which may use connectors like: AND OR and NOT. These relational can
use as relational operators like =, ≠, ≥, <, >, ≤.

For example: LOAN Relation

BRANCH_NAME LOAN_NO AMOUNT


Downtown L-17 1000
Redwood L-23 2000
Perryride L-15 1500
Downtown L-14 1500
Mianus L-13 500
Roundhill L-11 900
Perryride L-16 1300

Input: σ BRANCH_NAME="perryride" (LOAN)

Output:

BRANCH_NAME LOAN_NO AMOUNT


Perryride L-15 1500
Perryride L-16 1300

2. Project Operation:

 This operation shows the list of those attributes that we wish to appear in the result. Rest of the attributes are
eliminated from the table.
 It is denoted by ∏.

Notation: ∏ A1, A2, An (r)

Where:

A1, A2, A3 is used as an attribute name of relation r.

Example: CUSTOMER RELATION

NAME STREET CITY


Jones Main Harrison
Smith North Rye
Hays Main Harrison
Curry North Rye
Johnson Alma Brooklyn
Brooks Senator Brooklyn
Input: ∏ NAME, CITY (CUSTOMER)

Output:

NAME CITY
Jones Harrison
Smith Rye
Hays Harrison
Curry Rye
Johnson Brooklyn
Brooks Brooklyn

3. Union Operation:

 Suppose there are two tuples R and S. The union operation contains all the tuples that are either in R or S or
both in R & S.
 It eliminates the duplicate tuples. It is denoted by ∪.

Notation: R ∪ S

A union operation must hold the following condition:

 R and S must have the attribute of the same number.


 Duplicate tuples are eliminated automatically.

Example:

DEPOSITOR RELATION BORROW RELATION


CUSTOMER_NAME ACCOUNT_NO CUSTOMER_NAME LOAN_NO
Johnson A-101 Jones L-17
Smith A-121 Smith L-23
Mayes A-321 Hayes L-15
Turner A-176 Jackson L-14
Johnson A-273 Curry L-93
Jones A-472 Smith L-11
Lindsay A-284 Williams L-17

Input: ∏ CUSTOMER_NAME (BORROW) ∪ ∏ CUSTOMER_NAME (DEPOSITOR)

Output:

CUSTOMER_NAME
Johnson
Smith
Hayes
Turner
Jones
Lindsay
Jackson
Curry
Williams
Mayes
4. Set Intersection:

 Suppose there are two tuples R and S. The set intersection operation contains all tuples that are in both R &
S.
 It is denoted by intersection ∩.

Notation: R ∩ S

Example: Using the above DEPOSITOR table and BORROW table


Input:

∏ CUSTOMER_NAME (BORROW) ∩ ∏ CUSTOMER_NAME (DEPOSITOR)

Output:

CUSTOMER_NAME
Smith
Jones
5. Set Di erence:

 Suppose there are two tuples R and S. The set intersection operation contains all tuples that are in R but not
in S.
 It is denoted by intersection minus (-).

Notation: R - S

Example: Using the above DEPOSITOR table and BORROW table

Input:

∏ CUSTOMER_NAME (BORROW) - ∏ CUSTOMER_NAME (DEPOSITOR)

Output:

CUSTOMER_NAME
Jackson
Hayes
Willians
Curry
6. Cartesian product

 The Cartesian product is used to combine each row in one table with each row in the other table. It is also
known as a cross product.
 It is denoted by X.

Notation: E X D

Example: EMPLOYEE DEPARTMENT


EMP_ID EMP_NAME EMP_DEPT DEPT_NO DEPT_NAME
1 Smith A A Marketing
2 Harry C B Sales
3 John B C Legal

Input:

EMPLOYEE X DEPARTMENT

Output:

EMP_ID EMP_NAME EMP_DEPT DEPT_NO DEPT_NAME


1 Smith A A Marketing
1 Smith A B Sales
1 Smith A C Legal
2 Harry C A Marketing
2 Harry C B Sales
2 Harry C C Legal
3 John B A Marketing
3 John B B Sales
3 John B C Legal

7. Rename Operation:

The rename operation is used to rename the output relation. It is denoted by rho (ρ).

Example: We can use the rename operator to rename STUDENT relation to STUDENT1.

ρ(STUDENT1, STUDENT)

4.2.10 Relational Calculus

There is an alternate way of formulating queries known as Relational Calculus. Relational calculus is a non-
procedural query language. In the non-procedural query language, the user is concerned with the details of how
to obtain the end results. The relational calculus tells what to do but never explains how to do. Most commercial
relational languages are based on aspects of relational calculus including SQL-QBE and QUEL.

Why it is called Relational Calculus?

It is based on Predicate calculus, a name derived from branch of symbolic language. A predicate is a truth-valued
function with arguments. On substituting values for the arguments, the function result in an expression called a
proposition. It can be either true or false. It is a tailored version of a subset of the Predicate Calculus to
communicate with the relational database.

Many of the calculus expressions involves the use of Quantifiers. There are two types of quantifiers:

 Universal Quantifiers: The universal quantifier denoted by ∀ is read as for all which means that in a given set
of tuples exactly all tuples satisfy a given condition.
 Existential Quantifiers: The existential quantifier denoted by ∃ is read as for all which means that in a given
set of tuples there is at least one occurrences whose value satisfy a given condition.

Before using the concept of quantifiers in formulas, we need to know the concept of Free and Bound Variables.

A tuple variable t is bound if it is quantified which means that if it appears in any occurrences a variable that is
not bound is said to be free.

Free and bound variables may be compared with global and local variable of programming languages.

Types of Relational Calculus:

Tuiple Relational Model (TRC)


Types of Relational
Calculus
DOmain Relational Model (TRC)

1. Tuple Relational Calculus (TRC)

It is a non-procedural query language which is based on finding a number of tuple variables also known as range
variable for which predicate holds true. It describes the desired information without giving a specific procedure
for obtaining that information. The tuple relational calculus is specified to select the tuples in a relation. In TRC,
filtering variable uses the tuples of a relation. The result of the relation can have one or more tuples.

Notation:
A Query in the tuple relational calculus is expressed as following notation

{T | P (T)} or {T | Condition (T)}

Where

T is the resulting tuples

P(T) is the condition used to fetch T.

For example:

{ T.name | Author(T) AND T.article = 'database' }

Output: This query selects the tuples from the AUTHOR relation. It returns a tuple with 'name' from Author who
has written an article on 'database'.

TRC (tuple relation calculus) can be quantified. In TRC, we can use Existential (∃) and Universal Quantifiers (∀).

For example:

{ R| ∃T ∈ Authors(T.article='database' AND R.name=T.name)}

Output: This query will yield the same result as the previous one.

2. Domain Relational Calculus (DRC)

The second form of relation is known as Domain relational calculus. In domain relational calculus, filtering
variable uses the domain of attributes. Domain relational calculus uses the same operators as tuple calculus. It
uses logical connectives ∧ (and), ∨ (or) and ┓ (not). It uses Existential (∃) and Universal Quantifiers (∀) to bind
the variable. The QBE or Query by example is a query language related to domain relational calculus.

Notation: { a1, a2, a3, ..., an | P (a1, a2, a3, ... ,an)}

Where

a1, a2 are attributes / P stands for formula built by inner attributes

For example:

{< article, page, subject > | ∈ javatpoint ∧ subject = 'database'}

Output: This query will yield the article, page, and subject from the relational javatpoint, where the subject is a
database.
4.2.11 Codd’s Rules in DBMS

Rule 1: The Information Rule

All information, whether it is user information or metadata, that is stored in a database must be entered as a
value in a cell of a table. It is said that everything within the database is organized in a table layout.

Rule 2: The Guaranteed Access Rule

Each data element is guaranteed to be accessible logically with a combination of the table name, primary key
(row value), and attribute name (column value).

Rule 3: Systematic Treatment of NULL Values

Every Null value in a database must be given a systematic and uniform treatment.

Rule 4: Active Online Catalog Rule

The database catalog, which contains metadata about the database, must be stored and accessed using the
same relational database management system.

Rule 5: The Comprehensive Data Sublanguage Rule

A crucial component of any e icient database system is its ability to o er an easily understandable data
manipulation language (DML) that facilitates defining, querying, and modifying information within the database.

Rule 6: The View Updating Rule

All views that are theoretically updatable must also be updatable by the system.

Rule 7: High-level Insert, Update, and Delete

A successful database system must possess the feature of facilitating high-level insertions, updates, and
deletions that can grant users the ability to conduct these operations with ease through a single query.

Rule 8: Physical Data Independence

Application programs and activities should remain una ected when changes are made to the physical storage
structures or methods.

Rule 9: Logical Data Independence

Application programs and activities should remain una ected when changes are made to the logical structure
of the data, such as adding or modifying tables.

Rule 10: Integrity Independence

Integrity constraints should be specified separately from application programs and stored in the catalog. They
should be automatically enforced by the database system.
Rule 11: Distribution Independence

The distribution of data across multiple locations should be invisible to users, and the database system should
handle the distribution transparently.

Rule 12: Non-Subversion Rule

If the interface of the system is providing access to low-level records, then the interface must not be able to
damage the system and bypass security and integrity constraints.
4.3 SQL

Data definition and Data Types; Constraints, Queries, Insert, Delete and Update Statements; View, Stored
Procedures and Functions; Database Triggers, SQL Injections.

4.3.1 Data definition

The DDL Commands in Structured Query Language are used to create and modify the schema of the database
and its objects. The syntax of DDL commands is predefined for describing the data. The commands of Data
Definition Language deal with how the data should exist in the database.

Following are the five DDL commands in SQL:

1. CREATE Command
2. DROP Command
3. ALTER Command
4. TRUNCATE Command
5. RENAME Command

1. CREATE Command

It is a DDL command used to create databases, tables, triggers and other database objects.

Syntax to Create a Database:

CREATE Database Database_Name;

Syntax to create a new table:

CREATE TABLE table_name (


column_Name1 data_type ( size of the column ) ,
column_Name2 data_type ( size of the column) ,
...
column_NameN data_type ( size of the column ) ) ;
Syntax to Create a new index:

CREATE INDEX Name_of_Index ON Name_of_Table (column_name_1 , column_name_2 , … . ,


column_name_N);

Syntax to create a trigger:

CREATE TRIGGER [trigger_name]


[ BEFORE | AFTER ]
{ INSERT | UPDATE | DELETE }
ON [table_name] ;

2. DROP Command

It is a DDL command used to delete/remove database objects from the SQL database. We can easily remove the
entire table, view, or index from the database using this DDL command

Syntax to remove a database:

DROP DATABASE Database_Name;

Syntax to remove a table:


DROP TABLE Table_Name;

Syntax to remove an index:

DROP INDEX Index_Name;

3. ALTER Command

It is a DDL command which changes or modifies the existing structure of the database, and it also changes the
schema of database objects.

We can also add and drop constraints of the table using the ALTER command.

Example:

ALTER TABLE tableName ADD columnName column_definition;


ALTER TABLE tableName DROP columnName1, columnName 2,… columnNameN;
ALTER TABLE tableName MODIFY ( columnName column_datatype(size));

4. TRUNCATE Command

It is another DDL command which deletes or removes all the records from the table.

This command also removes the space allocated for storing the table records.

TRUNCATE TABLE Table_Name;

5. RENAME Command

It is used to change the name of the database table.

RENAME TABLE oldTableName TO newTableName;

4.3.2 Data Types

SQL Datatype is used to define the values that a column can contain.

Every column is required to have a name and data type in the database table.

1. Binary Datatypes

Data Type Description


binary It has a maximum length of 8000 bytes.
It contains fixed-length binary data.
varbinary It has a maximum length of 8000 bytes.
It contains variable-length binary data.
image It has a maximum length of 2,147,483,647 bytes.
It contains variable-length binary data.

2. Approximate Numeric Datatype :

Data From To Description


type
float -1.79E + 1.79E + 308 It is used to specify a
308 floating-point value e.g. 6.2,
2.9 etc.
real -3.40e + 38 3.40E + 38 It specifies a single
precision floating point
number

3. Exact Numeric Datatype

Data type Description


int It is used to specify an integer value.
smallint It is used to specify small integer value.
bit It has the number of bits to store.
decimal It specifies a numeric value that can have a decimal
number.
numeric It is used to specify a numeric value.

4. Character String Datatype

Data type Description


char It has a maximum length of 8000 characters. It contains
Fixed-length non-unicode characters.
varchar It has a maximum length of 8000 characters. It contains
variable-length non-unicode characters.
text It has a maximum length of 2,147,483,647 characters. It
contains variable-length non-unicode characters.

5. Date and time Datatypes

Datatype Description
date It is used to store the year, month, and days value.
time It is used to store the hour, minute, and second values.

timestamp It stores the year, month, day, hour, minute, and the
second value.

4.3.3 SQL Constraints

 SQL constraints are used to specify rules for data in a table.


 Constraints in SQL means we are applying certain conditions or restrictions on the database

Categories:

1. Column Level Constraint: Column Level Constraint is used to apply a constraint on a single column.
2. Table Level Constraint: Table Level Constraint is used to apply a constraint on multiple columns.

Common Constraints:

 NOT NULL - Ensures that a column cannot have a NULL value


ALTER TABLE Persons MODIFY Age int NOT NULL;
 UNIQUE - Ensures that all values in a column are di erent
ALTER TABLE Persons ADD CONSTRAINT UC_Person UNIQUE (ID,LastName);
 PRIMARY KEY - A combination of a NOT NULL and UNIQUE. Uniquely identifies each row in a table
ALTER TABLE Persons ADD CONSTRAINT PK_Person PRIMARY KEY (ID,LastName);
 FOREIGN KEY - Prevents actions that would destroy links between tables
ALTER TABLE Orders ADD FOREIGN KEY (PersonID) REFERENCES Persons(PersonID);
 CHECK - Ensures that the value in a column satisfies a specific condition
ALTER TABLE Persons ADD CONSTRAINT CHK_PersonAge CHECK (Age>=18 AND City='Sandnes');
 DEFAULT - Sets a default value for a column if no value is specified
ALTER TABLE Persons MODIFY City DEFAULT 'Sandnes';
 CREATE INDEX - Used to create and retrieve data from the database very quickly
CREATE INDEX idx_pname ON Persons (LastName, FirstName);

4.3.5 Insert Statement

SQL INSERT statement is a SQL query. It is used to insert a single or multiple records in a table.

There are two ways to insert data in a table:

1. By SQL insert into statement


a. By specifying column names
b. Without specifying column names
2. By SQL insert into select statement

Inserting data directly into a table - With specifying column names

INSERT INTO table_name (column1, column2, column3....)


VALUES (value1, value2, value3.....);

Inserting data directly into a table - Without specifying column names

INSERT INTO table_name


VALUES (value1, value2, value3....);

Inserting data through SELECT Statement

INSERT INTO table_name


[(column1, column2, .... column)]
SELECT column1, column2, .... Column N
FROM table_name [WHERE condition];

Inserting multiple rows:

INSERT INTO table_name


VALUES (value1, value2, value3....), (value1, value2, value3....);

4.3.6 Delete Statement

The SQL Delete statement is used to remove rows from a table based on a specific condition.
Delete a specific row

DELETE FROM employees WHERE employee_id = 101;

This will delete the employee with employee_id 101 from the employees table.

Deleting rows based on a condition

DELETE FROM products WHERE price < 10 ;

This will delete all products from table where the price is less than 10.

Deleting all rows

DELETE from orders;

This will delete all rows from the order table, but the structure of the table will remain intact.

Deleting rows using multiple conditions

DELETE FROM customers WHERE city=’New York’ AND last_purchase_date < ‘2023-01-01’;

This delete all the customers in New York who have not made a purchase after January 1st, 2023.

4.3.7 Update Statement

It is used to modify existing records in a table. You can update one or more rows depending on the condition you
provide. It allows us to set new values for one or more columns.

Updating a single column for specific rows based on condition:

UPDATE employees SET salary = 60000 WHERE employee_id = 101;


Updating multiple columns for specific rows based on condition:

UPDATE employees SET salary = 65000, department = ‘Marketing’ WHERE employee_id = 102;
Updating all rows:

UPDATE employees SET bonus = bonus * 15000;

3.8 SQL View

SQL views are powerful tools for abstracting and simplifying data queries, improving security, and making your
SQL code more maintainable.

 Virtual Table: A view acts like a table but doesn’t store data. It dynamically pulls data from underlying tables
each time it’s queried.
 Simplifies Complex Queries: A view can encapsulate complex joins, unions, and other queries, allowing
users to access data in a simplified manner.
 Data Security: Views can provide a limited view of the data, restricting access to certain columns or rows.
 Read-Only or Updatable: Some views can be updated directly, while others may only be read form,
depending on the complexity of the view.

Basic View

CREATE VIEW customers_from_city AS


SELECT customer_id, first_name, last_name, city
FROM customers
WHERE city = 'New York';
View with Joins

If you have two tables, orders and customers, and you want to create a view that shows each order with the
customer’s details:

CREATE VIEW ordWithCustDtl AS


SELECT o.order_id, o.order_date, o.total_amount, c.customer_id, c.first_name,
c.last_name
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id;

Querying a View

SELECT * FROM customers_from_city;

Dropping a View

DROP VIEW view_name;

3.9 Stored Procedures and Functions

both stored procedures and functions are vital tools for improving database performance and maintainability.
The choice of which to use depends on the requirements of your specific application, whether you need to
perform complex actions or just return a calculated value.

3.9.1 Stored Procedures

It is a precompiled collection of SQL statements that can be executed as a single unit. It can perform various
operations such as querying data, modifying data, and controlling the flow of execution using conditional logic
and loops.

Key Characteristics of Stored Procedures:

 Side E ects: Stored procedures can modify the database state by inserting, updating, or deleting records.
 Execution: Stored procedures are executed using the EXECUTE or CALL statement.
 Return Value: They do not necessarily return a value, but they can return status codes or messages.
 Parameters: Stored procedures can accept input, output, or input/output parameters.
 Transactions: They can include transaction control (e.g., COMMIT, ROLLBACK).

CREATE PROCEDURE GetEmployeeDetails (IN emp_id INT)


BEGIN
SELECT * FROM Employees WHERE EmployeeID = emp_id;
END;

--call the procedure


CALL GetEmployeeDetails(101);

3.9.2 Functions

It is a routine that performs a calculation or returns a result, typically used to return a single value. Functions are
generally used within SQL queries.

Key Characteristics of Functions:

No Side E ects: Functions cannot modify the database state (e.g., they cannot insert, update, or delete
records).
Return Value: A function must return a value, and it returns exactly one value of a specific type (e.g., INT,
VARCHAR).

Used in Queries: Functions are commonly used in SQL expressions, such as SELECT, WHERE, and ORDER BY.

Parameters: Functions accept input parameters, but they do not accept output parameters.

CREATE FUNCTION GetSalary (emp_id INT)


RETURNS DECIMAL(10, 2)
BEGIN
DECLARE salary DECIMAL(10, 2);
SELECT Salary INTO salary FROM Employees WHERE EmployeeID = emp_id;
RETURN salary;
END;

-- To use the function in a query:


SELECT GetSalary(101);

Di erences Between Stored Procedures and Functions:

Aspect Stored Procedure Function


Return Type May or may not return a value. Must return the value.
Side Effects Can modify data (e.g., INSERT, UPDATE, Cannot modify data (no side effects).
DELETE).
Usage Executed with EXEC or CALL. Can be used in SQL expressions (e.g., in
SELECT or WHERE clauses).
Parameters Can have input, output, or input/output Can only have input parameters.
parameters.
Transaction Can include transaction management (e.g., Cannot include transaction management.
Control COMMIT, ROLLBACK).
When to Use Stored Procedures and Functions

 Use a stored procedure when you need to perform operations that a ect the database state, such as
updates, inserts, or deletes. They are also useful for encapsulating complex logic or a series of SQL
commands that you want to reuse across di erent parts of an application.
 Use a function when you need to return a value from the database, and you want to integrate it within a query.
Functions are best for calculations or retrieving specific data, especially if the logic is simple and does not
require changes to the database state.

3.10 Database Triggers

A trigger in SQL is a set of instructions that automatically execute or fire when a specified event occurs on a
specified table or view. Triggers are used to enforce business rules, data integrity, and auditing without requiring
explicit calls in the application logic.

Key Concepts of SQL Triggers:

Events: Triggers are fired based on database operations such as:

 INSERT: Triggered when new records are inserted into a table.


 UPDATE: Triggered when existing records are updated in a table.
 DELETE: Triggered when records are deleted from a table.

Trigger Timing: Triggers can be set to execute before or after the event occurs:

 BEFORE Trigger: Executes before the actual database modification (e.g., before an insert, update, or delete).
 AFTER Trigger: Executes after the database modification has been performed.

Trigger Type: Based on how they interact with the data, triggers can be:

 Row-Level Triggers: Fired once for each row a ected by the operation. This is the most common type of
trigger.
 Statement-Level Triggers: Fired once for the entire SQL statement, regardless of how many rows are a ected.

General Syntax:

CREATE TRIGGER trigger_name


{ BEFORE | AFTER | INSTEAD OF }
{ INSERT | UPDATE | DELETE }
ON table_name
[ FOR EACH ROW ]
BEGIN
-- SQL statements to be executed when the trigger fires
END;

 trigger_name: The name you give to the trigger.


 BEFORE | AFTER: Specifies whether the trigger should fire before or after the event.
 INSERT | UPDATE | DELETE: Specifies the event that triggers the action.
 table_name: The table on which the trigger is defined.
 FOR EACH ROW: Indicates that the trigger should execute for each a ected row (row-level trigger).
 BEGIN ... END;: Block of SQL statements to execute when the trigger fires.

Types of Triggers in SQL:

BEFORE Triggers:
 It executes before the data modification statement is applied to the table.
 They allow for validation or modification of data before it is committed to the database.

CREATE TRIGGER before_insert_employee


BEFORE INSERT ON employees
FOR EACH ROW
BEGIN
IF NEW.salary < 0 THEN
SET NEW.salary = 0; -- Ensure salary can't be negative
END IF;
END;

AFTER Triggers:

 It executes after the data modification statement is applied to the table.


 They are often used for logging, auditing, or cascading changes.

CREATE TRIGGER after_update_employee


AFTER UPDATE ON employees
FOR EACH ROW
BEGIN
INSERT INTO audit_log (action, table_name, record_id, timestamp)
VALUES ('UPDATE', 'employees', OLD.employee_id, NOW());
END;

INSTEAD OF Triggers:

 It replaces the actual operation (e.g., INSERT, UPDATE, DELETE) with the logic defined in the trigger.
 They are typically used with views or for complex data modification scenarios where the default action needs
to be overridden.

CREATE TRIGGER instead_of_insert_employee


INSTEAD OF INSERT ON employees
FOR EACH ROW
BEGIN
-- Custom logic for handling insert
INSERT INTO employees (employee_id, name, salary)
VALUES (NEW.employee_id, NEW.name, NEW.salary);
END;

Considerations When Using Triggers

 Performance: Triggers add overhead to database operations. Each time an INSERT, UPDATE, or DELETE is
executed, the trigger must also run, which can impact performance, especially for complex triggers or those
that operate on large datasets.
 Trigger Nesting: Some databases support recursive triggers, where one trigger can fire another. Care should
be taken to avoid infinite loops of triggers firing each other.
 Complexity and Maintenance: Excessive use of triggers can make database logic harder to understand and
maintain. It’s important to document triggers well and ensure that they are necessary for the operation of the
system.
 Debugging: Debugging triggers can be challenging because they run automatically and may be hard to trace
unless logging is in place.

Example in MySQL:
DELIMITER $$
CREATE TRIGGER update_product_stock
AFTER UPDATE ON order_details
FOR EACH ROW
BEGIN
IF OLD.quantity <> NEW.quantity THEN
UPDATE products
SET stock = stock - (NEW.quantity - OLD.quantity)
WHERE product_id = NEW.product_id;
END IF;
END $$
DELIMITER ;

This trigger updates the stock column in the products table whenever the quantity of an item in the
order_details table is updated.

3.11 SQL Injections

SQL Injection (SQLi) is a web application vulnerability that occurs when an attacker manipulates an application's
SQL query to gain unauthorized access or perform malicious actions on a database. SQL injection happens when
user inputs are not properly sanitized and are directly included in SQL statements. This allows the attacker to
inject malicious SQL code that can alter the intended behavior of the query, potentially giving them control over
the database.

How SQL Injection Works

When an application takes user input and inserts it into a SQL query, if the input isn't properly validated or
escaped, an attacker can inject their own SQL commands. This can lead to unauthorized access, data
manipulation, or even complete compromise of the database.

Example of a Vulnerable Query:

Imagine an application asks users for their username and password and constructs the following query to
authenticate them:

SELECT * FROM users WHERE username = 'user_input' AND password = 'password_input';

If the user input is not sanitized, an attacker could enter malicious input like:

 username = '' OR '1'='1’


 password = '' OR '1'='1’

This would change the query to:

SELECT * FROM users WHERE username = '' OR '1'='1' AND password = '' OR '1'='1'; --the query
always return true since '1'='1'

Types of SQL Injection Attacks

1. Classic SQL Injection:

Attackers inject SQL statements directly into the user input fields to alter the behavior of the SQL query. This is
the most common form of SQL injection.

SELECT * FROM users WHERE username = 'admin' AND password = '' OR 1=1; --

2. Blind SQL Injection:

In this type of attack, the application does not provide direct feedback about the query result. Attackers infer the
results by observing the application's behavior (e.g., page response time, changes in page content).
Types:

 Boolean-based Blind SQL Injection: The attacker sends a query that evaluates a true or false condition and
observes the response.
 Time-based Blind SQL Injection: The attacker sends a query that causes a delay in the database's response,
allowing them to deduce information from the timing.

SELECT * FROM users WHERE username = '' OR 1=1;

3. Union-based SQL Injection:

This attack combines the results of the original query with results from other SELECT queries, often used to
retrieve data from other tables.

SELECT * FROM users WHERE username = 'admin' UNION SELECT name, email FROM customers;

4. Error-based SQL Injection:

This involves forcing the database to generate an error, which can reveal information about the database
structure (e.g., table names, column names).

SELECT * FROM users WHERE username = '' AND 1=1 --;

The attacker may use the error messages to gain insights into the database schema.

5. Out-of-Band SQL Injection:

This method uses external channels (e.g., DNS or HTTP requests) to get data from the database. It is often used
when other methods like error-based or boolean-based injection aren't viable.

Consequences of SQL Injection

1. Unauthorized Access: Attackers can bypass authentication and access sensitive data, including
usernames, passwords, and other confidential information.
2. Data Manipulation: Attackers can insert, update, or delete records in the database. This can result in data
corruption, loss, or unauthorized changes.
3. Privilege Escalation: Attackers may exploit SQL injection to escalate their privileges and gain administrative
control of the database.
4. Remote Code Execution: In some cases, attackers can execute arbitrary commands on the database server,
potentially compromising the entire system.
5. Denial of Service (DoS): Attackers might execute queries that overload the database, causing slowdowns or
making the system unavailable.
6. Reputation Damage: If an application is vulnerable to SQL injection and data is compromised, the
organization may su er reputational harm and legal consequences.

Preventing SQL Injection

1. Use Prepared Statements (Parameterized Queries): Prepared statements ensure that SQL code and user
input are processed separately. This eliminates the risk of SQL injection because user input is treated as data,
not executable code.

//Example in PHP (using MySQLi):


$stmt = $mysqli->prepare("SELECT * FROM users WHERE username = ? AND password = ?");
$stmt->bind_param("ss", $username, $password);
$stmt->execute();
2. Use Stored Procedures: Stored procedures are precompiled SQL statements that are stored in the
database. They separate SQL logic from user input, reducing the risk of injection.

CREATE PROCEDURE GetUser (IN username VARCHAR(255), IN password VARCHAR(255))


BEGIN
SELECT * FROM users WHERE username = username AND password = password;
END;

3. Input Validation and Sanitization: Always validate user input to ensure it matches the expected format (e.g.,
alphanumeric usernames, numeric IDs). Use a whitelist of acceptable inputs and reject any input that doesn't
conform.

4. Use ORM (Object-Relational Mapping): Many modern web frameworks use ORMs that automatically handle
parameterized queries, reducing the likelihood of SQL injection vulnerabilities.

5. Escape User Inputs: If you must directly include user input in SQL queries, make sure to escape special
characters (like single quotes, semicolons, etc.) to prevent malicious code injection.

$username = mysqli_real_escape_string($conn, $username);

6. Error Handling: Don't display detailed database errors to users. Instead, log errors server-side and display
generic error messages to the user. This prevents attackers from gaining information about the database
structure.

7. Use Web Application Firewalls (WAFs): A WAF can help detect and block common SQL injection patterns by
filtering incoming tra ic before it reaches the web application.

8. Principle of Least Privilege: Limit the permissions of the database user account used by the application. This
minimizes the impact if an attacker exploits vulnerability.

9. Regular Security Audits and Updates: Perform regular security audits, keep your software up-to-date, and
patch known vulnerabilities to reduce the chances of an attack.
Unit 4 - Normalization for Relational Databases

Normalization is the process of organizing data in a database to minimize redundancy and dependency by
dividing large tables into smaller ones and defining relationships between them. The goal is to remove
undesirable characteristics like update anomalies, insertion anomalies, and deletion anomalies, ensuring that
data is stored e iciently.

4.1 Functional Dependencies

A functional dependency is a relationship between two attributes (or sets of attributes) in a relational database.
It means that the value of one attribute (or set of attributes) determines the value of another attribute. In other
words, if you know the value of one attribute, you can determine the value of the other.

For example, consider the following:

 A → B: If attribute A uniquely determines attribute B, then B is functionally dependent on A.


 Example: In a Student table, if Student_ID → Name, it means that the student ID uniquely determines the
student's name.

4.2 Normalizations

Normalization involves decomposing a database schema into a series of "normal forms" (NF) to ensure the
elimination of redundancy and dependencies. The most common normal forms are:

Types of Normal Forms

First Normal Form (1NF):

 A table is in 1NF if it contains only atomic (indivisible) values and each record is unique (no repeating groups).
 Example: A table that stores multiple phone numbers in a single column is not in 1NF. It should be broken
down into separate rows for each phone number.

Second Normal Form (2NF):

 A table is in 2NF if it is in 1NF and all non-key attributes are fully functionally dependent on the primary key.
 This eliminates partial dependency (when a non-key attribute is dependent on part of the primary key).
 Example: In a table with a composite primary key (Order_ID, Product_ID), if the Product_Price depends only on
Product_ID and not on Order_ID, then it violates 2NF. This can be resolved by splitting the table into two.

Third Normal Form (3NF):

 A table is in 3NF if it is in 2NF and no transitive dependency exists (i.e., non-key attributes are not dependent
on other non-key attributes).
 Example: If Employee_ID → Department_ID and Department_ID → Department_Manager, then Employee_ID
→ Department_Manager is a transitive dependency. This can be resolved by separating the Department
information into another table.

Boyce-Codd Normal Form (BCNF):

 A table is in BCNF if it is in 3NF and for every functional dependency X → Y, X is a superkey (a candidate key or
a superset of a candidate key).
 Example: If A → B and B → C, but B is not a superkey, the table violates BCNF. To achieve BCNF, we would
decompose the table.

Fourth Normal Form (4NF):


 A table is in 4NF if it is in BCNF and there are no multi-valued dependencies. A multi-valued dependency
occurs when one attribute in a table determines multiple independent values of another attribute.
 Example: In a table that records a person’s phone number and email address, if each person has multiple
phone numbers and multiple email addresses, this could be a multi-valued dependency. This can be
resolved by splitting the data into two separate tables.

Fifth Normal Form (5NF):

 A table is in 5NF if it is in 4NF and it contains no join dependency, meaning that all non-trivial join
dependencies are a result of candidate keys.

4.3 Algorithms for Query Processing and Optimization

Query Processing involves translating a high-level query (like SQL) into a sequence of operations that can be
executed by the database management system (DBMS).

Query Optimization involves improving the performance of queries by minimizing the response time and
resource usage.

Steps in Query Processing:

1. Parsing: The SQL query is parsed to check for syntax correctness.


2. Translation: The query is translated into a query tree or relational algebra expression.
3. Optimization: Various algorithms are used to choose the most e icient execution plan. This includes
selecting the best access paths (e.g., using indexes, joins).
4. Execution: The DBMS executes the query based on the optimized plan.

Query Optimization Techniques:

 Selection Pushdown: Moving the filtering operation closer to the data retrieval.
 Join Reordering: Reordering the joins to minimize intermediate results.
 Index Usage: Leveraging indexes for faster access to specific data.
 Materialized Views: Using precomputed query results to reduce repeated computation.

4.4 Transaction Processing

It refers to handling a sequence of database operations as a single unit, ensuring data consistency, and providing
ACID (Atomicity, Consistency, Isolation, Durability) properties.

 Atomicity: A transaction is all or nothing; it either completes fully or has no e ect.


 Consistency: A transaction brings the database from one valid state to another.
 Isolation: The e ects of a transaction are isolated from others until the transaction is complete.
 Durability: Once a transaction is committed, its changes are permanent.

4.5 Concurrency Control Techniques

Concurrency control ensures that transactions are executed in a manner that maintains the consistency and
isolation of the database, even when multiple transactions occur simultaneously.

 Locking: Locks are used to control access to data during transactions. Common types include:
 Exclusive Lock (X-lock): Prevents any other transaction from accessing the data.
 Shared Lock (S-lock): Allows other transactions to read but not modify the data.
 Timestamping: Each transaction is assigned a timestamp, and transactions are executed in the order of their
timestamps.
 Optimistic Concurrency Control: Transactions execute without locks and check for conflicts only at the
commit time.

4.6 Database Recovery Techniques

Recovery techniques are used to restore the database to a consistent state after a failure (e.g., system crash or
power failure).

 Log-based Recovery:
 A transaction log records all changes made to the database. After a failure, the DBMS can use the log to
undo or redo transactions as necessary.
 Write-ahead Logging (WAL): Changes are written to the log before they are applied to the database.
 Checkpointing: Periodically saving the current state of the database to minimize the amount of work
required for recovery.
 Shadow Paging: Using a separate copy of the database to keep track of changes. If a failure occurs, the
shadow copy is used to restore the database.
4.7 Object and Object-Relational Databases

Object-Oriented Databases (OODBs): These databases store data as objects (similar to objects in object-
oriented programming languages). They support more complex data types like multimedia, and each object
contains both data and methods for manipulating that data.

Object-Relational Databases (ORDBs): These combine the features of both relational and object-oriented
databases. They allow you to define custom data types, methods, and relationships in a relational schema. They
support complex data types and inheritance.

4.8 Database Security and Authorization

It involves protecting the data from unauthorized access, ensuring confidentiality, integrity, and availability.

 Access Control: The process of restricting access to the database to authorized users. This includes defining
user roles and permissions.
 Discretionary Access Control (DAC): Users can control access to their own data.
 Mandatory Access Control (MAC): Access to data is determined by system policies and not by users.
 Encryption: Data is encrypted both at rest and in transit to prevent unauthorized access.
 Authentication and Authorization:
 Authentication: Ensuring that users are who they claim to be, typically via passwords or biometrics.
 Authorization: Granting or denying access to resources based on the user’s roles and permissions.
Unit 5. Enhanced Data Models:

Enhanced Data Models in modern databases go beyond traditional relational models, addressing the growing
complexity of data types, usage scenarios, and application needs. Below is an overview of several advanced data
models and concepts:

5.1 Temporal Database Concepts

Temporal databases manage data involving time dimensions. They track changes to data over time, which is
critical for applications that need to store historical data or capture the evolution of data.

Key Concepts:

 Valid Time: The time period during which a fact is true in the real world.
 Transaction Time: The time period during which a fact is stored in the database.
 Bitemporal Data: Data that has both valid time and transaction time, allowing you to track when data was
valid in the real world and when it was recorded in the system.

Use Cases:

 Financial systems (track transactions over time)


 Historical records (capture state changes over periods)

5.2. Multimedia Databases

Multimedia databases are designed to store, manage, and retrieve various types of media content, including
images, audio, video, and other forms of multimedia data.

Key Concepts:

 Content-Based Retrieval: Allows users to search for multimedia content based on its content (e.g., visual
characteristics for images or audio features for sound).
 Compression Techniques: Multimedia data is typically large and requires compression techniques like JPEG
(for images), MPEG (for video), and MP3 (for audio) for storage and transmission e iciency.
 Indexing: E icient indexing methods (e.g., for image or video metadata) are crucial for retrieving specific
multimedia content.

Use Cases:

 Digital libraries and archives


 Video on demand systems
 Medical imaging
5.3 Deductive Databases

A deductive database combines traditional relational databases with logical reasoning capabilities, enabling the
deduction of new facts based on stored data and rules.

Key Concepts:

 Rules and Inference: It allows the use of logic-based rules (such as Horn clauses) to infer new facts from
existing data.
 Logic Programming: Queries in deductive databases often involve logical programming languages, such as
Datalog or Prolog, to query data and infer relationships.
 Recursive Queries: Deductive databases can handle recursive queries (e.g., finding all ancestors of a
particular individual).

Use Cases:

 Expert systems
 Knowledge-based systems
 Complex data mining applications

5.4 XML and Internet Databases

XML (eXtensible Markup Language) databases are designed to handle hierarchical and semi-structured data,
which is often used in web-based applications.

Key Concepts:

 XML Schema: Describes the structure of XML documents, ensuring that they conform to a defined structure.
 XPath/XQuery: Query languages used to search and extract data from XML documents.
 NoSQL and Document Stores: Many NoSQL databases (e.g., MongoDB) are optimized to store and query
semi-structured data, including XML and JSON.

Use Cases:

 Web services and e-commerce platforms


 Data exchange between systems (e.g., RSS feeds, configuration files)
 Content management systems

5.5 Mobile Databases

Mobile databases are optimized for mobile devices, which have limited resources (e.g., processing power,
memory) and often experience intermittent connectivity.

Key Concepts:

 Data Synchronization: Mobile databases often need to synchronize data between the mobile device and a
central server, particularly in scenarios where devices are o line.
 Lightweight Data Models: Data models are optimized for minimal storage space, often using smaller,
embedded database systems like SQLite or Berkeley DB.
 Caching: Caching mechanisms help improve performance by storing frequently accessed data locally on
the mobile device.
Use Cases:

 Mobile applications (e.g., social media, retail)


 O line data collection (e.g., field surveys)
 Mobile customer relationship management (CRM)

5.6 Geographic Information Systems (GIS)

Geographic Information Systems (GIS) are designed to store, analyze, and visualize spatial and geographical
data. This includes maps, satellite imagery, and location-based data.

Key Concepts:

 Spatial Data Models: Data can be represented using points, lines, and polygons (vector data) or as grids
(raster data).
 Spatial Queries: GIS databases support spatial queries like distance calculations, area analysis, and spatial
relationships (e.g., "find all cities within 100 miles of a given location").
 Geospatial Indexing: Indexing techniques like R-trees or Quad-trees are used to e iciently query spatial
data.

Use Cases:

 Urban planning and development


 Environmental monitoring
 Navigation and route planning (e.g., Google Maps)

5.7 Genome Data Management

Genome databases are specialized systems for managing the large and complex datasets produced in genomic
research, such as DNA sequences and genomic annotations.

Key Concepts:

 Bioinformatics: The field combining biology, computer science, and information technology to analyze and
store genomic data.
 Sequence Data: Genomic data often consists of long DNA or RNA sequences, which require e icient storage
and querying techniques.
 Alignment and Mapping: Tools for aligning DNA sequences or mapping them to reference genomes are
critical for analysis.

Use Cases:

 Genomic research (e.g., sequencing genomes, identifying mutations)


 Personalized medicine (e.g., drug development based on genetic profiles)
 Biotechnology applications (e.g., genetically modified organisms)

5.8 Distributed Databases and Client-Server Architectures

Distributed databases are systems where data is stored across multiple locations, and the system needs to
ensure consistency, reliability, and e icient data access across these locations. Client-server architectures
describe the model where client devices request services from central server systems.

Key Concepts:
 Distributed Databases: Data is distributed across multiple servers or geographical locations, and queries
need to be routed to the correct node. These databases can be homogeneous (same DBMS across all nodes)
or heterogeneous (di erent DBMSs across nodes).
 Replication and Partitioning: To improve performance and fault tolerance, data can be replicated (stored
on multiple nodes) or partitioned (divided across nodes).
 CAP Theorem: A distributed database system can provide only two of the three guarantees: Consistency,
Availability, and Partition tolerance. This trade-o needs to be considered during system design.

Client-Server Architecture:

 Client: A device or application that requests services or data from the server.
 Server: A machine that provides resources or services to clients, such as hosting a database, performing
computation, or serving web pages.

Use Cases:

 E-commerce platforms (e.g., Amazon, eBay)


 Cloud-based applications (e.g., Google Drive)
 Real-time applications (e.g., stock trading platforms)
Unit 6. Data Warehousing and Data Mining:

Data Warehousing and Data Mining are critical components of modern data analytics, enabling businesses and
organizations to derive insights from large volumes of data. Below is a detailed breakdown of key concepts in
these areas:

6.1 Data Modeling for Data Warehouses

Data modeling for data warehouses involves organizing and structuring data to facilitate e icient querying and
analysis. Data warehouses are designed to support decision-making processes by consolidating data from
various sources and making it available for analysis.

Key Concepts:

 Star Schema: A common data modeling technique where a central fact table (containing quantitative data)
is connected to multiple dimension tables (containing descriptive data).
 Snowflake Schema: A more complex version of the star schema where dimension tables are normalized
into multiple related tables.
 Fact Tables: These tables store the main business metrics or facts (e.g., sales revenue, quantities).
 Dimension Tables: These store descriptive information (e.g., time, product, customer).
 ETL Process: Extract, Transform, Load – the process used to collect data from various sources, clean it, and
load it into the data warehouse.

Use Cases:

 Business Intelligence (BI) reporting


 Sales and performance analysis

6.2 Concept Hierarchy

A concept hierarchy is a way of organizing data at di erent levels of granularity, typically used in OLAP (Online
Analytical Processing) systems. It allows users to analyze data at various levels of detail.

Key Concepts:

 Hierarchical Levels: Data can be grouped into hierarchical levels, such as "Year > Quarter > Month > Day"
for time data or "Country > State > City" for geographic data.
 Drill-Down and Roll-Up: In OLAP, users can drill down to more detailed data or roll up to higher-level
summaries by navigating the concept hierarchy.

Use Cases:

 Analyzing sales data at di erent time levels (e.g., monthly, quarterly, yearly)
 Aggregating data by geographic location (e.g., country, region, city)

6.3 OLAP and OLTP

OLAP (Online Analytical Processing) and OLTP (Online Transaction Processing) are two distinct types of systems
used for di erent purposes.

 OLAP (Online Analytical Processing):


 Purpose: Designed for complex queries and data analysis.
 Characteristics: Supports multidimensional analysis, aggregations, and drill-down operations.
 Example: A system used to analyze sales data across di erent regions, products, and time periods.
 OLTP (Online Transaction Processing):
 Purpose: Designed for daily transactional operations.
 Characteristics: Handles high transaction volumes, ensuring data consistency and integrity (e.g., insert,
update, delete operations).
 Example: A banking system or an e-commerce transaction system.

Use Cases:

 OLAP: Business Intelligence, decision support systems


 OLTP: Operational systems like order processing, inventory management

6.4 Association Rules in Data Mining Technique

Association rule mining identifies relationships or patterns between items in datasets. For example, in retail,
association rules can uncover that customers who buy bread are likely to also buy butter.

6.4.1 Key Concepts

Association Rule:

 An association rule is an implication of the form 𝐴⇒𝐵, where 𝐴 and 𝐵are sets of items. It suggests that if
itemset 𝐴 is purchased or occurs, itemset 𝐵 is likely to be purchased or occur as well.
 Example: {Bread} ⇒ {Butter}: This rule means that if a customer buys bread, they are likely to buy butter as
well.

Antecedent and Consequent:

 Antecedent (LHS - Left-Hand Side): The item(s) found in the premise of the rule (e.g., Bread).
 Consequent (RHS - Right-Hand Side): The item(s) that are expected as a result (e.g., Butter).

6.4.2 Key Metrics Used in Association Rules

To evaluate the strength and usefulness of association rules, three key metrics are used:

1. Support:

Support is the proportion of transactions in the database that contain both the antecedent and consequent. It
represents the frequency of the occurrence of the itemset.

𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛𝑠 𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑖𝑛𝑔 𝑏𝑜𝑡ℎ 𝐴 𝑎𝑛𝑑 𝐵


𝑆𝑢𝑝𝑝𝑜𝑟𝑡(𝐴 ⇒ 𝐵) =
Total number of transactions
 Example: If 20 out of 30 transactions containing bread also contain butter, the confidence of the rule {Bread}
⇒ {Butter} is 0.67 (67%).

2. Confidence:

Confidence is the likelihood that the consequent occurs given that the antecedent occurs. It measures the
reliability of the rule.

𝑆𝑢𝑝𝑝𝑜𝑟𝑡(𝐴 ∪ 𝐵)
𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒(𝐴 ⇒ 𝐵) =
Support(A)

Example: If 20 out of 30 transactions containing bread also contain butter, the confidence of the rule {Bread} ⇒
{Butter} is 0.67 (67%).

3. Lift:
Lift measures how much more likely the consequent is to occur when the antecedent is present, compared to
its normal occurrence. It helps to identify rules that are statistically significant, beyond just being frequent.

𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒(𝐴 ⇒ 𝐵)
𝐿𝑖𝑓𝑡(𝐴 ⇒ 𝐵) =
Support(B)

If lift > 1, it indicates that the occurrence of 𝐴 increases the likelihood of 𝐵. If lift = 1, 𝐴 and 𝐵 are independent.

6.4.3 Example of Association Rules

Let's say a retailer has 100 transactions in a database, and we are interested in finding relationships between
products. Consider these:

 Support: 20 transactions contain both bread and butter.


 Confidence: Of the 30 transactions that contain bread, 20 also contain butter.
 Lift: The likelihood of buying butter is 2 times higher when bread is bought, compared to the overall likelihood
of buying butter.

Rule Example: {Bread} ⇒ {Butter}

 Support = 0.20 (20% of transactions contain both bread and butter).


 Confidence = 0.67 (67% of transactions containing bread also contain butter).
 Lift = 2 (the likelihood of buying butter is doubled when bread is bought).
6.4.4 Apriori Algorithm (A Common Algorithm for Mining Association Rules)

One of the most popular algorithms used for mining association rules is the Apriori algorithm, which operates in
the following steps:

Generate Candidate Itemsets:

 Start by finding frequent individual items (1-itemsets), and then iteratively combine these to form larger
itemsets (2-itemsets, 3-itemsets, etc.).

Prune Infrequent Itemsets:

 Only consider itemsets whose support is above a minimum threshold. This is done to reduce the search
space and focus on itemsets that are most likely to generate useful rules.

Generate Rules:

 For each frequent itemset, generate association rules by selecting di erent possible subsets as antecedents
and consequents, and then calculating the confidence and lift.

Example of the Apriori Algorithm Process

Assume we have a dataset of transactions with items like {Bread, Butter, Milk, Cheese, Eggs}. The algorithm will:

 First, calculate the support of individual items.


 Find frequent 1-itemsets, like {Bread}, {Butter}, etc.
 Then, generate candidate 2-itemsets, like {Bread, Butter}, {Bread, Milk}, etc.
 After pruning, it will identify the most frequent itemsets (those that have support above a given threshold).
 Finally, it will generate rules like {Bread} ⇒ {Butter}, and evaluate their confidence and lift.

Applications of Association Rules

1. Market Basket Analysis: Retailers can use association rules to identify which products are frequently
purchased together. This helps with inventory management, promotional strategies, and cross-selling.
2. Recommendation Systems: Association rules can be used in e-commerce to recommend products that are
often bought together (e.g., "Customers who bought this also bought...").
3. Web Mining: Association rules are used to discover patterns in web page visits. For example, if users visit
one page, they might be interested in visiting another related page.
4. Healthcare: In healthcare, association rules can uncover relationships between diseases, symptoms, and
treatments, helping in clinical decision-making.
5. Fraud Detection: Identifying unusual patterns of transactions or behaviors that may indicate fraudulent
activity.
Limitations of Association Rules

1. Scalability: As the size of the dataset increases, the computational complexity also grows, especially when
dealing with large numbers of items or transactions.
2. Interpretability: The rules can sometimes be complex or di icult to interpret, especially if there are too many
rules or if the rules are weak.
3. Irrelevant Rules: It’s common to generate many association rules that are not useful, which can be
overwhelming. Therefore, proper filtering is required.
4. Context Sensitivity: Association rules don't consider the context of items or the sequence of transactions,
which can be crucial in some cases (e.g., in time-sensitive scenarios).

6.5 Classification in Data Mining Technique

Classification is the process of identifying the class or category that a new observation belongs to, based on a
training set of data containing observations whose class labels are known. The goal is to learn from the training
data and use that knowledge to predict the class of new, unseen data.

Steps in Classification

1. Data Preprocessing:

 Cleaning: Handle missing values, remove noise, and outliers.


 Feature Selection/Extraction: Identify relevant features for building the model.
 Normalization/Standardization: Scale data to a specific range or distribution.

2. Model Selection:

 Choose an appropriate classification algorithm based on the problem and the dataset.

3. Model Training:

 The classification algorithm learns from the training data by identifying patterns or relationships between
features and the target class.

4. Model Evaluation:

 Test the model on a separate set of data (test data) to assess its performance.
 Evaluation metrics like accuracy, precision, recall, F1-score, confusion matrix, etc., are used.

5. Model Deployment:

 Once the model is evaluated and fine-tuned, it is deployed for classifying real-world data.
Popular Classification Algorithms

1. Decision Trees (DT):

 A tree-like structure where each node represents a decision based on a feature, and each branch represents
the outcome of that decision.
 Example: ID3, C4.5, CART (Classification and Regression Trees).

2. K-Nearest Neighbors (KNN):

 A non-parametric method that classifies an instance based on the majority class of its nearest neighbors in
the feature space.
 Simple and intuitive but can be computationally expensive for large datasets.

3. Naive Bayes:

 Based on Bayes’ theorem, this algorithm assumes that the features are independent given the class.
 It's particularly good for high-dimensional data and text classification problems.

4. Support Vector Machines (SVM):

 A supervised learning model that finds the optimal hyperplane that separates data points of di erent classes.
 Works well for high-dimensional spaces and complex datasets.

5. Random Forests:

 An ensemble method that builds multiple decision trees and combines their predictions.
 Reduces overfitting and generally provides high accuracy.

6. Artificial Neural Networks (ANN):

 Inspired by the human brain, neural networks consist of layers of interconnected nodes (neurons).
 Can capture complex relationships but may require large amounts of data and computational power.

7. Logistic Regression:

 A statistical model used to predict the probability of a binary outcome (0 or 1). It can be extended to multi-
class problems using techniques like "one-vs-all" or "softmax".

8. Gradient Boosting Machines (GBM):

 A boosting technique that builds an ensemble of trees in a sequential manner, where each new tree corrects
the errors of the previous ones.
 Examples: XGBoost, LightGBM, CatBoost.
Evaluation Metrics for Classification

Accuracy: The proportion of correctly classified instances.

𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 + 𝑇𝑟𝑢𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠


𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝑇𝑜𝑡𝑎𝑙 𝐼𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠
Precision: The proportion of positive predictions that are actually correct.

𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 + 𝐹𝑎𝑙𝑠𝑒 𝑃𝑠𝑜𝑖𝑡𝑖𝑣𝑒𝑠
Recall (Sensitivity): The proportion of actual positives that are correctly identified.

𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
𝑅𝑒𝑐𝑎𝑙𝑙 =
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 + 𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠

F1 Score: The harmonic mean of precision and recall, giving a balanced measure of classification performance.

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑅𝑒𝑐𝑎𝑙𝑙
𝐹1 = 2 ×
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙
Confusion Matrix: A matrix showing the counts of actual vs. predicted classifications, which helps in analyzing
the performance of the classifier.

Applications of Classification

 Spam Detection: Classifying emails as spam or not spam based on features like keywords and sender.
 Medical Diagnosis: Classifying medical conditions based on symptoms or test results (e.g., detecting
cancer based on patient data).
 Credit Scoring: Predicting whether a person will default on a loan based on financial history.
 Image Recognition: Classifying images into categories (e.g., identifying whether an image contains a dog or
a cat).
 Customer Segmentation: Classifying customers into di erent groups based on purchasing behavior.

Challenges in Classification

Imbalanced Data: When one class is underrepresented, which can lead to biased models.

Overfitting: When a model performs well on training data but poorly on unseen data due to excessive complexity.

Interpretability: Some models, especially neural networks, can be hard to interpret compared to simpler
models like decision trees.

6.6 Clustering in Data Mining Technique

Clustering is the process of dividing data into distinct groups or clusters, where data points within each group
share common characteristics. These clusters help to simplify complex data, find hidden patterns, and reveal
the underlying structure of the dataset.

Steps in Clustering

1. Data Preprocessing:

 Cleaning: Handle missing values, noise, and outliers.


 Feature Selection/Extraction: Identify relevant features for clustering.
 Normalization/Standardization: Scale data to a common range or distribution, which is important for
distance-based clustering algorithms.
2. Clustering Algorithm Selection:

 Choose an appropriate clustering algorithm based on the nature of the data, the desired output, and the
underlying structure.

3. Cluster Assignment:

 The algorithm assigns each data point to a cluster based on similarity metrics, often using distance measures
like Euclidean distance.

4. Cluster Evaluation:

 Evaluate the quality of the clusters using internal (within-cluster) and external (compared to known labels or
another dataset) measures.
 Metrics include intra-cluster cohesion, inter-cluster separation, and silhouette score.

5. Cluster Interpretation and Deployment:

 After clustering, interpret the results to identify meaningful patterns or insights from the groups.
 Apply the clusters to business or analytical processes, such as marketing segmentation or anomaly
detection.

Popular Clustering Algorithms

1. K-Means Clustering:

 A centroid-based clustering algorithm where the data is partitioned into k clusters, and the centroids (mean
of points) are updated iteratively to minimize within-cluster variance.
 Pros: Simple, fast, and easy to implement.
 Cons: Requires the number of clusters to be predefined and can struggle with non-spherical clusters or
outliers.

2. Hierarchical Clustering:

 Builds a tree-like structure (dendrogram) of nested clusters. It can be agglomerative (bottom-up) or divisive
(top-down).
 Agglomerative: Starts with individual data points as their own clusters and merges them iteratively.
 Divisive: Starts with one large cluster and splits it recursively.
 Pros: No need to specify the number of clusters in advance.
 Cons: Computationally expensive for large datasets and sensitive to noise.

3. DBSCAN (Density-Based Spatial Clustering of Applications with Noise):

 A density-based algorithm that groups points that are closely packed together, marking points in low-density
regions as noise (outliers).
 Pros: Can find arbitrarily shaped clusters and is robust to noise.
 Cons: Struggles with varying cluster densities and requires careful parameter tuning.

4. Gaussian Mixture Models (GMM):

 A probabilistic model that assumes data is generated from a mixture of several Gaussian distributions. Each
cluster is represented as a Gaussian distribution.
 Pros: Allows for soft clustering (each data point can belong to multiple clusters with di erent probabilities).
 Cons: Assumes that the data follows a Gaussian distribution, which may not always be true.

5. Self-Organizing Maps (SOM):


 An unsupervised neural network algorithm that maps high-dimensional data onto a lower-dimensional grid
while preserving the topological structure.
 Pros: Useful for visualizing high-dimensional data and capturing complex patterns.
 Cons: Requires careful tuning of the grid size and training parameters.

6. Mean Shift Clustering:

 A non-parametric clustering technique that shifts data points towards the mode (densest area) of the dataset
until convergence.
 Pros: Does not require specifying the number of clusters beforehand.
 Cons: Computationally expensive and can struggle with large datasets.

7. Spectral Clustering:

 Uses the eigenvalues of a similarity matrix to reduce dimensionality and perform clustering in fewer
dimensions.
 Pros: Can capture complex, non-linear relationships between points.
 Cons: Computationally expensive for large datasets and requires choosing an appropriate similarity matrix.

Evaluation Metrics for Clustering

Evaluating clustering results can be challenging since clusters are not labeled. However, several internal and
external metrics can help measure the quality of the clusters:

 Intra-cluster Distance: Measures how similar the points within a cluster are to each other (lower is better).
 Inter-cluster Distance: Measures how distinct the clusters are from each other (higher is better).
 Silhouette Score: Combines both intra-cluster and inter-cluster distances to provide a measure of how well-
separated and cohesive the clusters are. It ranges from -1 (incorrect clustering) to +1 (well-defined
clustering).
 Dunn Index: Measures the ratio of the minimum inter-cluster distance to the maximum intra-cluster
distance.
 Rand Index: Measures the similarity between two data clusterings, considering both pairs of points that are
either clustered together or apart.
 External Validation: Compares the clustering result to a predefined ground truth (if available) using metrics
like Adjusted Rand Index (ARI) or mutual information.

Applications of Clustering

 Market Segmentation: Identifying distinct groups of customers based on purchasing behavior,


demographics, etc., to tailor marketing strategies.
 Image Segmentation: Grouping pixels of similar colors or intensities to identify objects or regions within an
image.
 Anomaly Detection: Identifying outliers or rare events, such as fraud detection in banking or network
security.
 Document Clustering: Grouping documents or web pages with similar topics for recommendation systems
or information retrieval.
 Biological Data Analysis: Grouping genes, proteins, or samples based on expression patterns in genomics
or proteomics research.

Challenges in Clustering

 Choosing the Right Algorithm: Di erent clustering algorithms have di erent strengths and weaknesses
depending on the nature of the data. Selecting the best one can be di icult.
 Determining the Number of Clusters: In many algorithms (e.g., K-Means), the number of clusters must be
predefined, which can be challenging if the right number is unknown.
 Handling Noise and Outliers: Some clustering methods are sensitive to noise and outliers, which can distort
the results.
 Scalability: Many clustering algorithms (especially hierarchical clustering) are computationally expensive
and may struggle with large datasets.
 Interpretability: Some clustering methods, such as DBSCAN or GMM, may produce complex clusters that
are di icult to interpret.

6.7 Data Mining Technique - Regression

It is a fundamental statistical and machine learning technique used for predicting a continuous dependent
variable based on one or more independent variables. It is widely used in data mining for predictive modeling
and data analysis. The goal of regression analysis is to establish a relationship between the dependent variable
and one or more independent variables to make predictions.

Key Concepts in Regression

1. Dependent Variable (Target): The variable we are trying to predict.


2. Independent Variable (Predictors/Features): The variables used to make predictions.
3. Modeling: The process of fitting a mathematical model to the data in order to predict the dependent variable

Types of Regression Techniques

1. Linear Regression

 Definition: A simple form of regression that models the relationship between the dependent variable and
one or more independent variables using a straight line.
 Equation Y=β0+β1X+ϵ
 Where
o Y is the dependent variable
o X is the independent variable
o β0 is the intercept
o β1 is the coe icient
o ϵ is the error term
 Use Case: When the relationship between the dependent and independent variables is approximately linear.
 Pros: Simple, fast, easy to interpret.
 Cons: Assumes linearity, sensitive to outliers.

2. Multiple Linear Regression

 Definition: An extension of linear regression that models the relationship between the dependent variable
and multiple independent variables.
 Equation: Y=β0+β1X1+β2X2+⋯+βnXn+ϵ
 Use Case: When there are multiple predictors influencing the dependent variable.
 Pros: Can handle multiple variables, interpretable.
 Cons: Assumes linearity between the dependent and independent variables, prone to multicollinearity.
3. Polynomial Regression

A type of regression that models the relationship between the dependent and independent variables as a higher-
degree polynomial. This is useful when the relationship is not linear but can be captured by a polynomial
equation.

Y=β0+β1X+β2X2+⋯+βnXn+ϵ

Where X2,X3,…,Xn are higher powers of the independent variable.

Key Features:

 Handles non-linear relationships.


 Requires careful selection of the degree of the polynomial to avoid overfitting.

4. Ridge Regression

Ridge regression is a variation of linear regression that adds a penalty to the size of the coe icients to prevent
overfitting, especially when there is multicollinearity (high correlation between predictors).

Where:

 λ is the regularization parameter.


 βi are the coe icients of the independent variables.

Key Features:

 Helps prevent overfitting by adding a regularization term.


 Performs well when predictors are highly correlated.
 Tends to shrink the coe icients but does not necessarily eliminate them completely.

5. Lasso Regression

Lasso (Least Absolute Shrinkage and Selection Operator) regression is another variant of linear regression that
also includes a penalty term but di ers from Ridge in that it encourages sparsity in the coe icients.

Key Features:

 Lasso tends to drive some of the coe icients to zero, e ectively performing feature selection.
 Useful when we suspect that only a subset of the features are relevant for predicting the target variable.
 Helps in selecting important features and improving model interpretability.

6. Elastic Net Regression

Elastic Net regression is a combination of both Ridge and Lasso regression. It is particularly useful when there
are many correlated predictors.

Where:

 𝜆1 is the Lasso penalty term.


 𝜆2 is the Ridge penalty term.

Key Features:

 Combines the benefits of Lasso and Ridge regression.


 Suitable for situations where there are multiple features that are correlated.
 Regularization terms help prevent overfitting.

7. Stepwise Regression

Stepwise regression is an automated method for selecting a subset of predictors in a model. It iteratively adds or
removes predictors based on their statistical significance.

Key Features:

 Can be used with both forward selection (adding predictors) or backward elimination (removing predictors).
 Useful for reducing the complexity of the model while maintaining predictive power.
 Prone to overfitting and may be computationally expensive.

8. Support Vector Regression (SVR)

Support Vector Regression (SVR) is a regression version of the Support Vector Machine (SVM) technique. It tries
to find a function that fits the data while minimizing the error within a specified margin.

Key Features:

 E ective in high-dimensional spaces.


 Handles non-linear relationships via the use of kernels (e.g., radial basis function).
 Robust to outliers by using an 𝜖 - insensitive loss function.

9. Decision Tree Regression

Decision Tree Regression builds a model by splitting the data into subsets based on feature values, resulting in a
tree structure. Each leaf node represents a prediction.

Key Features:

 Non-linear regression method.


 Simple to interpret and visualize.
 Prone to overfitting, especially with deep trees, but can be controlled using pruning or by setting constraints.

10. Random Forest Regression

Random Forest Regression is an ensemble method that uses multiple decision trees to improve predictive
performance. It aggregates predictions from multiple trees to reduce variance and overfitting.

Key Features:

 Uses a collection of decision trees, which helps in reducing overfitting.


 Can handle large datasets with high dimensionality.
 Provides feature importance which can be useful for understanding the data.

11. Gradient Boosting Machines (GBM)

Gradient Boosting Machines are another ensemble method that builds decision trees sequentially, where each
tree tries to correct the errors of the previous one.

Key Features:
 Very e ective for both regression and classification tasks.
 Combines multiple weak models (shallow decision trees) into a strong model.
 Can be computationally expensive but yields high accuracy.

6.8 Data Mining Technique - Support Vector Machine (SVM)

Support Vector Machine is a powerful algorithm in data mining and machine learning that works by finding a
decision boundary (hyperplane) that separates the data points of di erent classes. The decision boundary is
chosen to maximize the margin, i.e., the distance between the hyperplane and the nearest data points of any
class. These nearest points are called support vectors, and they play a critical role in defining the optimal
hyperplane.

SVM can be used for:

 Classification: Dividing the data into di erent categories (e.g., spam or not spam).
 Regression: Predicting a continuous value (e.g., predicting house prices based on features).

Key Concepts in SVM

 Hyperplane: A hyperplane is a decision boundary that separates data points into di erent classes. In two
dimensions, it is a line; in three dimensions, it is a plane; and in higher dimensions, it is a hyperplane.
 Support Vectors: The data points that are closest to the hyperplane are called support vectors. These points
are critical in defining the position of the hyperplane. Only the support vectors influence the decision
boundary, making SVM a memory-e icient algorithm.
 Margin: The margin is the distance between the hyperplane and the nearest support vectors. SVM aims to
maximize this margin, as a larger margin typically results in better generalization to unseen data.
 Kernel Trick: SVM uses a mathematical technique called the kernel trick to transform data into a higher-
dimensional space, allowing it to find a linear separation in cases where the data is not linearly separable in
its original space. This allows SVM to handle non-linear classification and regression tasks.

Types of SVM

There are several types of Support Vector Machines, which di er primarily in how they are used for classification
and regression tasks:

1. Linear SVM (for Linearly Separable Data)

Linear SVM works when the data is linearly separable. The algorithm finds a hyperplane that separates the
classes with the largest margin.

Key Features:

 Only works for linearly separable data (i.e., data that can be perfectly separated by a straight line or
hyperplane).
 The optimal hyperplane is found by maximizing the margin between the two classes.

2. Non-Linear SVM (using Kernels)

When data is not linearly separable, SVM can be extended using kernels to map the data to a higher-dimensional
space where a linear hyperplane can be used to separate the classes.

 Kernel Trick: The kernel function transforms the data into a higher-dimensional space without explicitly
computing the transformation. Common kernels include:
 Polynomial Kernel: Maps data into higher-dimensional polynomial spaces.
 Radial Basis Function (RBF) Kernel: Maps data into an infinite-dimensional space and is widely used
for non-linear classification tasks.
 Sigmoid Kernel: Uses the hyperbolic tangent function to transform the data.

3. Support Vector Regression (SVR)

SVM can also be used for regression tasks through SVR. The goal in SVR is to fit the best possible line (or
hyperplane) that captures the majority of the data while keeping the deviation (error) within a certain threshold.
This threshold is called the epsilon margin.

 SVR Objective: Minimize the error while allowing some error for points that fall within the epsilon margin.

Working of SVM

Step-by-Step Process of SVM for Classification:

1. Linear Separability: SVM first checks if the classes can be linearly separated.
2. Choosing a Hyperplane: It finds a hyperplane (decision boundary) that separates the two classes. The best
hyperplane is the one that maximizes the margin, which is the distance between the hyperplane and the
support vectors.
3. Optimization Problem: SVM formulates the task as an optimization problem, where the objective is to
maximize the margin. This is done through quadratic programming.
4. Classification: Once the optimal hyperplane is found, SVM uses it to classify new, unseen data by
determining which side of the hyperplane the new data point falls on.

Step-by-Step Process of SVM for Regression:

 Choosing a Hyperplane: In SVR, the goal is to find a hyperplane that best fits the data within a predefined
margin.
 Error Tolerance: SVR introduces a margin of tolerance within which errors are allowed (the epsilon tube).
Data points that fall within this margin are considered to have zero error.
 Optimization: Like in classification, SVR solves an optimization problem to minimize the error and maximize
the margin.
 Prediction: Once the optimal hyperplane is determined, SVR uses this hyperplane to make predictions.

Advantages of SVM

 High-dimensional Spaces: SVM works well in high-dimensional spaces, making it suitable for tasks where
the number of features is large.
 E ective for Non-linear Data: With the use of the kernel trick, SVM can handle non-linear relationships in
the data e ectively.
 Memory E iciency: SVM is memory e icient because it uses only a subset of training points (the support
vectors) to define the hyperplane.
 Robust to Overfitting: SVM is robust to overfitting, especially when the number of dimensions exceeds the
number of samples, as it focuses on finding a global optimal margin.

Disadvantages of SVM

 Computation Complexity: The training time of SVM can be high, especially for large datasets, as it involves
solving a quadratic optimization problem.
 Choice of Kernel: Selecting the right kernel and tuning hyperparameters (such as 𝐶 and 𝛾) can be
challenging and time-consuming.
 Not Suitable for Large Datasets: SVM may struggle with very large datasets, as the computation complexity
increases with the number of data points.
 Sensitive to Noisy Data: SVM can be sensitive to noise in the data, particularly if the data is not linearly
separable and the wrong kernel is chosen.
Applications of SVM

 Text Classification: SVM is commonly used for text classification tasks, such as spam detection or
sentiment analysis.
 Image Classification: It is used for classifying images based on di erent visual features.
 Bioinformatics: SVM is used to classify proteins, genes, and other biological data into categories.
 Handwriting Recognition: SVM is applied in optical character recognition (OCR) tasks.
 Face Detection: SVM can classify whether a given image contains a face or not.

8. Key Parameters in SVM

 C (Regularization Parameter): Controls the trade-o between achieving a low error on the training data and
minimizing the model complexity. A higher value of C aims to classify all training points correctly, which may
lead to overfitting, while a smaller value encourages a larger margin but may lead to underfitting.
 Gamma (Kernel Parameter): For non-linear SVM, gamma defines the influence of a single training example.
A high gamma value means the influence is closer to the training example, and a low value means the
influence is broader. Proper tuning of gamma is critical for good performance.
 Kernel Function: Determines the transformation of the data. Common kernels include linear, polynomial,
and radial basis function (RBF).

6.9 Data Mining Technique - K-Nearest Neighbour

K-NN is a lazy learning algorithm, meaning that it does not learn an explicit model during the training phase.
Instead, it stores the training data and makes predictions based on the stored data at the time of prediction.

 Classification: K-NN classifies a data point by a majority vote of its K nearest neighbors. The class most
common among the K neighbors is assigned to the data point.
 Regression: In regression, K-NN predicts the output value by averaging the values of the K nearest neighbors.

The algorithm relies on the assumption that similar data points are likely to have the same label or similar output
values, which makes K-NN particularly e ective in problems where this assumption holds true.

Working of K-Nearest Neighbors (K-NN)

Classification Task

1. Choose the value of K: The first step is to choose the number of nearest neighbors, K, that will be considered
for making a prediction.
2. Compute the distance: For each data point, compute the distance between the point to be classified and
every other point in the training dataset. Common distance metrics include:
 Euclidean Distance: The straight-line distance between two points in Euclidean space.

 Manhattan Distance: The sum of the absolute di erences of the coordinates.


 Cosine Similarity: Measures the cosine of the angle between two vectors, useful in high-dimensional
data like text.
 Minkowski Distance: Generalization of both Euclidean and Manhattan distances.
3. Find the K nearest neighbors: Sort the distances in ascending order and select the K smallest distances,
i.e., the K nearest neighbors.
4. Vote for the class label: In classification, the majority class among the K nearest neighbors is assigned to
the data point.
5. Return the predicted label: The data point is assigned the class label with the highest frequency among the
K neighbors.

Regression Task

1. Choose the value of K: Select the number of neighbors, K, to use for the prediction.
2. Compute the distance: Calculate the distance between the target point and all other training points.
3. Find the K nearest neighbors: Select the K nearest points based on the computed distance.
4. Average the output values: In regression, rather than voting, the predicted output is the average of the output
values of the K nearest neighbors.
5. Return the predicted value: The predicted value is the mean (or median, depending on the variant) of the K
nearest neighbors.

Key Parameters in K-NN

 K (Number of Neighbors): The value of K determines how many neighbors are considered when making a
prediction. A small value of K can be noisy and sensitive to outliers, while a large K makes the algorithm more
resistant to noise but can blur the decision boundary, making it less sensitive to subtle patterns.
 Distance Metric: The choice of distance metric influences the accuracy and performance of K-NN.
Euclidean distance is commonly used, but other metrics like Manhattan or cosine similarity can be more
suitable for certain types of data (e.g., categorical or high-dimensional data).
 Weighting of Neighbors: Instead of giving each of the K neighbors equal weight, you can assign weights to
the neighbors, such that closer neighbors have more influence on the classification or regression outcome.

Advantages of K-NN

 Simple to Understand and Implement: K-NN is easy to understand and implement, making it a good choice
for beginners in machine learning.
 No Training Phase: Since K-NN is a lazy learner, it doesn’t require a time-consuming training phase, as it
directly stores the training data and uses it for prediction at query time.
 Non-Linear Decision Boundaries: K-NN can handle complex decision boundaries that are non-linear, unlike
some algorithms that assume linearity.
 Versatility: It can be used for both classification and regression tasks.

5. Disadvantages of K-NN

 Computational Complexity: K-NN requires computing the distance between the test point and all training
points. This makes the algorithm computationally expensive, especially with large datasets.
 High Memory Usage: Since K-NN stores all training data for future predictions, it can consume a lot of
memory, which is a concern with large datasets.
 Sensitivity to Irrelevant Features: K-NN is sensitive to the curse of dimensionality, meaning that as the
number of features (dimensions) increases, the performance of K-NN can degrade because all points appear
equally distant in high-dimensional spaces.
 Sensitivity to Noisy Data: K-NN can be sensitive to noisy data, especially with a small value of K, since
outliers or noisy points can influence the classification or regression results.

Choosing the Optimal K

Choosing the right value for K is crucial for the performance of the K-NN algorithm. The optimal value of K
depends on the data and the problem at hand. A small K leads to a model that is highly sensitive to noise, while
a large K leads to underfitting. To choose the optimal K:

 Cross-Validation: Use techniques like k-fold cross-validation to test the model’s performance with di erent
values of K and choose the one that results in the best performance.
 Odd vs. Even K: For classification tasks, it is generally recommended to use an odd value for K to avoid ties
between classes, especially if the number of classes is 2.
 Empirical Tuning: Experiment with di erent values of K and evaluate performance on a validation set to
identify the optimal value.

Applications of K-NN

 Image Recognition: K-NN can be used to classify images based on pixel values or other image features.
 Recommendation Systems: K-NN is used in collaborative filtering to recommend products or services by
finding users or items similar to the target user or item.
 Anomaly Detection: K-NN can be used to detect outliers or anomalies by identifying points that are far from
their nearest neighbors.
 Medical Diagnosis: K-NN can help in classifying diseases based on patient data and known medical
conditions.
 Finance: In finance, K-NN is used for tasks like credit scoring and fraud detection by analyzing the behavior
of customers or transactions.

Optimizations for K-NN

1. Dimensionality Reduction: Since K-NN can struggle with high-dimensional data, techniques like Principal
Component Analysis (PCA) or t-SNE can be used to reduce the dimensionality before applying K-NN.
2. Approximate Nearest Neighbors (ANN): To speed up K-NN, approximate nearest neighbor search
algorithms such as KD-Trees, Ball Trees, or Locality Sensitive Hashing (LSH) can be used to speed up the
process of finding the nearest neighbors, especially in high-dimensional spaces.
3. Weighted K-NN: Instead of giving equal weight to all K neighbors, give more weight to closer neighbors using
a weighting function based on distance (e.g., Gaussian kernel or inverse distance weighting).

6.10 Data Mining Technique - Hidden Markov Model

HMM is a statistical model used to describe systems that follow a Markov process with unobservable ("hidden")
states. In data mining, HMMs are widely used for modeling sequential or time-series data, where the goal is to
uncover hidden patterns or predict future states based on the observed data. HMMs are particularly e ective in
scenarios where the system being modeled has some inherent sequential dependency (e.g., text, speech,
biological sequences, etc.).

6.10.1 Key Concepts of Hidden Markov Models:

1. States: The system is assumed to be in one of a finite number of states at any time. These states are "hidden"
because they cannot be directly observed.
2. Observations: While the states themselves are hidden, we can observe some data generated by the system.
Each state generates an observation based on some probability distribution. The observed data is used to
infer the hidden state.
3. Transition Probabilities: The probability of moving from one state to another. These are typically denoted by
P(qt∣qt−1), where qt is the state at time 𝑡, and 𝑞𝑡−1 is the state at the previous time step.
4. Emission Probabilities: The probability of observing a particular observation given a state.
5. Initial Probabilities: The probability distribution over the initial state at time t=0 These are denoted by 𝑃(𝑞0).

6.10.2 Mathematical Representation:

 A set of states Q={q1,q2,…,qN}


 A set of observations O={o1,o2,…,oM}
 Transition probabilities: A=[aij] where aij=P(qt=qj∣qt−1=qi)
 Emission probabilities: B=[bij] where bij=P(ot=oj∣qt=qi)
 Initial state probabilities: π=[πi], where πi=P(q1=qi)

6.10.3 Types of Problems in HMM:

1. Evaluation Problem (Forward Algorithm):

 Given an observed sequence, compute the probability of the sequence under the model. This helps in
evaluating how well a given HMM explains the observed data.
 Forward Algorithm: This is an e icient way to compute the probability of a sequence of observations given
the model parameters. It works by recursively calculating the probability of being in each state at each time
step.

2. Decoding Problem (Viterbi Algorithm):

 Given a sequence of observations, determine the most likely sequence of hidden states. This is known as the
decoding problem.
 Viterbi Algorithm: A dynamic programming approach to find the most likely sequence of hidden states given
a sequence of observations.

3. Learning Problem (Baum-Welch Algorithm):

 Given a sequence of observations and an initial HMM, update the parameters (transition and emission
probabilities) to maximize the likelihood of the observations. This is done through expectation-maximization
(EM).
 Baum-Welch Algorithm: A special case of the EM algorithm used for HMMs to find the parameters that
maximize the likelihood of the observed data.

6.10.4 Working of Hidden Markov Models:

1. Assumptions:

 The system is Markovian, meaning the probability of being in a given state at time 𝑡 depends only on the state
at time 𝑡−1 (Markov property).
 The observations at each time step are dependent on the state, but the observations at di erent time steps
are independent given the state.

2. Process:

 At each time step 𝑡, the system is in a state 𝑞𝑡.


 The system generates an observation 𝑜𝑡 based on the current state 𝑞𝑡 according to the emission probabilities.
 The system then transitions to another state 𝑞𝑡+1 based on the transition probabilities.

3. Modeling Sequential Data:

 For example, in a speech recognition task, the hidden states could represent phonemes, and the
observations could be the sound features extracted from the speech signal. The HMM would then model the
sequence of phonemes (hidden states) that generated the observed sound features

6.10.5 Applications of Hidden Markov Models:

HMMs are used in a wide range of applications due to their ability to model sequential data. Some common use
cases include:

1. Speech Recognition: In speech recognition, HMMs are used to model the sequence of phonemes or words
in speech. Each state represents a phoneme, and the observations represent the acoustic features extracted
from the speech signal.
2. Natural Language Processing (NLP): HMMs can be used for tasks like part-of-speech tagging, named entity
recognition, and language modeling. The hidden states represent grammatical categories, and the
observations are the words in the sentence.
3. Bioinformatics: HMMs are widely used in bioinformatics for tasks such as gene prediction, protein structure
prediction, and DNA sequence alignment. In this case, the hidden states represent biological sequences or
structural elements, while the observations are the actual nucleotide or amino acid sequences.
4. Time Series Prediction: HMMs can model time-dependent data, such as stock prices or weather patterns,
where the hidden states represent di erent market or weather regimes, and the observations are the market
indicators or weather measurements.
5. Robotics and Control Systems: In robotics, HMMs can be used to model robot behaviors or environments.
For example, the robot’s possible states (e.g., "moving," "idle") are hidden, and the observations could be
sensor readings indicating the robot's position.
6. Video Analysis: HMMs can be applied in video analysis, where the states represent di erent actions (e.g.,
walking, running, sitting) and the observations represent features extracted from frames in the video.

6.10.6 Advantages of HMMs:

 Sequential Data Modeling: HMMs are excellent for modeling data with temporal dependencies, such as
time series or sequences.
 Flexibility: HMMs can be applied to a variety of domains, including speech, NLP, biology, and more.
 Handles Uncertainty: The hidden states and probabilistic nature of HMMs make them suitable for situations
where there is uncertainty or incomplete information.

6.10.7 Disadvantages of HMMs:

 Assumptions of Markov Property: HMMs assume that the current state depends only on the previous state.
This assumption may not hold in all cases, leading to limitations in modeling complex dependencies.
 State Space Explosion: When the number of states or observations increases, the model can become
computationally expensive.
 Local Optima: The Baum-Welch algorithm used for learning can converge to local optima, so careful
initialization and tuning may be required.

6.11 Data Mining Technique - Summarization

It is a technique used in data mining to extract essential information from large datasets and present it in a
simplified and concise form. It involves reducing the complexity of data while retaining its key characteristics
and patterns. The goal of summarization is to provide an overview of the data that can be easily understood and
analyzed by decision-makers, without losing important information.

6.11.1 Types of Summarization

1. Statistical Summarization: This type of summarization involves providing a set of summary statistics that
describe the central tendency, spread, and distribution of the data. Common statistics used include:
 Mean: The average value of a dataset.
 Median: The middle value of a dataset when sorted in order.
 Mode: The most frequent value in a dataset.
 Variance and Standard Deviation: Measures of data spread or dispersion.
 Skewness: A measure of the asymmetry of the data distribution.
 Kurtosis: A measure of the "tailedness" or sharpness of the data distribution.
2. Text Summarization: This is a specific form of summarization for unstructured data, particularly textual
data. Text summarization aims to extract the most important information from documents, articles, or texts
while keeping the key ideas intact. There are two primary types of text summarization:
 Extractive Summarization: Involves selecting and extracting key sentences or phrases directly from the
text to form a summary. It focuses on retaining important phrases without changing the original structure
of the content.
 Abstractive Summarization: Involves generating a summary in the form of new sentences, paraphrasing
or rewording the content, rather than directly extracting parts of the original text. This requires more
advanced natural language processing (NLP) techniques like neural networks.
3. Data Cube Summarization: This is used in the context of multidimensional databases (e.g., OLAP systems),
where large amounts of data are stored in a multidimensional array. Data summarization techniques
aggregate data along various dimensions (e.g., time, region, product) to generate insights. Examples of
summarization operations include:
 Roll-up: Aggregating data to a higher level of abstraction (e.g., summing sales for each quarter instead of
each month).
 Drill-down: Going into more detail (e.g., breaking down quarterly sales into monthly sales).
 Pivoting: Reorganizing data to get di erent perspectives (e.g., summarizing data by region instead of by
product).
4. Clustering-Based Summarization: Clustering techniques (like K-means, DBSCAN, or hierarchical
clustering) are used to group similar data points together. Summarization is achieved by representing each
cluster with a centroid or a representative data point. This technique is often used in cases where data points
are similar, and the aim is to create a compact summary by grouping similar objects.
5. Sampling-Based Summarization: Sampling methods aim to generate a representative subset of the data.
This subset is used to summarize the dataset, preserving its key characteristics. Common methods include:
 Random Sampling: Selecting a random subset of the data points.
 Stratified Sampling: Dividing the dataset into strata and then sampling from each stratum
proportionally.
 Reservoir Sampling: A method for sampling data when the total dataset size is unknown or when the
data arrives in a stream.
6. Dimensionality Reduction-Based Summarization: In high-dimensional datasets (e.g., datasets with many
features or variables), techniques such as Principal Component Analysis (PCA) or t-SNE are used to reduce
the number of dimensions while preserving the most important features. The summary is created by focusing
on the most significant dimensions or features, which help in understanding the main trends in the data.

6.11.2 Methods of Data Summarization

1. Descriptive Summarization:

 This approach describes the key characteristics of the dataset, such as using measures of central
tendency, variability, and distributions.
 Examples include generating histograms, frequency distributions, and box plots, which provide visual
representations of the data.
 This can also involve summary tables that present statistical summaries like means, variances, and
ranges.

2. Aggregation:

 Aggregating data points into groups to provide high-level summaries.


 This can be done using operations like sum, average, min, max, and count to summarize a large set of
data into smaller, manageable groups (e.g., summarizing sales by region or time period).

3. Feature Selection:

 Identifying the most important features or variables in the dataset and discarding irrelevant or redundant
features.
 This helps in reducing the complexity of the dataset while retaining the most useful information for
analysis.

4. Trend Analysis:

 Analyzing the data for trends over time, which is especially useful in time-series analysis. This might
involve summarizing patterns in sales data over months or years, detecting seasonality, or forecasting
future trends.
 Techniques like moving averages, exponential smoothing, and autoregressive models are often used for
trend analysis.

6.11.3 Advantages of Data Summarization:

 Simplification: Reduces the complexity of large datasets, making them easier to interpret and analyze.
 E icient Decision-Making: Summarized data helps decision-makers to quickly grasp the essential insights
without being overwhelmed by the raw data.
 Enhanced Visualization: Summarized data can be presented using charts, graphs, and tables that highlight
the most important trends.
 Improved Storage: By reducing the amount of data stored (e.g., through aggregation or sampling),
summarization can lead to more e icient storage solutions.

6.11.4 Disadvantages of Data Summarization:

 Loss of Information: While summarization reduces complexity, it can also lead to the loss of some fine-
grained details or nuances in the data.
 Risk of Misinterpretation: Summarized data might oversimplify complex relationships, leading to potential
misinterpretation if not presented carefully.
 Bias: Summarization techniques, particularly those involving sampling or aggregation, may introduce biases
depending on how the summary is generated.

6.12 Data Mining Technique - Dependency Modeling

It refers to techniques in data mining that aim to model the relationships or dependencies between variables or
attributes in a dataset. In many real-world scenarios, variables do not exist in isolation but instead are related to
one another. By identifying and understanding these dependencies, we can make more accurate predictions,
uncover hidden patterns, and generate useful insights from the data. Dependency modeling is an important
aspect of predictive modeling, association analysis, and causal inference.

6.12.1 Types of Dependencies in Data Mining

1. Functional Dependency:

 A functional dependency exists when one variable determines another. For example, in a dataset of
employees, the employee ID can determine other attributes like name, department, and salary.
 In database normalization, functional dependencies are important to remove redundancy and ensure data
integrity.

2. Statistical Dependency:

 This refers to the relationship between variables based on statistical measures such as correlation or
covariance.
 Correlation quantifies the strength and direction of a linear relationship between two variables. For example,
height and weight may show a positive correlation, while the number of hours studied and exam scores might
also show a relationship.
 Covariance is another measure that indicates the directional relationship between two variables.

3. Conditional Dependency:

 Conditional dependence occurs when the relationship between two variables is influenced by a third
variable.
 For instance, the relationship between hours of study and exam scores might depend on the di iculty of the
exam. If you introduce the variable "exam di iculty," the dependency between study hours and exam scores
may change.

4. Causal Dependency:

 Causal dependency aims to model direct cause-and-e ect relationships. It helps answer questions like
"Does X cause Y?" and is often explored using causal inference methods or probabilistic graphical models
(e.g., Bayesian networks).
 Causal modeling is used in fields like healthcare, economics, and social sciences to determine the e ects
of interventions.

6.12.2 Methods for Dependency Modeling

1. Association Rule Mining:

 Association rule mining is a popular technique used to find dependencies between variables in large
datasets, typically in transactional data.
 It is most commonly used in market basket analysis to find relationships between products bought together.
 Apriori and FP-Growth are common algorithms for discovering association rules in large datasets.
 Example: If a customer buys bread, they are likely to buy butter (i.e., {bread} → {butter}).

2. Bayesian Networks:

 A Bayesian Network is a probabilistic graphical model that represents a set of variables and their conditional
dependencies through a directed acyclic graph (DAG). Each node represents a variable, and edges represent
dependencies.
 These models are particularly useful for modeling causal relationships and reasoning under uncertainty.
 Bayesian networks can be used to make predictions, estimate missing values, or evaluate the impact of
interventions.

3. Regression Analysis:

 Linear Regression models the relationship between a dependent variable and one or more independent
variables. It assumes a linear relationship between the variables.
 Logistic Regression is used when the dependent variable is categorical. It models the probability of a
categorical outcome based on predictor variables.
 Multiple Regression is an extension of linear regression, where more than one independent variable is used
to predict the dependent variable.
 These regression models help in understanding how changes in independent variables a ect the dependent
variable.
4. Decision Trees:

 A decision tree is a non-linear model used for classification and regression tasks. It recursively splits the data
into subsets based on the feature that best splits the data, aiming to reduce impurity (for classification) or
variance (for regression).
 Decision trees implicitly model dependencies between features by selecting splits that best capture
relationships in the data.

5. Markov Models:

 Markov models are used to model sequential or time-dependent data. A Markov Chain models the
dependency of a variable on its previous state, while a Hidden Markov Model (HMM) is a more advanced
version where the system is assumed to be in a hidden state.
 These models are commonly used in speech recognition, text generation, and temporal event prediction.

6. Neural Networks:

 Neural networks, particularly feed-forward networks, model complex relationships between input features
and output predictions. They can capture non-linear dependencies between variables and have been used
extensively in fields like image recognition and natural language processing.
 In deep learning, networks with multiple layers (deep neural networks) can model very complex
dependencies.

7. Copulas:

 Copulas are statistical tools used to model dependencies between random variables. They allow us to
describe the relationship between variables without assuming a specific joint distribution.
 Copulas are especially useful in finance and insurance to model the dependencies between di erent
financial instruments or risk factors.

8. Conditional Probability:

 Conditional probability models the probability of an event occurring given that another event has already
occurred.
 This is important for understanding the likelihood of certain outcomes under specific conditions. For
example, given a person's age and smoking status, what is the probability that they will develop lung cancer?

6.12.3 Steps in Dependency Modeling

1. Data Preprocessing:

 Before modeling dependencies, data often needs to be cleaned and transformed. This may involve handling
missing values, encoding categorical variables, and normalizing or scaling features.

2. Feature Selection:

 In many cases, not all features are equally important. Feature selection techniques like correlation-based
filtering or mutual information can help identify which features are most relevant for modeling dependencies.

3. Model Building:

 Choose a method for modeling the dependencies, depending on the nature of the data and the problem at
hand. For example, regression models are suitable for continuous variables, while decision trees are often
used for classification tasks.
4. Model Evaluation:

 Evaluate the performance of the model using metrics like accuracy, precision, recall, F1-score, mean
squared error (MSE), or R-squared, depending on the type of model and task (e.g., classification or
regression).

5. Interpretation:

 Once the model is built, it is important to interpret the dependencies. For instance, regression coe icients
indicate the strength and direction of relationships between variables, while decision tree splits indicate
which features most influence the target variable.

6.12.4 Applications of Dependency Modeling

1. Market Basket Analysis:

 Identifying product dependencies (e.g., customers who buy one product are likely to buy another). This is
often used in retail and e-commerce to recommend products or optimize product placement.

2. Healthcare and Medicine:

 In healthcare, dependency models are used to understand how di erent factors (e.g., age, lifestyle, genetics)
contribute to diseases or outcomes. Causal dependency modeling can also help in identifying the e ects of
medical treatments or interventions.

3. Financial Risk Modeling:

 In finance, understanding dependencies between financial assets, market conditions, and economic factors
is crucial for portfolio management and risk assessment. Copulas are often used to model dependencies
between di erent risk factors.

4. Time Series Forecasting:

 In time series data, dependencies between di erent time periods can be modeled to predict future values
(e.g., stock prices, weather forecasts, energy consumption).

5. Predictive Maintenance:

 In industrial systems, dependency models can help in predicting when equipment is likely to fail by
understanding how di erent components and operating conditions influence failure.

6.12.5 Advantages of Dependency Modeling:

 Understanding Relationships: It helps in uncovering important relationships between variables that can be
used for prediction, optimization, or decision-making.
 Improved Accuracy: By modeling dependencies, we can improve the accuracy of predictive models, as they
take into account how di erent variables influence each other.
 Causal Inference: Dependency modeling allows us to understand not just correlations but also potential
causal relationships, which can be important for interventions and policy decisions.

6.12.6 Disadvantages of Dependency Modeling:

 Complexity: Some dependency modeling techniques, such as Bayesian networks or neural networks, can
become computationally complex, especially with large datasets.
 Data Requirements: Dependency modeling often requires large amounts of data to accurately capture
relationships, particularly in the case of probabilistic or machine learning models.
 Risk of Overfitting: If the model is too complex, it may overfit the data, leading to poor generalization on
unseen data.

6.13 Data Mining Technique - Link Analysis

It is a technique used in data mining to explore relationships between entities by examining the connections or
links between them. It is often used to analyze networks, identify patterns, and understand the structure of
complex systems, such as social networks, web pages, or communication systems.

6.13.1 Key Concepts of Link Analysis:

Entities and Links:

 Entities: These are the objects or nodes in a network (e.g., people, web pages, or organizations).
 Links: These are the relationships or connections between entities (e.g., friendships, hyperlinks, or business
transactions).

Graph Representation:

 Link analysis often represents the data in the form of a graph, where nodes represent entities and edges
(links) represent the relationships between them.

Network Structure:

 The relationships between entities can create complex network structures that reveal valuable insights, such
as clusters of closely related entities or the central nodes in a network.

6.13.2 Applications of Link Analysis:

Social Network Analysis:

 In social networks, link analysis helps identify influencers or key individuals based on their connections and
interactions with others. Techniques like centrality measures (e.g., degree centrality, betweenness centrality)
are commonly used.

Web Mining:

 PageRank: One of the most famous applications of link analysis in web mining is Google's PageRank
algorithm, which ranks web pages based on the number and quality of links pointing to them.
 Link analysis can also identify the structure of websites and optimize web crawling.

Fraud Detection:

 Link analysis is used to detect fraudulent activity in financial transactions, identifying suspicious patterns
like money laundering or Ponzi schemes by analyzing the flow of money between accounts.

Recommendation Systems:

 In recommendation systems (e.g., in e-commerce or media platforms), link analysis can identify
relationships between products, users, and preferences based on past behaviors and interactions.

Epidemiology and Spread of Disease:

 Link analysis can be used to track the spread of diseases by studying the relationships between individuals
(e.g., through contact tracing in case of epidemics).
6.13.3 Techniques Used in Link Analysis:

Centrality Measures:

 Degree Centrality: Measures the number of direct connections a node has.


 Betweenness Centrality: Measures how often a node acts as a bridge along the shortest path between other
nodes.
 Closeness Centrality: Measures the average distance from a node to all other nodes in the network.

Clustering:

 Link analysis can identify clusters of nodes that are more densely connected to each other than to nodes
outside the cluster. This is often used in community detection.

PageRank Algorithm:

 It works by analyzing the incoming and outgoing links to a webpage, assuming that a page is more important
if it is linked to by other important pages.

Social Network Analysis (SNA):

 Involves studying the relationships and patterns between individuals, groups, or organizations to understand
how information, influence, or resources flow through the network.

Link Prediction:

 Link prediction algorithms forecast potential future links or relationships in a network. These are used, for
example, to predict friendship connections in social networks.

6.13.4 Advantages of Link Analysis:

 Identifying key players: It helps in identifying influential or central entities in a network.


 Understanding network behavior: It allows for better understanding of the network structure and dynamics.
 Pattern recognition: Link analysis helps in recognizing patterns that might be otherwise hidden in large
datasets.

6.13.5 Challenges in Link Analysis:

 Scalability: Analyzing large networks with millions of nodes and links can be computationally expensive.
 Noise: In some networks, irrelevant or noisy data might distort the analysis.
 Complexity: Complex networks may have dynamic links, requiring advanced algorithms to capture evolving
relationships.

6.13.6 Example Use Case:

Consider a social media platform like Facebook. Link analysis could be used to study the connections between
users. By analyzing who is connected to whom (e.g., friends or followers), the platform can suggest new friends
or communities and also identify key influencers based on centrality measures.

6.14 Data Mining Technique - Sequencing Analysis

It is a data mining technique that focuses on identifying patterns, relationships, or trends in sequences of data.
The goal is to uncover meaningful associations between events or items that occur in a particular order over time
or in a sequence. It is widely applied in fields like bioinformatics, marketing, fraud detection, and text mining.

Key Concepts of Sequencing Analysis


1. Sequential Patterns: These are patterns where the elements occur in a specific order over time or within a
dataset. For instance, in a retail environment, customers may exhibit a sequential pattern like buying item A
first, followed by item B.
2. Sequential Rule Mining: This involves discovering rules that describe sequences of events. For example, it
could uncover rules such as "If a customer buys a laptop, they are likely to buy a mouse within the next
month."
3. Sequence Database: This is a collection of sequences, each containing ordered items or events. These can
represent anything from time series data to customer transaction logs.
4. Frequent Sequential Patterns: This is a subset of patterns that appear frequently across sequences in the
database. The challenge is to identify these patterns e iciently, especially when the dataset is large.

Applications of Sequencing Analysis

1. Market Basket Analysis: One of the most common applications. It involves discovering product sequences
that tend to appear together in customer transactions. For example, customers who purchase a camera
might also purchase a memory card within a week.
2. Web Page Access Patterns: This technique can be applied to web logs to uncover patterns in how users
navigate between pages. For example, a user might visit the homepage first, then proceed to the product
page, and finally, check out.
3. Bioinformatics and Genetics: Sequencing analysis is used to study genetic sequences, such as DNA or
protein sequences, to identify recurring motifs or biological patterns.
4. Fraud Detection: In the financial industry, sequencing analysis can be used to detect unusual patterns in
transaction sequences, which might indicate fraudulent activities.
5. Recommendation Systems: This technique helps improve the recommendations given to users by
analyzing the sequence of actions or purchases they make, leading to more relevant suggestions.

Algorithms Used in Sequencing Analysis

1. Apriori Algorithm: Originally designed for frequent itemset mining in transaction databases, the Apriori
algorithm can be adapted to sequence mining by finding frequent subsequences that appear in the same
order across multiple sequences.
2. GSP (Generalized Sequential Pattern): This algorithm extends the Apriori algorithm to handle sequential
data. It identifies frequent subsequences by searching for itemsets that appear in the same order but may be
separated by other items.
3. SPADE (Sequential Pattern Discovery using Equivalence Classes): This is a more e icient algorithm for
sequential pattern mining. It uses the concept of equivalence classes to reduce the search space and make
it faster.
4. PrefixSpan (Prefix-projected Sequential Pattern Mining): Unlike other algorithms, PrefixSpan avoids
candidate generation by projecting the sequence database based on prefixes and recursively finding frequent
patterns.
5. Closed Sequential Pattern Mining: This technique finds "closed" sequential patterns, meaning that no
super-sequence can have the same frequency, making it e icient for compressing the sequence data.

Steps in Sequencing Analysis

1. Preprocessing: The sequence data is cleaned and organized. Missing values, noise, and irrelevant
information are removed. The sequences are formatted properly to enable e ective analysis.
2. Pattern Discovery: Algorithms are applied to find frequent sequential patterns. This step involves searching
through the sequence database to identify patterns that appear often.
3. Postprocessing: After finding frequent patterns, the results are analyzed to identify the most relevant
patterns. The sequences are validated and interpreted based on the specific business or research context.
4. Visualization: In some cases, it’s helpful to visualize the patterns or sequences using graphs or charts to
reveal trends and relationships more clearly.

Challenges in Sequencing Analysis

1. Complexity: Sequential data can be large and complex, especially when dealing with long sequences and
large databases. This makes the search for patterns computationally expensive.
2. Noise and Incompleteness: Real-world sequence data often contain noise or missing values, which can
a ect the accuracy of the pattern discovery process.
3. Scalability: As the size of the dataset grows, the algorithms used for sequencing analysis can become less
e icient. This requires the development of scalable methods.
4. Dynamic Patterns: In some applications, the patterns may change over time, which requires dynamic
models that can adapt to evolving sequences.

6.15 Data Mining Technique – Social Network Analysis

SNA is a data mining technique used to analyze social structures through the study of relationships and
interactions between individuals or entities within a network. It focuses on identifying patterns, key influencers,
and the flow of information in networks of people or organizations.

The relationships in a social network can be represented as nodes (individuals or entities) and edges
(connections between them), forming a graph structure. SNA helps uncover hidden insights about communities,
collaborations, influence, and information flow.

Key Concepts in Social Network Analysis

1. Nodes: These are the individual elements of a network, representing entities like people, organizations, or
devices. Each node can have attributes that describe characteristics (e.g., age, location, interests).
2. Edges (Links): These are the connections between nodes that represent relationships, interactions, or
communications. Edges can be directed (one-way) or undirected (two-way), and they can also carry weights
that represent the strength or frequency of the relationship.
3. Graph: The entire network of nodes and edges forms a graph. This structure is used to represent and analyze
the relationships in the network.
4. Centrality: Centrality measures how important a node is within a network. Several types of centrality include:
 Degree Centrality: Measures the number of direct connections a node has.
 Betweenness Centrality: Measures the extent to which a node lies on the shortest path between other
nodes, indicating its role in connecting di erent parts of the network.
 Closeness Centrality: Measures how quickly a node can reach other nodes in the network, indicating its
e iciency in spreading information.
 Eigenvector Centrality: Measures a node's influence based on the centrality of its neighbors, often used
to identify influential nodes in a network.
5. Community Detection: In social networks, communities refer to groups of nodes that are more densely
connected to each other than to the rest of the network. Community detection algorithms identify these
clusters, helping to uncover hidden groupings or sub-networks within the larger network.
6. Homophily: The tendency of individuals to associate with similar others (e.g., people with shared interests,
backgrounds, or behaviors), which is often a crucial factor in how networks evolve.
7. Network Density: This is a measure of the proportion of possible connections in a network that are actually
present. It helps to understand how tightly connected the network is.

Applications of Social Network Analysis


1. Social Media Analysis: SNA is widely used to analyze platforms like Facebook, Twitter, or Instagram, helping
to understand user behavior, identify influencers, measure engagement, and detect viral content. For
example, SNA can be used to find out how information spreads through a network of users or to identify the
most influential people in a social media community.
2. Recommendation Systems: In online retail or content platforms, social network analysis can help build
recommendation systems based on users' connections and behaviors. For instance, a recommendation
might be made for a product or service based on the preferences of a user’s friends or peers.
3. Fraud Detection: In financial networks or online platforms, SNA is used to detect fraudulent activities by
analyzing suspicious patterns of connections, such as transactions involving a small number of people or an
abnormal number of interactions between accounts.
4. Collaboration Networks: In academic research or professional networks, SNA helps to map out
collaborations and partnerships. For example, it can identify central researchers in a field or track how
collaboration between institutions develops over time.
5. Epidemiology: SNA can be applied to the spread of diseases or behaviors within populations. By modeling
the network of contacts and interactions, public health o icials can predict and control the spread of an
illness.
6. Political Campaigns: In political analysis, SNA can be used to understand voter behavior, analyze the spread
of political ideologies, or track the influence of political figures within a network.
7. Supply Chain Networks: SNA helps businesses analyze their supply chains, identifying the most important
suppliers or customers and detecting potential risks from disruptions in the network.

Key Techniques and Algorithms in Social Network Analysis

1. Graph Theory: Social networks are essentially graphs, and many algorithms from graph theory are used in
SNA. Examples include:
2. Shortest Path Algorithms: Identify the shortest path between nodes (e.g., Dijkstra's algorithm).
3. Graph Clustering: This includes algorithms like Louvain and Girvan-Newman for detecting communities
within networks.
4. Centrality Measures: The di erent types of centrality mentioned earlier (degree, betweenness, closeness,
eigenvector) are widely used to rank and identify important nodes in the network.
5. PageRank: An algorithm initially used by Google to rank web pages, PageRank evaluates the importance of
nodes in a network by considering the number and quality of connections. It’s often used to identify
influential nodes or "hubs" in networks.
6. Network Visualization: Tools like Gephi, Cytoscape, or NetworkX (a Python package) allow for the
visualization of networks, helping researchers and analysts visually interpret complex connections and
patterns within the data.
7. Community Detection Algorithms: These include:
 Modularity-based methods (e.g., Girvan-Newman) to find communities based on network structure.
 Spectral clustering: Uses the eigenvalues of a graph’s adjacency matrix to partition the network.
8. Link Prediction: This technique predicts future links between nodes in the network based on existing
patterns and interactions. It’s commonly used in social networks to predict potential new connections
between users.
9. Network Evolution Analysis: SNA also involves studying how networks evolve over time. This includes
dynamic analysis, where edges and nodes may appear or disappear, changing the structure of the network.

Challenges in Social Network Analysis


1. Scalability: Large networks, such as those on social media platforms, can involve millions of nodes and
edges, making SNA computationally intensive. E icient algorithms and powerful computing resources are
necessary to analyze these networks.
2. Data Privacy and Ethics: Social network data often involves sensitive personal information. Ensuring privacy
and obtaining consent for using data is a significant ethical challenge, especially when analyzing social
media or online networks.
3. Data Quality: Social network data can be noisy or incomplete. Missing edges or unbalanced data may lead
to misleading results. Handling incomplete or noisy data is crucial for meaningful analysis.
4. Dynamic Nature: Social networks are dynamic and can change rapidly. Algorithms must account for the
evolving nature of relationships and interactions within the network.
5. Interpretation: While algorithms can uncover patterns and structures, interpreting these patterns
meaningfully can be complex, as social networks are influenced by many factors like culture, economics,
and personal preferences.

7. BIG DATA SYSTEMS

Big Data Characteristics, Types of Big Data, Big Data Architecture, Introduction to Map-Reduce and Hadoop;
Distributed File System, HDFS.

Big Data Systems:

Big data systems are critical in handling large volumes of structured and unstructured data. Understanding the
various components and technologies related to big data is essential for developing, managing, and analyzing
massive datasets.

7.1 Big Data Characteristics

Big Data refers to large volumes of structured, semi-structured, and unstructured data that are generated at high
velocity, from various sources, and often with varying levels of consistency and quality. To e ectively manage
and analyze Big Data, it is essential to understand its key characteristics, which are often referred to as the 5Vs.

The 5Vs of Big Data:

1. Volume: The sheer amount of data generated.


2. Velocity: The speed at which data is generated and needs to be processed.
3. Variety: The diversity of data types and formats.
4. Veracity: The uncertainty or quality issues with data.
5. Value: The usefulness and insight derived from data.

1. Volume

 The Volume of data refers to the sheer amount of data generated over time. This data can range from terabytes
to petabytes to exabytes in size.
 Importance: As organizations, devices, and systems produce more data (e.g., social media posts, sensor
data, transaction logs), the volume of data grows exponentially. Traditional data management tools are often
insu icient to handle such vast amounts of information.
 Example:
 Social media platforms like Facebook or Twitter generate massive volumes of user-generated data in the
form of posts, images, and videos.
 Companies like Google, Amazon, and Netflix accumulate terabytes of data from user activity and
behavior.

2. Velocity
 Velocity refers to the speed at which data is generated, processed, and analyzed. In Big Data systems,
velocity encompasses the real-time or near-real-time processing requirements.
 Importance: The ability to process high-speed data streams in real-time or near real-time is critical for
decision-making in many industries. Data can come in real-time from sources such as social media feeds,
financial transactions, or sensor data.
 Example:
 Financial institutions need to process thousands of transactions per second to prevent fraud.
 Real-time data feeds from IoT devices, such as smart sensors in factories, allow predictive maintenance.

3. Variety

 Variety refers to the di erent types and formats of data that need to be processed. Big Data encompasses
structured, semi-structured, and unstructured data.
 Importance: Organizations must process a mix of di erent data types, including text, images, videos, emails,
social media posts, log files, and sensor data, and derive meaningful insights from this diverse data.
 Example:
 Structured data like financial records in databases.
 Semi-structured data like XML or JSON files.
 Unstructured data such as images, audio, video, social media comments, or customer reviews.

4. Veracity

 Veracity refers to the uncertainty or trustworthiness of the data. Big Data often contains noisy, inconsistent,
or incomplete data, which can complicate analysis.
 Importance: Veracity addresses the issue of data quality. Inaccurate or unreliable data can lead to incorrect
conclusions, making data cleansing and preprocessing essential to improving the quality of Big Data.
 Example:
 Social media data may contain irrelevant or misleading information.
 Sensor data might include errors or missing readings due to faulty equipment.
 Online reviews can be biased or manipulated, a ecting the credibility of analysis.

5. Value

 Value refers to the usefulness or relevance of the data. It’s not enough to have massive amounts of data; it
needs to be processed and analyzed to extract valuable insights that can drive business decisions,
innovations, or improvements.
 Importance: For Big Data to be beneficial, organizations must focus on extracting meaningful patterns,
trends, and insights from the data. The real challenge is not just collecting data but ensuring that the data
provides value by answering business questions, optimizing operations, or improving customer experiences.
 Example:
 In retail, Big Data analytics can uncover patterns in customer behavior, allowing for personalized
marketing or recommendations.
 In healthcare, analyzing Big Data can lead to improvements in diagnosis, treatment plans, or patient care
through predictive models.

Extended Characteristics (Sometimes referred to as 7Vs):

In some cases, Big Data is also described using additional characteristics beyond the original 5Vs:

6. Variability

 Variability refers to the inconsistency in the data, such as fluctuating data flows or diverse data types.
 Importance: Big Data systems must be designed to handle variability in how data arrives, processes, and
changes over time.
 Example:
 Data from social media might be highly variable, with spikes in activity during certain events like news
breaks or product launches.
 Customer behavior can vary widely depending on seasonality, marketing campaigns, or external factors.

7. Visualization

 Visualization is the ability to represent large datasets in visual forms like graphs, charts, or interactive
dashboards.
 Importance: Visualization is critical for helping decision-makers quickly understand complex patterns and
insights from Big Data. Good data visualization allows organizations to spot trends, outliers, and correlations
more easily.
 Example:
 Interactive dashboards for monitoring real-time sales, website tra ic, or social media sentiment.
 Visualizations that show correlations in healthcare data (e.g., patient demographics and disease
outbreaks).

7.2 Types of Big Data

Big Data can be categorized into various types based on its structure and format. Understanding these types
helps determine how data can be managed, stored, processed, and analyzed. There are three primary types of
Big Data:

1. Structured Data
2. Semi-structured Data
3. Unstructured Data

Each type comes with its own set of challenges and requires di erent tools for processing and analysis. Let's
look at these types in more detail.
1. Structured Data:

 Definition: Structured data is highly organized data that is stored in a fixed format, typically in rows and
columns (like in databases or spreadsheets). This type of data is easy to analyze and manage because it
adheres to a strict schema, such as in relational databases (RDBMS).
 Characteristics:
 Highly Organized: Structured data follows a strict schema or model, typically in tables, with well-defined
relationships between data points.
 Easily Searchable: Due to its tabular nature, it is easy to query using SQL or similar query languages.
 Small to Medium Volume: Structured data usually has smaller volumes compared to other types of Big
Data, but when scaled up, it can become part of Big Data.
 Examples:
 Customer details (names, addresses, phone numbers) stored in a relational database.
 Transaction records in banking or e-commerce websites.
 Inventory data for products in a warehouse management system.
 Technologies Used:
 Relational databases like MySQL, Oracle, and SQL Server.
 Data warehousing systems like Amazon Redshift and Google BigQuery.

2. Semi-structured Data:

 Definition: Semi-structured data does not have a fixed schema like structured data, but it contains some
level of organization or tags to separate elements. This type of data has a flexible structure that can be
interpreted and processed using specific data models.

 Characteristics:
 Flexible Schema: Semi-structured data allows some flexibility in the way the data is stored and
organized, but it still has identifiable markers like tags, labels, or metadata.
 Interoperability: Semi-structured data can be easily converted into structured data using tools and
technologies designed for the purpose.
 Variety: Semi-structured data can come from diverse sources, and it may change in structure over time.
 Examples:
 XML (eXtensible Markup Language) files, which use tags to structure data but don’t adhere to a strict
relational model.
 JSON (JavaScript Object Notation) data, often used in web APIs to transfer data between servers and web
clients.
 Email messages, where the structure is partially defined (e.g., subject, sender, timestamp, etc.) but the
body content is freeform.

 Technologies Used:
 NoSQL databases like MongoDB and Cassandra, which handle semi-structured data well.
 Data formats like JSON, XML, and YAML.

3. Unstructured Data

 Definition: Unstructured data refers to data that lacks any predefined format or organization, making it more
di icult to process and analyze using traditional data processing tools. This type of data does not fit neatly
into rows and columns.
 Characteristics:
 No Defined Schema: Unstructured data lacks the organization of structured data and does not follow a
standard data model.
 Diverse Formats: It comes in a variety of formats, including text, audio, video, images, social media
posts, and web pages.
 Requires Advanced Processing: Processing unstructured data requires specialized tools and
techniques such as natural language processing (NLP) and machine learning algorithms.
 Large Volumes: Unstructured data is often massive in volume and is growing rapidly as more content
(e.g., videos, social media posts, and images) is generated daily.
 Examples:
 Text data: Such as social media posts, emails, and articles.
 Multimedia data: Videos, images, and audio files, such as those found on platforms like YouTube or
Instagram.
 Log files: Data generated by web servers, application logs, and system logs.
 Sensor data: Data from IoT devices, such as temperature readings or GPS coordinates, that may not be
structured in a tabular form.
 Technologies Used:
 Hadoop ecosystem for storing and processing unstructured data (HDFS, MapReduce, etc.).
 Apache Spark for distributed processing of unstructured data.
 Natural Language Processing (NLP) tools for analyzing text data.
 Computer vision and image processing libraries (OpenCV) for analyzing image and video data.

4. Hybrid Data (Combination of Structured, Semi-structured, and Unstructured)

 Definition: Hybrid data refers to data that combines elements from structured, semi-structured, and
unstructured data types. This is common in modern Big Data systems where data may come in di erent
formats and structures but needs to be integrated into a single system for analysis.
 Characteristics:
 Combination of Di erent Data Types: Hybrid data combines structured, semi-structured, and
unstructured data in a way that supports diverse analytics.
 Complex to Process: Managing and analyzing hybrid data requires advanced integration, cleaning, and
transformation techniques.
 Examples:
 An e-commerce platform that uses structured customer data, semi-structured product descriptions,
and unstructured reviews or feedback from customers.
 Social media platforms where posts (unstructured), user data (structured), and interactions (semi-
structured) coexist.
 Technologies Used:
 Data Lakes that can handle structured, semi-structured, and unstructured data all in one place, such as
Amazon S3 or Azure Data Lake.
 ETL (Extract, Transform, Load) tools for data integration.

Summary

Types of Big Data:

1. Structured Data:
 Organized and easy to query.
 Example: Relational databases, spreadsheets.
2. Semi-structured Data:
 Some organization, but not fully structured.
 Example: XML, JSON files, emails.
3. Unstructured Data:
 No predefined structure, often large and complex.
 Example: Social media posts, images, videos, text.
4. Hybrid Data:
 Combination of structured, semi-structured, and unstructured data.
 Example: E-commerce platforms with a mix of di erent data formats.

7.3 Big Data Architecture

Big Data Systems refer to the infrastructure, tools, and frameworks used to manage, process, and analyze large
volumes of data, often referred to as Big Data. This data is typically characterized by its 3 V’s: Volume (large
amount of data), Variety (di erent types of data), and Velocity (fast pace of generation). To handle such data, a
robust Big Data architecture is necessary.

7.3.1 Key Components of Big Data Architecture:

1. Data Sources:
 Data sources are where data originates. This could include social media platforms, IoT devices, sensors,
websites, transactional systems, and more.
 These sources provide data in di erent forms such as structured, semi-structured, or unstructured.
2. Data Ingestion:
 This layer is responsible for collecting data from various sources and bringing it into the system for further
processing.
 Data ingestion can happen in real-time (streaming) or in batch (periodic collection).
 Tools: Apache Kafka, Apache Flume, NiFi, AWS Kinesis, etc.
3. Data Storage:
 After ingestion, the data needs to be stored in a system that can handle large volumes, diverse formats,
and be scalable.
 Big Data storage solutions are typically distributed across many servers.
 Types of storage:
 Distributed File Systems: such as HDFS (Hadoop Distributed File System).
 NoSQL Databases: for unstructured or semi-structured data (e.g., Cassandra, MongoDB, HBase).
 Cloud Storage: such as Amazon S3, Google Cloud Storage, Azure Blob Storage.
4. Data Processing:
 Once the data is stored, it needs to be processed. This step often involves transforming and aggregating
data for analysis.
 Processing can be done in batch or real-time:
 Batch Processing: Processes large amounts of data at once. Example tools include Apache Hadoop
(MapReduce), Apache Spark.
 Stream Processing: Deals with data in motion, handling real-time data flow. Tools like Apache
Storm, Apache Flink, Apache Samza, and Kafka Streams are popular for stream processing.
5. Data Analytics:
 After processing, data is analyzed to derive insights, perform statistical analysis, predictive analytics, and
machine learning.
 Tools for analytics include:
 Apache Spark: for distributed data processing and machine learning.
 Apache Hive: for data querying.
 Presto: an interactive query engine for big data analytics.
 Google BigQuery: a cloud-based analytics platform.
6. Data Visualization and Business Intelligence (BI):
 This is the final step, where the results of data analysis are presented in a human-readable format.
 Dashboards, charts, graphs, and reports help businesses understand the insights derived from the data.
 Tools: Tableau, Power BI, Qlik, Looker.
7. Data Governance and Security:
 Ensures data is handled according to relevant laws, regulations, and organizational policies. This
includes data privacy, access control, auditing, and security.
 Tools: Apache Ranger, Apache Atlas for governance; Kerberos for authentication; Encryption for security.

7.3.2 Big Data Architecture Layers:

A typical Big Data architecture can be broken down into the following layers:

1. Data Collection Layer:


 This layer handles data collection and ingestion from various sources.
2. Data Storage Layer:
 Stores large datasets in a distributed manner. This layer can contain batch storage (HDFS, databases)
and real-time storage systems (like NoSQL or cloud storage).
3. Data Processing Layer:
 Data processing involves batch and real-time operations. Processing is done by tools like Hadoop, Spark,
and Flink.
4. Data Analysis Layer:
 This layer involves running various algorithms to analyze and interpret data. It includes tools for statistical
analysis, machine learning, and AI.
5. Data Presentation Layer:
 Visualizes processed data in the form of reports, dashboards, and interactive visuals.

7.3.3 Big Data Architecture Example:

A commonly used Big Data architecture might look like this:

1. Data Sources: IoT devices, sensors, web applications.


2. Ingestion Layer: Apache Kafka to stream data.
3. Storage Layer:
 HDFS for batch data storage.
 HBase for real-time data storage.
4. Processing Layer:
 Apache Spark (batch processing).
 Apache Flink (stream processing).
5. Analytics Layer:
 Machine learning using Spark MLlib, H2O.ai.
6. Visualization Layer:
 BI tools like Tableau or Power BI for visualizing the data.
7. Governance Layer:
 Data security and compliance through Apache Ranger, Kerberos.

7.3.4 Considerations for Big Data Architecture:

Scalability: The system must be able to scale as data grows.

Fault tolerance: Data and computations should be resilient to failures.

Flexibility: It must support diverse data types and various analytical tools.

Real-time vs Batch: Choosing between real-time and batch processing depends on the application needs.

7.4 Introduction to Map-Reduce

MapReduce is a programming model that processes and generates large datasets in a distributed manner,
particularly suited for applications with huge amounts of data. The model breaks the task into two main phases:
Map and Reduce.

Map phase: The data is divided into chunks, processed in parallel across multiple nodes, and each chunk
produces intermediate key-value pairs.

Reduce phase: These intermediate key-value pairs are aggregated based on their keys, resulting in the final
output.

Key Concepts of MapReduce:

1. Mapper: The function that processes input data and produces intermediate key-value pairs. It operates on a
subset of the data (split across nodes in the cluster).
2. Reducer: The function that processes the intermediate key-value pairs produced by the mappers, aggregates
them (e.g., sums, averages, concatenates), and generates the final output.
3. Shu le and Sort: After the Map phase, the system groups and sorts the intermediate results by key, which
are then sent to the appropriate reducers.

MapReduce Workflow:

1. Input Splitting:
 The input data is divided into splits (smaller manageable chunks of data).
 Each split is processed by an individual mapper task.
2. Map Phase:
 Each mapper takes a split of data and processes it.
 The mapper produces a set of key-value pairs as intermediate output.
 For example, if the task is to count the frequency of words in a large text file, the map function would take
each word and output a key-value pair like (word, 1).
3. Shu le and Sort:
 The output from the mappers is shu led and sorted based on the keys.
 This step ensures that all values associated with the same key are grouped together.
4. Reduce Phase:
 The reducer takes each group of key-value pairs, where the key is the same, and processes them to
produce the final result.
 For example, in the case of word count, the reducer would sum up the values for each word and produce
a final key-value pair like (word, total count).
5. Output:
 After all the data has been processed, the results are written to the output location, usually in a
distributed file system (e.g., HDFS).

Map Partition Suffling and Reduce Output


Input Reader
Function Function Sorting Function Writer

Example of a MapReduce Job (Word Count):

For a simple word count task, the data input might consist of several lines of text. Here’s a step-by-step
breakdown:

Input Data:

Hello World Hello

Map Phase: Each mapper takes a split of the data (e.g., a block of text) and emits key-value pairs where the key
is a word, and the value is 1.

("Hello", 1)
("World", 1)
("Hello", 1)

Shu le and Sort: The output from the mappers is sorted and shu led so that all instances of the same word are
grouped together.

("Hello", [1, 1])


("World", [1])

Reduce Phase: The reducer takes each group of key-value pairs and sums up the values for each key.

("Hello", 2)
("World", 1)

Final Output: The final result is written to the output

Hello: 2
World: 1

Advantages of MapReduce:

1. Scalability: MapReduce is highly scalable and can handle petabytes of data by distributing tasks across
many machines in a cluster.
2. Fault Tolerance: MapReduce is designed to be fault-tolerant. If a node fails, the task assigned to it can be re-
executed on another node.
3. Parallelism: The Map and Reduce phases can run in parallel, providing significant performance
improvements when processing large datasets.
4. Simplicity: The programming model (Map and Reduce) is simple and allows developers to focus on the core
logic of the problem without worrying about low-level details like data distribution, load balancing, and fault
tolerance.

Challenges of MapReduce:

1. Data Skew: If the input data is not evenly distributed, some nodes may end up processing more data than
others, leading to performance bottlenecks.
2. Limited Data Processing: MapReduce is suitable for batch processing but not ideal for real-time processing.
For real-time streaming data, tools like Apache Spark are more suitable.
3. Inter-Task Dependencies: MapReduce doesn’t naturally support complex inter-task dependencies. If a task
requires stateful operations or multiple passes over data, the MapReduce model may not be the most
e icient.
4. Not Suitable for Iterative Algorithms: MapReduce is not well-suited for tasks that involve multiple iterations
over the same dataset, such as machine learning algorithms (though frameworks like Apache Spark address
this issue).

MapReduce Frameworks:

1. Hadoop MapReduce:
 Apache Hadoop is the most popular framework that implements the MapReduce programming model.
Hadoop allows you to write MapReduce jobs in Java and process large amounts of data on a cluster of
machines.
 HDFS (Hadoop Distributed File System) stores the data, and YARN (Yet Another Resource Negotiator)
manages resources in the cluster.
2. Apache Spark: Apache Spark extends the MapReduce model and provides a more flexible, in-memory
processing framework. It can perform both batch and real-time data processing and is much faster than
traditional MapReduce, particularly for iterative tasks.
3. Google Cloud Dataflow: A fully managed service that supports MapReduce-style processing on Google's
cloud infrastructure.

7.5 Introduction to Hadoop

Hadoop is one of the most widely used frameworks for storing and processing large volumes of data in a
distributed manner. It is an open-source software project developed by the Apache Software Foundation.
Hadoop is designed to scale from a single server to thousands of machines, each o ering local computation and
storage. It is particularly well-suited for big data analytics because of its ability to handle petabytes of data across
clusters of commodity hardware.

Core Components of Hadoop

1. Hadoop Distributed File System (HDFS):


 HDFS is the storage layer of Hadoop and is designed to store large files across many machines.
 It divides large files into smaller blocks (default block size: 128 MB or 256 MB) and distributes these
blocks across a cluster of nodes.
 HDFS ensures fault tolerance by replicating data blocks (typically 3 replicas per block) across di erent
nodes, so if a node fails, the data can still be accessed from other nodes.
 HDFS is optimized for high throughput and is ideal for applications with large-scale data access patterns
(i.e., reading and writing large files sequentially).
2. MapReduce:
 MapReduce is the computational framework used in Hadoop to process large datasets in a distributed
fashion.
 It works by dividing a task into two phases: the Map phase and the Reduce phase.
 Map phase: The input data is divided into smaller chunks, processed by di erent mappers, and
outputted as key-value pairs.
 Reduce phase: The results of the map phase are grouped by key, and reducers process them to
generate the final output.
 MapReduce is designed to work on massive datasets and run in parallel across many machines.
 While MapReduce is e ective for many tasks, it can be slow for certain types of data processing, which
is why other processing frameworks like Apache Spark are often used in conjunction with Hadoop.
3. YARN (Yet Another Resource Negotiator):
 YARN is the resource management layer of Hadoop. It is responsible for managing resources in the
cluster and scheduling jobs.
 YARN allows for multi-tenancy by enabling di erent applications to share the same cluster. This includes
managing resources for applications like MapReduce, Spark, or other distributed frameworks.
 YARN consists of three main components:
 ResourceManager (RM): Manages resources across the cluster and allocates them to jobs.
 NodeManager (NM): Manages resources on individual nodes in the cluster and reports back to the
ResourceManager.
 ApplicationMaster (AM): Manages the execution of a specific application (like a MapReduce job or a
Spark job).
 YARN enables Hadoop to run various types of workloads (batch, interactive, real-time) within the same
cluster.
4. Hadoop Common:
 It includes the necessary libraries, utilities, and APIs that allow other Hadoop modules to function.
 It provides essential services such as file system access, configuration management, and necessary
tools for running Hadoop jobs.
 Hadoop Common is required by other Hadoop components like HDFS and YARN for smooth operation.

Hadoop Ecosystem

In addition to its core components, Hadoop has a rich ecosystem of tools that work together to process, manage,
and analyze big data.

Hive:

 Hive is a data warehousing tool built on top of Hadoop that allows SQL-like queries (HiveQL) to be run on
data stored in HDFS.
 It provides an abstraction layer to make Hadoop accessible to users familiar with relational databases and
SQL.
 Hive can optimize query execution and use MapReduce under the hood, although newer versions support
execution engines like Apache Tez and Apache Spark for faster performance.

Pig:

 Pig is a high-level platform that provides a scripting language called Pig Latin to process data in Hadoop.
 Pig allows for the processing of data without writing complex MapReduce code, using simpler data flows.
 It is especially useful for handling large datasets with complex data transformation needs.

HBase:
 HBase is a distributed, column-family-oriented NoSQL database built on top of HDFS.
 It is modeled after Google’s Bigtable and is designed for real-time random read/write access to large
datasets.
 HBase is ideal for use cases requiring low-latency data access and can scale to store billions of rows of data
across a large cluster.

ZooKeeper:

 ZooKeeper is a centralized service for maintaining configuration information, naming, and providing
distributed synchronization.
 It is used to manage the configuration of distributed systems and ensure coordination between di erent
nodes in a Hadoop cluster.

Oozie:

 Oozie is a workflow scheduler system to manage Hadoop jobs.


 It allows for the coordination of complex data pipelines, including MapReduce, Hive, Pig, and other Hadoop
jobs.
 Oozie supports job dependencies, retries, and scheduling of tasks based on triggers (e.g., time-based, event-
based).

Flume:

 Flume is a tool for collecting, aggregating, and transferring large amounts of streaming data to Hadoop.
 It is commonly used for log data collection and streaming data sources such as social media, IoT devices, or
web logs.

Sqoop:

 Sqoop is a tool designed for e iciently transferring bulk data between Hadoop and relational databases.
 It allows for the import and export of data from databases like MySQL, Oracle, or PostgreSQL to Hadoop
HDFS, Hive, or HBase.

Apache Spark (with Hadoop):

 Spark is a fast, in-memory data processing engine designed for large-scale data processing. While it can run
standalone, it is often used with Hadoop to process data stored in HDFS.
 Spark o ers significant performance improvements over MapReduce for many workloads, especially for
iterative tasks, thanks to its in-memory processing model.

Advantages of Hadoop

Scalability:

 Hadoop can scale horizontally by adding more machines to the cluster. It can handle petabytes of data by
distributing storage and computation across many nodes.

Cost-E ective:

 Hadoop can run on commodity hardware, reducing the cost of storage and computation compared to
traditional, centralized databases or data warehouses.

Fault Tolerance:

 Hadoop automatically replicates data across di erent nodes in the cluster. If one node fails, data can still be
accessed from other nodes with replicas.
Flexibility:

 Hadoop is suitable for processing structured, semi-structured, and unstructured data, making it ideal for a
wide range of applications such as data warehousing, log processing, and real-time analytics.

High Throughput:

 Hadoop is designed for high throughput, making it well-suited for batch processing of large datasets.

Challenges of Hadoop

Complexity:

 While Hadoop is powerful, it can be complex to manage, especially when dealing with large clusters or
integrating with other tools in the Hadoop ecosystem.

Latency:

 Hadoop is primarily designed for batch processing and may not be ideal for real-time data processing
(although tools like Apache Storm and Apache Spark Streaming address this limitation).

Security:

 While Hadoop has improved security features (e.g., Kerberos authentication), securing a Hadoop cluster can
still be challenging, especially in large deployments.

Limited Support for Iterative Algorithms:

 Hadoop MapReduce can be ine icient for iterative algorithms like machine learning, which require multiple
passes over the data. Tools like Apache Spark are often preferred for these tasks.

7.6 Distributed File System (DFS)

DFS is a key component in big data systems, providing an architecture that allows for the storage and access of
files across multiple machines or nodes in a network. It enables e icient, scalable, and reliable data storage in
distributed computing environments, where large volumes of data need to be stored, processed, and accessed
by various users or applications.

Key Concepts of DFS in Big Data Systems:

Data Distribution and Scalability:

 DFS allows data to be spread across multiple servers (nodes) in a cluster, enabling horizontal scaling. As the
data grows, additional machines can be added to the system without impacting performance.
 The data is typically divided into chunks, and each chunk is distributed across di erent nodes.

Fault Tolerance and Data Replication:


 In distributed systems, hardware failures are inevitable. DFS provides fault tolerance by replicating data
across di erent nodes in the system.
 If a node or disk fails, the system can still access copies of the data from other nodes, ensuring high
availability and durability.

Data Redundancy:

 DFS uses redundancy techniques like replication (multiple copies of data) and erasure coding (breaking data
into fragments and storing them across nodes). Replication is the most common, where each file is typically
stored in 2-3 copies across di erent nodes.
 This redundancy helps in ensuring data integrity and availability in the event of hardware or network failures.

Data Locality:

 One of the primary goals of DFS in big data systems is to optimize data locality. This means processing data
as close to where it is stored as possible, reducing the need to move data across the network, which can be
slow and expensive.
 Some DFS implementations like HDFS (Hadoop Distributed File System) store the computation logic close
to the data, which is a key factor in the performance of big data analytics.

Metadata Management:

 Metadata (data about the data) such as file names, sizes, locations, and permissions is stored separately
from the actual data. In DFS, a master node (also known as NameNode in HDFS) is responsible for storing
and managing this metadata.
 This separation allows DFS to e iciently manage and access the data while ensuring metadata integrity and
access control.

High Throughput and Low Latency:

 DFS is optimized for high throughput (i.e., reading and writing large volumes of data) rather than low-latency
access. This is ideal for big data processing workloads, such as batch processing and analytics, where speed
in accessing vast amounts of data is critical.

Examples of DFS Used in Big Data Systems:

Hadoop Distributed File System (HDFS):

 HDFS is the most widely used DFS in big data systems, especially in Hadoop-based frameworks. It is
designed to store large files (typically in the gigabyte to terabyte range) across a cluster of commodity
hardware.
 Key features of HDFS include data replication, block-based storage, and fault tolerance. HDFS is optimized
for read-heavy access patterns, where large datasets are processed in parallel.

Google File System (GFS):

 GFS is a proprietary DFS used by Google to manage its massive data storage needs. It was designed to handle
large-scale data across multiple machines and ensure data integrity, reliability, and availability.
 GFS inspired the creation of HDFS and shares many of its key principles like data replication, block-based
storage, and scalability.

Amazon S3 (Simple Storage Service):

 Amazon S3 is a cloud-based object storage system that functions like a distributed file system. While not a
traditional DFS, it is widely used in big data architectures as a storage solution for unstructured data.
 S3 provides high availability, durability, and scalability, making it suitable for big data applications. It's often
integrated with other cloud-based big data processing frameworks like Amazon EMR (Elastic MapReduce).

Ceph:

 Ceph is an open-source, distributed storage system that provides highly scalable object, block, and file
storage in a unified system. It is known for its flexibility, allowing it to be used in a variety of big data and cloud
storage use cases.
 Ceph is designed to be fault-tolerant and self-healing, with data replication and distribution techniques that
ensure high availability.

Key Advantages of DFS in Big Data:

1. Scalability: DFS can scale horizontally by adding more nodes as the data grows, without significant changes
to the architecture.
2. Fault Tolerance: Replication and redundancy mechanisms ensure that data remains accessible even in the
event of node failures.
3. High Throughput: DFS is designed to handle high volumes of data and large file sizes, which is crucial for big
data processing tasks.
4. Data Accessibility: DFS enables distributed access to data from di erent machines or nodes, facilitating
parallel processing and reducing bottlenecks.

Challenges of DFS in Big Data:

1. Consistency: Achieving consistency across multiple copies of data (replicas) in a distributed system can be
challenging, particularly in the presence of network partitions.
2. Complexity: Implementing and managing a DFS can be complex, especially in large-scale systems that
require fine-grained control over data distribution, replication, and recovery.
3. Latency: Although optimized for throughput, DFS systems may have higher latency in certain cases due to
the need to access data from multiple nodes.

7.7 HDFS (Hadoop Distributed File System)

is the primary distributed file system used by the Apache Hadoop ecosystem to store vast amounts of data
across a cluster of machines. It is designed to work with large-scale data processing frameworks, providing a
reliable, scalable, and fault-tolerant storage solution for big data applications.

Key Features of HDFS:

Distributed Storage:

 HDFS stores large files across multiple machines in a distributed fashion. The data is divided into blocks,
typically 128MB or 256MB in size, and each block is stored across several nodes (machines) in the cluster.

Data Replication:

 To ensure fault tolerance and high availability, HDFS replicates each data block multiple times across
di erent nodes. By default, HDFS replicates data three times (though this can be configured). If one node
fails, the data can still be accessed from the replica stored on another node.

Block-Level Storage:

 Data is stored in blocks, and each block is assigned to a specific machine in the cluster. The size of these
blocks (usually 128MB or 256MB) is much larger than traditional file systems to reduce the overhead of
managing many small files.
 This block-level storage is optimized for large sequential reads, typical in big data workloads.

High Throughput:

 HDFS is optimized for high throughput, which makes it suitable for data processing tasks like batch
processing (e.g., MapReduce) and analytics.
 It is designed for reading large volumes of data at a time, which is ideal for big data applications like data
warehousing, machine learning, and data mining.

Fault Tolerance:

 Data replication ensures that even if one or more nodes in the system fail, the data remains available through
replicas stored on other nodes. HDFS automatically handles data recovery by replicating missing blocks if a
node fails or becomes unreachable.

Data Locality:

 HDFS is designed to take advantage of the concept of data locality. When processing data, it tries to schedule
computation tasks near the location of the data to avoid the overhead of moving large volumes of data over
the network.
 This significantly improves the performance of processing tasks like those handled by MapReduce or other
big data frameworks.

Scalability:

 HDFS can scale horizontally by simply adding more nodes to the cluster. The distributed nature of HDFS
allows it to handle increasing amounts of data as the system grows without significant performance
degradation.
 HDFS clusters can contain thousands of nodes, making it suitable for handling petabytes of data.

Key Components of HDFS:

NameNode (Master):

 The NameNode is the central metadata server that manages the HDFS namespace. It is responsible for:
 Keeping track of the files and directories in the file system.
 Managing the metadata, including file names, permissions, and locations of blocks.
 Coordinating file creation, deletion, and replication.
 Directing clients to the appropriate DataNodes to read or write data.
 The NameNode does not store the actual data (which is stored in the DataNodes) but keeps a record of which
block is stored where.
DataNodes (Slave Nodes):

 DataNodes are the actual storage nodes in the HDFS architecture. They store the data blocks and handle the
read and write requests from clients. Each DataNode is responsible for:
 Storing blocks of data.
 Reporting the status of the blocks to the NameNode.
 Handling data retrieval and block creation, deletion, and replication.
 When a client wants to read or write data, the NameNode tells it where the relevant DataNode is located,
and the client communicates directly with the DataNode to fetch or store data.

Secondary NameNode:

 The Secondary NameNode is responsible for periodically checkpointing the file system metadata by merging
the NameNode's transaction logs with the current state of the file system. It helps to reduce the load on the
NameNode and provides fault tolerance for the NameNode’s metadata.
 It does not serve client requests and does not replace the NameNode in case of failure (this is managed
through HDFS High Availability).

Client:

 The Client in HDFS is any application or user that wants to read or write data in the HDFS cluster. It interacts
with the NameNode to retrieve metadata and directly communicates with DataNodes for actual data storage
or retrieval.

HDFS Architecture Workflow:

File Write Operation:

 When a client wants to write a file to HDFS, it first contacts the NameNode to get the list of DataNodes where
the file blocks should be stored.
 The NameNode responds with a set of DataNodes, and the client writes data to the first DataNode. The data
is split into blocks and sent to the DataNodes one by one.
 The DataNodes store the blocks, and the NameNode updates its metadata to reflect the location of the
blocks.

File Read Operation:

 To read a file, the client first contacts the NameNode to get the list of DataNodes where the blocks of the file
are stored.
 The NameNode responds with the block locations, and the client communicates directly with the DataNodes
to read the data.
Data Replication:

 If a DataNode fails, the NameNode is responsible for detecting the failure and ensuring that data replication
is maintained. If a block has fewer than the desired number of replicas, the NameNode will instruct other
DataNodes to replicate the missing block.

Data Recovery:

 If a block or DataNode becomes unavailable, HDFS ensures the lost data is replicated from other copies,
keeping the system reliable and fault-tolerant.

Advantages of HDFS in Big Data:

1. Fault Tolerance: Data replication ensures that even if nodes fail, the data remains available.
2. Scalability: HDFS scales horizontally by simply adding more nodes to the system, enabling it to handle
petabytes of data.
3. High Throughput: HDFS is optimized for high throughput rather than low latency, making it ideal for
processing large volumes of data in batch processing applications.
4. Cost-E ective: HDFS can run on commodity hardware, making it a cost-e ective solution for storing and
processing large amounts of data.
5. Data Locality: HDFS reduces network tra ic by moving computations closer to the data, thus speeding up
data processing.

Use Cases of HDFS in Big Data Systems:

Data Storage for Analytics:

 HDFS is commonly used as the storage layer for large-scale analytics applications. It works well with tools
like Apache Hive and Apache Impala for data warehousing and SQL queries on massive datasets.

Batch Processing with Apache Spark and Hadoop MapReduce:

 HDFS is often paired with frameworks like Apache Spark and Hadoop MapReduce for batch processing tasks
such as ETL (Extract, Transform, Load), data analytics, and machine learning on large datasets.

Data Lakes:

 HDFS is a common storage solution for data lakes, where organizations store structured and unstructured
data in raw form before processing and analyzing it.

Log Processing and Streaming Data:

 HDFS is also used in log processing applications, where large volumes of log data generated by various
systems are ingested and processed in real time or batch mode.
8. NOSQL

NoSQL (Not Only SQL) databases are a category of databases that provide an alternative to traditional relational
databases. NoSQL databases are designed to handle large volumes of unstructured, semi-structured, or
structured data and are scalable, flexible, and fault-tolerant. These databases are often used in scenarios
involving big data, real-time web apps, and IoT applications.

8.1 NOSQL

NoSQL databases are a popular choice in modern web applications, especially when dealing with large-scale,
distributed systems that need to handle varied and fast-moving data.

Key Characteristics of NoSQL:

Schema-less or Flexible Schema: NoSQL databases often allow you to store data without needing to define a
rigid schema beforehand, making them suitable for dynamic or evolving datasets.

Scalability: NoSQL databases are designed to scale out by distributing data across multiple servers (horizontal
scaling), which allows them to handle huge amounts of data and tra ic.

Variety of Data Models: They support di erent data models such as key-value, document, column-family, and
graph.

High Availability: Many NoSQL systems are designed for high availability and can handle failures gracefully by
replicating data across multiple nodes or data centers.

Performance: They are optimized for read-heavy and write-heavy workloads, often prioritizing speed and low
latency over complex querying capabilities.

Types of NoSQL Databases:

Key-Value Stores:

 Data is stored as key-value pairs (similar to a dictionary or hash map).


 Example: Redis, DynamoDB, Riak.

Document Stores:

 Store data in documents (often JSON, BSON, or XML format), allowing for nested structures.
 Example: MongoDB, CouchDB, Couchbase.

Column-Family Stores:

 Store data in columns rather than rows, making them e icient for read-heavy analytical workloads.
 Example: Apache Cassandra, HBase, ScyllaDB.

Graph Databases:

 Store data as nodes and edges, optimized for querying relationships between entities.
 Example: Neo4j, Amazon Neptune, ArangoDB.

Advantages of NoSQL:

Scalability: Can scale horizontally across many servers, handling large-scale datasets with ease.

Flexibility: Suitable for unstructured, semi-structured, or evolving data models.


High Availability and Fault Tolerance: Designed to be resilient, with automatic replication and distribution of
data.

Performance: Optimized for high throughput and low latency operations.

Disadvantages of NoSQL:

Lack of ACID Transactions: Many NoSQL systems do not support full ACID (Atomicity, Consistency, Isolation,
Durability) transactions, which makes them less suitable for applications requiring strict data consistency.

Limited Querying: Complex queries involving joins, aggregations, or subqueries may be di icult or ine icient to
perform.

Maturity: Some NoSQL systems are relatively newer compared to traditional relational databases, and they may
not have the same level of maturity, tooling, or community support.

Use Cases for NoSQL:

Big Data and Analytics: Storing and analyzing massive datasets in real-time, such as logs or sensor data.

Real-Time Applications: Applications requiring low-latency access, such as recommendation engines, social
media platforms, and IoT systems.

Content Management Systems: Websites and applications with dynamic and evolving content, where data
formats may vary.

Distributed Systems: Applications that need to distribute data across multiple servers or data centers.

8.2 Query Optimization

Optimizing queries in NoSQL databases is important for improving performance, especially as the scale of the
data grows. NoSQL databases like MongoDB, Cassandra, Couchbase, and others often have di erent query
optimization techniques compared to relational databases due to their schema-less nature and distributed
architecture. Here are general techniques and tips for optimizing queries in NoSQL databases:
Design the Data Model Carefully

Choose the right data model: NoSQL databases often provide di erent types of models (e.g., key-value,
document, column-family, graph). Design your schema to align with your access patterns, meaning you should
structure the data based on how it will be queried, not just on its relationships.

Denormalization: In NoSQL, it’s common to store redundant copies of data to avoid costly joins. For example,
instead of linking user data to posts in separate collections, you might store user data directly inside post
documents (or vice versa).

Avoid large documents: For document stores like MongoDB, ensure documents are not too large. Large
documents may lead to ine iciencies when querying or updating data.

Indexing

Create appropriate indexes: Indexing is critical for fast reads. Ensure that fields used in queries (e.g., filters or
sorts) are indexed. However, unnecessary indexes can degrade performance during write operations.

 In MongoDB, for instance, fields in query filters or those involved in sort operations should be indexed.
 Composite indexes: For queries that use multiple fields, consider composite indexes that cover multiple
fields simultaneously. This avoids the overhead of creating multiple indexes.

Covering indexes: Create indexes that include all the fields needed for a query, so the database can retrieve all
required data directly from the index, reducing the need to access the main database.

Use Query Projections

Limit the data returned: Instead of returning all fields of a document, use projections to specify only the required
fields. This reduces the amount of data transferred from the database and can significantly speed up queries.

Example (MongoDB):

db.collection.find({ name: "John" }, { name: 1, age: 1 })

This query will only return the name and age fields instead of the entire document.

Sharding (Horizontal Scaling)

If your NoSQL database supports sharding (e.g., MongoDB, Cassandra), design the shard key wisely. The choice
of the shard key can significantly a ect query performance and data distribution.

Even data distribution: Make sure that the data is evenly distributed across the nodes to prevent hotspots where
some nodes have too much data.

Use range-based sharding: In some cases, range-based sharding may be more e icient, especially when
queries often retrieve data in a certain range (e.g., time series data).

Avoid Joins, Use Aggregation

Avoid joins: NoSQL databases are generally not designed for complex joins like in relational databases. Instead,
use aggregation or embedding to keep the data denormalized.

Aggregation pipelines: For databases like MongoDB, use the aggregation framework to perform operations like
grouping, filtering, and sorting on the server side instead of pulling all data into the application and performing
those operations there.

 Example (MongoDB Aggregation):


db.orders.aggregate([
{ $match: { status: "completed" } },
{ $group: { _id: "$customerId", total: { $sum: "$amount" } } },
{ $sort: { total: -1 } }
])

Materialized views: Some NoSQL systems (e.g., Couchbase) allow you to create materialized views
(precomputed results) that can be queried e iciently.

Optimize Write Patterns

 Batch writes: When dealing with large datasets, batch writes to reduce overhead. In some systems,
performing many small writes can be more expensive than larger, batched ones.
 Write consistency: In distributed NoSQL systems, consider adjusting consistency settings to strike a
balance between read consistency and write performance.
 Avoid unnecessary updates: Only update the fields that are actually changing. Avoid full document
replacements or updates when only small parts of the document change.

Use Query Caching

 Some NoSQL databases (e.g., Redis, Couchbase) support caching of frequent queries or results. This is
useful for read-heavy workloads where the same data is requested repeatedly.
 Query result caching: By caching the results of expensive queries, subsequent reads can be served much
faster.
 Data eviction strategies: Configure the eviction policies to ensure that the cache does not grow too large
and cause performance degradation.

Monitor and Analyze Query Performance

 Query profiling: Most NoSQL databases provide tools to profile and analyze query performance (e.g.,
explain() in MongoDB). Use these tools to find slow queries, missing indexes, and other potential bottlenecks.
 Database monitoring: Monitor database metrics such as CPU, memory usage, disk I/O, and query execution
times. Use these insights to identify and address performance issues.

Leverage Secondary Indexes and Full-Text Search

 For databases like Cassandra or Couchbase, use secondary indexes where appropriate, but be cautious
about their impact on write performance.
 Full-text search: For NoSQL databases that support it (e.g., Elasticsearch or Couchbase with Full-Text
Search), use full-text search capabilities when you need advanced search functionality, as opposed to using
traditional indexes.

Limit Network Tra ic

 Data locality: For distributed NoSQL systems, ensure that queries are directed to the nodes where the data
is located. This minimizes unnecessary network tra ic between nodes.
 Use pagination: For queries that return large datasets, implement pagination to limit the number of results
fetched at once, reducing the strain on both the database and the network.

Tune Configuration and Resources

 Memory: Allocate su icient memory to store indexes and frequently accessed data.
 Disk I/O: Ensure your storage is optimized for low latency and high throughput, especially for databases that
rely on disk for storing data (e.g., Cassandra, MongoDB).
 Cluster setup: In distributed setups, ensure that the cluster is properly sized and the nodes are configured
for the expected workload.

Example: Optimizing a MongoDB Query

A typical scenario in MongoDB might involve querying a large dataset based on certain filters. Here's an example
of how to optimize it:

Original query (ine icient):

db.orders.find({ customerId: "12345", orderDate: { $gt: new Date("2023-01-01") } })

This query may be slow if customerId and orderDate are not indexed.

Optimized query:

1. Create compound index on customerId and orderDate:

db.orders.createIndex({ customerId: 1, orderDate: 1 })

2. Use projection to limit the fields returned:

db.orders.find({ customerId: "12345", orderDate: { $gt: new Date("2023-01-01") } }, {


orderId: 1, totalAmount: 1 })

This ensures the query uses the index e iciently, returns only the necessary fields, and reduces I/O overhead.

8.3 Di erent NOSQL Products

NoSQL databases are designed to handle large volumes of unstructured or semi-structured data that may not fit
well into traditional relational database models. They are often used for applications requiring high availability,
scalability, and flexibility. Below are some popular types of NoSQL database products, categorized by their data
models:

1. Document-Based Databases

These databases store data as documents, often in JSON or BSON format, where each document is a set of key-
value pairs.

 MongoDB: One of the most widely used NoSQL databases. It stores data in flexible, JSON-like documents.
MongoDB is known for its high performance, scalability, and ease of use.
 CouchDB: A database that uses JSON for data storage and JavaScript as its query language. CouchDB
focuses on ease of use and o ers features like multi-master replication.
 RavenDB: A fully transactional document database for .NET and .NET Core applications. It’s designed to
store and query large collections of JSON documents.

2. Key-Value Stores

These databases use a simple model where data is stored as key-value pairs, with keys being unique identifiers.

 Redis: An open-source, in-memory key-value store known for its high performance. Redis supports a variety
of data structures, including strings, lists, sets, and hashes.
 Riak: A highly available and fault-tolerant key-value store designed for scalability. It is commonly used for
distributed systems.
 Amazon DynamoDB: A managed NoSQL key-value store by Amazon Web Services (AWS), designed for high
availability and performance at scale.

3. Column-Family Stores
These databases store data in columns rather than rows, making them suitable for analytical queries on large
datasets.

 Apache Cassandra: A highly scalable, distributed NoSQL database optimized for handling large amounts of
data across many commodity servers. It's known for its decentralized architecture and fault tolerance.
 HBase: An open-source, distributed database that is modeled after Google’s Bigtable. It is part of the Hadoop
ecosystem and designed for large-scale, low-latency operations.
 ScyllaDB: A high-performance, drop-in replacement for Apache Cassandra, designed for low-latency and
high-throughput use cases.
4. Graph Databases

These databases are optimized to store and query graph structures, with nodes, edges, and properties. They are
used to represent relationships between entities.

 Neo4j: One of the most popular graph databases, widely used for applications like social networks,
recommendation engines, and fraud detection.
 ArangoDB: A multi-model database that supports graph, document, and key-value data models. It’s
designed to handle complex queries and relationships.
 Amazon Neptune: A fully managed graph database service by AWS that supports both property graph and
RDF graph models, making it suitable for a wide range of graph-based applications.

5. Multi-Model Databases

These databases support more than one data model, allowing users to choose the appropriate model for their
application.

 ArangoDB: As mentioned above, it supports graph, document, and key-value data models.
 OrientDB: A multi-model NoSQL database that supports document, graph, key-value, and object-oriented
data models. It’s designed for scalability and high availability.
 Couchbase: A NoSQL database that o ers a flexible data model and supports document, key-value, and
full-text search queries.

6. Time-Series Databases

These databases are optimized for storing and querying time-series data, such as sensor readings, financial data,
and logs.

 InfluxDB: A time-series database designed to handle high write and query loads. It’s used for monitoring
applications, IoT, and real-time analytics.
 Prometheus: An open-source time-series database used primarily for monitoring and alerting in cloud-
native environments.
 TimescaleDB: Built on top of PostgreSQL, it is a time-series database designed to combine the reliability of
SQL with the scale of NoSQL for time-based data.

7. Object-Oriented Databases

These databases store data in the form of objects, similar to how data is represented in object-oriented
programming.

 db4o: A Java and .NET-compatible object-oriented database. It allows developers to store objects directly,
eliminating the need for complex object-relational mapping.
 ObjectDB: An object database for Java that supports JPA (Java Persistence API) and o ers high performance
for managing persistent objects.

8. Search Engines (NoSQL by Nature)

These products can be used for full-text search and indexing, often leveraging NoSQL principles for scalability
and flexibility.

 Elasticsearch: A distributed search and analytics engine built on top of Apache Lucene. It’s often used for
log and event data analysis, as well as search applications.
 Apache Solr: Another open-source search platform based on Lucene, commonly used for full-text search
and analytics in large datasets.

9. Other Notable NoSQL Databases


 Couchbase: A NoSQL database with a flexible JSON document model and support for key-value store and
full-text search. It’s often used for caching and high-performance applications.
 FaunaDB: A distributed, serverless database designed to scale without the operational overhead of
traditional databases. It combines document, graph, and relational models.

8.4 Querying and Managing NOSQL

Querying and managing NoSQL databases requires understanding their specific characteristics, as NoSQL
databases come in various types (document, key-value, column-family, graph, etc.).

Here’s an overview of how querying and managing NoSQL databases works:

Types of NoSQL Databases

Document-based Databases (e.g., MongoDB, CouchDB)

 Store data as documents, usually in JSON or BSON format.


 Each document is a set of key-value pairs and can be complex with nested structures.
 Example Query (MongoDB):

db.users.find({ age: { $gt: 25 } })

 This query fetches users whose age is greater than 25.

Key-Value Stores (e.g., Redis, DynamoDB)

 Store data as key-value pairs.


 Highly performant and used for caching and session management.
 Example Query (Redis):

SET user:1 "John"


GET user:1

 This stores and retrieves a value associated with a key.

Column-family Stores (e.g., Cassandra, HBase)

 Data is stored in columns instead of rows, making it e icient for large-scale reads.
 Example Query (Cassandra):

SELECT * FROM users WHERE age > 30;

 This fetches users older than 30 from a table in a column-family store.

Graph Databases (e.g., Neo4j, ArangoDB)

 Store data as nodes and relationships between them.


 Useful for traversing and analyzing relationships.
 Example Query (Neo4j - Cypher query language):

MATCH (a:Person)-[:KNOWS]->(b:Person)
WHERE a.name = 'Alice'
RETURN b.name

 This query finds people that Alice knows.

Key Concepts in NoSQL Querying:


 Schema Flexibility: NoSQL databases often don’t enforce a fixed schema, which makes it easier to store
diverse types of data, but it also requires more care during querying.
 Data Modeling: In NoSQL, data modeling is crucial for query optimization. For example, in a document store
like MongoDB, you might decide whether to embed related data in a single document or to reference it in
another document.
 Indexing: Indexes are used to speed up query performance. In NoSQL, creating indexes on frequently queried
fields is essential for performance.
Example (MongoDB index creation):

db.users.createIndex({ age: 1 })

 Aggregation: Aggregation frameworks allow you to perform complex operations like filtering, grouping, and
summing data in a more advanced way.
Example (MongoDB aggregation):

db.orders.aggregate([
{ $match: { status: 'completed' } },
{ $group: { _id: "$customerId", total: { $sum: "$amount" } } }
])
Managing NoSQL Databases:

Data Insertion:

 Insertion can be done using the database’s API or query language.


 Example (MongoDB):

db.users.insert({ name: "Alice", age: 30 })

Data Updates:

 Data updates vary across NoSQL databases. You can update a document in MongoDB, set a new value in
Redis, or update columns in Cassandra.
 Example (MongoDB):

db.users.update({ name: "Alice" }, { $set: { age: 31 } })

Data Deletion:

 Deletion of data can also depend on the database type.


 Example (MongoDB):

db.users.remove({ name: "Alice" })

Scaling:

 NoSQL databases generally scale horizontally. Data is distributed across multiple nodes to handle increased
load.
 Examples include sharding (MongoDB) or partitioning (Cassandra) where data is divided among multiple
machines.

Backup and Recovery:

 Most NoSQL databases o er tools for backup and data recovery, often focused on scalability and high
availability. In MongoDB, for example, mongodump and mongorestore are commonly used for backups.

Replication:

 NoSQL databases often support replication, ensuring high availability and fault tolerance. For example,
MongoDB uses replica sets, and Cassandra uses a peer-to-peer replication model.

Consistency Models:

 NoSQL databases often o er eventual consistency instead of the strong consistency guaranteed by
relational databases. This helps them scale e iciently in distributed environments.
 Some NoSQL systems, like Cassandra, allow you to configure consistency levels (e.g., ONE, QUORUM, ALL)
to control the trade-o between consistency and availability.

Example: MongoDB Query and Management

 Querying a Document:

db.users.find({ name: "Alice" })

 Inserting a Document:

db.users.insert({ name: "Alice", age: 30 })


 Updating a Document:

db.users.update({ name: "Alice" }, { $set: { age: 31 } })

 Deleting a Document:

db.users.remove({ name: "Alice" })

 Aggregation:

db.orders.aggregate([
{ $match: { status: "completed" } },
{ $group: { _id: "$customerId", totalAmount: { $sum: "$amount" } } }
])

Best Practices for Managing NoSQL Databases:

1. Schema Design: Plan your data model carefully, as it impacts your queries' performance.
2. Indexing: Create appropriate indexes to speed up search operations.
3. Sharding and Partitioning: Use horizontal scaling strategies to handle large volumes of data.
4. Monitoring: Use monitoring tools to track performance and optimize resource usage.
5. Backups and Disaster Recovery: Always have a backup plan and recovery strategy in place.

8.5 Indexing Data Sets

Indexing and ordering data sets are critical concepts for improving the performance of queries in databases,
especially when dealing with large volumes of data. Both NoSQL and relational databases use indexing to speed
up retrieval operations, and ordering helps structure data to meet the specific requirements of an application.

Why Indexing is Important:

Faster Queries: Without indexes, a database may need to scan every record to find a match for a query, which
can be very slow. With indexes, queries that would otherwise require a full scan can be answered in constant
time (or logarithmic time, depending on the index type).

Optimized Performance: Indexing is especially useful when querying large datasets with non-sequential access
patterns.

Types of Indexes in NoSQL Databases:

Single-Field Indexes:

Indexes a single field in a document or table.

db.users.createIndex({ age: 1 }) // 1 for ascending, -1 for descending

Compound Indexes:

Indexes multiple fields together. This is useful when queries often involve more than one field.

db.users.createIndex({ name: 1, age: -1 })

Multikey Indexes:

Used for indexing arrays. This allows you to index each element of an array within a document.

db.users.createIndex({ hobbies: 1 })
Text Indexes:

Full-text search indexes. They allow e icient searching over text fields.

db.articles.createIndex({ content: "text" })

Geospatial Indexes:

Special indexes used for querying geographical data (latitude/longitude points).

db.places.createIndex({ location: "2dsphere" })

Hashed Indexes:

Used to index fields by hashing their values, often for exact match queries (often used in key-value stores like
Redis).

HSET user:1 name "Alice"

Example of Indexing in MongoDB:

Creating an Index:

db.users.createIndex({ age: 1 }) // Ascending order index on the age field

Querying with an Index:

db.users.find({ age: { $gt: 30 } }) // Faster query with an index on the 'age' field

Drop an Index:

db.users.dropIndex({ age: 1 })

Considerations for Indexing:

Storage Overhead: Indexes consume additional disk space. Each index you create requires storage and adds
overhead during insert/update operations.

Write Performance: Indexes can slow down write operations because the index has to be updated whenever the
data changes.

Choice of Fields: You should only index fields that are frequently queried or used in sorting. Over-indexing can
negatively impact performance.

8.6 Ordering Data Sets

Ordering refers to arranging the data in a specific sequence (ascending or descending) based on one or more
fields, and it's essential for queries that require sorted results.

How Ordering Works:

Single Field Ordering:

Data can be ordered by a single field. For example, sorting by age in ascending order.

db.users.find().sort({ age: 1 }) // Ascending order by 'age'


db.users.find().sort({ age: -1 }) // Descending order by 'age'

Compound Ordering:
Ordering by multiple fields. MongoDB, for instance, allows sorting by more than one field, which is useful when
sorting by primary and secondary criteria.

db.users.find().sort({ age: 1, name: -1 }) // Ascending by 'age' and descending by


'name'

Ordering with Index:

When an index is created on the field(s) used for sorting, the query can be executed much faster, because the
data is already ordered in the index structure.

db.users.createIndex({ age: 1 }) // Index for faster sorting by 'age'


db.users.find().sort({ age: 1 }) // Faster query with the index in place

Limit and Skip:

After ordering the data, you can use limit to restrict the number of results returned and skip to paginate through
results.

db.users.find().sort({ age: 1 }).skip(10).limit(5) // Skip 10 documents and return


the next 5

Example: MongoDB Sorting and Indexing

Creating an Index for Sorting:

db.users.createIndex({ age: 1 })

Querying and Sorting:

db.users.find().sort({ age: 1 }) // Sorted by age in ascending order

Using Compound Indexes for Sorting:

db.users.createIndex({ age: 1, name: 1 }) // Compound index on age and name


db.users.find().sort({ age: 1, name: 1 }) // Efficient sort with compound index

Ordering Considerations:

Performance: Sorting large data sets without proper indexing can lead to slower query performance. Indexes on
the sorted fields can significantly reduce the time required for sorting.

Consistency: In distributed NoSQL systems, ensuring consistency of sorted data (especially during updates)
can sometimes be a challenge.

Combining Indexing and Ordering:


Often, indexing and ordering are used together to improve the e iciency of queries that require both filtering and
sorting.

Example:

Querying and Sorting with Index: If you need to retrieve all users older than 30 and sort them by their names,
first, create an index on both age and name for optimal performance:

db.users.createIndex({ age: 1, name: 1 })


db.users.find({ age: { $gt: 30 } }).sort({ name: 1 })
This approach ensures the query is both e icient (due to the index) and fast (due to sorting being handled by the
index structure).

8.7 NOSQL in Cloud

NoSQL databases in the cloud o er a flexible, scalable, and cost-e ective solution for handling large amounts
of unstructured or semi-structured data. Cloud-based NoSQL solutions benefit from cloud infrastructure,
o ering ease of deployment, scalability, high availability, and managed services that reduce the operational
burden for developers.

Key Benefits of NoSQL in the Cloud

Scalability: Cloud-based NoSQL databases can scale horizontally, allowing you to handle large volumes of data
and tra ic by adding more resources (e.g., servers or nodes) as needed. This is particularly beneficial for
applications with unpredictable or rapidly growing data.

High Availability and Fault Tolerance: Cloud NoSQL databases typically o er built-in replication and failover
mechanisms, ensuring that data is highly available even in the event of hardware or network failures.

Managed Services: Many cloud providers o er fully managed NoSQL databases, which means the cloud
provider takes care of operational tasks like backups, patching, updates, and scaling. This allows developers to
focus on application development rather than managing infrastructure.

Cost-E ective: Cloud providers often o er pay-as-you-go pricing models, so you only pay for the resources you
use, which can be more cost-e ective than maintaining on-premise infrastructure.

Global Distribution: Cloud NoSQL databases can be distributed across multiple regions, improving
performance for users worldwide and providing low-latency access to data.

Integration with Cloud Ecosystem: NoSQL databases in the cloud are tightly integrated with other cloud
services (e.g., machine learning, analytics, serverless computing), enabling rich functionality and easy
integration with other components of your cloud infrastructure.

Popular Cloud NoSQL Databases

Here are some of the most commonly used NoSQL databases in the cloud:

1. Amazon DynamoDB (AWS)

 Type: Key-Value and Document Store


 Description: DynamoDB is a fully managed, serverless NoSQL database that provides fast and predictable
performance with seamless scalability. It is designed for applications that require low-latency data access
at any scale.
 Key Features:
 Fully managed, serverless, with automatic scaling
 Built-in security (encryption at rest, access control)
 Global tables for multi-region replication
 Supports both key-value and document data models
 Consistent or eventual consistency options for reads
 Use Cases: E-commerce, gaming, mobile apps, IoT, real-time analytics

2. Azure Cosmos DB (Microsoft Azure)

 Type: Multi-Model (supports key-value, document, column-family, and graph models)


 Description: Cosmos DB is a globally distributed, multi-model NoSQL database service designed for
mission-critical applications that require low-latency access to data at a global scale.
 Key Features:
 Multi-model support (supports document, key-value, graph, and column-family models)
 Global distribution with multi-region replication
 Tunable consistency levels (strong, bounded staleness, session, eventual)
 Fully managed with automatic scaling and performance optimization
 Integration with Azure services (e.g., Azure Functions, Logic Apps)
 Use Cases: IoT, gaming, personalized content, global applications

3. Google Cloud Firestore (Google Cloud)

 Type: Document Store


 Description: Firestore is a flexible, scalable NoSQL cloud database from Google Firebase. It is designed for
web and mobile applications, o ering real-time synchronization and o line capabilities.
 Key Features:
 Real-time updates (perfect for live apps like messaging or collaborative apps)
 Multi-region replication for high availability
 Automatic scaling, serverless architecture
 Strong consistency and ACID transactions
 Native integration with Firebase for mobile apps
 Use Cases: Real-time applications, mobile apps, web apps, gaming, and social apps

4. MongoDB Atlas (MongoDB on the Cloud)

 Type: Document Store


 Description: MongoDB Atlas is the fully managed version of the popular MongoDB database, designed for
high availability and scalability in the cloud. It is available on AWS, Google Cloud, and Azure.
 Key Features:
 Fully managed with automated backups, scaling, and updates
 Global distribution with multi-region replication
 Advanced querying, indexing, and aggregation
 Integration with various cloud analytics and machine learning tools
 Flexible schema with JSON-like documents
 Use Cases: Content management, e-commerce, real-time analytics, mobile and web applications

5. Couchbase Cloud (Couchbase on the Cloud)

 Type: Document Store


 Description: Couchbase Cloud o ers a fully managed NoSQL service that provides scalability, performance,
and flexibility for high-performance applications. It supports both key-value and document stores.
 Key Features:
 Multi-cloud, fully managed with global deployment options
 High performance with integrated caching and indexing
 Multi-dimensional scaling (separate scaling for data, query, and index)
 N1QL (SQL-like query language) for querying JSON data
 Built-in full-text search, eventing, and analytics
 Use Cases: Mobile applications, IoT, retail, personalization, and real-time data processing

6. Amazon DocumentDB (with MongoDB compatibility)

 Type: Document Store


 Description: Amazon DocumentDB is a managed NoSQL database service that is compatible with MongoDB
workloads. It allows you to run MongoDB applications without managing the database infrastructure.
 Key Features:
 Fully managed, scalable MongoDB-compatible database
 Automated backups and scaling
 Integration with other AWS services (e.g., Lambda, S3)
 Supports rich querying and indexing similar to MongoDB
 Use Cases: Applications using MongoDB, content management, and IoT

Key Features of Cloud-Based NoSQL Databases

1. Serverless and Managed: Cloud NoSQL databases often provide serverless models where you don’t need
to manage the infrastructure, and the database automatically scales based on usage.
2. Global Distribution: Most cloud NoSQL databases allow you to deploy databases across multiple regions,
improving latency and availability for global applications.
3. Automatic Scaling: These databases can automatically adjust resources based on the demand. Whether
you need to scale vertically or horizontally, the cloud provider manages it for you.
4. Replication and Fault Tolerance: Cloud databases often come with built-in data replication and fault
tolerance, ensuring high availability and data durability.
5. Security: Cloud providers implement robust security mechanisms such as encryption (both at rest and in
transit), access control, and identity management (IAM).
6. Integrated Analytics: Many cloud NoSQL databases integrate with cloud-based analytics platforms,
enabling real-time processing, machine learning, and business intelligence directly on the data.

Best Practices for Using NoSQL in the Cloud

1. Choose the Right Database Model: Cloud providers often o er multiple NoSQL database options
(document, key-value, column-family, graph, etc.). It's essential to select the model that best matches your
application’s data and query patterns.
2. Design for Scalability: When building applications that will leverage cloud NoSQL, design for scalability by
utilizing partitioning, sharding, and replication e ectively to ensure your database can handle growing data
volumes.
3. Optimize for Cost: Use cloud pricing calculators to estimate the cost based on your expected usage.
Consider using auto-scaling and on-demand pricing models to optimize your costs.
4. Monitor Performance: Set up monitoring and logging to track performance, identify bottlenecks, and
optimize queries. Cloud services typically o er built-in monitoring tools to help with this.
5. Leverage Cloud Ecosystem Integration: Take advantage of other cloud services, such as analytics, machine
learning, and security tools, which can integrate seamlessly with cloud NoSQL databases to enhance
functionality.

-- All the best --

You might also like