0% found this document useful (0 votes)
4 views91 pages

Unit 3: Types of Keys & Data Integrity 3.1. Keys: Super Key, Candidate Key, Primary Key, Alternate Key, Foreign Key Different Types of SQL Keys

This document outlines various types of SQL keys, including Super Key, Candidate Key, Primary Key, Alternate Key, Unique Key, and Foreign Key, explaining their roles in uniquely identifying records and establishing relationships between tables. It also discusses integrity constraints in SQL, such as NOT NULL, UNIQUE, PRIMARY KEY, and FOREIGN KEY, which ensure data accuracy and reliability. Additionally, the document covers PL/SQL concepts like recursive functions, cursors, and triggers, detailing their usage and functionality in database management.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views91 pages

Unit 3: Types of Keys & Data Integrity 3.1. Keys: Super Key, Candidate Key, Primary Key, Alternate Key, Foreign Key Different Types of SQL Keys

This document outlines various types of SQL keys, including Super Key, Candidate Key, Primary Key, Alternate Key, Unique Key, and Foreign Key, explaining their roles in uniquely identifying records and establishing relationships between tables. It also discusses integrity constraints in SQL, such as NOT NULL, UNIQUE, PRIMARY KEY, and FOREIGN KEY, which ensure data accuracy and reliability. Additionally, the document covers PL/SQL concepts like recursive functions, cursors, and triggers, detailing their usage and functionality in database management.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 91

Unit 3: Types of Keys & Data Integrity

3.1. Keys: Super Key, Candidate Key, Primary Key, Alternate Key, Foreign Key

Different Types of SQL Keys

A key is a single or combination of multiple fields in a table. It is used to fetch or


retrieve records/data-rows from data table according to the condition/requirement.
Keys are also used to create relationship among different database tables or views.

Types of SQL Keys

We have following types of keys in SQL which are used to fetch records from tables
and to make relationship among tables or views.

1. Super Key

Super key is a set of one or more than one keys that can be used to identify a
record uniquely in a table. Example : Primary key, Unique key, Alternate key
are subset of Super Keys.

2. Candidate Key

A Candidate Key is a set of one or more fields/columns that can identify a


record uniquely in a table. There can be multiple Candidate Keys in one table.
Each Candidate Key can work as Primary Key.
Example: In below diagram ID, RollNo and EnrollNo are Candidate Keys since
all these three fields can be work as Primary Key.

3. Primary Key

Primary key is a set of one or more fields/columns of a table that uniquely


identify a record in database table. It can not accept null, duplicate values. Only
one Candidate Key can be Primary Key.

4. Alternate key

An Alternate key is a key that can be work as a primary key. Basically it is a


candidate key that currently is not primary key.
Example: In below diagram RollNo and EnrollNo becomes Alternate Keys when
we define ID as Primary Key.
5. Composite/Compound Key

Composite Key is a combination of more than one fields/columns of a table. It


can be a Candidate key, Primary key.

6. Unique Key

Unique key is a set of one or more fields/columns of a table that uniquely


identify a record in database table. It is like Primary key but it can accept only
one null value and it cannot have duplicate values.
7. Foreign Key

Foreign Key is a field in database table that is Primary key in another table. It can
accept multiple null, duplicate values.

--Department Table
CREATE TABLE Department
(
DeptID int PRIMARY KEY, --primary key
Name varchar (50) NOT NULL,
Address varchar (200) NOT NULL
)
--Student Table
CREATE TABLE Student
(
ID int PRIMARY KEY, --primary key
RollNo varchar(10) NOT NULL,
Name varchar(50) NOT NULL,
EnrollNo varchar(50) UNIQUE, --unique key
Address varchar(200) NOT NULL,
DeptID int FOREIGN KEY REFERENCES Department(DeptID) --foreign key
)

Practically in database, we have only three types of keys Primary Key, Unique Key
and Foreign Key. Other types of keys are only concepts of DBMS which you should
know.

3.2. Constraints

Integrity Constraints

SQL Constraints

SQL constraints are used to specify rules for the data in a table.

Constraints are used to limit the type of data that can go into a table. This ensures
the accuracy and reliability of the data in the table. If there is any violation between
the constraint and the data action, the action is aborted.

Constraints can be column level or table level. Column level constraints apply to a
column, and table level constraints apply to the whole table.
The following constraints are commonly used in SQL:

 NOT NULL - Ensures that a column cannot have a NULL value


 UNIQUE - Ensures that all values in a column are different
 PRIMARY KEY - A combination of a NOT NULL and UNIQUE. Uniquely
identifies each row in a table
 FOREIGN KEY - Uniquely identifies a row/record in another table
 CHECK - Ensures that all values in a column satisfies a specific condition
 DEFAULT - Sets a default value for a column when no value is specified
 INDEX - Used to create and retrieve data from the database very quickly

SQL NOT NULL Constraint

 By default, a column can hold NULL values.


 The NOT NULL constraint enforces a column to NOT accept NULL values.
 This enforces a field to always contain a value, which means that you cannot
insert a new record, or update a record without adding a value to this field.
 The following SQL ensures that the "ID", "LastName", and "FirstName"
columns will NOT accept NULL values:

Example

CREATE TABLE Persons (


ID int NOT NULL,
LastName varchar(255) NOT NULL,
FirstName varchar(255) NOT NULL,
Age int
);

If the table has already been created, you can add a NOT NULL constraint to a
column with the ALTER TABLE statement.

SQL UNIQUE Constraint

The UNIQUE constraint ensures that all values in a column are different.

Both the UNIQUE and PRIMARY KEY constraints provide a guarantee for uniqueness
for a column or set of columns.

A PRIMARY KEY constraint automatically has a UNIQUE constraint.


However, you can have many UNIQUE constraints per table, but only one PRIMARY
KEY constraint per table.

CREATE TABLE Persons (


ID int NOT NULL UNIQUE,
LastName varchar(255) NOT NULL,
FirstName varchar(255),
Age int
);

SQL PRIMARY KEY Constraint

The PRIMARY KEY constraint uniquely identifies each record in a database table.

Primary keys must contain UNIQUE values, and cannot contain NULL values.

A table can have only one primary key, which may consist of single or multiple
fields.

CREATE TABLE Persons (


ID int NOT NULL PRIMARY KEY,
LastName varchar(255) NOT NULL,
FirstName varchar(255),
Age int
);

DROP a PRIMARY KEY Constraint

To drop a PRIMARY KEY constraint, use the following SQL:

ALTER TABLE Persons


DROP PRIMARY KEY;
SQL FOREIGN KEY Constraint

A FOREIGN KEY is a key used to link two tables together.

A FOREIGN KEY is a field (or collection of fields) in one table that refers to the
PRIMARY KEY in another table.

The table containing the foreign key is called the child table, and the table
containing the candidate key is called the referenced or parent table.

Look at the following two tables:

"Persons" table:

"Orders" table:

Notice that the "PersonID" column in the "Orders" table points to the "PersonID"
column in the "Persons" table.

The "PersonID" column in the "Persons" table is the PRIMARY KEY in the "Persons"
table.

The "PersonID" column in the "Orders" table is a FOREIGN KEY in the "Orders"
table.
The FOREIGN KEY constraint is used to prevent actions that would destroy links
between tables.

The FOREIGN KEY constraint also prevents invalid data from being inserted into the
foreign key column, because it has to be one of the values contained in the table it
points to.

SQL FOREIGN KEY on CREATE TABLE

The following SQL creates a FOREIGN KEY on the "PersonID" column when the
"Orders" table is created:

CREATE TABLE Orders (


OrderID int NOT NULL,
OrderNumber int NOT NULL,
PersonID int,
PRIMARY KEY (OrderID),
FOREIGN KEY (PersonID) REFERENCES Persons(PersonID)
);
SQL FOREIGN KEY on ALTER TABLE

To create a FOREIGN KEY constraint on the "PersonID" column when the "Orders"
table is already created, use the following SQL:

ALTER TABLE Orders


ADD FOREIGN KEY (PersonID) REFERENCES Persons(PersonID);

There are the domain integrity, the entity integrity, the referential integrity and the
foreign key integrity constraints.

Domain Integrity
Domain integrity means the definition of a valid set of values for an attribute. You
define
- data type,
- length or size
- is null value allowed
- is the value unique or not
for an attribute.
Rule 4. The model_id field in the Car table can have a null value which means that
the car type of that car in not known

Foreign Key Integrity Constraint


There are two foreign key integrity constraints: cascade update related fields and
cascade delete related rows. These constraints affect the referential integrity
constraint.

Prime and Non Prime Attributes in DBMS with Example –


 Prime Attributes – Attribute set that belongs to any candidate key are
called Prime Attributes.
(union of all the candidate key attribute)
{CK1 ∪ CK2 ∪ CK3 ∪ ……}
If Prime attribute determined by other attribute set, then more than one
candidate key is possible. For example,
If A is Candidate Key, and X→A, then, X is also Candidate Key .
 Non Prime Attribute – Attribute set does not belongs to any
candidate key are called Non Prime Attributes.

Given a relation R(ABCDE) having FDs {A → BC, CD → E, B → D, E → A} Identify


the prime attributes and non prime attributes.
Solution :

(A)+ : {ABCDE} ⇒ (Candidate Key)

(E)+ : {ABCDE} ⇒ (Candidate Key)

⇒ Candidate Keys {A,E}

⇒ Prime Attributes {A,E}

⇒ Non Prime Attributes {B,C,D}


PL/SQL procedure successfully completed.
PL/SQL Recursive Functions
We have seen that a program or subprogram may call another subprogram. When a subprogram calls
itself, it is referred to as a recursive call and the process is known as recursion.
To illustrate the concept, let us calculate the factorial of a number. Factorial of a number n is defined as

n! = n*(n-1)!
= n*(n-1)*(n-2)!
...
= n*(n-1)*(n-2)*(n-3)... 1
The following program calculates the factorial of a given number by calling itself recursively −
DECLARE
num number;
factorial number;

FUNCTION fact(x number)


RETURN number
IS
f number;
BEGIN
IF x=0 THEN
f := 1;
ELSE
f := x * fact(x-1);
END IF;
RETURN f;
END;

BEGIN
num:= 6;
factorial := fact(num);
dbms_output.put_line(' Factorial '|| num || ' is ' || factorial);
END;
/
When the above code is executed at the SQL prompt, it produces the following result −
Factorial 6 is 720

PL/SQL procedure successfully completed.

PL/SQL - Cursors
Oracle creates a memory area, known as the context area, for processing an SQL statement, which
contains all the information needed for processing the statement; for example, the number of rows
processed, etc.
A cursor is a pointer to this context area. PL/SQL controls the context area through a cursor. A cursor
holds the rows (one or more) returned by a SQL statement. The set of rows the cursor holds is referred to
as the active set.
You can name a cursor so that it could be referred to in a program to fetch and process the rows returned
by the SQL statement, one at a time. There are two types of cursors −
 Implicit cursors
 Explicit cursors
Implicit Cursors
Implicit cursors are automatically created by Oracle whenever an SQL statement is executed, when there
is no explicit cursor for the statement. Programmers cannot control the implicit cursors and the
information in it.
Whenever a DML statement (INSERT, UPDATE and DELETE) is issued, an implicit cursor is
associated with this statement. For INSERT operations, the cursor holds the data that needs to be
inserted. For UPDATE and DELETE operations, the cursor identifies the rows that would be affected.
In PL/SQL, you can refer to the most recent implicit cursor as the SQL cursor, which always has
attributes such as %FOUND, %ISOPEN, %NOTFOUND, and %ROWCOUNT. The SQL cursor has
additional attributes, %BULK_ROWCOUNT and %BULK_EXCEPTIONS, designed for use with
the FORALL statement. The following table provides the description of the most used attributes −
%FOUND
Returns TRUE if an INSERT, UPDATE, or DELETE statement affected one or more rows or a SELECT
INTO statement returned one or more rows. Otherwise, it returns FALSE.
%NOTFOUND
The logical opposite of %FOUND. It returns TRUE if an INSERT, UPDATE, or DELETE statement
affected no rows, or a SELECT INTO statement returned no rows. Otherwise, it returns FALSE.
%ISOPEN
Always returns FALSE for implicit cursors, because Oracle closes the SQL cursor automatically after
executing its associated SQL statement.
%ROWCOUNT
Returns the number of rows affected by an INSERT, UPDATE, or DELETE statement, or returned by a
SELECT INTO statement.
Any SQL cursor attribute will be accessed as sql%attribute_name as shown below in the example.
Example
he following program will update the table and increase the salary of each customer by 500 and use
the SQL%ROWCOUNT attribute to determine the number of rows affected −
DECLARE
total_rows number(2);
BEGIN
UPDATE customers
SET salary = salary + 500;
IF sql%notfound THEN
dbms_output.put_line('no customers selected');
ELSIF sql%found THEN
total_rows := sql%rowcount;
dbms_output.put_line( total_rows || ' customers selected ');
END IF;
END;
When the above code is executed at the SQL prompt, it produces the following result −
6 customers selected

PL/SQL procedure successfully completed.

Explicit Cursors
Explicit cursors are programmer-defined cursors for gaining more control over the context area. An
explicit cursor should be defined in the declaration section of the PL/SQL Block. It is created on a
SELECT Statement which returns more than one row.
The syntax for creating an explicit cursor is −
CURSOR cursor_name IS select_statement;
Working with an explicit cursor includes the following steps −
 Declaring the cursor for initializing the memory
 Opening the cursor for allocating the memory
 Fetching the cursor for retrieving the data
 Closing the cursor to release the allocated memory
Declaring the Cursor
Declaring the cursor defines the cursor with a name and the associated SELECT statement. For example

CURSOR c_customers IS
SELECT id, name, address FROM customers;
Opening the Cursor
Opening the cursor allocates the memory for the cursor and makes it ready for fetching the rows returned
by the SQL statement into it. For example, we will open the above defined cursor as follows −
OPEN c_customers;
Fetching the Cursor
Fetching the cursor involves accessing one row at a time. For example, we will fetch rows from the
above-opened cursor as follows −
FETCH c_customers INTO c_id, c_name, c_addr;
Closing the Cursor
Closing the cursor means releasing the allocated memory. For example, we will close the above-opened
cursor as follows −
CLOSE c_customers;
Example
Following is a complete example to illustrate the concepts of explicit cursors &minua;
DECLARE
c_id customers.id%type;
c_name customerS.No.ame%type;
c_addr customers.address%type;
CURSOR c_customers is
SELECT id, name, address FROM customers;
BEGIN
OPEN c_customers;
LOOP
FETCH c_customers into c_id, c_name, c_addr;
EXIT WHEN c_customers%notfound;
dbms_output.put_line(c_id || ' ' || c_name || ' ' || c_addr);
END LOOP;
CLOSE c_customers;
END;
When the above code is executed at the SQL prompt, it produces the following result −
1 Ramesh Ahmedabad
2 Khilan Delhi
3 kaushik Kota
4 Chaitali Mumbai
5 Hardik Bhopal
6 Komal MP

PL/SQL procedure successfully completed.


book1 books;
book2 books;
PROCEDURE printbook (book books) IS
BEGIN
dbms_output.put_line ('Book title : ' || book.title);
dbms_output.put_line('Book author : ' || book.author);
dbms_output.put_line( 'Book subject : ' || book.subject);
dbms_output.put_line( 'Book book_id : ' || book.book_id);
END;

BEGIN
-- Book 1 specification
book1.title := 'C Programming';
book1.author := 'Nuha Ali ';
book1.subject := 'C Programming Tutorial';
book1.book_id := 6495407;

-- Book 2 specification
book2.title := 'Telecom Billing';
book2.author := 'Zara Ali';
book2.subject := 'Telecom Billing Tutorial';
book2.book_id := 6495700;

-- Use procedure to print book info


printbook(book1);
printbook(book2);
END;
/

PL/SQL - Triggers
Triggers are stored programs, which are automatically executed or fired when some events occur.
Triggers are, in fact, written to be executed in response to any of the following events −
 A database manipulation (DML) statement (DELETE, INSERT, or UPDATE)
 A database definition (DDL) statement (CREATE, ALTER, or DROP).
 A database operation (SERVERERROR, LOGON, LOGOFF, STARTUP, or SHUTDOWN).
Triggers can be defined on the table, view, schema, or database with which the event is associated.

Benefits of Triggers
Triggers can be written for the following purposes −
 Generating some derived column values automatically
 Enforcing referential integrity
 Event logging and storing information on table access
 Auditing
 Synchronous replication of tables
 Imposing security authorizations
 Preventing invalid transactions

Creating Triggers
The syntax for creating a trigger is −
CREATE [OR REPLACE ] TRIGGER trigger_name
{BEFORE | AFTER | INSTEAD OF }
{INSERT [OR] | UPDATE [OR] | DELETE}
[OF col_name]
ON table_name
[REFERENCING OLD AS o NEW AS n]
[FOR EACH ROW]
WHEN (condition)
DECLARE
Declaration-statements
BEGIN
Executable-statements
EXCEPTION
Exception-handling-statements
END;

Where,
 CREATE [OR REPLACE] TRIGGER trigger_name − Creates or replaces an existing trigger with
the trigger_name.
 {BEFORE | AFTER | INSTEAD OF} − This specifies when the trigger will be executed. The
INSTEAD OF clause is used for creating trigger on a view.
 {INSERT [OR] | UPDATE [OR] | DELETE} − This specifies the DML operation.
 [OF col_name] − This specifies the column name that will be updated.
 [ON table_name] − This specifies the name of the table associated with the trigger.
 [REFERENCING OLD AS o NEW AS n] − This allows you to refer new and old values for
various DML statements, such as INSERT, UPDATE, and DELETE.
 [FOR EACH ROW] − This specifies a row-level trigger, i.e., the trigger will be executed for each
row being affected. Otherwise the trigger will execute just once when the SQL statement is
executed, which is called a table level trigger.
 WHEN (condition) − This provides a condition for rows for which the trigger would fire. This
clause is valid only for row-level triggers.

Example
The following program creates a row-level trigger for the customers table that would fire for INSERT or
UPDATE or DELETE operations performed on the CUSTOMERS table. This trigger will display the
salary difference between the old values and new values –

CREATE OR REPLACE TRIGGER display_salary_changes


BEFORE DELETE OR INSERT OR UPDATE ON customers
FOR EACH ROW
WHEN (NEW.ID > 0)
DECLARE
sal_diff number;
BEGIN
sal_diff := :NEW.salary - :OLD.salary;
dbms_output.put_line('Old salary: ' || :OLD.salary);
dbms_output.put_line('New salary: ' || :NEW.salary);
dbms_output.put_line('Salary difference: ' || sal_diff);
END;
/
When the above code is executed at the SQL prompt, it produces the following result −
Trigger created.

Triggering a Trigger
Let us perform some DML operations on the CUSTOMERS table. Here is one INSERT statement, which
will create a new record in the table −
INSERT INTO CUSTOMERS (ID,NAME,AGE,ADDRESS,SALARY)
VALUES (7, 'Kriti', 22, 'HP', 7500.00 );
When a record is created in the CUSTOMERS table, the above create
trigger, display_salary_changes will be fired and it will display the following result −
Old salary:
New salary: 7500
Salary difference:
Because this is a new record, old salary is not available and the above result comes as null. Let us now
perform one more DML operation on the CUSTOMERS table. The UPDATE statement will update an
existing record in the table −
UPDATE customers
SET salary = salary + 500
WHERE id = 2;
When a record is updated in the CUSTOMERS table, the above create
trigger, display_salary_changes will be fired and it will display the following result −
Old salary: 1500
New salary: 2000
Salary difference: 500
Database Design and 1.3.1 Embedded SQL
Implementation
The embedded SQL statements can be put in the application program written in C,
Java or any other host language. These statements sometime may be called static.
Why are they called static? The term „static‟ is used to indicate that the embedded
SQL commands, which are written in the host program, do not change automatically
during the lifetime of the program. Thus, such queries are determined at the time of
database application design. For example, a query statement embedded in C to
determine the status of train booking for a train will not change. However, this query
may be executed for many different trains. Please note that it will only change the
input parameter to the query that is train-number, date of boarding, etc., and not the
query itself.

But how is such embedding done? Let us explain this with the help of an example.

Example: Write a C program segment that prints the details of a student whose
enrolment number is input.
Let us assume the relation
STUDENT (enrolno:char(9), name:Char(25), phone:integer(12), prog-code:char(3))
/* add proper include statements*/
/*declaration in C program */
EXEC SQL BEGIN DECLARE SECTION;
Char enrolno[10], name[26], p-code[4];
int phone;
int SQLCODE;
char SQLSTATE[6]
EXEC SQL END DECLARE SECTION;
/* The connection needs to be established with SQL*/
/* program segment for the required function */
printf (“enter the enrolment number of the student”);
scanf (“% s”, &enrolno);
EXEC SQL
SELECT name, phone, prog-code INTO
:name, :phone, :p-code
FROM STUDENT
WHERE enrolno = :enrolno;
If (SQLCODE ==0)
printf (“%d, %s, %s, %s”, enrolno, name, phone, p-code)
else
printf (“Wrong Enrolment Number”);

Please note the following points in the program above:

 The program is written in the host language „C‟ and contains embedded SQL
statements.
 Although in the program an SQL query (SELECT) has been added. You can
embed any DML, DDL or views statements.
 The distinction between an SQL statement and host language statement is made
by using the key word EXEC SQL; thus, this key word helps in identifying the
Embedded SQL statements by the pre-compiler.
 Please note that the statements including (EXEC SQL) are terminated by a
semi-colon (;),
 As the data is to be exchanged between a host language and a database, there is
a need of shared variables that are shared between the environments. Please
note that enrolno[10], name[20], p-code[4]; etc. are shared variables, colon (:)
declared in „C‟.
52
 Please note that the shared host variables enrolno is declared to have char[10] Advanced
whereas, an SQL attribute enrolno has only char[9]. Why? Because in „C‟ SQL
conversion to a string includes a „\ 0‟ as the end of the string.
 The type mapping between „C‟ and SQL types is defined in the following table:

„C‟ TYPE SQL TYPE


long INTEGER
short SMALLINT
float REAL
double DOUBLE
char [ i+1] CHAR (i)

 Please also note that these shared variables are used in SQL statements of the
program. They are prefixed with the colon (:) to distinguish them from
database attribute and relation names. However, they are used without this
prefix in any C language statement.
 Please also note that these shared variables have almost the same name (except
p-code) as that of the attribute name of the database. The prefix colon (:) this
distinguishes whether we are referring to the shared host variable or an SQL
attribute. Such similar names is a good programming convention as it helps in
identifying the related attribute easily.
 Please note that the shared variables are declared between BEGIN DECLARE
SECTION and END DECLARE SECTION and there typed is defined in „C‟
language.

Two more shared variables have been declared in „C‟. These are:

 SQLCODE as int
 SQLSTATE as char of size 6
 These variables are used to communicate errors and exception conditions
between the database and the host language program. The value 0 in
SQLCODE means successful execution of SQL command. A value of the
SQLCODE =100 means „no more data‟. The value of SQLCODE if less than 0
indicates an error. Similarly, SQLSTATE is a 5 char code the 6th char is for
„\0‟ in the host language „C‟. Value “00000” in an SQLSTATE indicate no
error. You can refer to SQL standard in more detail for more information.
 In order to execute the required SQL command, connection with the database
server need to be established by the program. For this, the following SQL
statement is used:
CONNECT <name of the server> AS <name of the connection>
AUTHORISATION <username, password>,
TO DISCONNECT we can simply say
DISCONNECT <name of the connection>;

However, these statements need to be checked in the commercial database


management system, which you are using.

Execution of SQL query in the given program: To create the SQL query, first, the
given value of enrolment number is transferred to SQL attribute value, the query then
is executed and the result, which is a single tuple in this case, is transferred to shared
host variables as indicated by the key word INTO after the SELECT statement.

The SQL query runs as a standard SQL query except the use of shared host variables.
Rest of the C program has very simple logic and will print the data of the students
whose enrolment number has been entered.

53
printf (“enter the programme code); Advanced
scanf (“%s, &p-code); SQL
EXEC SQL DECLARE CURSOR GUPDATE
SELECT enrolno, name, phone, grade
FROM STUDENT
WHERE progcode =: p-code
FOR UPDATE OF grade;
EXEC SQL OPEN GUPDATE;
EXEC SQL FETCH FROM GUPDATE
INTO :enrolno, :name, :phone, :grade;
WHILE (SQLCODE==0) {
printf (“enter grade for enrolment number, “%s”, enrolno);
scanf (“%c”, grade);
EXEC SQL
UPDATE STUDENT
SET grade=:grade
WHERE CURRENT OF GUPDATE
EXEC SQL FETCH FROM GUPDATE;
}
EXEC SQL CLOSE GUPDATE;

 Please note that the declared section remains almost the same. The cursor is
declared to contain the output of the SQL statement. Please notice that in this
case, there will be many tuples of students database, which belong to a
particular programme.
 The purpose of the cursor is also indicated during the declaration of the cursor.
 The cursor is then opened and the first tuple is fetch into shared host variable
followed by SQL query to update the required record. Please note the use of
CURRENT OF which states that these updates are for the current tuple referred
to by the cursor.
 WHILE Loop is checking the SQLCODE to ascertain whether more tuples are
pending in the cursor.
 Please note the SQLCODE will be set by the last fetch statement executed just
prior to while condition check.

How are these SQL statements compiled and error checked during embedded SQL?

 The SQL pre-compiler performs the type of checking of the various shared host
variables to find any mismatches or errors on each of the SQL statements. It
then stores the results into the SQLCODE or SQLSTATE variables.

Is there any limitation on these statically embedded SQL statements?

They offer only limited functionality, as the query must be known at the time of
application development so that they can be pre-compiled in advance. However,
many queries are not known at the time of development of an application; thus we
require dynamically embedded SQL also.

1.3.3 Dynamic SQL


Dynamic SQL, unlike embedded SQL statements, are built at the run time and placed
in a string in a host variable. The created SQL statements are then sent to the DBMS
for processing. Dynamic SQL is generally slower than statically embedded SQL as
they require complete processing including access plan generation during the run
time.

However, they are more powerful than embedded SQL as they allow run time
application logic. The basic advantage of using dynamic embedded SQL is that we
need not compile and test a new program for a new query.
55
Database Design and Let us explain the use of dynamic SQL with the help of an example:
Implementation
Example: Write a dynamic SQL interface that allows a student to get and modify
permissible details about him/her. The student may ask for subset of information also.
Assume that the student database has the following relations.
STUDENT (enrolno, name, dob)
RESULT (enrolno, coursecode, marks)

In the table above, a student has access rights for accessing information on his/her
enrolment number, but s/he cannot update the data. Assume that user names are
enrolment number.

Solution: A sample program segment may be (please note that the syntax may change
for different commercial DBMS).
/* declarations in SQL */
EXEC SQL BEGIN DECLARE SECTION;
char inputfields (50);
char tablename(10)
char sqlquery ystring(200)
EXEC SQL END DECLARE SECTION;
printf (“Enter the fields you want to see \n”);
scanf (“SELECT%s”, inputfields);
printf (“Enter the name of table STUDENT or RESULT”);
scanf (“FROM%s”, tablename);
sqlqueryystring = “SELECT” +inputfields +“ ”+
“FROM” + tablename
+ “WHERE enrolno + :USER”
/*Plus is used as a symbol for concatenation operator; in some DBMS it may be ||*/
/* Assumption: the user name is available in the host language variable USER*/

EXEC SQL PREPARE sqlcommand FROM :sqlqueryystring;


EXEC SQL EXECUTE sqlcommand;

Please note the following points in the example above.

 The query can be entered completely as a string by the user or s/he can be
suitably prompted.
 The query can be fabricated using a concatenation of strings. This is language
dependent in the example and is not a portable feature in the present query.
 The query modification of the query is being done keeping security in mind.
 The query is prepared and executed using a suitable SQL EXEC commands.

1.3.4 SQLJ
Till now we have talked about embedding SQL in C, but how can we embed SQL
statements into JAVA Program? For this purpose we use SQLJ. In SQLJ, a
preprocessor called SQLJ translator translates SQLJ source file to JAVA source file.
The JAVA file compiled and run on the database. Use of SQLJ improves the
productivity and manageability of JAVA Code as:

 The code becomes somewhat compact.


 No run-time SQL syntax errors as SQL statements are checked at compile time.
 It allows sharing of JAVA variables with SQL statements. Such sharing is not
possible otherwise.

Please note that SQLJ cannot use dynamic SQL. It can only use simple embedded
SQL. SQLJ provides a standard form in which SQL statements can be embedded in
56
UNIT IV
TRANSACTION PROCESSING AND CONCURRENCY CONTROL

3.1 TRANSACTION CONCEPTS:


The term transaction refers to a collection of operations that form a single logical unit of
work.
Example: Transfer of money from one account to another is a transaction.
A transaction is initiated by a user program written in a high-level programming language
with embedded database accesses in JDBC or ODBC.
A transaction is delimited by statements of the form begin transaction and end
transaction.

3.2 ACID PROPERTIES:


• Atomicity:
Either all operations of the transaction are reflected properly in the database, or none
are.
•Consistency:
With no other transaction executing concurrently preserves the consistency of the
database.
• Isolation:
Each transaction is unaware of other transactions executing concurrently in the
system.
• Durability:
After a transaction completes successfully, the changes it has made to the database
persist, even if there are system failures.
These properties are often called the ACID properties.

Transactions access data using two operations:


 read(X), which transfers the data item X from the database to a variable, also called
X, in a buffer in main memory.

 write(X), which transfers the value in the variable X in the main-memory buffer of
the transaction that executed the write to the data item X in the database.
Example:
Let Ti be a transaction that transfers $50 from account A to account B. This transaction can
be defined as:
Ti : read(A);
A := A − 50;
write(A);
read(B);
B := B + 50;
write(B).
Let us now consider each of the ACID properties.

• Consistency: The consistency requirement here is that the sum of A and B be unchanged by the
execution of the transaction. This task may be facilitated by automatic testing of integrity
constraints.

Atomicity:The basic idea behind ensuring atomicity is this: The database system keeps track (on
disk) of the old values of any data on which a transaction performs a write. This information is
written to a file called the log. If the transaction does not complete its execution, the database
system restores the old values from the log to make it appear as though the transaction never
executed.
Suppose that, just before the execution of transaction Ti, the values of accounts A and B are
$1000 and $2000, respectively. Suppose that the failure happened after the write(A) operation but
before the write(B) operation. In this case, the values of accounts A and B reflected in the database
are $950 and $2000.
The system destroyed $50 as a result of this failure. In particular, we note that the sum A + B
is no longer preserved.
Ensuring atomicity is the responsibility of the database system; specifically, it is handled by
a component of the database called the recovery system.

Durability: The durability property guarantees that, once a transaction completes successfully, all
the updates that it carried out on the database persist, even if there is a system failure after the
transaction completes execution.
We can guarantee durability by ensuring that either:
1. The updates carried out by the transaction have been written to disk before the
transaction completes.
2. Information about the updates carried out by the transaction and written to disk is
sufficient to enable the database to reconstruct the updates when the database system
is restarted after the failure.
The recovery system of the database is responsible for ensuring durability.

Isolation: If several transactions are executed concurrently, their operations may interleave in some
undesirable way, resulting in an inconsistent state.
For example, the database is temporarily inconsistent while the transaction to transfer funds
from A to B is executing, with the deducted total written to A and the increased total yet to be
written to B. If a second concurrently running transaction reads A and B at this intermediate point
and computes A+B, it will observe an inconsistent value. Furthermore, if this second transaction
then performs updates on A and B based on the inconsistent values that it read, the database may be
left in an inconsistent
state even after both transactions have completed.
A way to avoid the problem of concurrently executing transactions is to execute transactions
serially—that is, one after the other.
Ensuring the isolation property is the responsibility of a component of the database system
called the concurrency-control system.

States of Transaction

The
state
diagram
correspondi
ng to a
transaction
is shown in
Figure.
A transaction must be in one of the following states:
Active: the initial state, the transaction stays in this state while it is executing.
Partially committed: after the final statement has been executed.
Failed: when the normal execution can no longer proceed.
Aborted: after the transaction has been rolled back and the database has been restored to
its state prior to the start of the transaction.
Committed: after successful completion.

3.3 SCHEDULES:
Schedule is defined as a sequence of instructions that specify the chronological order in
which instructions of concurrent transactions are executed.
A schedule is serializable if it is equivalent to a serial schedule.
A schedule where the operations of each transaction are executed consecutively without any
interference from other transactions is called serial sechedule.
Types of serializability are
1. Conflict Serializability
2. View Serializability

3.4 SERIALIZABILITY:
Conflict Serializability:
Instructions Ii and Ij, of transactions Ti and Tj respectively, conflict if and only if there exists
some item Q accessed by both Ii and Ij, and at least one of these instructions wrote Q.
1.Ii = read( Q), Ij = read( Q). Ii and Ij don't conflict.
2.Ii = read( Q), Ij = write( Q). They conflict.
3.Ii = write( Q), Ij = read( Q). They conflict.
4.Ii = write( Q), Ij = write( Q). They conflict.
If Ii and Ij are consecutive in a schedule and they do not conflict, their results would remain
the same even if they had been interchanged in the schedule.
Consider following schedule 3.

The write (A) of T1 conflicts with read (A) of T2. However, write (A) of T2 does not conflict
with read (B) of T1, because, the two instructions access different data items.

Because of no conflict, we can swap write (A) and read (B) instructions to generate a new
schedule 5.
Regardless of
initial system
state, schedule
3 and 5
generates same
result.

We can continue to swap non-conflicting instructions:


Swap the read (B) instruction of T1 with read (A) instruction of T2 .

Swap the write (B) instruction of T1 with write (A) instruction of T2.

Swap the write (B) instruction of T1 with the read (A) instruction of T2 .

The final result


of these swaps -
is shown below,
which is serial
schedule.

If a schedule S can be transformed into a schedule S1 by a series of swaps of


non-conflicting instructions, we say that S and S1 are conflict equivalent.
The concept of conflict equivalence loads to the concept of conflict serializability. We say
that a schedule S is conflict serializable, if it is conflict equivalent to a serial schedule. Thus
schedule 3 is conflict serializable, since it is conflict equivalent to the serial schedule 1.
Consider schedule 7. It consists of two transactions T 3 and T4. This schedule is not conflict
serializable, since it is not equivalent to either the serial schedule <T3, T4> or the serial schedule <T4 ,
T3>.

View Serializability:
A schedule S is view
serializable if it is view equivalent to
a serial schedule.
Let S and S0 be two schedules with the same set of transactions. S and S0 are view
equivalent if the following three conditions are met:
1. For each data item Q, if transaction Ti reads the initial value of Q in schedule S,
then transaction Ti must, in schedule S0, also read the initial value of Q.
2. For each data item Q, if transaction Ti executes read(Q) in schedule S, and that
value was produced by transaction Tj (if any), then transaction Ti must in schedule
S0 also read the value of Q that was produced by transaction Tj.
3. For each data item Q, the transaction (if any) that performs the final write (Q)
operation in schedule S must perform the final write (Q) operation in schedule S0.
Every conflict serializable schedule is also view serializable.
The following schedule is view-serializable but not conflict serializable

In the above schedule, transactions T4 and T5 performs write(Q) operations without having
performed a read(Q) operation. Writes of this sort are called blind writes. View serializable
schedule with blind writes is not conflict serializable.

Testing for Serializability:


Testing for Serializability is done by using a directed graph called precedence graph,
constructed from schedule.
This graph consists of a pair G = (V, E), where V is a set of vertices and E is a set of edges.
The set of vertices consists of all the transactions participating in the schedule.
The set of edges consists of all edges Ti→ Tj for which one of three conditions holds:
1. Ti executes write(Q) before Tj executes read(Q).
2. Ti executes read(Q) before Tj executes write(Q).
3. Ti executes write(Q) before Tj executes write(Q).
The
precede
nce
graph
for
schedul
e 1 and
schedul
e 2 are
shown
in
figure
given
below.

The
preced
ence
graph
for
schedu
le 1
contains a single edge T1→T2,since all the instructions of T1 are executed before the first
instruction of T2 is executed.
The precedence graph for schedule 2 contains a single edge T2→T1,since all the instructions
of T2 are executed before the first instruction of T1 is executed.
Consider the following schedule 4.

The precedence graph for schedule 4 is shown below.

To test conflict serializability construct a precedence graph for given schedule. If the graph
contains cycle, the schedule is not conflict serializable. If the graph contains no cycle, the schedule
is conflict serializable.
Schedule 1 and schedule 2 are conflict serializable as the precedence graph for both
schedules does not contain any cycle. While the schedule 4 is not conflict serializable as the
precedence graph for it contains cycle.

A serializability order of the transactions can be obtained through topological sorting,


which determines a linear order consistent with the partial order of the precedence graph.
(a)Test for Conflict Serializability
To test conflict serializability, construct a precedence graph for given schedule. If graph
contains cycle, the schedule is not conflict serializable. If the graph contains no cycle, then the
schedule is conflict serializable.
Schedule 1 and schedule 2 are conflict serializable, as the precedence graph for both schedules
does not contain any cycle. However the schedule 9 is not conflict serializable, as precedence graph
for it contains cycle.
Example: Consider the schedule given in Fig. 5.14. Find out whether that schedule is
conflict serializable or not?

Solution:
The precedence graph for given schedule is

As the graph is acyclic, so the schedule is conflict serializable.

3.5 CONCURRENCY CONTROL:


The system must control the interaction among the concurrent transactions. This control is
achieve through one of concurrency control schemes. The concurrency control schemes are based
on the serializability property.

Different types of protocols/schemes used to control concurrent execution of transactions.

3.6 LOCKING PROTOCOLS:


Locking is a protocol used to control access to data when one transaction is accessing the
database, a lock may deny access to other transactions to prevent incorrect results. Locking is one of
the most widely used mechanisms to ensure serializability.
To ensure serializability, it is required that data items should be accessed in mutual exclusive
manner; if one transaction is accessing a data item, no other transaction can modify that data item. A
transaction is allowed to access a data item only if it is currently holding a lock on that item.

Locks:
The two modes of locks are:
1. Shared. If a transaction Ti has obtained a shared-mode lock (denoted by S) on
Concurrency Control is the management procedure that is required for
controlling concurrent execution of the operations that take place on a database.

Concurrent Execution in DBMS

o In a multi-user system, multiple users can access and use the same
database at one time, which is known as the concurrent execution of the
database. It means that the same database is executed simultaneously on a
multi-user system by different users.
o While working on the database transactions, there occurs the requirement
of using the database by multiple users for performing different
operations, and in that case, concurrent execution of the database is
performed.
o The thing is that the simultaneous execution that is performed should be
done in an interleaved manner, and no operation should affect the other
executing operations, thus maintaining the consistency of the database.
Thus, on making the concurrent execution of the transaction operations,
there occur several challenging problems that need to be solved.

Problems with Concurrent Execution

In a database transaction, the two main operations


are READ and WRITE operations. So, there is a need to manage these two
operations in the concurrent execution of the transactions as if these operations
are not performed in an interleaved manner, and the data may become
inconsistent. So, the following problems occur with the Concurrent Execution
of the operations:

Problem 1: Lost Update Problems (W - W Conflict)

The problem occurs when two different database transactions perform the
read/write operations on the same database items in an interleaved manner
(i.e., concurrent execution) that makes the values of the items incorrect hence
making the database inconsistent.

For example:

Consider the below diagram where two transactions TX and TY, are
performed on the same account A where the balance of account A is $300.
o At time t1, transaction TX reads the value of account A, i.e., $300 (only
read).
o At time t2, transaction TX deducts $50 from account A that becomes $250
(only deducted and not updated/write).
o Alternately, at time t3, transaction TY reads the value of account A that
will be $300 only because TX didn't update the value yet.
o At time t4, transaction TY adds $100 to account A that becomes $400
(only added but not updated/write).
o At time t6, transaction TX writes the value of account A that will be
updated as $250 only, as TY didn't update the value yet.
o Similarly, at time t7, transaction TY writes the values of account A, so it
will write as done at time t4 that will be $400. It means the value written
by TX is lost, i.e., $250 is lost.

Hence data becomes incorrect, and database sets to inconsistent.

Dirty Read Problems (W-R Conflict)

The dirty read problem occurs when one transaction updates an item of the
database, and somehow the transaction fails, and before the data gets rollback,
the updated database item is accessed by another transaction. There comes the
Read-Write Conflict between both transactions.

For example:
Consider two transactions TX and TY in the below diagram performing
read/write operations on account A where the available balance in account
A is $300:

o At time t1, transaction TX reads the value of account A, i.e., $300.


o At time t2, transaction TX adds $50 to account A that becomes $350.
o At time t3, transaction TX writes the updated value in account A, i.e.,
$350.
o Then at time t4, transaction TY reads account A that will be read as $350.
o Then at time t5, transaction TX rollbacks due to server problem, and the
value changes back to $300 (as initially).
o But the value for account A remains $350 for transaction TY as
committed, which is the dirty read and therefore known as the Dirty Read
Problem.

Unrepeatable Read Problem (W-R Conflict)

Also known as Inconsistent Retrievals Problem that occurs when in a


transaction, two different values are read for the same database item.

For example:

Consider two transactions, TX and TY, performing the read/write


operations on account A, having an available balance = $300. The diagram
is shown below:
o At time t1, transaction TX reads the value from account A, i.e., $300.
o At time t2, transaction TY reads the value from account A, i.e., $300.
o At time t3, transaction TY updates the value of account A by adding $100
to the available balance, and then it becomes $400.
o At time t4, transaction TY writes the updated value, i.e., $400.
o After that, at time t5, transaction TX reads the available value of account
A, and that will be read as $400.
o It means that within the same transaction TX, it reads two different values
of account A, i.e., $ 300 initially, and after updation made by transaction
TY, it reads $400. It is an unrepeatable read and is therefore known as the
Unrepeatable read problem.

Thus, in order to maintain consistency in the database and avoid such problems
that take place in concurrent execution, management is needed, and that is
where the concept of Concurrency Control comes into role

Concurrency Control Protocols


The concurrency control protocols ensure the atomicity, consistency, isolation,
durability and serializability of the concurrent execution of the database
transactions. Therefore, these protocols are categorized as:

o Lock Based Concurrency Control Protocol


o Time Stamp Concurrency Control Protocol
o Validation Based Concurrency Control Protocol

Lock-Based Protocol

o In this type of protocol, any transaction cannot read or write data until it
acquires an appropriate lock on it. There are two types of lock:

1. Shared lock:

o It is also known as a Read-only lock. In a shared lock, the data item can
only read by the transaction.
o It can be shared between the transactions because when the transaction
holds a lock, then it can't update the data on the data item.

2. Exclusive lock:

o In the exclusive lock, the data item can be both reads as well as written by
the transaction.
o This lock is exclusive, and in this lock, multiple transactions do not
modify the same data simultaneously.

There are four types of lock protocols available:

1. Simplistic lock protocol

It is the simplest way of locking the data while transaction. Simplistic lock-
based protocols allow all the transactions to get the lock on the data before
insert or delete or update on it. It will unlock the data item after completing the
transaction.

2. Pre-claiming Lock Protocol

o Pre-claiming Lock Protocols evaluate the transaction to list all the data
items on which they need locks.
o Before initiating an execution of the transaction, it requests DBMS for all
the lock on all those data items.
o If all the locks are granted then this protocol allows the transaction to
begin. When the transaction is completed then it releases all the lock.
o If all the locks are not granted then this protocol allows the transaction to
rolls back and waits until all the locks are granted.

3. Two-phase locking (2PL)

o The two-phase locking protocol divides the execution phase of the


transaction into three parts.
o In the first part, when the execution of the transaction starts, it seeks
permission for the lock it requires.
o In the second part, the transaction acquires all the locks. The third phase
is started as soon as the transaction releases its first lock.
o In the third phase, the transaction cannot demand any new locks. It only
releases the acquired locks.

There are two phases of 2PL:

Growing phase: In the growing phase, a new lock on the data item may be
acquired by the transaction, but none can be released.
Shrinking phase: In the shrinking phase, existing lock held by the transaction
may be released, but no new locks can be acquired.

In the below example, if lock conversion is allowed then the following phase
can happen:

1. Upgrading of lock (from S(a) to X (a)) is allowed in growing phase.


2. Downgrading of lock (from X(a) to S(a)) must be done in shrinking
phase.

Example:

The following way shows how unlocking and locking work with 2-PL.

Transaction T1:

o Growing phase: from step 1-3


o Shrinking phase: from step 5-7
o Lock point: at 3

Transaction T2:

o Growing phase: from step 2-6


o Shrinking phase: from step 8-9
o Lock point: at 6

4. Strict Two-phase locking (Strict-2PL)


o The first phase of Strict-2PL is similar to 2PL. In the first phase, after
acquiring all the locks, the transaction continues to execute normally.
o The only difference between 2PL and strict 2PL is that Strict-2PL does
not release a lock after using it.
o Strict-2PL waits until the whole transaction to commit, and then it
releases all the locks at a time.
o Strict-2PL protocol does not have shrinking phase of lock release.

It does not have cascading abort as 2PL does.

Timestamp Ordering Protocol


o The Timestamp Ordering Protocol is used to order the transactions based
on their Timestamps. The order of transaction is nothing but the
ascending order of the transaction creation.
o The priority of the older transaction is higher that's why it executes first.
To determine the timestamp of the transaction, this protocol uses system
time or logical counter.
o The lock-based protocol is used to manage the order between conflicting
pairs among transactions at the execution time. But Timestamp based
protocols start working as soon as a transaction is created.
o Let's assume there are two transactions T1 and T2. Suppose the
transaction T1 has entered the system at 007 times and transaction T2 has
Deadlock Avoidance:
Aborting a transaction is not always a practical approach. Instead deadlock avoidance
mechanisms can be used to detect any deadlock situation in advance.
Methods like "wait-for graph" are available but for the system where transactions are light in
weight and have hold on fewer instances of resource. In a bulky system deadlock prevention
techniques may work well.

Wait-for Graph:
This is a simple method available to track if any deadlock situation may arise.
For each transaction entering in the system, a node is created.
When transaction Ti requests for a lock on item, say X, which is held by some other
transaction Tj, a directed edge is created from Ti to Tj. If Tj releases item X, the edge between them
is dropped and Ti locks the data item.
The system maintains this wait-for graph for every transaction waiting for some data items
held by others. System keeps checking if there's any cycle in the graph.

Fig Wait-for Graph

Two approaches can be used, first not to allow any request for an item, which is already
locked by some other transaction.
This is not always feasible and may cause starvation, where a transaction indefinitely waits
for data item and can never acquire it. Second option is to roll back one of the transactions.
It is not feasible to always roll back the younger transaction, as it may be important than the
older one.
With help of some relative algorithm a transaction is chosen, which is to be aborted, this
transaction is called victim and the process is known as victim selection.

3.9 TRANSACTION RECOVERY:


 Protocols that obey this are referred to as non-blocking protocols. In the following two
sections, we consider two common commit protocols suitable for distributed DBMSs: two-
phase commit (2PC) and three-phase commit (3PC), a non-blocking protocol.
 Assume that every global transaction has one site that acts as coordinator (or transaction
manager) for that transaction, which is generally the site at which the transaction was
initiated. Sites at which the global transaction has agents are called participants (or
resource managers).
 Assume that the coordinator knows the identity of all participants and that each participant
knows the identity of the coordinator but not necessarily of the other participants.

Two-Phase Commit (2PC)


 2PC operates in two phases: a voting phase and a decision phase.
 The basic idea is that the coordinator asks all participants whether they are prepared to
commit the transaction. If one participant votes to abort, or fails to respond within a timeout
period, then the coordinator instructs all participants to abort the transaction.
 If all vote to commit, then the coordinator instructs all participants to commit the
transaction. The global decision must be adopted by all participants.
 If a participant votes to abort, then it is free to abort the transaction immediately; in fact, any
site is free to abort a transaction at any time up until it votes to commit. This type of abort is
known as a unilateral abort.
 If a participant votes to commit, then it must wait for the coordinator to broadcast either the
global commit or global abort message.
 This protocol assumes that each site has its own local log, and can therefore rollback or
commit the transaction reliably. Two-phase commit involves processes waiting for messages
from other sites. To avoid processes being blocked unnecessarily, a system of timeouts is
used. The procedure for the coordinator at commit is as follows:

Phase 1
(1) Write a begin_commit record to the log file and force-write it to stable storage.
 Send a PREPARE message to all participants.
 Wait for participants to respond within a timeout period.
Phase 2
(2) If a participant returns an ABORT vote,
 Write an abort record to the log file and force write it to stable storage.
 Send a GLOBAL_ABORT message to all participants.
 Wait for participants to acknowledge within a timeout period.
(3) If a participant returns a READY_COMMIT vote,
 Write a commit record to the log file and force-write it to stable storage.
 Send a GLOBAL_COMMIT message to all participants.
 Wait for participants to acknowledge within a timeout period.
(4) Once all acknowledgements have been received,
 Write an end_transaction message to the log file.

2PC Protocol for voting Commit:


Fig 2 PC Protocol for voting Commit

2PC Protocol for voting Abort:

Fig. 2PC Protocol for voting Abort


Termination protocols for 2PC
A termination protocol is invoked whenever a coordinator or participant fails to receive an
expected message and times out. The action to be taken depends on whether the coordinator or
participant has timed out and on when the timeout occurred.
(i) Coordinator
The coordinator can be in one of four states during the commit process:
 INITIAL
 WAITING
 DECIDED
 COMPLETED
as shown in the state transition diagram in Figure, but can time out only in the middle two states.
The actions to be taken are as follows:
Timeout in the WAITING state -The coordinator is waiting for all participants to acknowledge
whether they wish to commit or abort the transaction. In this case, the coordinator cannot commit
the transaction because it has not received all votes. However, it can decide to globally abort the
transaction.
Timeout in the DECIDED state -The coordinator is waiting for all participants to acknowledge
whether they have successfully aborted or committed the transaction. In this case, the coordinator
simply sends the global decision again to sites that have not acknowledged.

Fig.Termination protocols for 2PC


(ii) Participant
A participant can be in one of four states during the commit process:
 INITIAL
 PREPARED
 ABORTED
 COMMITTED
as shown in the state transition diagram in Figure. However, a participant may time out only in the
first two states as follows:
 Read Committed – This isolation level guarantees that any data read is committed at the
moment it is read. Thus it does not allows dirty read. The transaction hold a read or write lock
on the current row, and thus prevent other rows from reading, updating or deleting it.

 Repeatable Read – This is the most restrictive isolation level. The transaction holds read locks
on all rows it references and write locks on all rows it inserts, updates, or deletes. Since other
transaction cannot read, update or delete these rows, consequently it avoids non repeatable read.
 Serializable – This is the Highest isolation level. A serializable execution is guaranteed to be
serializable. Serializable execution is defined to be an execution of operations in which
concurrently executing transactions appears to be serially executing.
3.12 SQL FACILITIES FOR CONCURRENCY AND RECOVERY:
Crash Recovery
DBMS is a highly complex system with hundreds of transactions being executed every
second. The durability and robustness of a DBMS depends on its complex architecture and its
underlying hardware and system software. If it fails or crashes amid transactions, it is expected that
the system would follow some sort of algorithm or techniques to recover lost data.
Failure Classification
To see where the problem has occurred, we generalize a failure into various categories, as
follows −

a) Transaction failure
A transaction has to abort when it fails to execute or when it reaches a point from where it
can’t go any further. This is called transaction failure where only a few transactions or processes
are hurt.
Reasons for a transaction failure could be −
 Logical errors − Where a transaction cannot complete because it has some code error or
any internal error condition.
 System errors − Where the database system itself terminates an active transaction because
the DBMS is not able to execute it, or it has to stop because of some system condition. For
example, in case of deadlock or resource unavailability, the system aborts an active
transaction.

System Crash
There are problems − external to the system − that may cause the system to stop abruptly
and cause the system to crash. For example, interruptions in power supply may cause the failure of
underlying hardware or software failure.
Examples may include operating system errors.

Disk Failure
In early days of technology evolution, it was a common problem where hard-disk drives or
storage drives used to fail frequently. Disk failures include formation of bad sectors, unreachability
to the disk, disk head crash or any other failure, which destroys all or a part of disk storage.

Storage Structure
We have already described the storage system. In brief, the storage structure can be divided into
two categories −
 Volatile storage − As the name suggests, a volatile storage cannot survive system crashes.
Volatile storage devices are placed very close to the CPU; normally they are embedded
onto the chipset itself. For example, main memory and cache memory are examples of
volatile storage. They are fast but can store only a small amount of information.
 Non-volatile storage − These memories are made to survive system crashes. They are huge
in data storage capacity, but slower in accessibility. Examples may include hard-disks,
magnetic tapes, flash memory, and non-volatile (battery backed up) RAM.

Recovery and Atomicity


When a system crashes, it may have several transactions being executed and various files
opened for them to modify the data items. Transactions are made of various operations, which are
atomic in nature. But according to ACID properties of DBMS, atomicity of transactions as a whole
must be maintained, that is, either all the operations are executed or none.
When a DBMS recovers from a crash, it should maintain the following −
 It should check the states of all the transactions, which were being executed.
 A transaction may be in the middle of some operation; the DBMS must ensure the atomicity
of the transaction in this case.
 It should check whether the transaction can be completed now or it needs to be rolled back.
 No transactions would be allowed to leave the DBMS in an inconsistent state.
There are two types of techniques, which can help a DBMS in recovering as well as maintaining
the atomicity of a transaction −
 Maintaining the logs of each transaction, and writing them onto some stable storage before
actually modifying the database.
 Maintaining shadow paging, where the changes are done on a volatile memory, and later,
the actual database is updated.

Log-based Recovery
Log is a sequence of records, which maintains the records of actions performed by a
transaction. It is important that the logs are written prior to the actual modification and stored on a
stable storage media, which is failsafe.
Log-based recovery works as follows −
 The log file is kept on a stable storage media.
 When a transaction enters the system and starts execution, it writes a log about it.
 <Tn, Start>
 When the transaction modifies an item X, it write logs as follows −
 <Tn, X, V1, V2>
 It reads T n has changed the value of X, from V1 to V2.
 When the transaction finishes, it logs −
 <Tn, commit>
The database can be modified using two approaches −
 Deferred database modification − All logs are written on to the stable storage and the
database is updated when a transaction commits.
 Immediate database modification − Each log follows an actual database modification.
That is, the database is modified immediately after every operation.
Recovery with Concurrent Transactions
When more than one transaction are being executed in parallel, the logs are interleaved. At
the time of recovery, it would become hard for the recovery system to backtrack all logs, and then
start recovering. To ease this situation, most modern DBMS use the concept of 'checkpoints'.

1.Checkpoint
Keeping and maintaining logs in real time and in real environment may fill out all the
memory space available in the system. As time passes, the log file may grow too big to be handled
at all. Checkpoint is a mechanism where all the previous logs are removed from the system and
stored permanently in a storage disk. Checkpoint declares a point before which the DBMS was in
consistent state, and all the transactions were committed.

2.Recovery
When a system with concurrent transactions crashes and recovers, it behaves in the
following manner −

Fig 5.15 Checkpoint versus Failure


 The recovery system reads the logs backwards from the end to the last checkpoint.

 It maintains two lists, an undo-list and a redo-list.

 If the recovery system sees a log with <T n, Start> and <T n, Commit> or just <T n, Commit>,
it puts the transaction in the redo-list.
 If the recovery system sees a log with <T n, Start> but no commit or abort log found, it puts
the transaction in undo-list.
All the transactions in the undo-list are then undone and their logs are removed. All the
transactions in the redo-list and their previous logs are removed and then redone before saving
their logs.
QUERY PROCESSING AND OPTMIZATION IN DBMS.
Query Processing is the activity performed in extracting data from the database. In query
processing, it takes various steps for fetching the data from the database. The steps involved are:
1. Parsing and translation
2. Optimization
3. Evaluation

The query processing works in the following way:


Parsing and Translation
 The scanning, parsing, and validating module produces an internal representation of the
query. The query optimizer module devises an execution plan which is the execution
strategy to retrieve the result of the query from the database files.
 A query typically has many possible execution strategies differing in performance, and
the process of choosing a reasonably efficient one is known as query optimization.
 The code generator generates the code to execute the plan. The runtime database
processor runs the generated code to produce the query result.
 Relational algebra is well suited for the internal representation of a query.

-DATABASE MANAGEMENT SYSTEMS


The translation process in query processing is similar to the parser of a query. When a
user executes any query, for generating the internal form of the query, the parser in the system
checks the syntax of the query, verifies the name of the relation in the database, the tuple, and
finally the required attribute value. The parser creates a tree of the query, known as 'parse-tree.'
Further, translate it into the form of relational algebra. With this, it evenly replaces all the use of
the views when used in the query.
It is done in the following steps:
Step-1:
Parser: During parse call, the database performs the following checks- Syntax check, Semantic
check and Shared pool check, after converting the query into relational algebra.
Parser performs the following checks as (refer detailed diagram):
1. Syntax check – concludes SQL syntactic validity. Example:
SELECT * FORM employee
Here error of wrong spelling of FROM is given by this check.
2. Semantic check – determines whether the statement is meaningful or not. Example:
query contains a table name which does not exist is checked by this check.
3. Shared Pool check – Every query possess a hash code during its execution. So, this check
determines existence of written hash code in shared pool if code exists in shared pool
then database will not take additional steps for optimization and execution.

Hard Parse and Soft Parse –


If there is a fresh query and its hash code does not exist in shared pool then that
query has to pass through from the additional steps known as hard parsing otherwise if hash
code exists then query does not passes through additional steps. It just passes directly to
execution engine (refer detailed diagram). This is known as soft parsing.
Hard Parse includes following steps – Optimizer and Row source generation.

Step-2:
Optimizer: During optimization stage, database must perform a hard parse atleast for one
unique DML statement and perform optimization during this parse. This database never
optimizes DDL unless it includes a DML component such as subquery that require optimization.

-DATABASE MANAGEMENT SYSTEMS


It is a process in which multiple query execution plan for satisfying a query are examined
and most efficient query plan is satisfied for execution.
Database catalog stores the execution plans and then optimizer passes the lowest cost plan for
execution.
Step-3:
Execution Engine: Finally runs the query and display the required result.
Thus, we can understand the working of a query processing in the below-described
diagram:
Suppose a user executes a query. As we have learned that there are various methods of
extracting the data from the database. In SQL, a user wants to fetch the records of the employees
whose salary is greater than or equal to 10000. For doing this, the following query is undertaken:
SELECT EMP_NAME FROM EMPLOYEE WHERE SALARY>10000;
Thus, to make the system understand the user query, it needs to be translated in the form of
relational algebra. We can bring this query in the relational algebra form as:

o σsalary>10000 (πEmp_Name(Employee))
o πEmp_Name(σsalary>10000 (Employee))
After translating the given query, we can execute each relational algebra operation by using
different algorithms. So, in this way, a query processing begins its working.
Evaluation
For this, with addition to the relational algebra translation, it is required to annotate the
translated relational algebra expression with the instructions used for specifying and evaluating
each operation. Thus, after translating the user query, the system executes a query evaluation
plan.
Query Evaluation Plan
o In order to fully evaluate a query, the system needs to construct a query evaluation plan.
o A query evaluation plan defines a sequence of primitive operations used for evaluating a
query. The query evaluation plan is also referred to as the query execution plan.
o A query execution engine is responsible for generating the output of the given query. It
takes the query execution plan, executes it, and finally makes the output for the user
query.

-DATABASE MANAGEMENT SYSTEMS


Optimization

The cost of the query evaluation can vary for different types of queries. Although the system is
responsible for constructing the evaluation plan, the user does need not to write their query
efficiently.
o Usually, a database system generates an efficient query evaluation plan, which minimizes

its cost. This type of task performed by the database system and is known as Query
Optimization.
o For optimizing a query, the query optimizer should have an estimated cost analysis of
each operation. It is because the overall operation cost depends on the memory
allocations to several operations, execution costs, and so on.
Finally, after selecting an evaluation plan, the system evaluates the query and produces the
output of the query.
Example:
SELECT LNAME, FNAME FROM EMPLOYEE WHERE SALARY > (SELECT MAX (SALARY) FROM
EMPLOYEE WHERE DNO=5);
The inner block
(SELECT MAX (SALARY) FROM EMPLOYEE WHERE DNO=5)
 Translated in: ∏ MAX SALARY (σDNO=5(EMPLOYEE))
The Outer block
SELECT LNAME, FNAME FROM EMPLOYEE WHERE SALARY > C
 Translated in: ∏ LNAZME, FNAME (σSALARY>C (EMPLOYEE))
(C represents the result returned from the inner block.)
 The query optimizer would then choose an execution plan for each block.
 The inner block needs to be evaluated only once. (Uncorrelated nested query).
 It is much harder to optimize the more complex correlated nested queries.
External Sorting
It refers to sorting algorithms that are suitable for large files of records on disk that do not fit
entirely in main memory, such as most database files..
ORDER BY.
Sort-merge algorithms for JOIN and other operations (UNION, INTERSECTION). Duplicate
elimination algorithms for the PROJECT operation (DISTINCT).
Typical external sorting algorithm uses a sort-merge strategy:
Sort phase: Create sort small sub-files (sorted sub-files are called runs).
-DATABASE MANAGEMENT SYSTEMS
Merge phase: Then merges the sorted runs. N-way merge uses N memory buffers to
buffer input runs, and 1 block to buffer output. Select the 1st record (in the sort order) among
input buffers, write it to the output buffer and delete it from the input buffer. If output buffer
full, write it to disk. If input buffer empty, read next block from the corresponding run. E.g. 2-way
Sort-Merge

-DATABASE MANAGEMENT SYSTEMS


NoSQL Technologies
1.1 INTRODUCTION TO NOSQL AND INTERFACING
WITH NOSQL DATA STORES
“NoSQL” is “nonSQL” or “not only SQL”, It stores the databases in the
format other than traditional format of RDBMS like relational type of
tables. It is useful for managing and accessing various types of databases
for large volume of data.

Basics Introduction to NoSQL:

 A non-relational database which stores data in a non-tabular manner.

 NoSQL database can store data in traditional as well as non-


traditional structural way.

 Relational Databases have been only one choice or the default choice
for data storage.

 After relational databases, current excitement about NoSQL databases


has come.

 The value of relational databases are for two areas of memory a. fast,
small, volatile main memory b. Larger, slower, non - volatile backing
store.

 As main memory is volatile to keep data around, for backing store


(File system, Database).

 The database allows more flexibility than a file system in storing large
amounts of data in a way that allows an application program to get
information quickly and easily.

 A NoSQL database provides a mechanism for storage and retrieval of


data that employs less constrained consistency models than traditional
relational database.

 NoSQL systems are also referred to as “NotonlySQL” to emphasize


that they do in fact allow SQL-like query languages to be used.

1.2 CHARACTERISTICS OF NOSQL


1. High Scalability:
NoSQL have higher scalability for the large database.

2. Independent of Schema:
NoSQL have more efficiency to work with the independent of schema
feature i.e. large volume of heterogeneous type of data which requires no
schemas for structuring it.

2
3. Complex with free working: NoSQL Technologies

NoSQL is very easy to handle than the SQL databases, for storing data in
an semi-structured, unstructured form that requires no tabular format or
arrangement.

4. Flexible to accommodate:
NoSQL have heterogeneous data that does not require any the of structure
format, they are very flexible in terms of their reliability and use.

1.3 NOSQL STORAGE TYPES


A database is an easily accessible collection of organised data or
information kept in a computer system. A Database Management System
often oversees a database (DBMS).
The nontabular data is stored in a non-relational database called NoSQL.
NoSQL is an acronym for Not Only SQL. Document, key-value, column-
oriented, and graphs are the primary types.

It is divided into four different types:


1. Document Database.
2. Key-Value Database.
3. Column-oriented Database.
4. Graph Database.

1. Document Database:
In document database, it stores the data in the form of document. The data
is grouped into the specified files where it is useful for building any
application software.
The most important benefit of document database is it allows to the use to
store the database in a particular format i.e. document format.
It is hierarchical and semi-structured format of NoSQL database it allows
efficient storage for the data. For example user profile it works very well
for storing the data. MongoDb is very good example of NoSQL database.

2. Key -Value Database:


Key-Value database is a type of NoSQL database where it stores the data
in a schema-less manner.
It store the data in key-value format. One data point is assign as a key
while another data point is assign as value for key-value allotment.
Example of Key-Value is the term ‘age’ is assign as key data point while
‘45’ can be termed as value.

3
NoSQL Technologies 3. Column-oriented Database:
It stores the data in the form of columns where it segregates the data into
homogenous categories.
User can access the data very easily without retrieving unnecessary
information.
Column-oriented databases works efficiently for data analytics in many
social media networking sites.
This type of databases can accommodate large volume of data, For
filtering the data or information, column-oriented databases are used.
Apache HBase is an example of column-oriented database.

4. Graph Database:
In Graph Database we can store the data in the form of graphical
knowledge and its related element like nodes, edges etc.
Data points are placed very well so that nodes are easily related to the
edges and thus, a connection or network can easily establish.
Graph-based databases focus on the relationship between the elements. It
stores the data in the form of nodes in the database. The connections
between the nodes are called links or relationships.

Key features of graph database:

 In a graph-based database, it is easy to identify the relationship


between the data by using the links.

 The Query’s output is real-time results.

 The speed depends upon the number of relationships among the


database elements.

 Updating data is also easy, as adding a new node or edge to a graph


database is a straightforward task that does not require significant
schema changes.

 For software development Graph database is useful.

 Good example for NoSQL database is Amazon Neptune where it


makes high effective and organized functioning of software. Amazon
Neptune is reliable, fast and graph database service that runs or build
various applications with highly connected databases.

1.4 ADVANTAGES OF NOSQL


 No constraint on the structure of the data to be stored.

 Integration with cloud computing.

4
 It can store large volume of data. NoSQL Technologies

 Flexible data model.

 High performance.

 Open Source.

1.5 DRAWBACKS OF NOSQL


 Less developed as compared to traditional SQL.

 Improvements are required for cross-platform support.

 In NoSQL data inconsistency may occur.

 Large document size

 GUI is not available.

 It mainly designed for storage but it has very less functionality.

 Backup is one drawback of NoSQL database as some NoSQL


databases like MongoDB, it has no approach for the backup of data in
a consistent manner.

1.6 NOSQL PRODUCTS INTERFACING AND


INTERACTING WITH NOSQL
 MongoDB is an open-source document-oriented database where it is
designed to store a large volume of data and it allows user to work
with the data efficiently. Storage and retrieval of data in MongoDB is
not in the form of tables. It supports the languages like C, C++, C#
and .Net, Java, Node.js, Perl, PHP, Python, Scala etc. User can easily
create an application using any of these languages.

 Examples: There are many companies that uses MongoDB like


Facebook, eBay, Google etc to store large volume of respective data.

1.7 SQL AND NOSQL

SQL NoSQL
It is called as RDBMS or Relational It is called as Non-Relational or
Database. Distributed Database.
Table-based databases. It can be document based, key-
value pairs, graph databases.
Vertical Scalability. Horizontal Scaliability.
Fixed or Predefined schema. Flexible schema.
It is not suitable for hierarchical data It is suitable for hierarchical data
storage. storage.
5
3) Differentiate between the NoSQL and SQL.
……………………………………………………………………………
……………………………………………………………………………
……………………………………………………………………………

15.3 TYPES OF NoSQL DATABASES

In this section, we will discuss the many classifications of NoSQL databases.


There are typically four types of NoSQL databases:
1) Column-based: Instead of accumulating data in rows, this method
organizes it all together into columns, which makes it easier to query
large datasets.
2) Graph-based: These are systems that are utilized for the storage of
information regarding networks, such as social relationships.
3) Key-value pair based: This is the simplest sort of database, in which each
item of your database is saved in the form of an attribute name (also
known as a "key") coupled with the value.
4) Document-based: Made up of sets of key-value pairs that are kept in
documents.

15.3.1 Column Based


A column store, in contrast to a relational database, is arranged as a set of
columns, rather than rows. This allows you to read only the columns you need
for analysis, saving memory space that would otherwise be taken up by
irrelevant information. Because columns are frequently of the same kind, they
are able to take advantage of more efficient compression, which makes data
reading even quicker. The value of a specific column can be quickly aggregated
using columnar databases.
Although columnar databases are excellent for analytics, because of the way they
publish data, it is challenging for them to remain consistent because writes to all
the columns need several write events on the disk. However, this problem never
arises with relational databases because row data is continuously written to disk.

How Does a Column Database Work?


A columnar database is a type of database management system (DBMS) that
allows data to be stored in columns rather than rows. It is accountable for
reducing the amount of time needed to return a certain query. Additionally, it
is accountable for the significant enhancement of the disk I/O performance.
Both data analytics and data warehousing benefit from it. Additionally, the
primary goal of a Columnar Database is to read and write data in an efficient
manner. Column-store databases include Casandra, CosmoDB, Bigtable, and
HBase, to name a few.
Columnar Database Vs Row Database:
When processing big data analytics and data warehousing, there are a number
of different techniques that can be used, including columnar databases and row
databases. But they each take a different method.

For instance:
• Row Database: “Customer 1: Name, Address, Location". (The fields for each
new record are stored in a long row).
• Columnar Database: “Customer 1: Name, Address, Location”. (Each field
has its own set of columns). Refer Table 2 for relational database example.

Table 2: Relational database: an example


ID Number First Name Last Name Amount

A01234 Sima Kaur 4000

B03249 Tapan Rao 5000

C02345 Srikant Peter 1000

In a Columnar DBMS, the data will be stored in the following format:


A01234, B03249, C02345; Sima, Tapan, Srikant; Kaur, Rao, Peter; 4000,
5000, 1000.

In a Row-oriented DBMS, the data will be stored in the following format:


A01234, Sima, Kaur, 4000; B03249, Tapan, Rao, 5000; C02345, Srikant,
Peter, 1000.

Columnar databases: advantages

The use of columnar databases has various advantages:

 Column stores are highly effective in compression, making them storage


efficient. This implies that you can conserve disk space while storing
enormous amounts of data in a single column.
 Aggregation queries are fairly quick with column-store databases
because the majority of the data is kept in a column, which is beneficial
for projects that need to execute a lot of queries quickly.
 Load times are also quite good; a table with a billion rows can be loaded
in a matter of seconds. This suggests that you can load and query
practically instantly.
 A great deal of versatility because columns do not have to resemble one
another. The database would not be affected if you add new or different
columns, however, updating all tables is necessary to input whole new
record queries.
 Overall, column-store databases are excellent for analytics and reporting
due to their quick query response times and capacity to store massive
volumes of data without incurring significant costs.

5
1
Column databases: Disadvantages

While there are many benefits to adopting column-oriented databases, there are
also a few drawbacks to keep in mind.

 It takes a lot of time and effort to create an efficient indexing schema.


 Incremental data loading is undesirable and is to be avoided, if at all
possible, even though this might not be a problem for some users.
 This applies to all forms of NoSQL databases, not just those with
columns. Web applications frequently have security flaws, and the
absence of security features in NoSQL databases does not help. If
security is your top goal, you should either consider using relational
databases or, if it's possible, use a clearly specified schema.
 Due to the way data is stored, Online Transaction Processing (OLTP)
applications are incompatible with columnar databases.

Are columns databases always NoSQL?

Before we conclude, we should note that column-store databases are not always
NoSQL-only. It is frequently argued that column-store belongs firmly in the
NoSQL camp because it differs so much from relational database approaches.
The debate between NoSQL and SQL is generally quite nuanced, therefore this
is not usually the case. They are essentially the same as SQL techniques when it
comes to column-store databases. For instance, keyspaces function as schema,
so schema management is still necessary. A NoSQL data store's keyspace
contains all column families. The concept is comparable to relational database
management systems' schema. There is typically only one keyspace per
program. Another illustration is the fact that the metadata occasionally
resembles a conventional relational DBMS perfectly. Ironically, column-store
databases frequently adhere to ACID and SQL standards. However, NoSQL
databases are often either document-store or key-store, neither of which are
column-store. Therefore, it is difficult to claim that column-store is a pure
NoSQL system.

15.3.2 Graph Based


The initial hardware hurdles that made it feasible for SQL to handle vast
quantities of data are no longer there, despite the fact that SQL is an excellent
superb RDBMS and has been used for many years to manage massive amounts
of data. As a result, NoSQL has rapidly emerged as the dominant form of
contemporary database management and many of the largest websites, we rely
on today, are powered by NoSQL, like Twitter's use of FlockDB and Amazon's
DynamoDB.
A database that stores data using graph structures is known as a graph database.
It represents and stores data using nodes, edges, and attributes rather than tables
or documents. Relationships between the nodes are represented by the edges.
This makes data retrieval simpler and, in many circumstances, only requires one
action. Additionally, it works fantastically as a database for fast, threaded data
structures like those used on Twitter

How does a Graph Database Work?


Graphs, which are not relational databases, rely heavily on the idea of multi-
relational data "pathways" for their functionality. However, the structure of
graph databases is typically simple. They are largely made up of two elements:
 The Node: This represents the actual data itself. It may be the number of
people who watched a video on YouTube; it could be the number of
people who read a tweet; or it could even be fundamental information
like people's names, addresses, and other such details.
 The Edge: This clarifies the real connection between the two nodes. It is
interesting to note that edges can also have their own data, such as the
type of connection between two nodes. Similar to edges, mentioned
directions may also describe the direction in which the data is flowing.
Graph databases are mostly utilized for studying relationships. For instance,
businesses might extract client information from social media using a graph
database. For example, some organization might use a graph database to extract
data about relationships between Person, Restaurant, and City, as shown in
Figure 2.

Figure 2. Different Nodes and Edges in Graph Database.


(Adapted from https://www.kdnuggets.com/)

When do we need Graph Database?


1) It resolves issues with many-to-many relationships. For example, many-
to-many relationships include friends of friends.
2) When connections among data pieces are more significant. For example,
there is a profile with some unique information, but the main selling
point is the relationship between these different profiles, which is how
you get connected inside a network.
3) Low latency with big amounts of data. The relational database's data sets
will grow significantly as you add more relationships, and when you
query it, its complexity will increase and it will take longer than usual.
However, graph databases are specifically created for this purpose, and
one can easily query relationships.

Now, let’s look at a more specific illustration to explain a group of people's


complicated relationships. For example, five friends share a social network.
These friends are Binny, Bhawna, Chaitaya, Manish, and Mohit. Their personal
data may be kept in a graph database that resembles this, as shown in Figure 3
and Table 3:

53
Figure 3. Example-Five friends sharing Social network.

Table 3: Relational database: an example


Id Firstname Lastname Email Mobile
1001 Biney Dayal [email protected] 8645212321
1002 Bhawna Rao [email protected] 9645212323
1003 Chaitaya Robert [email protected] 7645212356
1004 Manish Kumar [email protected] 9955212320
1005 Mohit Jain [email protected] 9945212329
This means we will need yet another table to keep track of user relationships.
Our friendship table (refer Table 4) will resemble the following:

Table 4: Friendship Table


user_id friend_id
1001 1002
1001 1003
1001 1004
1001 1005
1002 1001
1002 1003
1002 1004
1002 1005
1003 1001
1003 1002
1003 1004
1003 1005
1004 1001
1004 1002
1004 1003
1004 1005
1005 1001
1005 1002
1005 1003
1005 1004
We won't go too deeply into the theory of the database's main key and foreign
key. Instead, presume that the friendship table uses both friends' ids. Let's say
that every member on our social network gets access to a feature that lets them
view the personal information of their other users who are friends with them.
This means that if Chaitaya were to ask for information, it would be regarding
Biney, Bhawna, Manish and Mohit. We shall address this issue in a conventional
(relational database) manner. First, we need to locate Chaitaya's user id in the
database's Users table (refer Table 5).

Table 5: Chaitaya’s Record


Id Firstname Lastname Email Mobile
1003 Chaitaya Robert [email protected] 7645212356
We would now search the friendship table (refer Table 6) for all tuples with the
user id of 3. The resulting relationship would look like this:

Table 6: Friendship Table for user id 3


user_id friend_id
1003 1001
1003 1002
1003 1004
1003 1005
Let us now examine the time required for this Relational database strategy. This
will be close to log (N) times, where N is the number of tuples in the friendship
table. In this case, the database continues to keep the entries in sequential order
based on their ids. So, in general, the time complexity for 'M' number of queries
is M*log (N). Only, if we had used a graph database strategy the overall time
complexity has been O (N). For the simple reason that once Chaitaya has been
located in the database, all the rest of her friends may be found with a single
click, as shown in Figure 4.

Figure 4. Accessing other data with a single click.


Graph Database Examples
Although graph databases are not as widely used as other NoSQL databases,
there are a handful that have become de facto standards when discussing
NoSQL:
Neo4j is both an open-source and an interestingly developed on Java graph
database. It is considered to be one of the best graph databases. In addition to
that, it comes with its own language known as Cypher, which is comparable to
the declarative SQL language but is designed to work with graphs. In addition
to Java, it supports a number of other popular programming languages, including
Python, .NET, JavaScript, and a few others. Neo4j excels in applications such as
55
the administration of data centers and the identification of fraudulent activity.
RedisGraph is a graph module that is integrated into Redis, which is a key-
value NoSQL database. RedisGraph was developed to have its data saved in
RAM for the same reason that Redis itself is constructed on in-memory data
structures. As a result, a graph database with excellent speed and quick searching
and indexing is created. RedisGraph also makes use of Cypher, which is ideal if
you're a programmer or data scientist looking for greater database flexibility.
Applications that require blazing-fast performance are the main uses.
OrientDB It is interesting to note that OrientDB supports graph, document store,
key-value store, and object-based data formats. Having stated that, the graph
model, which uses direct links between databases, is used to hold all of the
relationships. Although it does not use Cypher, OrientDB is open-source and
developed in Java, just like Neo4j and the two prior graph databases. OrientDB
is designed to be used in situations when many data models are necessary, and
as a result, it is optimized for data consistency as well as minimizing data
complexity.

15.3.3 Key-value pair Based


Key-value stores are perhaps the most widely used of the four major NoSQL
database formats because of their simplicity and quick performance. Let us
examine key-value stores' operation and application in more detail. With some
of the most well-known platforms and services depending on them to deliver
material to users with lightning speed, NoSQL has grown in significance in our
daily lives. Of course, NoSQL includes a range of database types, but key-value
store is unquestionably the most used.
Because of its extreme simplicity, this kind of data model is built to execute
incredibly quickly when compared to relational databases. Furthermore, because
key-value stores adhere to the scalable NoSQL design philosophy, they are
flexible and simple to set up.
How Does a Key-Value Work?
In reality, key-value storage is quite simple. A value is saved with a key that
specifies its location, and a value can be pretty much any piece of data or
information. In reality, this design idea may be found in almost every
programming language as an array or map object, refer Figure 5. The fact that it
is persistently kept in a database management system makes a difference in this
case.

Figure 5. Example Key-Value database.


Popularity of key-value stores is due to the fact that information is stored as a
single large piece of data instead of as discrete data. As a result, indexing the
database is not really necessary to improve its performance. Instead, because of
the way it is set up, it operates more quickly on its own. Similar to that, it mostly
uses the get, put, and delete commands rather than having a language of its own.
Of course, this has the drawback that the data you receive in response to a request
is not screened. Under certain conditions, this lack of data management may be
problematic, but generally speaking, the trade-off is worthwhile. Because key-
value stores are both quick and reliable, the vast majority of programmers find
ways to get around any filtering or control problems that may arise.
Benefits of Key-Value
Key-value data models, one of the more well-liked types of NoSQL data models,
provide many advantages when it comes to creating a database:
Scalability: Key-value stores, like NoSQL in general, are infinitely scalable in
a horizontal fashion, which is one of its main advantages over relational
databases. This can be a huge advantage for sophisticated and larger databases
compared to relational databases, where expansion is vertical and finite, as
shown in Figure 6.

Figure 6. Horizontal and Vertical Scalability.


More specifically, partitioning and replication are used to manage this.
Additionally, by avoiding things like low-overhead server calls, it decreases the
ACID guarantees.
No/Simpler Querying: With key-value stores, querying is really not possible
except in very particular circumstances when it comes to querying keys, and
even then, it is not always practicable. Because there is just one request to read
and one request to write, key-value makes it easier to manage situations like
sessions, user profiles, shopping carts, and so on (due to the blob-like nature of
how the data is stored). Similar to this, concurrency problems are simpler to
manage because only one key needs to be resolved.
Mobility: Because key-value stores lack a query language, it is simple to move
them from one system to another without modifying the architecture or the code.
Thus, switching operating systems is less disruptive than switching relational
databases.
When to Use Key-Value
Key-value stores excel in this area because traditional relational databases are
not actually designed to manage a large number of read/write operations. Key-
value can readily scale to thousands of users per second due to its scalability.
Additionally, it can easily withstand lost storage or data because of the built-in
redundancy.
As a result, key-value excels in the following instances:
 Profiles and user preferences
 Large-scale user session management
 Product suggestions (such as in eCommerce platforms)

57
 Delivery of personalized ads to users based on their data profiles
 Cache data for infrequently updated data
There are numerous other circumstances where key-value works nicely. For
instance, because of its scalability, it frequently finds usage in big data research.
Similar to how it works for web applications, key-value is effective for
organizing player sessions in MMOG (massively multiplayer online game) and
other online games.

Key-Value Database Examples


Some key-value database models, for instance, save information to a solid-state
drive (SSD), while others use random-access memory (RAM). We depend on
key-value stores on a daily basis in our lives since they are some of the most
popular and frequently used databases. The fact is that some of the most popular
and commonly used databases are key-value stores.
Amazon DynamoDB is most likely the database that is used the most often for
key-value storage. In point of fact, study into Amazon DynamoDB was the
impetus for the rise in popularity of NoSQL.
Aerospike is a free and open-source database that was designed specifically for
use with in-memory data storage.
Berkeley DB: Another free and open-source database, Berkeley DB is a high-
performance framework for storing databases, despite the fact that it has a very
simple interface.
Couchbase: Text searches and querying in a SQL-like format are both possible
with Couchbase, which is an interesting feature.
Memcached not only saves cached data in RAM, which helps websites load
more quickly, but it is also free and open source.
Riak was designed specifically for use in the app development process, and it
plays well with other databases and app platforms.
Redis: A database that serves as both a memory cache and a message broker.

15.3.4 Document Based


A non-relational database that stores data as structured documents is known as a
document database (also known as a NoSQL document store). Instead of using
standard rows and columns, JSON format is a more recent technique to store
data. An XML or JSON file, or a PDF, are all examples of documents. NoSQL
is everywhere nowadays; just look at Twitter and its use of FlockDB or Amazon
and their use of DynamoDB. Figure 7 shows the difference between the
Relational and Document Store model.
Figure 7: Relational Vs Document Store Model.
In spite of the fact that there are a great deal of data models, each of which
contains hundreds of databases, the one we are going to investigate today is
called Document-store. One of the most common database models now in use,
document-store functions in a manner that is somewhat similar to that of the key-
value model in the sense that documents are saved together with particular keys
that access the information. Figure 8 (a) shows the document that holds
information about a book. This file is a JSON representation of a book's
metadata, which includes the book's BookID, Title, Author, and Year and Figure
8 (b) shows the same metadata for Key value database.

A Document Key Value

{ BookID 978-1449396091
“BookID”: “978-1449396091”,
“Title”: “DBMS”, Title DBMS
“Author”: “Raghu Ramakrishnan”,
“Year”: “2022”, Author Raghu Ramakrishnan
} Year 2022
(a) (b)
Figure 8: Example of Document and Key-value database
When to use a document database?
 When your application requires data that is not structured in a table
format.
 When your application requires a large number of modest continuous
reads and writes and all you require is quick in-memory access.
 When your application requires CRUD (Create, Read, Update, Delete)
functionality.
 These are often adaptable and perform well when your application has to
run across a broad range of access patterns and data kinds.

59
How does a Document Database Work?
It appears that document databases work under the assumption that any kind of
information can be stored in a document. This suggests that you shouldn't have
to worry about the database being unable to interpret any combination of data
types. Naturally, in practice, most document databases continue to use some sort
of schema with a predetermined structure and file format.
Document stores do not have the same foibles and limitations as SQL databases,
which are both tubular and relational. This implies that using the information at
hand is significantly simpler and running queries may also be much simpler.
Ironically, you can execute the same types of operations in a document storage
that you can in a SQL database, including removing, adding, and querying.
Each document requires a key of some kind, as was previously mentioned, and
this key is given to it through a unique ID. This unique ID processed the
document directly instead of being obtained column by column.
Document databases often have a lower level of security than SQL databases.
As a result, you really need to think about database security, and utilizing Static
Application Security Testing (SAST) is one approach to do so. SAST, examines
the source code directly to hunt for flaws. Another option is to use DAST, a
dynamic version that can aid in preventing NoSQL injections.

Document database advantages


One major benefit of document-store is that all of the data is stored in a single
location, rather than being spread out over many interconnected databases. As a
result, if you do not employ relational processes, you perform better than a SQL
database.
 Schema-less: Because there are no constraints on the format and
structure of data storage, they are particularly effective at keeping huge
quantities of existing data.
 Faster creation of document and maintenance: The creation of a
document is a fairly straightforward process, and apart from that, the
upkeep requirements are virtually nonexistent.
 Open formats: It offers a relatively easy construction process that makes
use of XML, JSON, and other formats.
 Built-in versioning: Because it contains built-in versioning, it means
that when the documents expand in size, there is a possibility that they
will also expand in complexity. Versioning makes conflicts less likely.
More precisely, document stores are excellent for the following applications
because schema can be changed without any downtime or because you could not
know future user needs:
 eCommerce giants (Like Amazon)
 Blogging platforms (such as Blogger, Tumblr)
 CMS (Content management systems) (Like WordPress, windows
registry)
 Analytical platforms (such as Tableau, Oracle server)
Cassandra

Besides Cassandra, we have the following NoSQL databases that are quite popular:

● Apache HBase: HBase is an open source, non-relational, distributed database


modeled after Google’s BigTable and is written in Java. It is developed as a part of
Apache Hadoop project and runs on top of HDFS, providing BigTable-like capabilities
for Hadoop.

● MongoDB: MongoDB is a cross-platform document-oriented database system that


avoids using the traditional table-based relational database structure in favor of JSON-
like documents with dynamic schemas making the integration of data in certain types
of applications easier and faster.

What is Apache Cassandra?


Apache Cassandra is an open source, distributed and decentralized/distributed storage system
(database), for managing very large amounts of structured data spread out across the world.
It provides highly available service with no single point of failure.

Listed below are some of the notable points of Apache Cassandra:

 It is scalable, fault-tolerant, and consistent.

 It is a key-value as well as a column-oriented database.

 Its distribution design is based on Amazon’s Dynamo and its data model on Google’s
Bigtable.

 Created at Facebook, it differs sharply from relational database management


systems.

 Cassandra implements a Dynamo-style replication model with no single point of


failure, but adds a more powerful “column family” data model.

 Cassandra is being used by some of the biggest companies such as Facebook, Twitter,
Cisco, Rackspace, ebay, Twitter, Netflix, and more.

Features of Cassandra
Cassandra has become so popular because of its outstanding technical features. Given below
are some of the features of Cassandra:

● Elastic scalability: Cassandra is highly scalable; it allows to add more hardware to


accommodate more customers and more data as per requirement.

● Always on architecture: Cassandra has no single point of failure and it is


continuously available for business-critical applications that cannot afford a failure.

7
Cassandra

● Fast linear-scale performance: Cassandra is linearly scalable, i.e., it increases


your throughput as you increase the number of nodes in the cluster. Therefore it
maintains a quick response time.

● Flexible data storage: Cassandra accommodates all possible data formats


including: structured, semi-structured, and unstructured. It can dynamically
accommodate changes to your data structures according to your need.

● Easy data distribution: Cassandra provides the flexibility to distribute data where
you need by replicating data across multiple datacenters.

● Transaction support: Cassandra supports properties like Atomicity, Consistency,


Isolation, and Durability (ACID).

● Fast writes: Cassandra was designed to run on cheap commodity hardware. It


performs blazingly fast writes and can store hundreds of terabytes of data, without
sacrificing the read efficiency.

History of Cassandra
 Cassandra was developed at Facebook for inbox search.
 It was open-sourced by Facebook in July 2008.
 Cassandra was accepted into Apache Incubator in March 2009.
 It was made an Apache top-level project since February 2010.

8
Cassandra
2. ARCHITECTURE

The design goal of Cassandra is to handle big data workloads across multiple nodes without
any single point of failure. Cassandra has peer-to-peer distributed system across its nodes,
and data is distributed among all the nodes in a cluster.

 All the nodes in a cluster play the same role. Each node is independent and at the
same time interconnected to other nodes.

 Each node in a cluster can accept read and write requests, regardless of where the
data is actually located in the cluster.

 When a node goes down, read/write requests can be served from other nodes in the
network.

Data Replication in Cassandra


In Cassandra, one or more of the nodes in a cluster act as replicas for a given piece of data.
If it is detected that some of the nodes responded with an out-of-date value, Cassandra will
return the most recent value to the client. After returning the most recent value, Cassandra
performs a read repair in the background to update the stale values.

The following figure shows a schematic view of how Cassandra uses data replication among
the nodes in a cluster to ensure no single point of failure.

9
Cassandra

Note: Cassandra uses the Gossip Protocol in the background to allow the nodes to
communicate with each other and detect any faulty nodes in the cluster.

Components of Cassandra
The key components of Cassandra are as follows:

 Node: It is the place where data is stored.

 Data center: It is a collection of related nodes.

 Cluster: A cluster is a component that contains one or more data centers.

 Commit log: The commit log is a crash-recovery mechanism in Cassandra. Every


write operation is written to the commit log.

 Mem-table: A mem-table is a memory-resident data structure. After commit log, the


data will be written to the mem-table. Sometimes, for a single-column family, there
will be multiple mem-tables.

 SSTable: It is a disk file to which the data is flushed from the mem-table when its
contents reach a threshold value.

10
Cassandra

 Bloom filter: These are nothing but quick, nondeterministic, algorithms for testing
whether an element is a member of a set. It is a special kind of cache. Bloom filters
are accessed after every query.

Cassandra Query Language


Users can access Cassandra through its nodes using Cassandra Query Language (CQL). CQL
treats the database (Keyspace) as a container of tables. Programmers use cqlsh: a prompt
to work with CQL or separate application language drivers.

Clients approach any of the nodes for their read-write operations. That node (coordinator)
plays a proxy between the client and the nodes holding the data.

Write Operations
Every write activity of nodes is captured by the commit logs written in the nodes. Later the
data will be captured and stored in the mem-table. Whenever the mem-table is full, data will
be written into the SStable data file. All writes are automatically partitioned and replicated
throughout the cluster. Cassandra periodically consolidates the SSTables, discarding
unnecessary data.

Read Operations
During read operations, Cassandra gets values from the mem-table and checks the bloom
filter to find the appropriate SSTable that holds the required data.

11
Cassandra
3. DATA MODEL

The data model of Cassandra is significantly different from what we normally see in an RDBMS.
This chapter provides an overview of how Cassandra stores its data.

Cluster
Cassandra database is distributed over several machines that operate together. The
outermost container is known as the Cluster. For failure handling, every node contains a
replica, and in case of a failure, the replica takes charge. Cassandra arranges the nodes in a
cluster, in a ring format, and assigns data to them.

Keyspace
Keyspace is the outermost container for data in Cassandra. The basic attributes of a Keyspace
in Cassandra are:

● Replication factor: It is the number of machines in the cluster that will receive copies
of the same data.

● Replica placement strategy: It is nothing but the strategy to place replicas in


the ring. We have strategies such as simple strategy (rack-aware strategy), old
network topology strategy (rack-aware strategy), and network topology
strategy (datacenter-shared strategy).

● Column families: Keyspace is a container for a list of one or more column families.
A column family, in turn, is a container of a collection of rows. Each row contains
ordered columns. Column families represent the structure of your data. Each keyspace
has at least one and often many column families.

The syntax of creating a Keyspace is as follows:

CREATE KEYSPACE Keyspace name


WITH replication = {'class': 'SimpleStrategy', 'replication_factor' : 3};

12
Cassandra

The following illustration shows a schematic view of a Keyspace.

Column Family
A column family is a container for an ordered collection of rows. Each row, in turn, is an
ordered collection of columns. The following table lists the points that differentiate a column
family from a table of relational databases.

Relational Table Cassandra Column Family

A schema in a relational model is fixed. In Cassandra, although the column


Once we define certain columns for a table, families are defined, the columns are not.
while inserting data, in every row all the You can freely add any column to any
columns must be filled at least with a null column family at any time.
value.

Relational tables define only columns and In Cassandra, a table contains columns,
the user fills in the table with values. or can be defined as a super column
family.

A Cassandra column family has the following attributes:

 keys_cached It represents the number of locations to keep cached per SSTable.

13
Cassandra

 rows_cached It represents the number of rows whose entire contents will be


cached in memory.

 preload_row_cache It specifies whether you want to pre-populate the row


cache.

Note: Unlike relational tables where a column family’s schema is not fixed, Cassandra does
not force individual rows to have all the columns.

The following figure shows an example of a Cassandra column family.

Column
A column is the basic data structure of Cassandra with three values, namely key or column
name, value, and a time stamp. Given below is the structure of a column.

14
ClickHouse
1. Introduction to ClickHouse
1. ClickHouse is an open-source column-oriented database management system.
2. Originally developed by Yandex in 2016 for their web analytics platform.
3. Specially optimized for OLAP (Online Analytical Processing) workloads.
4. Columnar storage → reads only needed data → faster queries.
5. Designed for analytical queries (summaries, reports, trends).
6. Handles petabyte-scale data efficiently.

2. Features of ClickHouse
Feature Description
Columnar Storage Stores data by columns instead of rows.
Fast Query Performance Processes billions of rows per second.
Data Compression Reduces storage requirements.
Fault Tolerant High availability using replication.
SQL Compatible Supports a dialect of SQL for queries.
Scalable Architecture Works on single nodes or across distributed clusters.
Real-Time Inserts Supports near real-time data ingestion.

3. ClickHouse Architecture Overview


1. Client Layer: Applications or users send SQL queries.
2. Server Layer: ClickHouse servers process queries.
3. Storage Layer: Data is stored on disk in compressed, columnar format.
4. MergeTree Engine: Core storage engine managing partitions, indexing, replication.
5. MergeTree: Default engine, supports indexing and partitioning.
6. ReplicatedMergeTree: Supports replication for fault-tolerance.
7. ReplacingMergeTree, SummingMergeTree: Specialized variants for specific
needs.

4. Data Model in ClickHouse


1. Table-based model (similar to RDBMS).
2. Each table = multiple columns.
3. Primary index = minimizes data scanning.
4. No foreign keys (optimized for analytics, not relational consistency).
5.Real-world Use Cases
1. Web analytics and reporting.
2. Real-time dashboards.
3. Financial market analysis.
4. Monitoring systems (like server logs, events).
5. IoT data aggregation.

6. Sample Query Example


sql
CopyEdit
SELECT
event_type,
COUNT(*) AS event_count
FROM
user_activity
WHERE
event_time >= now() - INTERVAL 7 DAY
GROUP BY
event_type
ORDER BY
event_count DESC;

7.Advantages and Limitations


Advantages:

1. Extremely fast query processing.


2. Efficient disk usage with compression.
3. Easy to scale out (add more servers).

Limitations:

1. Not designed for transactional systems (no ACID support like traditional DBMS).
2. No strong relational integrity (no foreign keys, no joins across clusters easily).
3. Requires tuning for best performance at scale.
NoSQL databases. The design and query languages of NoSQL databases vary
widely between different NoSQL products – much more widely than they do
among traditional SQL databases.
• Backup of Database - Though some NoSQL databases like MongoDB provide
some tools for backup, these tools are not mature enough to ensure proper
complete data backup solution.
• Security - NoSQL databases are generally subject to a fairly long list of security
issues. It is a limiting factor for NoSQL deployment.
• Consistency - NoSQL puts a scalability and performance first but when it comes
to a consistency of the data NoSQL doesn’t take much consideration so it makes
it little insecure as compared to the relational database e.g., in NoSQL databases
if you enter same set of data again, it will take it without issuing any error
whereas relational databases ensure that no duplicate rows get entry in
databases.
BASICS OF MONGODB:
• MongoDB is an open source, document-oriented database designed with both
scalability and developer agility in mind.
• Instead of storing your data in tables and rows as you would with a relational database,
in MongoDB you store JSON-like documents with dynamic schemas(schema-free,
schema less).
{
"_id" : ObjectId("5114e0bd42…"),
“FirstName" : "John",
“LastName" : "Doe",
“Age" : 39,
“Interests" : [ "Reading", "Mountain Biking ]
“Favorites": {
"color": "Blue",
"sport": "Soccer“
}
}

Introduction:
• Developed by 10gen
• Founded in 2007
• A document-oriented, NoSQL database
• Hash-based, schema-less database
• No Data Definition Language
• It store hashes with any keys and values
• Keys are a basic data type but in reality stored as strings
• Document Identifiers (_id) will be created for each document, field name reserved by
system
• Application tracks the schema and mapping
• Uses BSON format - Based on JSON – B stands for Binary
• Written in C++
• Supports APIs (drivers) in many computer languages
• JavaScript, Python, Ruby, Perl, Java, Java Scala, C#, C++, Haskell, Erlang
MongoDB is a cross-platform, document oriented database that provides, high performance,
high availability, and easy scalability. MongoDB works on concept of collection and document.

Database
Database is a physical container for collections. Each database gets its own set of files on the
file system. A single MongoDB server typically has multiple databases.

Collection
Collection is a group of MongoDB documents. It is the equivalent of an RDBMS table. A
collection exists within a single database. Collections do not enforce a schema. Documents
within a collection can have different fields. Typically, all documents in a collection are of
similar or related purpose.

Document
A document is a set of key-value pairs. Documents have dynamic schema. Dynamic schema
means that documents in the same collection do not need to have the same set of fields or
structure, and common fields in a collection's documents may hold different types of data.

Why MongoDB?
• Simple queries
• Functionality provided applicable to most web applications
• Easy and fast integration of data
• No ERD diagram
• Not well suited for heavy and complex transactions systems
• Document Oriented Storage: Data is stored in the form of JSON style documents.
• Index on any attribute
• Replication and high availability
• Auto-sharding
• Rich queries
• Fast in-place updates
• Professional support by MongoDB

Where to Use MongoDB?


• Big Data
• Content Management and Delivery
• Mobile and Social Infrastructure
• User Data Management
• Data Hub

Advantages of MongoDB over RDBMS


• Schema less: MongoDB is a document database in which one collection holds
different documents. Number of fields, content and size of the document can differ
from one document to another.
• Structure of a single object is clear.
• No complex joins.
• Deep query-ability. MongoDB supports dynamic queries on documents using a
document-based query language that's nearly as powerful as SQL.
• Tuning.
• Ease of scale-out: MongoDB is easy to scale.
• Conversion/mapping of application objects to database objects not needed.
• Uses internal memory for storing the (windowed) working set, enabling faster
access of data.

MongoDB Hierarchical Objects:

CRUD Operations:
1. Create
a. db.collection.insert( <document> )
b. db.collection.save( <document> )
c. db.collection.update( <query>, <update>, { upsert: true } )
2. Read
a. db.collection.find( <query>, <projection> )
b. db.collection.findOne( <query>, <projection> )
• Provides functionality similar to the SELECT command
• <query> where condition , <projection> fields in result set
• Example: var PartsCursor = db.parts.find({parts:“hammer”}).limit(5)
• Has cursors to handle a result set
• Can modify the query to impose limits, skips, and sort orders.
• Can specify to return the ‘top’ number of records from the result set
3. Update
a. db.collection_name.insert( <document> )
• Omit the _id field to have MongoDB generate a unique key
• Example db.parts.insert( {{type: “screwdriver”, quantity: 15 } )
• db.parts.insert({_id: 10, type: “hammer”, quantity: 1 })
b. db.collection_name.update( <query>, <update>, { upsert: true } )
• Will update 1 or more records in a collection satisfying query
c. db.collection_name.save( <document> )
• Updates an existing record or creates a new record
4. Delete
a. db.collection.remove( <query>, <justOne> )

Replication:

• Replication provides redundancy and increases data availability


• With multiple copies of data on different database servers, replication provides a level
of fault tolerance against the loss of a single database server
• Shard is a Mongo instance to handle a subset of original data.
• Mongos is a query router to shards.
• Config Server is a Mongo instance which stores metadata information and
configuration details of cluster.
Java Mongo Connectivity

By Prof. B.A.Khivsara

Note: The material to prepare this presentation has been taken from internet and are
generated only for students reference and not for commercial use.
Software Required and Steps
• Eclipse
• JDK 1.6
• MongoDB
• MongoDB-Java-Driver
In Eclipse perform following steps:
1. File - New – Java Project –Give Project Name – ok
2. In project Explorer window- right click on project name-
new- class- give Class name- ok
3. In project Explorer window- right click on project name-
Build path- Configure build path- Libraries- Add External
Jar - MongoDB-Java-Driver
4. Start Mongo server before running the program
Import packages, Create connection,
database and collection
• Import packages
import com.mongodb.*;

• Create connection
MongoClient mongo = new MongoClient( "localhost" , 27017 );

• Create Database
DB db = mongo.getDB("database name");

• Create Collection
DBCollection coll = db.getCollection(“Collection Name");
Insert Document

BasicDBObject d1 = new BasicDBObject(“rno“,“1”).append(“name",


“Monika"). append(“age", “17”)

BasicDBObject d2 = new BasicDBObject(“rno“,“2”).append(“name",


“Roshan"). append(“age", “18”)

coll.insert(d1);

coll.insert(d2);
Display document

DBCursor cursor = coll.find();


while (cursor.hasNext())
{
System.out.println(cursor.next());
}
Update Document
• BasicDBObject query = new BasicDBObject();
• query.put("name", “Monika");
• BasicDBObject newDocument = new BasicDBObject();
• newDocument.put("name", “Ragini");
• BasicDBObject updateObj = new BasicDBObject();
• updateObj.put("$set", newDocument);
• Coll.update(query, updateObj);
Remove document

BasicDBObject searchQuery = new


BasicDBObject();

searchQuery.put("name", “Monika");

Coll.remove(searchQuery);
Program
import com.mongodb.*;
public class conmongo {
public static void main(String[] args) {
try {

MongoClient mongoClient = new MongoClient( "localhost" , 27017 );


DB db = mongoClient.getDB( "mydb" );
DBCollection coll = db.createCollection(“Stud",null);

BasicDBObject doc1 = new BasicDBObject("rno","1").append("name",“Mona");


BasicDBObject doc2 = new BasicDBObject("rno","2").append("name","swati");
coll.insert(doc1);
coll.insert(doc2);
Program
DBCursor cursor = coll.find(searchQuery);
while (cursor.hasNext())
{
System.out.println(cursor.next());
}
BasicDBObject query = new BasicDBObject();
query.put("name", “Monika");
BasicDBObject N1 = new BasicDBObject();
N1.put("name", “Ragini");
BasicDBObject S1= new BasicDBObject();
S1.put("$set", newDocument);
coll.update(query, S1);
Program
BasicDBObject R1 = new BasicDBObject();
R1.put("name", “Monika");
coll.remove(R1);
}

catch(Exception e)
{
e.printStackTrace();
}

}
}
MongoDB Testing
1. Introduction
1. MongoDB is a popular NoSQL database that stores data in JSON-like documents.
2. Testing MongoDB is crucial to ensure:

A. Database operations work correctly (insert, update, delete, query).


B. Data integrity is maintained.
C. Application logic interacting with MongoDB is correct.

2. Types of Testing in MongoDB


Testing Type Purpose

Unit Testing Test individual database functions/methods.

Integration Testing Test MongoDB together with application logic.

Performance Testing Test MongoDB under load (heavy read/write).

Security Testing Ensure MongoDB is safe against vulnerabilities.

Backup & Recovery Testing Validate data backup and restoration processes.

3. Tools for Testing MongoDB


Tool Usage

Jest + Mongoose Unit testing in Node.js apps.

Mocha/Chai JavaScript testing frameworks with MongoDB.

MongoDB in-memory server Create a temporary MongoDB instance for tests.

Postman Manual API testing that interacts with MongoDB.

NoSQLUnit Java testing framework for MongoDB database.

Locust / JMeter Load testing MongoDB servers.


4. Key Points to Test in MongoDB
1. Data Insertion: Are documents inserted correctly?
2. Data Querying: Can queries retrieve the right data?
3. Data Updating: Are updates applied correctly?
4. Data Deletion: Is deletion working as expected?
5. Index Testing: Are indexes created and used properly?
6. Transactions Testing: (If using multi-document transactions.)

5. Performance Testing for MongoDB


1. Test with large datasets.
2. Measure query execution time, read/write throughput.
3. Use bulk inserts, bulk reads, and aggregation pipeline testing.

Tools:

4. JMeter + MongoDB Plugin


5. Locust (Python)

6. Best Practices for MongoDB Testing


1. Use test databases (never run tests on production DB).
2. Prefer in-memory MongoDB for fast and isolated unit/integration tests.
3. Use mocking for pure unit tests.
4. Clean up database after each test (drop collections, remove inserted docs).
5. Test failure cases: simulate network failure, unauthorized access, etc.
Metabase
1. Metabase is an open-source Business Intelligence (BI) tool.
2. Helps users query databases, visualize data, and build dashboards easily.
3. No need for complex programming — uses a simple GUI (graphical user interface).
4. Supports many databases like PostgreSQL, MySQL, MongoDB, and more.

Important Points and Limitations


Aspect Details

Schema
MongoDB is schema-less; Metabase tries to guess fields.
Detection

Simple Queries Best suited for basic queries (filters, groups, counts).

Aggregations Complex MongoDB aggregation pipelines are limited in Metabase UI.

Performance Big collections may cause slow queries.

MongoDB uses aggregation syntax, not real SQL. Metabase uses its own
SQL Support
visual query builder.

MongoDB with Metabase?


1. To analyze and visualize data stored inside MongoDB.
2. To create dashboards, charts, and reports without writing complex MongoDB
queries manually.
3. To allow non-technical teams (e.g., business analysts) to understand MongoDB
data easily.

Connect MongoDB to Metabase


1. Open Metabase Admin Panel → Databases → Add a Database.
2. Select MongoDB.
3. Enter details:

a) Host (IP address or domain)


b) Port (default is 27017)
c) Database name
d) Username and Password (if authentication is enabled)
e) Connection options (e.g., SSL if needed)

4. Save and test the connection.

Once connected, MongoDB collections will be available in Metabase like tables.

MongoDB to Metabase After Connecting


1. Collections (MongoDB's groups of documents) appear like tables.
2. Documents (individual MongoDB records) behave like rows.

a) Filter, group, and summarize data.


b) Create visualizations (bar chart, pie chart, time series, etc.)
c) Build dashboards to monitor real-time or historical trends.

Sample Use Case


Suppose your MongoDB collection orders looks like:

json
CopyEdit
{
"order_id": "12345",
"product": "Phone",
"price": 25000,
"order_date": "2025-04-01T10:00:00Z"
}

In Metabase:

1. Filter order_date to last 30 days.


2. Group by product. &
3. Sum price → to find total sales per product.

You might also like