Unit 3: Types of Keys & Data Integrity 3.1. Keys: Super Key, Candidate Key, Primary Key, Alternate Key, Foreign Key Different Types of SQL Keys
Unit 3: Types of Keys & Data Integrity 3.1. Keys: Super Key, Candidate Key, Primary Key, Alternate Key, Foreign Key Different Types of SQL Keys
3.1. Keys: Super Key, Candidate Key, Primary Key, Alternate Key, Foreign Key
We have following types of keys in SQL which are used to fetch records from tables
and to make relationship among tables or views.
1. Super Key
Super key is a set of one or more than one keys that can be used to identify a
record uniquely in a table. Example : Primary key, Unique key, Alternate key
are subset of Super Keys.
2. Candidate Key
3. Primary Key
4. Alternate key
6. Unique Key
Foreign Key is a field in database table that is Primary key in another table. It can
accept multiple null, duplicate values.
--Department Table
CREATE TABLE Department
(
DeptID int PRIMARY KEY, --primary key
Name varchar (50) NOT NULL,
Address varchar (200) NOT NULL
)
--Student Table
CREATE TABLE Student
(
ID int PRIMARY KEY, --primary key
RollNo varchar(10) NOT NULL,
Name varchar(50) NOT NULL,
EnrollNo varchar(50) UNIQUE, --unique key
Address varchar(200) NOT NULL,
DeptID int FOREIGN KEY REFERENCES Department(DeptID) --foreign key
)
Practically in database, we have only three types of keys Primary Key, Unique Key
and Foreign Key. Other types of keys are only concepts of DBMS which you should
know.
3.2. Constraints
Integrity Constraints
SQL Constraints
SQL constraints are used to specify rules for the data in a table.
Constraints are used to limit the type of data that can go into a table. This ensures
the accuracy and reliability of the data in the table. If there is any violation between
the constraint and the data action, the action is aborted.
Constraints can be column level or table level. Column level constraints apply to a
column, and table level constraints apply to the whole table.
The following constraints are commonly used in SQL:
Example
If the table has already been created, you can add a NOT NULL constraint to a
column with the ALTER TABLE statement.
The UNIQUE constraint ensures that all values in a column are different.
Both the UNIQUE and PRIMARY KEY constraints provide a guarantee for uniqueness
for a column or set of columns.
The PRIMARY KEY constraint uniquely identifies each record in a database table.
Primary keys must contain UNIQUE values, and cannot contain NULL values.
A table can have only one primary key, which may consist of single or multiple
fields.
A FOREIGN KEY is a field (or collection of fields) in one table that refers to the
PRIMARY KEY in another table.
The table containing the foreign key is called the child table, and the table
containing the candidate key is called the referenced or parent table.
"Persons" table:
"Orders" table:
Notice that the "PersonID" column in the "Orders" table points to the "PersonID"
column in the "Persons" table.
The "PersonID" column in the "Persons" table is the PRIMARY KEY in the "Persons"
table.
The "PersonID" column in the "Orders" table is a FOREIGN KEY in the "Orders"
table.
The FOREIGN KEY constraint is used to prevent actions that would destroy links
between tables.
The FOREIGN KEY constraint also prevents invalid data from being inserted into the
foreign key column, because it has to be one of the values contained in the table it
points to.
The following SQL creates a FOREIGN KEY on the "PersonID" column when the
"Orders" table is created:
To create a FOREIGN KEY constraint on the "PersonID" column when the "Orders"
table is already created, use the following SQL:
There are the domain integrity, the entity integrity, the referential integrity and the
foreign key integrity constraints.
Domain Integrity
Domain integrity means the definition of a valid set of values for an attribute. You
define
- data type,
- length or size
- is null value allowed
- is the value unique or not
for an attribute.
Rule 4. The model_id field in the Car table can have a null value which means that
the car type of that car in not known
BEGIN
num:= 6;
factorial := fact(num);
dbms_output.put_line(' Factorial '|| num || ' is ' || factorial);
END;
/
When the above code is executed at the SQL prompt, it produces the following result −
Factorial 6 is 720
PL/SQL - Cursors
Oracle creates a memory area, known as the context area, for processing an SQL statement, which
contains all the information needed for processing the statement; for example, the number of rows
processed, etc.
A cursor is a pointer to this context area. PL/SQL controls the context area through a cursor. A cursor
holds the rows (one or more) returned by a SQL statement. The set of rows the cursor holds is referred to
as the active set.
You can name a cursor so that it could be referred to in a program to fetch and process the rows returned
by the SQL statement, one at a time. There are two types of cursors −
Implicit cursors
Explicit cursors
Implicit Cursors
Implicit cursors are automatically created by Oracle whenever an SQL statement is executed, when there
is no explicit cursor for the statement. Programmers cannot control the implicit cursors and the
information in it.
Whenever a DML statement (INSERT, UPDATE and DELETE) is issued, an implicit cursor is
associated with this statement. For INSERT operations, the cursor holds the data that needs to be
inserted. For UPDATE and DELETE operations, the cursor identifies the rows that would be affected.
In PL/SQL, you can refer to the most recent implicit cursor as the SQL cursor, which always has
attributes such as %FOUND, %ISOPEN, %NOTFOUND, and %ROWCOUNT. The SQL cursor has
additional attributes, %BULK_ROWCOUNT and %BULK_EXCEPTIONS, designed for use with
the FORALL statement. The following table provides the description of the most used attributes −
%FOUND
Returns TRUE if an INSERT, UPDATE, or DELETE statement affected one or more rows or a SELECT
INTO statement returned one or more rows. Otherwise, it returns FALSE.
%NOTFOUND
The logical opposite of %FOUND. It returns TRUE if an INSERT, UPDATE, or DELETE statement
affected no rows, or a SELECT INTO statement returned no rows. Otherwise, it returns FALSE.
%ISOPEN
Always returns FALSE for implicit cursors, because Oracle closes the SQL cursor automatically after
executing its associated SQL statement.
%ROWCOUNT
Returns the number of rows affected by an INSERT, UPDATE, or DELETE statement, or returned by a
SELECT INTO statement.
Any SQL cursor attribute will be accessed as sql%attribute_name as shown below in the example.
Example
he following program will update the table and increase the salary of each customer by 500 and use
the SQL%ROWCOUNT attribute to determine the number of rows affected −
DECLARE
total_rows number(2);
BEGIN
UPDATE customers
SET salary = salary + 500;
IF sql%notfound THEN
dbms_output.put_line('no customers selected');
ELSIF sql%found THEN
total_rows := sql%rowcount;
dbms_output.put_line( total_rows || ' customers selected ');
END IF;
END;
When the above code is executed at the SQL prompt, it produces the following result −
6 customers selected
Explicit Cursors
Explicit cursors are programmer-defined cursors for gaining more control over the context area. An
explicit cursor should be defined in the declaration section of the PL/SQL Block. It is created on a
SELECT Statement which returns more than one row.
The syntax for creating an explicit cursor is −
CURSOR cursor_name IS select_statement;
Working with an explicit cursor includes the following steps −
Declaring the cursor for initializing the memory
Opening the cursor for allocating the memory
Fetching the cursor for retrieving the data
Closing the cursor to release the allocated memory
Declaring the Cursor
Declaring the cursor defines the cursor with a name and the associated SELECT statement. For example
−
CURSOR c_customers IS
SELECT id, name, address FROM customers;
Opening the Cursor
Opening the cursor allocates the memory for the cursor and makes it ready for fetching the rows returned
by the SQL statement into it. For example, we will open the above defined cursor as follows −
OPEN c_customers;
Fetching the Cursor
Fetching the cursor involves accessing one row at a time. For example, we will fetch rows from the
above-opened cursor as follows −
FETCH c_customers INTO c_id, c_name, c_addr;
Closing the Cursor
Closing the cursor means releasing the allocated memory. For example, we will close the above-opened
cursor as follows −
CLOSE c_customers;
Example
Following is a complete example to illustrate the concepts of explicit cursors &minua;
DECLARE
c_id customers.id%type;
c_name customerS.No.ame%type;
c_addr customers.address%type;
CURSOR c_customers is
SELECT id, name, address FROM customers;
BEGIN
OPEN c_customers;
LOOP
FETCH c_customers into c_id, c_name, c_addr;
EXIT WHEN c_customers%notfound;
dbms_output.put_line(c_id || ' ' || c_name || ' ' || c_addr);
END LOOP;
CLOSE c_customers;
END;
When the above code is executed at the SQL prompt, it produces the following result −
1 Ramesh Ahmedabad
2 Khilan Delhi
3 kaushik Kota
4 Chaitali Mumbai
5 Hardik Bhopal
6 Komal MP
BEGIN
-- Book 1 specification
book1.title := 'C Programming';
book1.author := 'Nuha Ali ';
book1.subject := 'C Programming Tutorial';
book1.book_id := 6495407;
-- Book 2 specification
book2.title := 'Telecom Billing';
book2.author := 'Zara Ali';
book2.subject := 'Telecom Billing Tutorial';
book2.book_id := 6495700;
PL/SQL - Triggers
Triggers are stored programs, which are automatically executed or fired when some events occur.
Triggers are, in fact, written to be executed in response to any of the following events −
A database manipulation (DML) statement (DELETE, INSERT, or UPDATE)
A database definition (DDL) statement (CREATE, ALTER, or DROP).
A database operation (SERVERERROR, LOGON, LOGOFF, STARTUP, or SHUTDOWN).
Triggers can be defined on the table, view, schema, or database with which the event is associated.
Benefits of Triggers
Triggers can be written for the following purposes −
Generating some derived column values automatically
Enforcing referential integrity
Event logging and storing information on table access
Auditing
Synchronous replication of tables
Imposing security authorizations
Preventing invalid transactions
Creating Triggers
The syntax for creating a trigger is −
CREATE [OR REPLACE ] TRIGGER trigger_name
{BEFORE | AFTER | INSTEAD OF }
{INSERT [OR] | UPDATE [OR] | DELETE}
[OF col_name]
ON table_name
[REFERENCING OLD AS o NEW AS n]
[FOR EACH ROW]
WHEN (condition)
DECLARE
Declaration-statements
BEGIN
Executable-statements
EXCEPTION
Exception-handling-statements
END;
Where,
CREATE [OR REPLACE] TRIGGER trigger_name − Creates or replaces an existing trigger with
the trigger_name.
{BEFORE | AFTER | INSTEAD OF} − This specifies when the trigger will be executed. The
INSTEAD OF clause is used for creating trigger on a view.
{INSERT [OR] | UPDATE [OR] | DELETE} − This specifies the DML operation.
[OF col_name] − This specifies the column name that will be updated.
[ON table_name] − This specifies the name of the table associated with the trigger.
[REFERENCING OLD AS o NEW AS n] − This allows you to refer new and old values for
various DML statements, such as INSERT, UPDATE, and DELETE.
[FOR EACH ROW] − This specifies a row-level trigger, i.e., the trigger will be executed for each
row being affected. Otherwise the trigger will execute just once when the SQL statement is
executed, which is called a table level trigger.
WHEN (condition) − This provides a condition for rows for which the trigger would fire. This
clause is valid only for row-level triggers.
Example
The following program creates a row-level trigger for the customers table that would fire for INSERT or
UPDATE or DELETE operations performed on the CUSTOMERS table. This trigger will display the
salary difference between the old values and new values –
Triggering a Trigger
Let us perform some DML operations on the CUSTOMERS table. Here is one INSERT statement, which
will create a new record in the table −
INSERT INTO CUSTOMERS (ID,NAME,AGE,ADDRESS,SALARY)
VALUES (7, 'Kriti', 22, 'HP', 7500.00 );
When a record is created in the CUSTOMERS table, the above create
trigger, display_salary_changes will be fired and it will display the following result −
Old salary:
New salary: 7500
Salary difference:
Because this is a new record, old salary is not available and the above result comes as null. Let us now
perform one more DML operation on the CUSTOMERS table. The UPDATE statement will update an
existing record in the table −
UPDATE customers
SET salary = salary + 500
WHERE id = 2;
When a record is updated in the CUSTOMERS table, the above create
trigger, display_salary_changes will be fired and it will display the following result −
Old salary: 1500
New salary: 2000
Salary difference: 500
Database Design and 1.3.1 Embedded SQL
Implementation
The embedded SQL statements can be put in the application program written in C,
Java or any other host language. These statements sometime may be called static.
Why are they called static? The term „static‟ is used to indicate that the embedded
SQL commands, which are written in the host program, do not change automatically
during the lifetime of the program. Thus, such queries are determined at the time of
database application design. For example, a query statement embedded in C to
determine the status of train booking for a train will not change. However, this query
may be executed for many different trains. Please note that it will only change the
input parameter to the query that is train-number, date of boarding, etc., and not the
query itself.
But how is such embedding done? Let us explain this with the help of an example.
Example: Write a C program segment that prints the details of a student whose
enrolment number is input.
Let us assume the relation
STUDENT (enrolno:char(9), name:Char(25), phone:integer(12), prog-code:char(3))
/* add proper include statements*/
/*declaration in C program */
EXEC SQL BEGIN DECLARE SECTION;
Char enrolno[10], name[26], p-code[4];
int phone;
int SQLCODE;
char SQLSTATE[6]
EXEC SQL END DECLARE SECTION;
/* The connection needs to be established with SQL*/
/* program segment for the required function */
printf (“enter the enrolment number of the student”);
scanf (“% s”, &enrolno);
EXEC SQL
SELECT name, phone, prog-code INTO
:name, :phone, :p-code
FROM STUDENT
WHERE enrolno = :enrolno;
If (SQLCODE ==0)
printf (“%d, %s, %s, %s”, enrolno, name, phone, p-code)
else
printf (“Wrong Enrolment Number”);
The program is written in the host language „C‟ and contains embedded SQL
statements.
Although in the program an SQL query (SELECT) has been added. You can
embed any DML, DDL or views statements.
The distinction between an SQL statement and host language statement is made
by using the key word EXEC SQL; thus, this key word helps in identifying the
Embedded SQL statements by the pre-compiler.
Please note that the statements including (EXEC SQL) are terminated by a
semi-colon (;),
As the data is to be exchanged between a host language and a database, there is
a need of shared variables that are shared between the environments. Please
note that enrolno[10], name[20], p-code[4]; etc. are shared variables, colon (:)
declared in „C‟.
52
Please note that the shared host variables enrolno is declared to have char[10] Advanced
whereas, an SQL attribute enrolno has only char[9]. Why? Because in „C‟ SQL
conversion to a string includes a „\ 0‟ as the end of the string.
The type mapping between „C‟ and SQL types is defined in the following table:
Please also note that these shared variables are used in SQL statements of the
program. They are prefixed with the colon (:) to distinguish them from
database attribute and relation names. However, they are used without this
prefix in any C language statement.
Please also note that these shared variables have almost the same name (except
p-code) as that of the attribute name of the database. The prefix colon (:) this
distinguishes whether we are referring to the shared host variable or an SQL
attribute. Such similar names is a good programming convention as it helps in
identifying the related attribute easily.
Please note that the shared variables are declared between BEGIN DECLARE
SECTION and END DECLARE SECTION and there typed is defined in „C‟
language.
Two more shared variables have been declared in „C‟. These are:
SQLCODE as int
SQLSTATE as char of size 6
These variables are used to communicate errors and exception conditions
between the database and the host language program. The value 0 in
SQLCODE means successful execution of SQL command. A value of the
SQLCODE =100 means „no more data‟. The value of SQLCODE if less than 0
indicates an error. Similarly, SQLSTATE is a 5 char code the 6th char is for
„\0‟ in the host language „C‟. Value “00000” in an SQLSTATE indicate no
error. You can refer to SQL standard in more detail for more information.
In order to execute the required SQL command, connection with the database
server need to be established by the program. For this, the following SQL
statement is used:
CONNECT <name of the server> AS <name of the connection>
AUTHORISATION <username, password>,
TO DISCONNECT we can simply say
DISCONNECT <name of the connection>;
Execution of SQL query in the given program: To create the SQL query, first, the
given value of enrolment number is transferred to SQL attribute value, the query then
is executed and the result, which is a single tuple in this case, is transferred to shared
host variables as indicated by the key word INTO after the SELECT statement.
The SQL query runs as a standard SQL query except the use of shared host variables.
Rest of the C program has very simple logic and will print the data of the students
whose enrolment number has been entered.
53
printf (“enter the programme code); Advanced
scanf (“%s, &p-code); SQL
EXEC SQL DECLARE CURSOR GUPDATE
SELECT enrolno, name, phone, grade
FROM STUDENT
WHERE progcode =: p-code
FOR UPDATE OF grade;
EXEC SQL OPEN GUPDATE;
EXEC SQL FETCH FROM GUPDATE
INTO :enrolno, :name, :phone, :grade;
WHILE (SQLCODE==0) {
printf (“enter grade for enrolment number, “%s”, enrolno);
scanf (“%c”, grade);
EXEC SQL
UPDATE STUDENT
SET grade=:grade
WHERE CURRENT OF GUPDATE
EXEC SQL FETCH FROM GUPDATE;
}
EXEC SQL CLOSE GUPDATE;
Please note that the declared section remains almost the same. The cursor is
declared to contain the output of the SQL statement. Please notice that in this
case, there will be many tuples of students database, which belong to a
particular programme.
The purpose of the cursor is also indicated during the declaration of the cursor.
The cursor is then opened and the first tuple is fetch into shared host variable
followed by SQL query to update the required record. Please note the use of
CURRENT OF which states that these updates are for the current tuple referred
to by the cursor.
WHILE Loop is checking the SQLCODE to ascertain whether more tuples are
pending in the cursor.
Please note the SQLCODE will be set by the last fetch statement executed just
prior to while condition check.
How are these SQL statements compiled and error checked during embedded SQL?
The SQL pre-compiler performs the type of checking of the various shared host
variables to find any mismatches or errors on each of the SQL statements. It
then stores the results into the SQLCODE or SQLSTATE variables.
They offer only limited functionality, as the query must be known at the time of
application development so that they can be pre-compiled in advance. However,
many queries are not known at the time of development of an application; thus we
require dynamically embedded SQL also.
However, they are more powerful than embedded SQL as they allow run time
application logic. The basic advantage of using dynamic embedded SQL is that we
need not compile and test a new program for a new query.
55
Database Design and Let us explain the use of dynamic SQL with the help of an example:
Implementation
Example: Write a dynamic SQL interface that allows a student to get and modify
permissible details about him/her. The student may ask for subset of information also.
Assume that the student database has the following relations.
STUDENT (enrolno, name, dob)
RESULT (enrolno, coursecode, marks)
In the table above, a student has access rights for accessing information on his/her
enrolment number, but s/he cannot update the data. Assume that user names are
enrolment number.
Solution: A sample program segment may be (please note that the syntax may change
for different commercial DBMS).
/* declarations in SQL */
EXEC SQL BEGIN DECLARE SECTION;
char inputfields (50);
char tablename(10)
char sqlquery ystring(200)
EXEC SQL END DECLARE SECTION;
printf (“Enter the fields you want to see \n”);
scanf (“SELECT%s”, inputfields);
printf (“Enter the name of table STUDENT or RESULT”);
scanf (“FROM%s”, tablename);
sqlqueryystring = “SELECT” +inputfields +“ ”+
“FROM” + tablename
+ “WHERE enrolno + :USER”
/*Plus is used as a symbol for concatenation operator; in some DBMS it may be ||*/
/* Assumption: the user name is available in the host language variable USER*/
The query can be entered completely as a string by the user or s/he can be
suitably prompted.
The query can be fabricated using a concatenation of strings. This is language
dependent in the example and is not a portable feature in the present query.
The query modification of the query is being done keeping security in mind.
The query is prepared and executed using a suitable SQL EXEC commands.
1.3.4 SQLJ
Till now we have talked about embedding SQL in C, but how can we embed SQL
statements into JAVA Program? For this purpose we use SQLJ. In SQLJ, a
preprocessor called SQLJ translator translates SQLJ source file to JAVA source file.
The JAVA file compiled and run on the database. Use of SQLJ improves the
productivity and manageability of JAVA Code as:
Please note that SQLJ cannot use dynamic SQL. It can only use simple embedded
SQL. SQLJ provides a standard form in which SQL statements can be embedded in
56
UNIT IV
TRANSACTION PROCESSING AND CONCURRENCY CONTROL
write(X), which transfers the value in the variable X in the main-memory buffer of
the transaction that executed the write to the data item X in the database.
Example:
Let Ti be a transaction that transfers $50 from account A to account B. This transaction can
be defined as:
Ti : read(A);
A := A − 50;
write(A);
read(B);
B := B + 50;
write(B).
Let us now consider each of the ACID properties.
• Consistency: The consistency requirement here is that the sum of A and B be unchanged by the
execution of the transaction. This task may be facilitated by automatic testing of integrity
constraints.
Atomicity:The basic idea behind ensuring atomicity is this: The database system keeps track (on
disk) of the old values of any data on which a transaction performs a write. This information is
written to a file called the log. If the transaction does not complete its execution, the database
system restores the old values from the log to make it appear as though the transaction never
executed.
Suppose that, just before the execution of transaction Ti, the values of accounts A and B are
$1000 and $2000, respectively. Suppose that the failure happened after the write(A) operation but
before the write(B) operation. In this case, the values of accounts A and B reflected in the database
are $950 and $2000.
The system destroyed $50 as a result of this failure. In particular, we note that the sum A + B
is no longer preserved.
Ensuring atomicity is the responsibility of the database system; specifically, it is handled by
a component of the database called the recovery system.
Durability: The durability property guarantees that, once a transaction completes successfully, all
the updates that it carried out on the database persist, even if there is a system failure after the
transaction completes execution.
We can guarantee durability by ensuring that either:
1. The updates carried out by the transaction have been written to disk before the
transaction completes.
2. Information about the updates carried out by the transaction and written to disk is
sufficient to enable the database to reconstruct the updates when the database system
is restarted after the failure.
The recovery system of the database is responsible for ensuring durability.
Isolation: If several transactions are executed concurrently, their operations may interleave in some
undesirable way, resulting in an inconsistent state.
For example, the database is temporarily inconsistent while the transaction to transfer funds
from A to B is executing, with the deducted total written to A and the increased total yet to be
written to B. If a second concurrently running transaction reads A and B at this intermediate point
and computes A+B, it will observe an inconsistent value. Furthermore, if this second transaction
then performs updates on A and B based on the inconsistent values that it read, the database may be
left in an inconsistent
state even after both transactions have completed.
A way to avoid the problem of concurrently executing transactions is to execute transactions
serially—that is, one after the other.
Ensuring the isolation property is the responsibility of a component of the database system
called the concurrency-control system.
States of Transaction
The
state
diagram
correspondi
ng to a
transaction
is shown in
Figure.
A transaction must be in one of the following states:
Active: the initial state, the transaction stays in this state while it is executing.
Partially committed: after the final statement has been executed.
Failed: when the normal execution can no longer proceed.
Aborted: after the transaction has been rolled back and the database has been restored to
its state prior to the start of the transaction.
Committed: after successful completion.
3.3 SCHEDULES:
Schedule is defined as a sequence of instructions that specify the chronological order in
which instructions of concurrent transactions are executed.
A schedule is serializable if it is equivalent to a serial schedule.
A schedule where the operations of each transaction are executed consecutively without any
interference from other transactions is called serial sechedule.
Types of serializability are
1. Conflict Serializability
2. View Serializability
3.4 SERIALIZABILITY:
Conflict Serializability:
Instructions Ii and Ij, of transactions Ti and Tj respectively, conflict if and only if there exists
some item Q accessed by both Ii and Ij, and at least one of these instructions wrote Q.
1.Ii = read( Q), Ij = read( Q). Ii and Ij don't conflict.
2.Ii = read( Q), Ij = write( Q). They conflict.
3.Ii = write( Q), Ij = read( Q). They conflict.
4.Ii = write( Q), Ij = write( Q). They conflict.
If Ii and Ij are consecutive in a schedule and they do not conflict, their results would remain
the same even if they had been interchanged in the schedule.
Consider following schedule 3.
The write (A) of T1 conflicts with read (A) of T2. However, write (A) of T2 does not conflict
with read (B) of T1, because, the two instructions access different data items.
Because of no conflict, we can swap write (A) and read (B) instructions to generate a new
schedule 5.
Regardless of
initial system
state, schedule
3 and 5
generates same
result.
Swap the write (B) instruction of T1 with write (A) instruction of T2.
Swap the write (B) instruction of T1 with the read (A) instruction of T2 .
View Serializability:
A schedule S is view
serializable if it is view equivalent to
a serial schedule.
Let S and S0 be two schedules with the same set of transactions. S and S0 are view
equivalent if the following three conditions are met:
1. For each data item Q, if transaction Ti reads the initial value of Q in schedule S,
then transaction Ti must, in schedule S0, also read the initial value of Q.
2. For each data item Q, if transaction Ti executes read(Q) in schedule S, and that
value was produced by transaction Tj (if any), then transaction Ti must in schedule
S0 also read the value of Q that was produced by transaction Tj.
3. For each data item Q, the transaction (if any) that performs the final write (Q)
operation in schedule S must perform the final write (Q) operation in schedule S0.
Every conflict serializable schedule is also view serializable.
The following schedule is view-serializable but not conflict serializable
In the above schedule, transactions T4 and T5 performs write(Q) operations without having
performed a read(Q) operation. Writes of this sort are called blind writes. View serializable
schedule with blind writes is not conflict serializable.
The
preced
ence
graph
for
schedu
le 1
contains a single edge T1→T2,since all the instructions of T1 are executed before the first
instruction of T2 is executed.
The precedence graph for schedule 2 contains a single edge T2→T1,since all the instructions
of T2 are executed before the first instruction of T1 is executed.
Consider the following schedule 4.
To test conflict serializability construct a precedence graph for given schedule. If the graph
contains cycle, the schedule is not conflict serializable. If the graph contains no cycle, the schedule
is conflict serializable.
Schedule 1 and schedule 2 are conflict serializable as the precedence graph for both
schedules does not contain any cycle. While the schedule 4 is not conflict serializable as the
precedence graph for it contains cycle.
Solution:
The precedence graph for given schedule is
Locks:
The two modes of locks are:
1. Shared. If a transaction Ti has obtained a shared-mode lock (denoted by S) on
Concurrency Control is the management procedure that is required for
controlling concurrent execution of the operations that take place on a database.
o In a multi-user system, multiple users can access and use the same
database at one time, which is known as the concurrent execution of the
database. It means that the same database is executed simultaneously on a
multi-user system by different users.
o While working on the database transactions, there occurs the requirement
of using the database by multiple users for performing different
operations, and in that case, concurrent execution of the database is
performed.
o The thing is that the simultaneous execution that is performed should be
done in an interleaved manner, and no operation should affect the other
executing operations, thus maintaining the consistency of the database.
Thus, on making the concurrent execution of the transaction operations,
there occur several challenging problems that need to be solved.
The problem occurs when two different database transactions perform the
read/write operations on the same database items in an interleaved manner
(i.e., concurrent execution) that makes the values of the items incorrect hence
making the database inconsistent.
For example:
Consider the below diagram where two transactions TX and TY, are
performed on the same account A where the balance of account A is $300.
o At time t1, transaction TX reads the value of account A, i.e., $300 (only
read).
o At time t2, transaction TX deducts $50 from account A that becomes $250
(only deducted and not updated/write).
o Alternately, at time t3, transaction TY reads the value of account A that
will be $300 only because TX didn't update the value yet.
o At time t4, transaction TY adds $100 to account A that becomes $400
(only added but not updated/write).
o At time t6, transaction TX writes the value of account A that will be
updated as $250 only, as TY didn't update the value yet.
o Similarly, at time t7, transaction TY writes the values of account A, so it
will write as done at time t4 that will be $400. It means the value written
by TX is lost, i.e., $250 is lost.
The dirty read problem occurs when one transaction updates an item of the
database, and somehow the transaction fails, and before the data gets rollback,
the updated database item is accessed by another transaction. There comes the
Read-Write Conflict between both transactions.
For example:
Consider two transactions TX and TY in the below diagram performing
read/write operations on account A where the available balance in account
A is $300:
For example:
Thus, in order to maintain consistency in the database and avoid such problems
that take place in concurrent execution, management is needed, and that is
where the concept of Concurrency Control comes into role
Lock-Based Protocol
o In this type of protocol, any transaction cannot read or write data until it
acquires an appropriate lock on it. There are two types of lock:
1. Shared lock:
o It is also known as a Read-only lock. In a shared lock, the data item can
only read by the transaction.
o It can be shared between the transactions because when the transaction
holds a lock, then it can't update the data on the data item.
2. Exclusive lock:
o In the exclusive lock, the data item can be both reads as well as written by
the transaction.
o This lock is exclusive, and in this lock, multiple transactions do not
modify the same data simultaneously.
It is the simplest way of locking the data while transaction. Simplistic lock-
based protocols allow all the transactions to get the lock on the data before
insert or delete or update on it. It will unlock the data item after completing the
transaction.
o Pre-claiming Lock Protocols evaluate the transaction to list all the data
items on which they need locks.
o Before initiating an execution of the transaction, it requests DBMS for all
the lock on all those data items.
o If all the locks are granted then this protocol allows the transaction to
begin. When the transaction is completed then it releases all the lock.
o If all the locks are not granted then this protocol allows the transaction to
rolls back and waits until all the locks are granted.
Growing phase: In the growing phase, a new lock on the data item may be
acquired by the transaction, but none can be released.
Shrinking phase: In the shrinking phase, existing lock held by the transaction
may be released, but no new locks can be acquired.
In the below example, if lock conversion is allowed then the following phase
can happen:
Example:
The following way shows how unlocking and locking work with 2-PL.
Transaction T1:
Transaction T2:
Wait-for Graph:
This is a simple method available to track if any deadlock situation may arise.
For each transaction entering in the system, a node is created.
When transaction Ti requests for a lock on item, say X, which is held by some other
transaction Tj, a directed edge is created from Ti to Tj. If Tj releases item X, the edge between them
is dropped and Ti locks the data item.
The system maintains this wait-for graph for every transaction waiting for some data items
held by others. System keeps checking if there's any cycle in the graph.
Two approaches can be used, first not to allow any request for an item, which is already
locked by some other transaction.
This is not always feasible and may cause starvation, where a transaction indefinitely waits
for data item and can never acquire it. Second option is to roll back one of the transactions.
It is not feasible to always roll back the younger transaction, as it may be important than the
older one.
With help of some relative algorithm a transaction is chosen, which is to be aborted, this
transaction is called victim and the process is known as victim selection.
Phase 1
(1) Write a begin_commit record to the log file and force-write it to stable storage.
Send a PREPARE message to all participants.
Wait for participants to respond within a timeout period.
Phase 2
(2) If a participant returns an ABORT vote,
Write an abort record to the log file and force write it to stable storage.
Send a GLOBAL_ABORT message to all participants.
Wait for participants to acknowledge within a timeout period.
(3) If a participant returns a READY_COMMIT vote,
Write a commit record to the log file and force-write it to stable storage.
Send a GLOBAL_COMMIT message to all participants.
Wait for participants to acknowledge within a timeout period.
(4) Once all acknowledgements have been received,
Write an end_transaction message to the log file.
Repeatable Read – This is the most restrictive isolation level. The transaction holds read locks
on all rows it references and write locks on all rows it inserts, updates, or deletes. Since other
transaction cannot read, update or delete these rows, consequently it avoids non repeatable read.
Serializable – This is the Highest isolation level. A serializable execution is guaranteed to be
serializable. Serializable execution is defined to be an execution of operations in which
concurrently executing transactions appears to be serially executing.
3.12 SQL FACILITIES FOR CONCURRENCY AND RECOVERY:
Crash Recovery
DBMS is a highly complex system with hundreds of transactions being executed every
second. The durability and robustness of a DBMS depends on its complex architecture and its
underlying hardware and system software. If it fails or crashes amid transactions, it is expected that
the system would follow some sort of algorithm or techniques to recover lost data.
Failure Classification
To see where the problem has occurred, we generalize a failure into various categories, as
follows −
a) Transaction failure
A transaction has to abort when it fails to execute or when it reaches a point from where it
can’t go any further. This is called transaction failure where only a few transactions or processes
are hurt.
Reasons for a transaction failure could be −
Logical errors − Where a transaction cannot complete because it has some code error or
any internal error condition.
System errors − Where the database system itself terminates an active transaction because
the DBMS is not able to execute it, or it has to stop because of some system condition. For
example, in case of deadlock or resource unavailability, the system aborts an active
transaction.
System Crash
There are problems − external to the system − that may cause the system to stop abruptly
and cause the system to crash. For example, interruptions in power supply may cause the failure of
underlying hardware or software failure.
Examples may include operating system errors.
Disk Failure
In early days of technology evolution, it was a common problem where hard-disk drives or
storage drives used to fail frequently. Disk failures include formation of bad sectors, unreachability
to the disk, disk head crash or any other failure, which destroys all or a part of disk storage.
Storage Structure
We have already described the storage system. In brief, the storage structure can be divided into
two categories −
Volatile storage − As the name suggests, a volatile storage cannot survive system crashes.
Volatile storage devices are placed very close to the CPU; normally they are embedded
onto the chipset itself. For example, main memory and cache memory are examples of
volatile storage. They are fast but can store only a small amount of information.
Non-volatile storage − These memories are made to survive system crashes. They are huge
in data storage capacity, but slower in accessibility. Examples may include hard-disks,
magnetic tapes, flash memory, and non-volatile (battery backed up) RAM.
Log-based Recovery
Log is a sequence of records, which maintains the records of actions performed by a
transaction. It is important that the logs are written prior to the actual modification and stored on a
stable storage media, which is failsafe.
Log-based recovery works as follows −
The log file is kept on a stable storage media.
When a transaction enters the system and starts execution, it writes a log about it.
<Tn, Start>
When the transaction modifies an item X, it write logs as follows −
<Tn, X, V1, V2>
It reads T n has changed the value of X, from V1 to V2.
When the transaction finishes, it logs −
<Tn, commit>
The database can be modified using two approaches −
Deferred database modification − All logs are written on to the stable storage and the
database is updated when a transaction commits.
Immediate database modification − Each log follows an actual database modification.
That is, the database is modified immediately after every operation.
Recovery with Concurrent Transactions
When more than one transaction are being executed in parallel, the logs are interleaved. At
the time of recovery, it would become hard for the recovery system to backtrack all logs, and then
start recovering. To ease this situation, most modern DBMS use the concept of 'checkpoints'.
1.Checkpoint
Keeping and maintaining logs in real time and in real environment may fill out all the
memory space available in the system. As time passes, the log file may grow too big to be handled
at all. Checkpoint is a mechanism where all the previous logs are removed from the system and
stored permanently in a storage disk. Checkpoint declares a point before which the DBMS was in
consistent state, and all the transactions were committed.
2.Recovery
When a system with concurrent transactions crashes and recovers, it behaves in the
following manner −
If the recovery system sees a log with <T n, Start> and <T n, Commit> or just <T n, Commit>,
it puts the transaction in the redo-list.
If the recovery system sees a log with <T n, Start> but no commit or abort log found, it puts
the transaction in undo-list.
All the transactions in the undo-list are then undone and their logs are removed. All the
transactions in the redo-list and their previous logs are removed and then redone before saving
their logs.
QUERY PROCESSING AND OPTMIZATION IN DBMS.
Query Processing is the activity performed in extracting data from the database. In query
processing, it takes various steps for fetching the data from the database. The steps involved are:
1. Parsing and translation
2. Optimization
3. Evaluation
Step-2:
Optimizer: During optimization stage, database must perform a hard parse atleast for one
unique DML statement and perform optimization during this parse. This database never
optimizes DDL unless it includes a DML component such as subquery that require optimization.
o σsalary>10000 (πEmp_Name(Employee))
o πEmp_Name(σsalary>10000 (Employee))
After translating the given query, we can execute each relational algebra operation by using
different algorithms. So, in this way, a query processing begins its working.
Evaluation
For this, with addition to the relational algebra translation, it is required to annotate the
translated relational algebra expression with the instructions used for specifying and evaluating
each operation. Thus, after translating the user query, the system executes a query evaluation
plan.
Query Evaluation Plan
o In order to fully evaluate a query, the system needs to construct a query evaluation plan.
o A query evaluation plan defines a sequence of primitive operations used for evaluating a
query. The query evaluation plan is also referred to as the query execution plan.
o A query execution engine is responsible for generating the output of the given query. It
takes the query execution plan, executes it, and finally makes the output for the user
query.
The cost of the query evaluation can vary for different types of queries. Although the system is
responsible for constructing the evaluation plan, the user does need not to write their query
efficiently.
o Usually, a database system generates an efficient query evaluation plan, which minimizes
its cost. This type of task performed by the database system and is known as Query
Optimization.
o For optimizing a query, the query optimizer should have an estimated cost analysis of
each operation. It is because the overall operation cost depends on the memory
allocations to several operations, execution costs, and so on.
Finally, after selecting an evaluation plan, the system evaluates the query and produces the
output of the query.
Example:
SELECT LNAME, FNAME FROM EMPLOYEE WHERE SALARY > (SELECT MAX (SALARY) FROM
EMPLOYEE WHERE DNO=5);
The inner block
(SELECT MAX (SALARY) FROM EMPLOYEE WHERE DNO=5)
Translated in: ∏ MAX SALARY (σDNO=5(EMPLOYEE))
The Outer block
SELECT LNAME, FNAME FROM EMPLOYEE WHERE SALARY > C
Translated in: ∏ LNAZME, FNAME (σSALARY>C (EMPLOYEE))
(C represents the result returned from the inner block.)
The query optimizer would then choose an execution plan for each block.
The inner block needs to be evaluated only once. (Uncorrelated nested query).
It is much harder to optimize the more complex correlated nested queries.
External Sorting
It refers to sorting algorithms that are suitable for large files of records on disk that do not fit
entirely in main memory, such as most database files..
ORDER BY.
Sort-merge algorithms for JOIN and other operations (UNION, INTERSECTION). Duplicate
elimination algorithms for the PROJECT operation (DISTINCT).
Typical external sorting algorithm uses a sort-merge strategy:
Sort phase: Create sort small sub-files (sorted sub-files are called runs).
-DATABASE MANAGEMENT SYSTEMS
Merge phase: Then merges the sorted runs. N-way merge uses N memory buffers to
buffer input runs, and 1 block to buffer output. Select the 1st record (in the sort order) among
input buffers, write it to the output buffer and delete it from the input buffer. If output buffer
full, write it to disk. If input buffer empty, read next block from the corresponding run. E.g. 2-way
Sort-Merge
Relational Databases have been only one choice or the default choice
for data storage.
The value of relational databases are for two areas of memory a. fast,
small, volatile main memory b. Larger, slower, non - volatile backing
store.
The database allows more flexibility than a file system in storing large
amounts of data in a way that allows an application program to get
information quickly and easily.
2. Independent of Schema:
NoSQL have more efficiency to work with the independent of schema
feature i.e. large volume of heterogeneous type of data which requires no
schemas for structuring it.
2
3. Complex with free working: NoSQL Technologies
NoSQL is very easy to handle than the SQL databases, for storing data in
an semi-structured, unstructured form that requires no tabular format or
arrangement.
4. Flexible to accommodate:
NoSQL have heterogeneous data that does not require any the of structure
format, they are very flexible in terms of their reliability and use.
1. Document Database:
In document database, it stores the data in the form of document. The data
is grouped into the specified files where it is useful for building any
application software.
The most important benefit of document database is it allows to the use to
store the database in a particular format i.e. document format.
It is hierarchical and semi-structured format of NoSQL database it allows
efficient storage for the data. For example user profile it works very well
for storing the data. MongoDb is very good example of NoSQL database.
3
NoSQL Technologies 3. Column-oriented Database:
It stores the data in the form of columns where it segregates the data into
homogenous categories.
User can access the data very easily without retrieving unnecessary
information.
Column-oriented databases works efficiently for data analytics in many
social media networking sites.
This type of databases can accommodate large volume of data, For
filtering the data or information, column-oriented databases are used.
Apache HBase is an example of column-oriented database.
4. Graph Database:
In Graph Database we can store the data in the form of graphical
knowledge and its related element like nodes, edges etc.
Data points are placed very well so that nodes are easily related to the
edges and thus, a connection or network can easily establish.
Graph-based databases focus on the relationship between the elements. It
stores the data in the form of nodes in the database. The connections
between the nodes are called links or relationships.
4
It can store large volume of data. NoSQL Technologies
High performance.
Open Source.
SQL NoSQL
It is called as RDBMS or Relational It is called as Non-Relational or
Database. Distributed Database.
Table-based databases. It can be document based, key-
value pairs, graph databases.
Vertical Scalability. Horizontal Scaliability.
Fixed or Predefined schema. Flexible schema.
It is not suitable for hierarchical data It is suitable for hierarchical data
storage. storage.
5
3) Differentiate between the NoSQL and SQL.
……………………………………………………………………………
……………………………………………………………………………
……………………………………………………………………………
For instance:
• Row Database: “Customer 1: Name, Address, Location". (The fields for each
new record are stored in a long row).
• Columnar Database: “Customer 1: Name, Address, Location”. (Each field
has its own set of columns). Refer Table 2 for relational database example.
5
1
Column databases: Disadvantages
While there are many benefits to adopting column-oriented databases, there are
also a few drawbacks to keep in mind.
Before we conclude, we should note that column-store databases are not always
NoSQL-only. It is frequently argued that column-store belongs firmly in the
NoSQL camp because it differs so much from relational database approaches.
The debate between NoSQL and SQL is generally quite nuanced, therefore this
is not usually the case. They are essentially the same as SQL techniques when it
comes to column-store databases. For instance, keyspaces function as schema,
so schema management is still necessary. A NoSQL data store's keyspace
contains all column families. The concept is comparable to relational database
management systems' schema. There is typically only one keyspace per
program. Another illustration is the fact that the metadata occasionally
resembles a conventional relational DBMS perfectly. Ironically, column-store
databases frequently adhere to ACID and SQL standards. However, NoSQL
databases are often either document-store or key-store, neither of which are
column-store. Therefore, it is difficult to claim that column-store is a pure
NoSQL system.
53
Figure 3. Example-Five friends sharing Social network.
57
Delivery of personalized ads to users based on their data profiles
Cache data for infrequently updated data
There are numerous other circumstances where key-value works nicely. For
instance, because of its scalability, it frequently finds usage in big data research.
Similar to how it works for web applications, key-value is effective for
organizing player sessions in MMOG (massively multiplayer online game) and
other online games.
{ BookID 978-1449396091
“BookID”: “978-1449396091”,
“Title”: “DBMS”, Title DBMS
“Author”: “Raghu Ramakrishnan”,
“Year”: “2022”, Author Raghu Ramakrishnan
} Year 2022
(a) (b)
Figure 8: Example of Document and Key-value database
When to use a document database?
When your application requires data that is not structured in a table
format.
When your application requires a large number of modest continuous
reads and writes and all you require is quick in-memory access.
When your application requires CRUD (Create, Read, Update, Delete)
functionality.
These are often adaptable and perform well when your application has to
run across a broad range of access patterns and data kinds.
59
How does a Document Database Work?
It appears that document databases work under the assumption that any kind of
information can be stored in a document. This suggests that you shouldn't have
to worry about the database being unable to interpret any combination of data
types. Naturally, in practice, most document databases continue to use some sort
of schema with a predetermined structure and file format.
Document stores do not have the same foibles and limitations as SQL databases,
which are both tubular and relational. This implies that using the information at
hand is significantly simpler and running queries may also be much simpler.
Ironically, you can execute the same types of operations in a document storage
that you can in a SQL database, including removing, adding, and querying.
Each document requires a key of some kind, as was previously mentioned, and
this key is given to it through a unique ID. This unique ID processed the
document directly instead of being obtained column by column.
Document databases often have a lower level of security than SQL databases.
As a result, you really need to think about database security, and utilizing Static
Application Security Testing (SAST) is one approach to do so. SAST, examines
the source code directly to hunt for flaws. Another option is to use DAST, a
dynamic version that can aid in preventing NoSQL injections.
Besides Cassandra, we have the following NoSQL databases that are quite popular:
Its distribution design is based on Amazon’s Dynamo and its data model on Google’s
Bigtable.
Cassandra is being used by some of the biggest companies such as Facebook, Twitter,
Cisco, Rackspace, ebay, Twitter, Netflix, and more.
Features of Cassandra
Cassandra has become so popular because of its outstanding technical features. Given below
are some of the features of Cassandra:
7
Cassandra
● Easy data distribution: Cassandra provides the flexibility to distribute data where
you need by replicating data across multiple datacenters.
History of Cassandra
Cassandra was developed at Facebook for inbox search.
It was open-sourced by Facebook in July 2008.
Cassandra was accepted into Apache Incubator in March 2009.
It was made an Apache top-level project since February 2010.
8
Cassandra
2. ARCHITECTURE
The design goal of Cassandra is to handle big data workloads across multiple nodes without
any single point of failure. Cassandra has peer-to-peer distributed system across its nodes,
and data is distributed among all the nodes in a cluster.
All the nodes in a cluster play the same role. Each node is independent and at the
same time interconnected to other nodes.
Each node in a cluster can accept read and write requests, regardless of where the
data is actually located in the cluster.
When a node goes down, read/write requests can be served from other nodes in the
network.
The following figure shows a schematic view of how Cassandra uses data replication among
the nodes in a cluster to ensure no single point of failure.
9
Cassandra
Note: Cassandra uses the Gossip Protocol in the background to allow the nodes to
communicate with each other and detect any faulty nodes in the cluster.
Components of Cassandra
The key components of Cassandra are as follows:
SSTable: It is a disk file to which the data is flushed from the mem-table when its
contents reach a threshold value.
10
Cassandra
Bloom filter: These are nothing but quick, nondeterministic, algorithms for testing
whether an element is a member of a set. It is a special kind of cache. Bloom filters
are accessed after every query.
Clients approach any of the nodes for their read-write operations. That node (coordinator)
plays a proxy between the client and the nodes holding the data.
Write Operations
Every write activity of nodes is captured by the commit logs written in the nodes. Later the
data will be captured and stored in the mem-table. Whenever the mem-table is full, data will
be written into the SStable data file. All writes are automatically partitioned and replicated
throughout the cluster. Cassandra periodically consolidates the SSTables, discarding
unnecessary data.
Read Operations
During read operations, Cassandra gets values from the mem-table and checks the bloom
filter to find the appropriate SSTable that holds the required data.
11
Cassandra
3. DATA MODEL
The data model of Cassandra is significantly different from what we normally see in an RDBMS.
This chapter provides an overview of how Cassandra stores its data.
Cluster
Cassandra database is distributed over several machines that operate together. The
outermost container is known as the Cluster. For failure handling, every node contains a
replica, and in case of a failure, the replica takes charge. Cassandra arranges the nodes in a
cluster, in a ring format, and assigns data to them.
Keyspace
Keyspace is the outermost container for data in Cassandra. The basic attributes of a Keyspace
in Cassandra are:
● Replication factor: It is the number of machines in the cluster that will receive copies
of the same data.
● Column families: Keyspace is a container for a list of one or more column families.
A column family, in turn, is a container of a collection of rows. Each row contains
ordered columns. Column families represent the structure of your data. Each keyspace
has at least one and often many column families.
12
Cassandra
Column Family
A column family is a container for an ordered collection of rows. Each row, in turn, is an
ordered collection of columns. The following table lists the points that differentiate a column
family from a table of relational databases.
Relational tables define only columns and In Cassandra, a table contains columns,
the user fills in the table with values. or can be defined as a super column
family.
13
Cassandra
Note: Unlike relational tables where a column family’s schema is not fixed, Cassandra does
not force individual rows to have all the columns.
Column
A column is the basic data structure of Cassandra with three values, namely key or column
name, value, and a time stamp. Given below is the structure of a column.
14
ClickHouse
1. Introduction to ClickHouse
1. ClickHouse is an open-source column-oriented database management system.
2. Originally developed by Yandex in 2016 for their web analytics platform.
3. Specially optimized for OLAP (Online Analytical Processing) workloads.
4. Columnar storage → reads only needed data → faster queries.
5. Designed for analytical queries (summaries, reports, trends).
6. Handles petabyte-scale data efficiently.
2. Features of ClickHouse
Feature Description
Columnar Storage Stores data by columns instead of rows.
Fast Query Performance Processes billions of rows per second.
Data Compression Reduces storage requirements.
Fault Tolerant High availability using replication.
SQL Compatible Supports a dialect of SQL for queries.
Scalable Architecture Works on single nodes or across distributed clusters.
Real-Time Inserts Supports near real-time data ingestion.
Limitations:
1. Not designed for transactional systems (no ACID support like traditional DBMS).
2. No strong relational integrity (no foreign keys, no joins across clusters easily).
3. Requires tuning for best performance at scale.
NoSQL databases. The design and query languages of NoSQL databases vary
widely between different NoSQL products – much more widely than they do
among traditional SQL databases.
• Backup of Database - Though some NoSQL databases like MongoDB provide
some tools for backup, these tools are not mature enough to ensure proper
complete data backup solution.
• Security - NoSQL databases are generally subject to a fairly long list of security
issues. It is a limiting factor for NoSQL deployment.
• Consistency - NoSQL puts a scalability and performance first but when it comes
to a consistency of the data NoSQL doesn’t take much consideration so it makes
it little insecure as compared to the relational database e.g., in NoSQL databases
if you enter same set of data again, it will take it without issuing any error
whereas relational databases ensure that no duplicate rows get entry in
databases.
BASICS OF MONGODB:
• MongoDB is an open source, document-oriented database designed with both
scalability and developer agility in mind.
• Instead of storing your data in tables and rows as you would with a relational database,
in MongoDB you store JSON-like documents with dynamic schemas(schema-free,
schema less).
{
"_id" : ObjectId("5114e0bd42…"),
“FirstName" : "John",
“LastName" : "Doe",
“Age" : 39,
“Interests" : [ "Reading", "Mountain Biking ]
“Favorites": {
"color": "Blue",
"sport": "Soccer“
}
}
•
Introduction:
• Developed by 10gen
• Founded in 2007
• A document-oriented, NoSQL database
• Hash-based, schema-less database
• No Data Definition Language
• It store hashes with any keys and values
• Keys are a basic data type but in reality stored as strings
• Document Identifiers (_id) will be created for each document, field name reserved by
system
• Application tracks the schema and mapping
• Uses BSON format - Based on JSON – B stands for Binary
• Written in C++
• Supports APIs (drivers) in many computer languages
• JavaScript, Python, Ruby, Perl, Java, Java Scala, C#, C++, Haskell, Erlang
MongoDB is a cross-platform, document oriented database that provides, high performance,
high availability, and easy scalability. MongoDB works on concept of collection and document.
Database
Database is a physical container for collections. Each database gets its own set of files on the
file system. A single MongoDB server typically has multiple databases.
Collection
Collection is a group of MongoDB documents. It is the equivalent of an RDBMS table. A
collection exists within a single database. Collections do not enforce a schema. Documents
within a collection can have different fields. Typically, all documents in a collection are of
similar or related purpose.
Document
A document is a set of key-value pairs. Documents have dynamic schema. Dynamic schema
means that documents in the same collection do not need to have the same set of fields or
structure, and common fields in a collection's documents may hold different types of data.
Why MongoDB?
• Simple queries
• Functionality provided applicable to most web applications
• Easy and fast integration of data
• No ERD diagram
• Not well suited for heavy and complex transactions systems
• Document Oriented Storage: Data is stored in the form of JSON style documents.
• Index on any attribute
• Replication and high availability
• Auto-sharding
• Rich queries
• Fast in-place updates
• Professional support by MongoDB
CRUD Operations:
1. Create
a. db.collection.insert( <document> )
b. db.collection.save( <document> )
c. db.collection.update( <query>, <update>, { upsert: true } )
2. Read
a. db.collection.find( <query>, <projection> )
b. db.collection.findOne( <query>, <projection> )
• Provides functionality similar to the SELECT command
• <query> where condition , <projection> fields in result set
• Example: var PartsCursor = db.parts.find({parts:“hammer”}).limit(5)
• Has cursors to handle a result set
• Can modify the query to impose limits, skips, and sort orders.
• Can specify to return the ‘top’ number of records from the result set
3. Update
a. db.collection_name.insert( <document> )
• Omit the _id field to have MongoDB generate a unique key
• Example db.parts.insert( {{type: “screwdriver”, quantity: 15 } )
• db.parts.insert({_id: 10, type: “hammer”, quantity: 1 })
b. db.collection_name.update( <query>, <update>, { upsert: true } )
• Will update 1 or more records in a collection satisfying query
c. db.collection_name.save( <document> )
• Updates an existing record or creates a new record
4. Delete
a. db.collection.remove( <query>, <justOne> )
Replication:
By Prof. B.A.Khivsara
Note: The material to prepare this presentation has been taken from internet and are
generated only for students reference and not for commercial use.
Software Required and Steps
• Eclipse
• JDK 1.6
• MongoDB
• MongoDB-Java-Driver
In Eclipse perform following steps:
1. File - New – Java Project –Give Project Name – ok
2. In project Explorer window- right click on project name-
new- class- give Class name- ok
3. In project Explorer window- right click on project name-
Build path- Configure build path- Libraries- Add External
Jar - MongoDB-Java-Driver
4. Start Mongo server before running the program
Import packages, Create connection,
database and collection
• Import packages
import com.mongodb.*;
• Create connection
MongoClient mongo = new MongoClient( "localhost" , 27017 );
• Create Database
DB db = mongo.getDB("database name");
• Create Collection
DBCollection coll = db.getCollection(“Collection Name");
Insert Document
coll.insert(d1);
coll.insert(d2);
Display document
searchQuery.put("name", “Monika");
Coll.remove(searchQuery);
Program
import com.mongodb.*;
public class conmongo {
public static void main(String[] args) {
try {
catch(Exception e)
{
e.printStackTrace();
}
}
}
MongoDB Testing
1. Introduction
1. MongoDB is a popular NoSQL database that stores data in JSON-like documents.
2. Testing MongoDB is crucial to ensure:
Backup & Recovery Testing Validate data backup and restoration processes.
Tools:
Schema
MongoDB is schema-less; Metabase tries to guess fields.
Detection
Simple Queries Best suited for basic queries (filters, groups, counts).
MongoDB uses aggregation syntax, not real SQL. Metabase uses its own
SQL Support
visual query builder.
json
CopyEdit
{
"order_id": "12345",
"product": "Phone",
"price": 25000,
"order_date": "2025-04-01T10:00:00Z"
}
In Metabase: