ADBMS
ADBMS
A database system is a collection of interrelated data and a set developers hide the complexity from users through several
of programs that allow users to access and modify these data. levels of abstraction, to simplify users’ interactions with the
A major purpose of a database system is to provide users with system:
an abstract view of the data. Physical level.
That is, the system hides certain details of how the data are The lowest level of abstraction describes how the data are
stored and maintained. actually stored. The physical level describes complex low-level
data structures in detail.
Data Models
Database systems have several schemas, partitioned according Underlying the structure of a database is the data model: a
to the levels of abstraction. collection of conceptual tools for describing data, data
The physical schema describes the database design at the relationships, data semantics, and consistency constraints.
physical level, while the logical schema describes the database A data model provides a way to describe the design of a
design at the logical level. database at the physical, logical, and view levels.
A database may also have several schemas at the view level, The data models can be classified into four different categories:
sometimes called sub schemas, that describe different views of
the database.
1. Relational Model. 2. Entity-Relationship Model.
The relational model uses a collection of tables to represent The entity-relationship (E-R) data model uses a collection of
both data and the relationships among those data. basic objects, called entities, and relationships among these
Each table has multiple columns, and each column has a unique objects.
name. An entity is a “thing” or “object” in the real world that is
Tables are also known as relations. distinguishable from other objects.
The relational model is an example of a record-based model. The entity-relationship model is widely used in data base design
Object-oriented programming (especially in Java,C++, or C#) The semi structured data model permits the specification of
has become the dominant software-development methodology . data where individual data items of the same type may have
This led to the development of an object-oriented data model different sets of attributes.
that can be seen as extending the E-R model with notions of This is in contrast to the data models mentioned earlier, where
encapsulation, methods(functions), and object identity. every data item of a particular type must have the same set of
The object-relational data model combines features of the attributes. The Extensible Markup Language (XML)is widely
object-oriented data model and relational data model. used to represent semi structured data.
Database Architecture
A primary goal of a database system is to retrieve information There are four different types of database-system users,
from and store new information into the database. People who differentiated by the way they expect to interact with the system.
work with a database can be categorized as database users or Different types of user interfaces have been designed for the
database administrators different types of users.
1. Naıve users are unsophisticated users who interact with the
system by invoking one of the application programs that have
been written previously. For example, a clerk in the university
The typical user interface for naıve users is a forms interface,
where the user can fill in appropriate fields of the form.
2. Application programmers are computer professionals who 4. Specialized users are sophisticated users who write
write application programs. Application programmers can specialized database applications that do not fit into the
choose from many tools to develop user interfaces. traditional data-processing frame work.
3. Sophisticated users interact with the system without writing Among these applications are computer-aided design systems,
programs. In-stead, they form their requests either using a knowledge-base and expert systems, systems that store data
database query language or by using tools such as data analysis with complex data types (for example, graphics data and audio
software. data), and environment-modeling systems.
Database Administrator
One of the main reasons for using DBMSs is to have central Storage structure and access-method definition.
control of both the data and the programs that access those Schema and physical-organization modification. The DBA
data. carries out changes to the schema and physical organization to
A person who has such central control over the system is called reflect the changing needs of the organization, or to alter the
a database administrator (DBA). physical organization to improve performance.
The functions of a DBA include:
Schema definition. The DBA creates the original database
schema by executing a set of data definition statements in the
DDL.
Granting of authorization for data access. By granting Routine maintenance.
different types of authorization, the database administrator can Examples of the database administrator’s routine maintenance
regulate which parts of the data base various users can access. activities are:
The authorization information is kept in a special system Periodically backing up the database, either onto tapes or onto
structure that the database system consults whenever some one remote servers, to prevent loss of data in case of disasters
attempts to access the data in the system. such as flooding.
Ensuring that enough free disk space is available for normal
operations , and upgrading disk space as required.
Monitoring jobs running on the database and ensuring that
performance is not degraded by very expensive tasks submitted
by some users.
The entity-relationship(E-R) data model was developed to An entity is a “thing” or “object” in the real world that is
facilitate data base design. distinguishable from all other objects.
The E-R data model employs three basic concepts: entity sets, For example, each person in a university is an entity.
relationship sets, and attributes. An entity has a set of properties, and the values for some set
of properties may uniquely identify an entity.
For instance, a person may have a person id property whose
value uniquely identifies that person.
An entity set is a set of entities of the same type that share the Entity sets do not need to be disjoint.
same properties , or attributes. For example, it is possible to define the entity set of all people
The set of all people who are instructors at a given university, in a university (person). A person entity may be an instructor
for example, can be defined as the entity set instructor. entity, a student entity, both, or neither.
Similarly, the entity set student might represent the set of all
students in the university
An entity is represented by a set of attributes. Attributes are A relationship is an association among several entities.
descriptive properties possessed by each member of an entity For example, we can define a relationship advisor that
set. associates instructor James with student Shankar. This
Possible attributes of the instructor entity set are ID ,name , relationship specifies that James is an advisor to student
deptname , and salary. Shankar.
Possible attributes of the course entity set are courseid , title ,
deptname , and credits.
Each entity has a value for each of its attributes.
A relationship set is a set of relationships of the same type. The association between entity sets is referred to as
Formally, it is a mathematical relation on n≥2 (possibly non participation;
distinct) entity sets. If E1,E2,...,En are entity sets, then a that is, the entity setsE1,E2,...,En participate in relationship
relationship set Ris a subset of set R.
{(e1,e2,...,en)|e1∈E1,e2∈E2,..., en∈En} where (e1,e2,... The function that an entity plays in a relationship is called that
,en) is a relationship entity’s role.
Attributes
A relationship may also have attributes called descriptive For each attribute, there is a set of permitted values, called the
attributes . domain,or value set, of that attribute.
Consider a relationship set advisor with entity sets instructor The domain of attribute courseid might be the set of all text
and student. strings of a certain length.
We could associate the attribute date with that relationship to An attribute, as used in the E-R model, can be characterized
specify the date when an instructor became the advisor of a by the following attribute types
student. Simple and composite attributes.
Single-valued and multi valued attributes.
Derived attribute.
Simple and composite attributes. Single-valued and multi valued attributes.
Simple- that is, they have not been divided into subparts. single value for a particular entity.
Composite attributes- can be divided into subparts (that is, For instance, the studentID attributefor a specific student entity
other attributes). refers to only one studentID. Such attributes are said to be
For example, an attribute name could be structured as a single valued.
composite attribute consisting of first name ,middle initial , multi valued attributes -an attribute has a set of values for a
and last name. specific entity.
Eg: a phone number attribute.
The value for this type of attribute can be derived from the An E-R enterprise schema may define certain constraints to
values of other related attributes or entities. which the contents of a database must conform.
Eg: age
Mapping Cardinalities
One-to-many. Many-to-one.
An entity in A is associated with any number (zero or more)of An entity in A is associated with at most one entity in B. An
entities in B. An entity in B, however, can be associated with entity in B, however, can be associated with any number (zero
at most one entity in A. or more) of entities in A.
Participation Constraints
The values of the attribute values of an entity must be such An E-R diagram can express the overall logical structure of a
that they can uniquely identify the entity. database graphically.
Basic Structure
An E-R diagram consists of the following major components
Rectangles divided into two parts :represent entity sets. The
first part, contains the name of the entity set. The second part
contains the names of all the attributes of the entity set.
Undivided rectangles represent the attributes of a relationship
set. Attributes that are part of the primary key are underlined.
Lines link entity sets to relationship sets.
Dashed lines link attributes of a relationship set to the
relationship set.
Double lines indicate total participation of an entity in a
Diamonds represent relationship sets.
relationship set
Double diamonds represent identifying relationship sets linked
to weak entity sets
Roles
An entity set that does not have sufficient attributes to form a Every weak entity must be associated with an identifying entity;
primary key is termed a weak entity set. that is, the weak entity set is said to be existence dependent on
An entity set that has a primary key is termed a strong entity the identifying entity set.
set. The identifying entity set is said to own the weak entity set that
For a weak entity set to be meaningful, it must be associated it identifies.
with another entity set, called the identifying or owner entity The relationship associating the weak entity set with the
set. identifying entity set is called the identifying relationship
The identifying relationship is many-to-one from the weak The primary key of a weak entity set is formed by the primary
entity set to the identifying entity set, and the participation of key of the identifying entity set, plus the weak entity set’s
the weak entity set in the relationship is total. discriminator.
The identifying relationship set should not have any descriptive In E-R diagrams, a weak entity set is depicted via a rectangle,
attributes, since any such attributes can instead be associated like a strong entity set, but there are two main differences:
with the weak entity set. The discriminator of a weak entity is underlined with a dashed,
The discriminator of a weak entity set is a set of attributes that rather than a solid, line.
allows this distinction to be made. The relationship set connecting the weak entity set to the
identifying strong entity set is depicted by a double diamond.
Problem Problem
A company database needs to store information about employees Employees work in departments; each department is managed
(identified by ssn, with salary and phone as attributes), by an employee; a child must be identified uniquely by name
departments (identified by dno, with dname and budget as when the parent (who is an employee; assume that only one
attributes), and children of employees (with name and age as parent works for the company) is known. We are not
attributes). interested in information about a child once the parent leaves
the company.
Draw an ER diagram that captures this information.
Structure of Relational Databases in the relational model the term relation is used to refer to a
A relational database consists of a collection of tables, each of table, while the term tuple is used to refer to a row. Similarly,
which is assigned a unique name. the term attribute refers to a column of a table.
For example, consider the instructor table,which stores the term relation instance to refer to a specific instance of a
information about instructors. The table has four column relation, i.e., containing a specific set of rows.
headers: ID, name, deptname, and salary. For each attribute of a relation, there is a set of permitted
Each row of this table records information about an instructor. values, called the domain of that attribute.
The domain of the name attribute is the set of all possible
instructor names.
Database Schema
for all relations r, the domains of all attributes of r be atomic. The database schema, which is the logical design of the
A domain is atomic if elements of the domain are considered to database, and the database instance, which is a snapshot of the
be indivisible units. data in the database at a given instant in time.
In general, a relation schema consists of a list of attributes and
their corresponding domains.
department(deptname,building,budget)
Super key is a set of one or more attributes that, taken A query language is a language in which a user requests
collectively, allow us to identify uniquely a tuple in the relation. information from the database.
Such minimal super keys are called candidate keys These languages are usually on a level higher than that of a
The term primary key to denote a candidate key that is chosen standard programming language.
by the database designer as the principal means of identifying Query languages can be categorized as either procedural or
tuples within a relation. nonprocedural.
A relation, say r1, may include among its attributes the In a procedural language, the user instructs the system to
primary key of an other relation, say r2. This attribute is perform a sequence of operations on the database to compute
called a foreign key from r1, referencing r2. the desired result.
In a non procedural language, the user describes the desired
information with out giving a specific procedure for obtaining
that information
The Relational Algebra Fundamental Operations
The relational algebra is a procedural query language. The select, project, and rename operations are called unary
It consists of a set of operations that take one or two relations operations, because they operate on one relation.
as input and produce a new relation as their result. The other three operations operate on pairs of relations and are,
The fundamental operations in the relational algebra are select, therefore, called binary operations
project, union , set difference ,Cartesian product , and
rename.
In addition to the fundamental operations, there are several
other operations—namely, set intersection , naturaljoin ,and
assignment.
The select operation selects tuples that satisfy a given predicate. We can find all instructors with salary greater than $90,000
We use the lower case Greek letter sigma (σ) to denote by writing:
selection. σ salary>90000(instructor)
The predicate appears as a subscript to σ . we allow comparisons using=,=,<,≤,>, and ≥ in the selection
The argument relation is in parentheses after the σ predicate. Furthermore, we can combine several predicates into
Thus, to select those tuples of the instructor relation where the a larger predicate by using the connectives and (∧),or(∨), and
instructor is in the “Physics” department, we write: not(¬).
σ deptname=“Physics” (instructor) Thus, to find the instructors in Physics with a salary greater
than $90,000, we write:
σ deptname=“Physics”∧salary>90000 (instructor)
The Project Operation
To find the set of all courses taught in the Fall 2009 semester, The set-difference operation, denoted by −, allows us to find
we write: tuples that are in one relation but are not in another. The
Πcourseid(σsemester=“Fall”∧year=2009(section)) expression r−s produces a relation containing those tuples in r
but not in s.
To find the set of all courses taught in the Spring 2010 semester,
we write: Courses offered in the Fall 2009 semester but not in Spring
2010 semester.
Π courseid(σsemester=“Spring”∧year=2010(section))
Πcourseid(σsemester=“Fall”∧year=2009(section)) - Π
Πcourseid(σsemester=“Fall”∧year=2009(section)) U Π
courseid(σsemester=“Spring”∧year=2010(section))
courseid(σsemester=“Spring”∧year=2010(section))
The Cartesian-product operation, denoted by a cross (×), The rename operator, denoted by the lowercase Greek letter
allows us to combine information from any two relations. We rho (ρ),
write the Cartesian product of relations r1 and r2 as r1×r2. ρ x(E)
Formal Definition of the Relational Algebra Additional Relational-Algebra Operations
The natural join is a binary operation that allows us to combine “Find the names of all instructors together with the courseid of
certain selections and a Cartesian product into one operation. all courses they taught.
It is denoted by the join symbol .
It is convenient at times to write a relational-algebra expression Notice that much of the data is lost when applying a join to two
by assigning parts of it to temporary relation variables. relations. In some cases this lost data might hold useful
The assignment operation, denoted by ←, works like information. An outer join retains the information that would
assignment in a programming language. have been lost from the tables, replacing missing data with
nulls.
There are three forms of the outer join, depending on which
data is to be kept.
LEFT OUTER JOIN - keep data from the left-hand table
RIGHT OUTER JOIN - keep data from the right-hand table
FULL OUTER JOIN - keep data from both tables
1. Write a relational algebra expression that returns the food
items required to cook the recipe “Pasta and Meat-balls”. For
each such food item return the item paired with the number of
ounces required by the recipe.
Module II
Data redundancy:-
In an unnormalizaed table design some information may be stored
repeatedly.
A functional dependency (FD) is a relationship between two The left side of the above FD diagram is called the determinant,
attributes, typically between the PK and other non-key and the right side is the dependent.
attributes within a table. SIN ———-> Name, Address, Birthdate
For any relation R, attribute Y is functionally dependent on SIN determines Name, Address and Birthdate.
attribute X (usually the PK), if for every valid instance of X, SIN, Course ———> DateCompleted
that value of X uniquely determines the value of Y.
SIN and Course determine the date completed (DateCompleted). This must
This relationship is indicated by the representation below : also work for a composite PK.
X ———–> Y
A → B has a non-trivial functional dependency if B is not a Armstrong’s axioms are a set of inference rules used to infer
subset of A. all the functional dependencies on a relational database.
When A intersection B is NULL, then A → B is called as They were developed by William W. Armstrong.
complete non-trivial. Axiom of reflexivity
ID → Name This axiom says, if Y is a subset of X, then X determines Y
Attribute Closure:
Attribute closure of an attribute set can be defined as set of If attribute closure of an attribute set contains all attributes of
attributes which can be functionally determined from it. relation, the attribute set will be super key of the relation.
To find attribute closure of an attribute set: Question 1:
Add elements of attribute set to the result set. Given relational schema R( P Q R S T) having following
Recursively add elements to the result set which can be attributes P Q R S and T, also there is a set of functional
functionally determined from the elements of the result set dependency denoted by FD = { P->QR, RS->T, Q->S, T-> P }.
Determine Closure of ( T )+
FD = { P->QR, RS->T, Q->S, T-> P }. Consider the relation scheme R = {E, F, G, H, I, J, K, L, M,
T+={ T,P,Q,R,S,T} N} and the set of functional dependencies {{E, F} -> {G}, {F}
-> {I, J}, {E, H} -> {K, L}, K -> {M}, L -> {N} on R.
What is the key for R?
A. {E, F}
B. {E, F, H}
C. {E}
Any action that reads from and/or writes to a database may A logical unit of work that must be either entirely completed or
consist of aborted
Simple SELECT statement to generate a list of table contents Successful transaction changes the database from one consistent
A series of related UPDATE statements to change the values of attributes in state to another
various tables One in which all data integrity constraints are satisfied
A series of INSERT statements to add rows to one or more tables
Most real-world database transactions are formed by two or more
A combination of SELECT, UPDATE, and INSERT statements database requests
The equivalent of a single SQL statement in an application program or
transaction
Evaluating Transaction Results Transaction Properties(ACID properties)
Not all transactions update the database
Atomicity
SQL code represents a transaction because database was accessed Requires that all operations (SQL requests) of a transaction be
Improper or incomplete transactions can have a devastating completed; if not, then the transaction is aborted
effect on database integrity A transaction is treated as a single, indivisible, logical unit of work
This “all-or-none” property is referred to as atomicity.
Some DBMSs provide means by which user can define enforceable
constraints based on business rules Consistency
Other integrity rules are enforced automatically by the DBMS when table Consistency property ensures that the database must remain in
structures are properly defined, thereby letting the DBMS validate some the consistent state before the start of transaction and after
transactions the transaction is over.
Consistency states that only valid data will be written to the
database.
If for some reason a transaction is executed that violates the
database consistency rules the entire transaction will be rolled
back.
Durability
1. A COMMIT statement is reached- all changes are permanently Keeps track of all transactions that update the database. It contains:
recorded within the database A record for the beginning of transaction
For each transaction component (SQL statement)
2. A ROLLBACK is reached – all changes are aborted and the Type of operation being performed (update, delete, insert)
database is restored to a previous consistent state Names of objects affected by the transaction (the name of the table)
3. The end of the program is successfully reached – equivalent to a “Before” and “after” values for updated fields
COMMIT Pointers to previous and next transaction log entries for the same
transaction
4. The program abnormally terminates and a rollback occurs The ending (COMMIT) of the transaction
Increases processing overhead but the ability to restore a corrupted database is
worth the price
A transaction is seen by the dbms as a series or list of actions. Each transaction must specify as its final action either commit
Actions include read and writes of database object. or abort
Assume that an object O is always read into a program variable AbortT and CommitT
that is also named O Schedule is a list of actions from a set of transactions,
Denote transaction T reading an object O as RT(O) Schedule represents an actual or potential execution sequence.
Similarly writing as WT(O)
TYPES OF SCHEDULE
There may be a mix of transactions running on a system, some short 1. Serial Schedule
and some long. 2. Non-serial Schedule
3. Serializable schedule
If transactions are run serially, a short transaction may have to
wait for a preceding long transaction to complete, which can lead to
unpredictable delays in running a transaction.
The serial schedule is a type of schedule where one transaction is Execute all the operations of T1 which was followed by all the
executed completely before starting another transaction. operations of T2.
In the serial schedule, when the first transaction completes its cycle, Execute all the operations of T2 which was followed by all the
then the next transaction is executed. operations of T1.
For example: Suppose there are two transactions T1 and T2 which In the given (a) figure, Schedule A shows the serial schedule where
have some operations. T1 followed by T2.
In the given (b) figure, Schedule B shows the serial schedule where
T2 followed by T1.
If it has no interleaving of operations, then there are the following 2. Non-serial Schedule/ Concurrent Execution
two possible outcomes: If interleaving of operations is allowed, then there will be non-serial
schedule.
It contains many possible orders in which the system can execute the
individual operations of the transactions.
In the given figure (c) and (d), Schedule C and Schedule D are the
non-serial schedules. It has interleaving of operations.
Problems with Concurrent Execution
Non-serial Schedule
In a database transaction, the two main operations are READ
and WRITE operations. So, there is a need to manage these
two operations in the concurrent execution of the transactions.
following problems occur with the Concurrent Execution of the
operations:
Problem 1: Lost Update Problems (W - W Conflict)
Dirty Read Problems (W-R Conflict)
Unrepeatable Read Problem (W-R Conflict)/ Inconsistent
Retrievals Problem
When multiple transactions run concurrently, then it may give The only difference between serial schedules and serializable
rise to inconsistency of the database. schedules is that-
Serializability is a concept that helps to identify which non- In serial schedules, only one transaction is allowed to execute at
serial schedules are correct and will maintain the consistency of a time i.e. no concurrency is allowed.
the database. Whereas in serializable schedules, multiple transactions can
If a given schedule of ‘n’ transactions is found to be equivalent execute simultaneously i.e. concurrency is allowed.
to some serial schedule of ‘n’ transactions, then it is called as
a serializable schedule.
Types of Serializability
Conflict Serializability
Precedence Graph
The graph contains one node for each Transaction Ti.
Precedence Graph or Serialization Graph is used commonly to test
Conflict Serializability of a schedule. An edge ei is of the form Tj –> Tk.
PROBLEM 1
SOLUTION
If a given schedule is found to be view equivalent to some serial 1.For each data item Q,if transaction Ti reads the initial value
schedule, then it is called as a view serializable schedule.. of Q in schedule S,then transaction Ti must in schedule S’’
Consider two schedules S1 and S2 each consisting of two also read the initial value of Q.
transactions T1 and T2. Schedules S1 and S2 are called view “Initial reads must be same for all data items”
equivalent if the following three conditions hold true for them- If transaction Ti reads a data item that has been updated by the
transaction Tj in schedule S1, then in schedule S2 also,
transaction Ti must read the same data item that has been
updated by transaction Tj.
“Write-read sequence must be same.”.
For each data item Q ,the transaction that perform the final Method-01:
write(Q) operation in schedule S must perform the final
Write(Q) operation in schedule in S”. Check whether the given schedule is conflict serializable or not.
“Final writers must be same for all data items”.
If the given schedule is conflict serializable, then it is surely
view serializable.
If the given schedule is not conflict serializable, then it may or
may not be view serializable. Go and check using other methods.
Method-02: Method-03:
Check if there exists any blind write operation (writing without In this method, try finding a view equivalent serial schedule.
reading a value is known as a blind write).
If there does not exist any blind write, then the schedule is
surely not view serializable. Stop and report your answer.
If there exists any blind write, then the schedule may or may
not be view serializable. Go and check using other methods.
EXAMPLE:SOLUTION
EXAMPLE :
Step 1: final updation on data items The first schedule S1 satisfies all three conditions, so we don't
need to check another schedule.
In both schedules S and S1, there is no read except the initial read
that's why we don't need to check that condition.
Hence, view equivalent serial schedule of S is S1:
Step 2: Initial Read
Example:Irrecoverable schedule
Irrecoverable Schedules-
If in a schedule,
A transaction performs a dirty read operation from an uncommitted transaction
And commits before the transaction from which it has read the value then such a
schedule is known as an Irrecoverable Schedule.
Recoverable Schedules-
Example: Irrecoverable schedule
If in a schedule,
In the above example
A transaction performs a dirty read operation from an uncommitted transaction.
T2 performs a dirty read operation.
T2 commits before T1.
And its commit operation is delayed till the uncommitted transaction either
T1 fails later and roll backs.
commits or roll backs then such a schedule is known as a Recoverable Schedule.
The value that T2 read now stands to be incorrect.
T2 can not recover since it has already committed.
So the above schedule is an irrecoverable schedule.
Since the commit operation of the transaction that performs the dirty
read is delayed.
CASCADING SCHEDULE
Even if a schedule is recoverable ,to recover correctly from failure In the above example ,transaction T8 has been aborted.
of transaction Ti, we may have to roll back several transaction.
Such situations occur if transactions have read data written by Ti. T8 must be rolled back.
A logical disk then consists of two physical disks, With multiple disks, we can improve the transfer
and every write is carried out on both disks. rate as well (or instead) by striping data across
If one of the disks fails, the data can be read from multiple disks. In its simplest form, data
the other.Data will be lost only if the second disk striping consists of
fails before the first failed disk is repaired. splitting the bits of each byte across multiple
With disk mirroring, the rate at which read disks; such striping is called bit-level striping.
requests can be handled is doubled, since read
requests can be sent to either disk The transfer rate
of each read is the same as in a single-disk system,
but the number of reads per unit time has doubled.
Different type of data striping :
When reading a large file, block-level striping In summary, there are two main goals of
fetches n blocks at a time in parallel from the n parallelism in a disk system:
disks, giving a high data-transfer rate for large 1. Load-balance multiple small accesses (block
reads. accesses), so that the throughput of such accesses
When a single block is read, the data-transfer increases.
rate is the same as on one disk, but the 2. Parallelize large accesses so that the response
remaining n − 1 disks are free to perform other time of large accesses is reduced.
actions.
RAID Levels
Mirroring provides high reliability, but it is RAID level 0 refers to disk arrays with striping
expensive. Striping provides high data-transfer rates, at the level of blocks, but without any
but does not improve reliability. redundancy (such as mirroring or parity bits).
Various alternative schemes aim to provide RAID level 1 refers to disk mirroring with block
redundancy at lower cost by combining disk striping
striping.
with “parity” bits
RAID level 2, known as memory-style error-
These schemes have different cost—performance
trade-offs. The schemes are classified into RAID levels correcting-code (ECC) organization, employs
(For all levels, the figure depicts four disks’ worth of parity bits. Memory systems have long used
data, and the extra disks depicted are used to store parity bits for error detection and correction
redundant information for failure recovery.)
Each byte in a memory system may have a The idea of error-correcting codes can be used
parity bit associated with it that records whether the directly in disk arrays by striping bytes across
numbers of bits in the byte that are set to 1 is even
disks. For example, the first bit of each byte
(parity = 0) or odd (parity = 1).
If one of the bits in the byte gets damaged (either a 1
could be
becomes a 0, or a 0 becomes a 1), the parity of the byte
stored in disk 0, the second bit in disk 1, and so
changes and thus will not match the stored parity. on until the eighth bit is stored in disk 7, and the
Similarly,if the stored parity bit gets damaged, it will not error-correction bits are stored in further disks.
match the computed parity.Thus, all 1-bit errors will be
detected by the memory system. Error-correcting
schemes store 2 or more extra bits, and can reconstruct
the data if a single bitgets damaged.
The disks labeled P store the error correction
bits. If one of the disks fails, the remaining bits of
the byte and the associated error-correction bits
can be read from other disks, and can be used to
reconstruct the damaged data.
RAID level 3, bit-interleaved parity organization, RAID level 3 is as good as level 2, but is less
improves on level 2 by exploiting the fact that disk expensive in the number of extra disks (it has
controllers, unlike memory systems, can detect
whether a sector has been read correctly, so a single parity
only a one-disk overhead), so level 2 is not used
bit can be used for error correction, as well as for in practice.
detection. RAID level 3 has two benefits over level 1. It
If one of the sectors gets damaged, the system knows needs only one parity disk for several regular
exactly which sector it is, and, for each bit in the sector, the disks, whereas level 1 needs one mirror disk for
system can figure out whether it is a 1 or a 0 by computing
the parity of the corresponding bits from sectors in the
every
other disks. If the parity of the remaining bits is equal to disk, and thus level 3 reduces the storage
the stored parity, the missing bit is 0; otherwise, it is 1 overhead.
RAID level 4, block-interleaved parity
organization, uses block-level striping, like RAID
0, and in addition keeps a parity block on a
separate disk for corresponding blocks from N
other disks.
If one of the disks fails, the parity block can be
used with the corresponding blocks from the
other disks to restore the blocks of the failed disk
A database is mapped into a number of different files A block may contain several records; the exact set
that are maintained by the underlying operating of records that a block contains is determined by
system. These files reside permanently on disks. the form of physical data organization being used.
A file is organized logically as a sequence of records. In a relational database, tuples of distinct
These records are mapped onto disk blocks.
relations are generally of different sizes. One
Each file is also logically partitioned into fixed-length
approach to mapping the database to files is to use
storage units called blocks, which are the units of both
several files, and to store records of only one fixed
storage allocation and data transfer. Most
length in any given file. An alternative is to
databases use block sizes of 4 to 8 kilobytes by default
structure our files so that we can accommodate
multiple lengths for records;
Fixed-Length Records
As an example, let us consider a file of instructor Assume that each character occupies 1 byte and
records for our university database. Each record that numeric (8,2) occupies 8 bytes.
of this file is defined (in pseudocode) as: Suppose that instead of allocating a variable
amount of bytes for the attributes ID, name, and
dept name, we allocate the maximum number of
bytes that each attribute can hold.
Then, the instructor record is 53 bytes long. A
simple approach is to use the first 53 bytes for
the first record, the next 53 bytes for the second
record, and so on
However, there are two problems with this simple To avoid the first problem, we allocate only as many
approach: records to a block as would fit entirely in the block.
1. Unless the block size happens to be a multiple of 53 When a record is deleted, we could move the record
(which is unlikely), some records will cross block that came after it into the space formerly occupied by
boundaries. That is, part of the record will be stored in the deleted record, and so on, until every record fol-
one block and part in another. It would thus require lowing the deleted record has been moved ahead .
two block accesses to read or write such a record. Such an approach requires moving a large number of
2. It is difficult to delete a record from this structure.
records. It might be easier simply to move the
The space occupied by the record to be deleted must be
final record of the file into the space occupied by the
filled with some other record of the file, or we must
deleted record
have a way of marking deleted records so that they can
be ignored
It is undesirable to move records to occupy the
space freed by a deleted record, since doing so
requires additional block accesses. Since
insertions tend to be more frequent than
deletions, it is acceptable to leave open the space
occupied by the deleted record, and to wait for a
subsequent insertion before reusing the space.
A simple marker on a deleted record is not
sufficient, since it is hard to find this available
space when an insertion is being done. Thus, we
need to introduce an additional structure.
At the beginning of the file, we allocate a certain On insertion of a new record, we use the record
number of bytes as a file header. The header will pointed to by the header.
contain a variety of information about the file. For We change the header pointer to point to the
now, all we need to store there is the address of the next available record. If no space is available, we
first record whose contents are deleted.
add the new record to the end of the file.
We use this first record to store the address of the
Insertion and deletion for files of fixed-length
second available record, and so on. Intuitively, we
records are simple to implement, because the
can think of these stored addresses as pointers,
since they point to the location of a record. space made available by a deleted record is
The deleted records thus form a linked list, which is
exactly the space needed to insert a record.
often referred to as a free list.
Variable-Length Records
Different techniques for implementing variable- The representation of a record with variable-length
length records exist. Two different attributes typically has two parts: an initial part with
fixed length attributes, followed by data for variable-
problems must be solved by any such technique:
length attributes.
• How to represent a single record in such a way Fixed-length attributes, such as numeric values, dates, or
that individual attributes can be extracted easily. fixed length character strings are allocated as many
• How to store variable-length records within a bytes as required to store their value.
block, such that records in a block can be Variable-length attributes, such as varchar types, are
extracted easily represented in the initial part of the record by a pair
(offset, length), where offset denotes where the data for
that attribute begins within the record, and length is the
length in bytes of the variable-sized attribute.
The values for these attributes are stored null bitmap, which indicates which attributes of
consecutively, after the initial fixed-length part the record have a null value. In this particular
of the record. Thus, the initial part of the record record, if the salary were null, the fourth bit of
stores a fixed size of information about each the bitmap would be set to 1, and the salary
attribute, whether it is fixed-length or variable- value stored
length. in bytes 12 through 19 would be ignored.
The slotted-page structure is commonly used for
organizing variable length records within a
block
The actual records are allocated contiguously in Several of the possible ways of organizing
the block, starting from the end of the block. The records in files are:
free space in the block is contiguous, between Heap file organization. Any record can be
the final entry in the header array, and the first placed anywhere in the file where there is space
record. If a record is inserted, space is allocated for the record. There is no ordering of records.
for it at the end of free space, and an entry Typically, there is a single file for each relation.
containing its size and location is added to the Sequential file organization. Records are
header. stored in sequential order, according to the value
If a record is deleted, the space that it occupies is of a “search key”of each record
freed, and its entry is set to deleted (its size is set
to −1, for example).
Hashing file organization. A hash function is A sequential file is designed for efficient processing
computed on some attribute of each record. The of records in sorted order based on some search key.
result of the hash function specifies in which A search key is any attribute or set of attributes; it
block of the file the record should be placed. need not be the primary key, or even a superkey.
Generally, a separate file is used to store the To permit fast retrieval of records in search-key
The sequential file organization allows records to be A multitable clustering file organization is a file
read in sorted order; that can be useful for display organization, such as that stores related records
purposes, as well as for certain query-processing of two or more relations in each block. Such a
algorithms. file organization allows us to read records that
For insertion, we apply the following rules: would satisfy the join condition by using one
1. Locate the record in the file that comes before the block read. Thus, we are able to process this
record to be inserted in search-key order. particular query more efficiently
2. If there is a free record (that is, space left after a
deletion) within the same block as this record,
insert the new record there. Otherwise, insert the
new record in an overflow block. In either case,
adjust the pointers so as to chain together the
records in search-key order.
Insertion time: The time it takes to insert a new data We often want to have more than one index for
item. This value includes the time it takes to find the a file.
correct place to insert the new data item, as well An attribute or set of attributes used to look up
as the time it takes to update the index structure.
records in a file is called a search key.
Deletion time: The time it takes to delete a data item.
This value includes the time it takes to find the item
to be deleted, as well as the time it takes to update
the index structure.
• Space overhead: The additional space occupied by
an index structure. Provided that the amount of
additional space is moderate, it is usually worth
while to sacrifice the space to achieve improved
performance.
1.Ordered Indices
To gain fast random access to records in a file, we file may have several indices, on different search
can use an index structure. keys. If the file containing the records is
Each index structure is associated with a particular sequentially ordered, a clustering index is an
search key. Just like the index of a book or a library index whose search key also defines the
catalog, an ordered index stores the values of the sequential order of the file.
search keys in sorted order, and associates with Clustering indices are also called primary
each search key the records that contain it. indices
The records in the indexed file may themselves be Indices whose search key specifies an order
stored in some sorted order, just as books in a different from the sequential order of the file are
library are stored according to some attribute.
called nonclustering indices, or secondary
indices
Indices with two or more levels are called The B+ tree is a balanced binary search tree. It
multilevel indices. follows a multi-level index format.
In the B+ tree, leaf nodes denote actual data
pointers. B+ tree ensures that all leaf nodes
remain at the same height.
In the B+ tree, the leaf nodes are linked using a
link list. Therefore, a B+ tree can support
random access as well as sequential access.
Structure of a B+-Tree
The main disadvantage of the index-sequential file In the B+ tree, every leaf node is at equal
organization is that performance degrades as the distance from the root node. The B+ tree is of the
file grows. order n where n is fixed for every B+ tree.
The B+-tree index structure is the most widely used It contains an internal node and leaf node.
of several index structures that maintain their
efficiency despite insertion and deletion of data.
A B+-tree index takes the form of a balanced tree
in which every path from the root of the tree to a
leaf of the tree is of the same length. Each nonleaf
node in the tree has between [n/2] and n children,
where n is fixed for a particular tree.
A B +-tree index is a multilevel index, but it has a We consider first the structure of the leaf nodes.
structure that differs from that of the multilevel For i = 1,2,...,n−1, pointer Pi points to a file record
index-sequential file. with search-key value Ki . Pointer Pn has a
special purpose
Since there is a linear order on the leaves based
on the search-key values that they contain, we
It contains up to n − 1 search-key values K1,K2,...,Kn use Pn to chain together the leaf nodes in search-
− 1, and n pointers P1,P2,...,Pn. The search-key key order. This ordering allows for efficient
values within a node are kept in sorted order; thus, sequential processing of the file.
if i <j , then Ki <K j
Deletion is equally straightforward. If the Hashing can be used for two different purposes.
search-key value of the record to be deleted is In a hash file organization, we obtain the
Ki , we compute h(Ki ), then search the address of the disk block containing a desired
corresponding bucket for that record directly by
record, and delete the record from the bucket. computing a function on the search-key value of
the record.
In a hash index organization we organize the
search keys, with their associated pointers, into a
hash file structure.
Hash Functions
Hash functions require careful design. A bad If the bucket does not have enough space,
hash function may result in lookup taking time a bucket overflow is said to occur. Bucket overflow can
proportional to the number of search keys in the occur for several reasons:
Insufficient buckets.
file. A well designed function gives an average-
The number of buckets, which we denote nB , must be
case lookup time that is a (small) constant,
independent of the number of search keys in the chosen such that nB >nr /fr , where nr denotes the total
number of records that will be stored and fr denotes
file
the number of records that will fit in a bucket. This
designation, of course, assumes that the total number
of records is known when the hash function is chosen.
Skew. Some buckets are assigned more records So that the probability of bucket overflow is
than are others, so a bucket may overflow even reduced, the number of buckets is chosen to be
when other buckets still have space. (nr /fr ) ∗ (1 + d), where d is a fudge factor,
This situation is called bucket skew. typically around 0.2.
Skew can occur for two reasons: Some space is wasted: About 20 percent of the
1. Multiple records may have the same search space in the buckets will be empty. But the
key. benefit is that the probability of overflow is
2. The chosen hash function may result in reduced.
nonuniform distribution of search keys
Despite allocation of a few more buckets than Overflow handling using such a linked list is
required, bucket overflow can still occur. called overflow chaining.
We handle bucket overflow by using overflow
buckets. If a record must be inserted into a bucket
b, and b is already full, the system provides an
overflow bucket for b, and inserts the record into
the overflow bucket.
If the overflow bucket is also full, the system
provides another overflow bucket, and so
on. All the overflow buckets of a given bucket are
chained together in a linked list,
We must change the lookup algorithm slightly to Under an alternative approach, called open
handle overflow chaining. hashing, the set of buckets is fixed, and there are
As before, the system uses the hash function on no overflow chains. Instead, if a bucket is full,
the search key to identify a bucket b. The system the system inserts records in some other bucket
must examine all the records in bucket b to see in the initial set of buckets B.
whether they match the search key, as before. In One policy is to use the next bucket (in cyclic
addition, if bucket b has overflow buckets, the order) that has space; this policy is called linear
system probing.
must examine the records in all the overflow
buckets also.
The form of hash structure is closed hashing.
Dynamic Hashing
the need to fix the set B of bucket addresses presents a Several dynamic hashing techniques allow the hash
serious problem with the static hashing technique. function to be modified dynamically to accommodate
Most databases grow larger over time. If we are to use the growth or shrinkage of the database.
static hashing for such a database, we have Extendable hashing(dynamic hashing) copes with
three classes of options: changes in database size by splitting and combining
1. Choose a hash function based on the current file size buckets as the database grows and shrinks. As a result,
2. Choose a hash function based on the anticipated size
space efficiency is retained.
With extendable hashing, we choose a hash function h
of the file at some point
with the desirable properties of uniformity and
in the future
randomness. However, this hash function generates
3. Periodically reorganize the hash structure in
values over a relatively large range—namely, b-bit
response to file growth. binary integers. A typical value for b is 32.
Distributed Databases
In a homogeneous distributed database system, all sites have Consider a relation r that is to be stored in the database. There
identical database management system software, are aware of are two approaches
one another, and agree to cooperate in processing users’ to storing this relation in the distributed database:
requests. • Replication. The system maintains several identical replicas
In contrast, in a heterogeneous distributed database, different (copies) of the relation, and stores each replica at a different
sites may use different schemas, and different database- site. The alternative to replication is to store only one copy of
management system software. The sites may not be aware of relation r.
one another, and they may provide only limited facilities for • Fragmentation. The system partitions the relation into several
cooperation in transaction processing. fragments, and stores each fragment at a different site.
Data Replication
Fragmentation and replication can be combined: A relation can If relation r is replicated, a copy of relation r is stored in two
be partitioned into several fragments and there may be several or more sites. In the most extreme case, we have full
replicas of each fragment replication, in which a copy is stored in every site in the system.
There are a number of advantages and disadvantages to
replication.
Availability: If one of the sites containing relation r fails, then
the relation r can be found in another site. Thus, the system
can continue to process queries
involving r, despite the failure of one site
We can simplify the management of replicas of relation r by If relation r is fragmented, r is divided into a number of
choosing one of them as the primary copy of r. fragments r1, r2, . . . , rn. These fragments contain
For example, in a banking system, an account can be associated sufficient information to allow reconstruction of the
with the site in which the account has been opened. original relation r.
There are two different schemes for fragmenting a relation:
horizontal fragmentation and vertical fragmentation.
Horizontal fragmentation splits the relation by assigning each
tuple of r to one or more fragments. Vertical
fragmentation splits the relation by decomposing the scheme R
of relation r.
In horizontal fragmentation, a relation r is partitioned into a a horizontal fragment can be defined as a selection on the
number of subsets, r1, r2, . . . , rn. Each tuple of relation r global relation r. That is, we use a predicate Pi to construct
must belong to at least one of the fragments, so that the fragment ri :
original relation can be reconstructed, if needed.
the account relation can be divided into several different
fragments, each of which consists of tuples of accounts
belonging to a particular We reconstruct the relation r by taking the union of all
branch. fragments; that is:
vertical fragmentation is the same as decomposition. Vertical
fragmentation of r(R) involves the definition of several subsets of
attributes R1, R2, . . . , Rn of the schema R so that:
R = R1 ∪ R2 ∪ · · · ∪ R
Each fragment ri of r is defined by One way of ensuring that the relation r can be reconstructed is
to include the primary-key attributes of R in each Ri . More
generally, any superkey can be used.
It is often convenient to add a special attribute, called a tuple-
we can reconstruct relation id, to the schema R
r from the fragments by taking the natural join:
Transparency
The user of a distributed database system should not be Replication transparency. Users view each data object as
required to know where the data are physically located nor logically unique. The distributed system may replicate an object
how the data can be accessed at the specific local site. This to increase either system performance or data availability.
characteristic, called data transparency, can take several forms: Users do not have to be concerned with what data objects have
Fragmentation transparency. Users are not required to know been replicated, or where replicas have been placed.
how a relation has been fragmented. Location transparency. Users are not required to know the
physical location of the data. The distributed database system
should be able to find any data
as long as the data identifier is supplied by the user transaction.
Data items—such as relations, fragments, and replicas—must The database system can create a set of alternative
have unique names. This property is easy to ensure in a names, or aliases, for data items. A user may thus refer to
centralized database. In a distributed database, however, we data items by simple names that are translated by the system to
must take care to ensure that two sites do not use the same complete names.
name for distinct data items.
One solution to this problem is to require all names to be
registered in a central name server. The name server helps to
ensure that the same name does not get used for different data
items.
Access to the various data items in a distributed system is Each site has its own local transaction manager, whose
usually accomplished through transactions, which must function is to ensure the ACID properties of those transactions
preserve the ACID properties . that execute at that site. The various transaction managers
There are two types of transaction that we need to consider. cooperate to execute global transactions.
The local transactions are those that access and update data in each site contains two subsystems
only one local database; the global transactions are those that The transaction manager manages the execution of those
access and update data in several local databases. transactions (or subtransactions) that access data stored in a
local site.
The transaction coordinator coordinates the execution of the
various transactions (both local and global) initiated at that site.
Each transaction manager is responsible for: the coordinator is responsible for:
• Maintaining a log for recovery purposes. • Starting the execution of the transaction.
• Participating in an appropriate concurrency-control scheme to • Breaking the transaction into a number of sub transactions and
coordinate the distributing these subtransactions to the appropriate sites for
concurrent execution of the transactions executing at that site. execution.
• Coordinating the termination of the transaction, which may
result in the transaction being committed at all sites or aborted
at all sites.
A distributed system may suffer from the same types of failure complex application domains require correspondingly complex
that a centralized system does data types, such as nested record structures, multivalued
• Failure of a site. attributes, and inheritance, which are supported by traditional
• Loss of messages. programming languages.
• Failure of a communication link. The object-relational data model extends the relational data
• Network partition. model by providing a richer type system including complex data
types and object orientation.
Object-relational database systems, that is, database systems Two approaches are used
based on the object-relation model, provide a convenient 1. Build an object-oriented database system, that is, a database
migration path for users system that natively supports an object-oriented type system,
of relational databases who wish to use object-oriented features. and allows direct access to
data from an object-oriented programming language using the
native type system of the language.
2. Automatically convert data from the native type system of
the programming language to a relational representation, and
vice versa. Data conversion is
specified using an object-relational mapping.
Traditional database applications have conceptually simple data On the other hand, if an address were represented by breaking
types. The basic data items are records that are fairly small it into the components (street address, city, state, and postal
and whose fields are atomic. code), writing queries would be more complicated since they
Consider, for example, addresses. would have to mention each field.
While an entire address could be viewed as an atomic data item A better alternative is to allow structured data types that allow
of type string, this view would hide details such as the street a type address with subparts street address, city, state, and
address, city, state, and postal code, which could be of postal code.
interest to queries. With complex type systems we can represent E-R model
concepts, such as composite attributes, multivalued attributes,
generalization, and specialization
directly, without a complex translation to the relational model.
Structured Types and Inheritance in SQL
the following structured type can be used to represent a We can now use these types to create composite attributes in a
composite attribute address: relation, by simply declaring an attribute to be of one of these
types. For example, we could create a table person as follows:
Such types are called user-defined types in SQL. The final and
not final specifications are
related to subtyping,
The components of a composite attribute can be accessed using
a “dot” notation; for instance name.firstname returns the
firstname component of the name attribute. An access to
attribute name would return a value of the structured type
Name.
We can also create a table whose rows are of a user-defined
type. For example, we could define a type PersonType and
create the table person as follows:
An alternative way of defining composite attributes in SQL is to The following query illustrates how to access component
use unnamed row types. attributes of a composite attribute. The query finds the last
name and city of each person.
we could declare a constructor for the type We can then use new Name(’John’, ’Smith’) to create a value
Name like this of the type Name.
We can construct a row value by listing its attributes within
parentheses.
By default every structured type has a constructor with no
arguments, which sets the attributes to their default values.
Any other constructors have to be created
explicitly. There can be more than one constructor for the same
structured type; although they have the same name, they must
be distinguishable by the number
of arguments and types of their arguments
Type Inheritance
The following statement illustrates how we can create a new Suppose that we have the following type definition for people:
tuple in the Person relation
The SQL standard requires an extra field at the end of the type create table people of Person;
definition,whose value is either final or not final. The keyword We can then define tables students and teachers as subtables of
final says that subtypes may people,
not be created from the given type, while not final says that
subtypes may be created.
Further, when we declare students and teachers as subtables of SQL permits us to find tuples that are in people but not in its
people, everytuple present in students or teachers becomes subtables by using “only people”in place of people in a query.
implicitly present in people. Thus, The only keyword can also be used in delete and update
if a query uses the table people, it will find not only tuples statements. Without the only keyword, a delete statement on a
directly inserted into that table, but also tuples inserted into its supertable, such as people, also deletes tuples that were
subtables, namely students and teachers. originally inserted in subtables (such as students);
However, only those attributes that are present in people can delete from people where P;
be accessed by that would delete all tuples from the table people, as well as its
query. subtables students and teachers, that satisfy P
Array and Multiset Types in SQL
If the only keyword is added to the above statement, SQL supports two collection types: arrays and multisets;
tuples that were inserted in subtables are not affected, even if a multiset is an unordered collection, where an element may
they satisfy the where clause conditions occur multiple times.
multiple inheritance is possible with tables,
Object-oriented languages provide the ability to refer to objects. We can omit the declaration scope people from the type
An attribute of a type can be a reference to an object of a declaration and instead
specified type. make an addition to the create table statement
For example, in SQL we can define a type Department with a
field name and a field head that is a reference to the type
Person, and a table departments of type Department, as
follows:
The referenced table must have an attribute that stores the
identifier of the tuple. We declare this attribute, called the
self-referential attribute, by adding a ref is clause to the
create table statement:
CAP theorem In normal operations, your data store provides all three
The CAP Theorem is comprised of three components (hence its functions. But the CAP theorem maintains that when a
name) as they relate to distributed data stores: distributed database experiences a network failure, you can
Consistency. All reads receive the most recent write or an
provide either consistency or availability.
error. In the theorem, partition tolerance is a must. The assumption is
Availability. All reads contain data, but it might not be the
that the system operates on a distributed data store so the
most recent. system, by nature, operates with network partitions. Network
failures will happen, so to offer any kind of reliable service,
Partition tolerance. The system continues to operate despite partition tolerance is necessary—the P of CAP.
network failures (ie; dropped partitions, slow network
connections, or unavailable network connections between nodes.
)
if Partition means a break in communication then Partition That leaves a decision between the other two, C and A. When
tolerance would mean that the system should still be able to a network failure happens, one can choose to guarantee
work even if there is a partition in the system. Meaning if a consistency or availability:
node fails to communicate, then one of the replicas of the node High consistency comes at the cost of lower availability.
should be able to retrieve the data required by the user. High availability comes at the cost of lower consistency.
The CAP theorem states that a distributed database system has
to make a tradeoff between Consistency and Availability when a
Partition occurs.
Non-relational database
Non-relational databases (often called NoSQL databases) are There are several advantages to using non-relational databases,
different from traditional relational databases in that they store including:
their data in a non-tabular form. Massive dataset organization
Instead, non-relational databases might be based on data In the age of Big Data, non-relational databases can not only
structures like documents. A document can be highly detailed store massive quantities of information, but they can also query
while containing a range of different types of information in these datasets with ease. Scale and speed are crucial advantages
different formats. of non-relational databases.
MongoDB
MongoDB Sharding
MongoDB is a NoSQL document database system that scales MongoDB achieves scaling through a technique known as
well horizontally and implements data storage through a key- “sharding”. It is the process of writing data across different
value system. servers to distribute the read and write load and data storage
requirements
Sharding is the process of storing data records across multiple
machines and it is MongoDB’s approach to meeting the demands
of data growth. As the size of the data increases, a single
machine may not be sufficient to store the data nor provide an
acceptable read and write throughput
MongoDB Replication
Sharding solves the problem with horizontal scaling. With Replica Sets are a great way to replicate MongoDB data across
sharding, you add more machines to support data growth and multiple servers and have the database automatically failover in
the demands of read and write operations. case of server failure.
MongoDB sharding works by creating a cluster of MongoDB A shard is a single MongoDB instance that holds a subset of the
instances consisting of at least three servers. That means sharded data. Shards can be deployed as replica sets to
sharded clusters consist of three main components: increase availability and provide redundancy. The combination
The shard of multiple shards creates a complete data set. For example, a
Mongos
2 TB data set can be broken down into four shards, each
containing 500 GB of data from the original data set.
Config servers
Mongos Config Servers
Mongos act as the query router providing a stable interface Configuration servers store the metadata and the configuration
between the application and the sharded cluster. This MongoDB settings for the whole cluster.
instance is responsible for routing the client requests to the
correct shard.
HBase is a column-oriented non-relational database management Unlike relational database systems, HBase does not support a
system that runs on top of Hadoop Distributed File System structured query language like SQL; in fact, HBase isn’t a
(HDFS). HBase provides a fault-tolerant way of storing sparse relational data store at all. HBase applications are written in
data sets, which are common in many big data use cases. It is Java much like a typical Apache MapReduce application.
well suited for real-time data processing or random read/write
access to large volumes of data.
Cassandra
Apache Cassandra is an open source, distributed and It is scalable, fault-tolerant, and consistent.
decentralized/distributed storage system (database), for It is a column-oriented database.
managing very large amounts of structured data spread out Its distribution design is based on Amazon’s Dynamo and its data
across the world. It provides highly available service with no model on Google’s Bigtable.
single point of failure. Created at Facebook, it differs sharply from relational database
management systems.
Cassandra implements a Dynamo-style replication model with no
single point of failure, but adds a more powerful “column family”
data model.
Cassandra is being used by some of the biggest companies such
as Facebook, Twitter, Cisco, Rackspace, ebay, Twitter, Netflix,
and more
Features of Cassandra
Elastic scalability − Cassandra is highly scalable; it allows to Flexible data storage − Cassandra accommodates all possible data
add more hardware to accommodate more customers and more formats including: structured, semi-structured, and
data as per requirement. unstructured. It can dynamically accommodate changes to your
data structures according to your need.
Always on architecture − Cassandra has no single point of Easy data distribution − Cassandra provides the flexibility to
failure and it is continuously available for business-critical distribute data where you need by replicating data across
applications that cannot afford a failure. multiple data centers.
Fast linear-scale performance − Cassandra is linearly scalable, Transaction support − Cassandra supports properties like
i.e., it increases your throughput as you increase the number Atomicity, Consistency, Isolation, and Durability (ACID).
of nodes in the cluster. Therefore it maintains a quick response Fast writes − Cassandra was designed to run on cheap commodity
time. hardware. It performs blazingly fast writes and can store
hundreds of terabytes of data, without sacrificing the read
efficiency.
Components of Cassandra
Node − It is the place where data is stored. Mem-table − A mem-table is a memory-resident data structure.
Data center − It is a collection of related nodes. After commit log, the data will be written to the mem-table.
Cluster − A cluster is a component that contains one or more
Sometimes, for a single-column family, there will be multiple
data centers. mem-tables.
SSTable − It is a disk file to which the data is flushed from the
Commit log − The commit log is a crash-recovery mechanism in
Cassandra. Every write operation is written to the commit log. mem-table when its contents reach a threshold value.
Bloom filter − These are nothing but quick, nondeterministic,
algorithms for testing whether an element is a member of a set.
It is a special kind of cache. Bloom filters are accessed after
every query
Cassandra Query Language
users can access Cassandra through its nodes using Cassandra Write Operations
Query Language (CQL). CQL treats the database (Keyspace) as Every write activity of nodes is captured by the commit logs
a container of tables. Programmers use cqlsh: a prompt to written in the nodes. Later the data will be captured and stored
work with CQL or separate application language drivers. in the mem-table. Whenever the mem-table is full, data will
be written into the SStable data file. All writes are
automatically partitioned and replicated throughout the cluster.
Cassandra periodically consolidates the SSTables, discarding
unnecessary data.
Read Operations
During read operations, Cassandra gets values from the mem-
table and checks the bloom filter to find the appropriate SSTable
that holds the required data.
Module5 XML stands for eXtensible Markup Language.
XML was designed to store and transport data.
XML is a software- and hardware-independent tool for storing
XML and transporting data.
The Difference Between XML and HTML
XML and HTML were designed with different goals:
XML was designed to carry data - with focus on what data is
HTML was designed to display data - with focus on how data
looks
XML tags are not predefined like HTML tags are
XML DTD
XML Does Not Use Predefined Tags DTD stands for Document Type Definition.
The XML language has no predefined tags. A DTD defines the structure and the legal elements and
XML is Extensible attributes of an XML document.
Most XML applications will work as expected even if new data An XML document with correct syntax is called "Well Formed".
is added (or removed). An XML document validated against a DTD is both "Well
XML Simplifies Things Formed" and "Valid".
It simplifies data sharing
It simplifies data transport
It simplifies platform changes
It simplifies data availability
The DOCTYPE declaration above contains a reference to a DTD
file.
The purpose of a DTD is to define the structure and the legal
elements and attributes of an XML document:
<xs:element name="note"> defines the element called "note" XML Schemas are written in XML
<xs:complexType> the "note" element is a complex type XML Schemas are extensible to additions
<xs:sequence> the complex type is a sequence of elements XML Schemas support data types
<xs:element name="to" type="xs:string"> the element "to" is of XML Schemas support namespaces
type string (text)
<xs:element name="from" type="xs:string"> the element "from"
is of type string
<xs:element name="heading" type="xs:string"> the element "
heading" is of type string
<xs:element name="body" type="xs:string"> the element "body"
is of type string
XML Applications
Web Services
When the information is to be used directly by a human,
organizations pro-
vide Web-based forms, where users can input values and get
back desired in-
formation in HTML form. However, there are many applications
where such in-
formation needs to be accessed by software programs, rather
than by end users.
Providing the results of a query in XML form is a clear
requirement. In addition,
it makes sense to specify the input values to the query also in
XML format.