database class12
database class12
Chapter 13
DATABASE CONCEPTS
What is a data?
Data is a collection of facts, figures, statistics, which can be processed to produce meaningful information.
(OR)
Data is raw, unorganized facts that need to be processed. Data can be something simple and seemingly random
and useless until it is organized.
What is information?
When data is processed, organized, structured or presented in a given context so as to make it useful, it is called
information.
(OR)
Information is data that has been processed in such a way as to be meaningful to the person who receives it.
(OR)
Information is the processed data on which decisions and actions are based.
Example:
1. The history of temperature readings all over the world for the past 100 years is data. If this data is
organized and analyzed to find that global temperature is rising, then that is information.
2. Each student's test score is one piece of data. The average score of a class or of the entire school is
information that can be derived from the given data.
What is a database?
A database is a collection of information that is organized so that it can easily be accessed, managed, and
updated.
Examples:
1. College database
2. Library database
Applications of database:
1. Banking
4. Colleges
6. Telecommunications
7. Finance
8. Sales
9. Manufacturing
10. Military
11. Medical
Data processing is simply the conversion of raw data to meaningful information through a process. Data is
manipulated to produce results that lead to a resolution of a problem or improvement of an existing situation.
1. Collection of data
2. Preparation of data
3. Input of data
4. Processing of data
5. Output
6. Storage
1) Collection is the first stage of the cycle, and is very crucial, since the quality of data collected will impact heavily
on the output. The collection process needs to ensure that the data gathered are both defined and accurate, so
that subsequent decisions based on the findings are valid. This stage provides both the baseline from which to
Some types of data collection include census (data collection about everything in a group or statistical
population), sample survey (collection method that includes only part of the total population), and administrative
2) Preparation is the manipulation of data into a form suitable for further analysis and processing. Raw data
cannot be processed and must be checked for accuracy. Preparation is about constructing a dataset from one or
more data sources to be used for further exploration and processing. Analyzing data that has not been carefully
screened for problems can produce highly misleading results that are heavily dependent on the quality of data
prepared.
3) Input is the task where verified data is coded or converted into machine readable form so that it can be
processed through a computer. Data entry is done through the use of a keyboard, digitizer, scanner, or data entry
from an existing source. This time-consuming process requires speed and accuracy. Most data need to follow a
formal and strict syntax since a great deal of processing power is required to breakdown the complex data at this
stage. Due to the costs, many businesses are resorting to outsource this stage.
4) Processing is when the data is subjected to various means and methods of manipulation, the point where
a computer program is being executed, and it contains the program code and its current activity. The process may
be made up of multiple threads of execution that simultaneously execute instructions, depending on the operating
5) Output is the stage where processed information is now transmitted to the user. Output is presented to users in
various report formats like printed report, audio, video, or on monitor. Output need to be interpreted so that it
can provide meaningful information that will guide future decisions of the company.
4
6) Storage is the last stage in the data processing cycle, where data, instruction and information are held for future
use. The importance of this cycle is that it allows quick access and retrieval of the processed information, allowing
it to be passed on to the next stage directly, when needed. Every computer uses storage to hold system and
application software.
Database terms:
File:
A file is a large collection of related data.
Tables:
A table is a collection of data elements organized in terms of rows and columns
Records:
A single entry in a table is called record.
Tuple:
Records are also called the tuple
Fields/attribute:
Each column is identified by a distinct header called attribute or field.
Domain:
Set of values for an attribute in that column.
An Entity:
An entity can be any object, place, person or class.
Attributes/fields
1 PRAMOD 26 45000
Row/tuple/
record 2 NAVEEN 30 36000
3 RAJU 24 55000
Domain
5
Database Management System or DBMS in short, refers to the technology of storing and retrieving user’s data
with utmost efficiency along with safety and security features. DBMS allows its users to create their own
databases which are relevant with the nature of work they want.
The primary goal of a DBMS is to provide away to store and retrieve database information that is both
convenient and efficient.
6
1. Minimized Redundancy
2. Enforcing data integrity
3. Data sharing
4. Ease of application development
5. Data security
6. Multiple user interfaces
7. Backup and recovery
MinimizedRedundancy:
Data redundancy in database means that some data fields are repeated in the database.
This data repetition may occur either if a field is repeated in two or more tables or if the field is repeated within
the table.
Data can appear multiple times in a database for a variety of reasons. For example, a shop may have the same
customer’s name appearing several times if that customer has bought several different products at different dates.
Data integrity refers to the overall completeness, accuracy and consistency of data. The integrity of the stored
data can be lost in different ways
Data sharing:
Data sharing is a primary feature of a database management system (DBMS).The DBMS helps create an
environment in which end users have better access to more and better-managed data.
Data security:
Data security is the protection of the database from unauthorized users. When number of user’s increases to
access the data, the risk of data security increases, but the DBMS provides a framework for better enforcement of
data privacy and security policies. Only the authorized persons are allowed to access the database.
7
In order to meet the needs of various users having different technical knowledge,DBMS provides different types of
interfaces such as query language, application program interface and graphical user interface.
Most of the DBMSs provide the 'backup and recovery' sub-systems that automatically create the backup of data
and restore data if required.
Data abstraction:
Data abstraction is a process of representing the essential features withoutincluding implementation details.
Many database-systems users are not computer trained, developers hide thecomplexity from users through
several levels of abstraction, to simplify users’interactions with the system
1. Internal level
2. Conceptual level
3. External level
The lowest level of abstraction describes how the data areactually stored. The physical level describes complex
low-level datastructures in detail.
At this level various aspects are considered to achieve optimal runtime performance and storage space utilization.
These aspects includes storage space allocation techniques for data indexes, access paths, data compression and
encryption techniques and record placement
The next-higher level of abstraction describes what data arestored in the database, and what relationships exist
among those data. Thelogical level thus describes the entire database in terms of a small number of
relatively simple structures.
The highest level of abstraction describes only part of the entire database. The variety of information stored in a
large database. Many users of the database system do not need all this information; instead, they need to access
only a part of the database. The view level of abstraction exists to simplify their interaction with the system.
8
DBMS users:
Application Programmers are responsible for writing application programs that use the database. These programs
could be written in General Purpose Programming languages such as Visual Basic, Developer, C, FORTRAN, COBOL
etc. to manipulate the database. These application programs operate on the data to perform various operations
such as retaining information, creating new information, deleting or changing existing information.
Database administrator:
DBA is responsible for authorizing access to the database, for coordinating and monitoring its use, and acquiring
software and hardware resources as needed.
Database designers:
Identify the data to be stored in the database and choosing appropriate structures to represent and store
the data. Most of these functions are done before the database is implemented and populated with the data. It is
the responsibility of the database designers to communicate with all prospective users to understand their
requirements and come up with a design that meets these requirements. Database designers interact with all
potential users and develop views of the database that meet the data and processing requirements of these
groups. The final database must support the requirements of all user groups.
End users:
End Users are the people who interact with the database through applications or utilities. The various categories
of end users are:
Data independence:
The ability to modify a scheme definition in one level without affecting a scheme definition in a higher level is
called data independence.
9
Logical data independence means the ability to change the internal schema without having to change the
conceptual schema.
Physical data independence means the ability to change the conceptual schema without having to change the
external schema
Logical data independence is more difficult to achieve than physical data independence.
Serial file organization is the simplest file organization method. The data are collected in the file in the order in
which they arrive. That means files are in unordered, serial files are primarily used as transaction files in which the
transactions are recorded in the order that they occur.
Sequential file organization:
In a sequential file organization, the records are arranged in a particular order may be in ascending order or
descending order and accessed in the predetermined order of their keys.
In sequential file organization the records are stored in the media called as magnetic tape, punched cards or
magnetic discs. To access records, the computer must read the file in the sequence from the beginning.The first
record is read and processed first and next the second record in the sequence and so on.
Records in a sequential file can be stored in two ways.
Unsorted file: Records are placed one after another as they arrive (no sorting of anykind).
Sorted file: Records are placed in ascending or descending values of theprimary key.
10
The first specification of network data model was represented by Conference on Data Systems Languages
(CODASYL) in 1969.
The network model is very similar to the hierarchical model. In fact, the hierarchical model is a subset of the
network model. However, instead of using a single-parent tree hierarchy, the network model uses set theory to
provide a tree-like hierarchy with the exception that child tables were allowed to have more than one parent. This
allowed the network model to support many-to-many relationships
1. Conceptual simplicity
5. Data independence
1. System complexity
Relational model:
Relational model was developed by E.F. CODD in 1970. He is also called as father of RDBMS. In relational model,
unlike the hierarchical and network model there are no physical links. In relational model all data is maintained in
form tables consisting of rows and columns. Here each row represents an entity and a column represents an
attribute. The relationship between two tables is implemented through a common attribute in the tables and not
by physical link.
14
Advantages
1. The main advantage of this model is its ability to represent data in a simplifiedformat.
2. The process of manipulating record is simplified with the use of certain keyattributes used to retrieve data.
Codd’s Rule:
Relational model was developed by E.F. CODD in 1970. He is also called as father of RDBMS. Based on relational
model, relational database was created. Codd proposed 13 rules popularly known as Codd’s 12 rules to test DBMS
concepts. Codd’s rules actually define what quality a DBMS requires in order to become a relational database
management system (RDMS).
Rule zero:
This rule states that for a system to qualify as an RDBMS, it must be able to manage database entirely through the
relational capabilities.
All information (including metadata) is to be represented as stored data in cells of tables. The rows and columns
have to be strictly unordered.
Each unique piece of data (atomic value) should be accessibleby:Table Name + primary key(Row) +
Attribute(column).
Null has several meanings; it can mean missing data, not applicable or no value. It should be handled consistently.
Primary key must not be null. Expression on NULL must give null.
Database dictionary (catalog) must have description of Database. Catalog to be governed by same rule as rest of
the database. The same query language to be used on catalog as on application database.
One well defined language must be there to provide all manners of access to data. Example: SQL. If a file
supporting table can be accessed by any manner except SQL interface, then its a violation to this rule.
All views that are theoretically updatable should be updatable by the system.
There must be Insert, Delete, and Update operations at each level of relations. Set operation like Union,
Intersection and minus should also be supported.
The physical storage of data should not matter to the system. If say, some file supporting table were renamed or
moved from one disk to another, it should not affect the application.
If there is change in the logical structure (table structures) of the database the user view of data should not
change. Say, if a table is split into two tables, a new view should give result as the join of the two tables. This rule
is most difficult to satisfy.
The database should be able to con-force its own integrity rather than using other programs. Key and Check
constraints, trigger etc. should be stored in Data Dictionary. This also makesRDBMS independent of front-end.
A database should work properly regardless of its distribution across a network. This lays foundation of distributed
database.
If low level access is allowed to a system it should not be able to subvert or bypass integrity rule to change data.
This can be achieved by some sort of looking or encryption
Normalization of Database:
Database Normalization is a technique of organizing the data in the database. Normalization is a systematic
approach of decomposing tables to eliminate data redundancy and undesirable characteristics like Insertion,
Update and Deletion Anomalies.
4. BCNF
As per First Normal Form, no two Rows of data must contain repeating group of information i.e. each set of
column must have a unique value, such that multiple columns cannot be used to fetch the same row. Each table
should be organized into rows, and each row should have a primary key that distinguishes it as unique.
Before 2NF table should meet all the needs of 1NF. As per the Second Normal Form there must not be any partial
dependency of any column on primary key. It means that for a table that has concatenated primary key, each
column in the table that is not part of the primary key must depend upon the entire concatenated key for its
existence. If any column depends only on one part of the concatenated key, then the table fails Second normal
form.
This table has a composite primary key [Customer ID, Store ID]. The non-key attribute is [Purchase Location]. In
this case, [Purchase Location] only depends on [Store ID], which is only part of the primary key. Therefore, this
table does not satisfy second normal form.
To bring this table to second normal form, we break the table into two tables, and now we have the following:
17
Third Normal form applies that every non-prime attribute of table must be dependent on primary key. The
transitive functional dependency should be removed from the table. The table must be in Second Normal form.
In this table Student_id is Primary key, but street, city and state depends upon Zip. The dependency between zip
and other fields is called transitive dependency. If there is transitive dependencies then remove that particular
field and store it in a separate file. Hence to apply 3NF, we need to move the street, city and state to new table,
with Zip as primary key.
Student table
Address table
A table is in Boyce-Codd normal form (BCNF) if and only if it is in 3NF and every determinant is a candidate key.
ER-Diagram is a visual representation of data that describes how data is related to each other.The entity
relationship data model is based on a perception of a real world consists of basic objects called entities and
relation among these objects.
1. Entity:
An entity is an object in the real world that is distinguishable from other objects. An Entity can be any object,
place, person or class. In E-R Diagram, an entity is represented using rectangles.
Weak Entity:
Weak entity is an entity that depends on another entity. Weak entity doesn't have key attribute of their
own. Double rectangle represents weak entity.
2. Attribute:
An Attribute describes a property or characteristic of an entity. The attributes are useful in describing the
properties of each entity in the entity set. An attribute is represented using ellipse.
Key Attribute:
Key attribute represents the main characteristic of an Entity. It is used to represent Primary key. Ellipse with
underlying lines represents Key Attribute.
Composite Attribute:
An attribute can also have their own attributes. These attributes are known as Composite attribute.
3) Relationship:
1. Binary Relationship
2. Recursive Relationship
3. Ternary Relationship
Binary Relationship:
Recursive Relationship:
Ternary Relationship:
SYMBOL MEANING
Entity
Weak entity
Relation
Attribute
Key attribute
Composite attribute
Links
Cardinality:
Cardinality is a very important concept in database design. Cardinalities are used when you are creating an E/R
diagram, and show the relationships between entities/ tables. Cardinality specifies how many instances of one
entity relate to one instance of another entity.
20
1. One to one
2. One to many
3. Many to one
4. Many to many
One to one:
An entity from one entity set is associated with at most one entity in another entity set and vice versa.
One to many:
An entity from one entity set is associated with one or more instances of another entity.
Many to one:
Many instances of an entity from one entity setcan be associated with a single entity from another entity set.
Many to many:
Many instances of an entity from one entity set are associated with many instances from another entity set.
Keys:
The key is defined as the column or attribute of the database table. They ensure each record within a table can be
uniquely identified by one or a combination of fields within the table.
1. Super key
2. Candidate key
3. Primary key
4. Foreign key
5. Composite key
6. Alternate key
Super key:
A Super key is any combination of fields within a table that uniquely identifies each record within that table.
Candidate key:
A candidate is a subset of a super key. A candidate key is a single field or the least combination of fields that
uniquely identifies each record in the table. The least combination of fields distinguishes a candidate key from a
super key. Every table must have at least one candidate key but at the same time can have several.
Once your candidate keys have been identified you can now select one to be your primary key
Primary key:
A primary key is the candidate key which is selected as the principal unique identifier. Every relation must contain
a primary key. The primary key is usually the key selected to identify a row when the database is implemented.
As with any candidate key the primary key must contain unique values, must never be null and uniquely identify
each record in the table.
22
Foreign key:
A foreign key is generally a primary key from one table that appears as a field in another where the first table has a
relationship to the second.
Composite key:
A composite key consists of more than one field to uniquely identify a record.
Alternate key:
23
A table may have one or more choices for the primary key. Collectively these are known as candidate keys. One is
selected as the primary key. Those not selected are known as secondary keys or alternative keys.
Relational algebra:
Relational algebra is a procedural query language. It consists of set of operations that are used to
manipulate the relational model. These set of operations takes one or more relation as input and produce a new
relation as their result.
1. Select
2. Project
3. Cartesian product
4. Rename
5. Union
6. Set difference
7. Set intersection
8. Join
9. Division
10. assignment
The select operation enables the user to specify basic retrieval request the result of retrieval is a new relation. This
may have been formed from one or more relations.
The select operation is used to select a subset the tuples from a relation that satisfy a selection condition.
σ<selectioncondition>(R)
Where, σis used to denote select operation
R is relation (table)
Example:
σsalary>2500(EMPLOYEE)
Project operation:(π)
The project operation selects certainattributes (columns) from a relation table and discards the other columns
(duplicate rows are eliminated).
π<attribute list>(R)
Where, πis used to denote select operation
R is relation (table)
Example:
πEmpname, salary(EMPLOYEE)
This is a binary operation, which is also known as cross product or cross join. This operation is used to combine
tuples from two relations in a combinational fashion. In general the result of R (A1,A2……An) X S(B1,B2……Bm) is a
relation Q with n+m attributes.
Rename operation: ( ρ)
The rename operation can rename either the relation name or the attribute names, or both as a unary operator.
Union operation:(U)
The result of this operation is denoted by RUS is the relation that includes all tuples that are either in R or in S or in
both R and S. duplicate tuples are eliminated.
Set difference:()
The result of this operation is denoted by R-S, is a relation that includes all tuples that are in R but not in S.
25
Set intersection:(∩)
The result of this operation is denoted by R∩S, is a relation that includes all tuples that are in both R and S.
Join:(⋈)
Natural join:
The natural join is a binary operation that allows us to combine certain selections and a Cartesian product into one
operation.
The natural join operation forms a Cartesian product of its two arguments, performs a selection forcing equality
on those attributes that appear in both relation schemes and finally removes duplicate columns.
Outer joins:
An outer join does not require each record in the two joined tables to have a matching record. The joined
table retains each record—even if no other matching record exists.
The result contains all the rows from the first source and the correspondent values for second source (or empty
values for non-matching keys)
The result contains all the rows from the second source, and the corresponding values for the first source ( or
empty values for non-matching keys)
The result contains all the rows from both sources (with empty values for non-matching keys)
26
Data warehouse:
Data warehouse is Subject Oriented, Integrated, Time-Variant and Nonvolatile collection of data that support
management's decision making process.
The key features of Data Warehouse such as Subject Oriented, Integrated, Nonvolatile and Time-Variant are are
discussed below:
Subject Oriented - The Data Warehouse is Subject Oriented because it provides us the information around a
subject rather the organization's ongoing operations. These subjects can be product, customers, suppliers, sales,
revenue etc. The data warehouse does not focus on the ongoing operations Rather it focuses on modelling and
analysis of data for decision making.
Integrated - Data Warehouse is constructed by integration of data from heterogeneous sources such as relational
databases, flat files etc. This integration enhances the effective analysis of data.
Time-Variant - The Data in Data Warehouse is identified with a particular time period. The data in data warehouse
provide information from historical point of view.
Non Volatile - Nonvolatile means that the previous data is not removed when new data is added to it. The data
warehouse is kept separate from the operational database therefore frequent changes in operational database
are not reflected in data warehouse.
27
Advantages
The data warehouse addresses these factors and provides many advantages to the end-users
Data mining:
Data Mining is defined as extracting the information from the huge set of data. In other words we can say that
data mining is mining the knowledge from data.
The different stages of data mining is a logical process for searching large amount information for finding
important data.
Stage 1: Exploration:
One will want to explore and prepare data. The goal of the exploration stage is to find important variables
and determine their nature.
Stage 2: pattern identification:
Searching for patterns and choosing the one which allows making best prediction, is the primary action in
this stage.
Stage 3: Deployment stage.
Until consistent pattern is found in stage 2, which is highly predictive, this stage cannot be reached. The
pattern found in stage 2, can be applied for the purpose to see whether the desired Outcome is achieved or not.