Data Base Concepts
Data Base Concepts
CHAPTER 13
DATABASE CONCEPTS
OBJECTIVES
13.1 Introduction :
The large and complex data which are collected, entered, stored and accessed
based on the users needs are in the form of queries. Unique softwares are
developed and used, which are highly secured and complex. This chapter will
illustrate the database concepts.
Today the database is used everywhere and in all walks of life. The database
is becoming the backbone of all the softwares from standalone, client-server,on-
line, mainframe, supercomputers etc. For exam pl e use in business,
government agency, service organization etc. Wherever the data and information
is required, there the database software is used and the results are presented in
the form of reports or graphical representation, which are easier for
understanding, so that furture activities can be carried out based on these reports.
One of the area where database is used can be schools; for example. The Fees
payment software. The basic requirements for new student admission are Name,
student id, admission no, father name, class, section, amount, date of entry into
the school are entered and saved. Some of the activities can be admission fees/
annual fees, monthly fees, late payment etc., created and the respective reports
are generated.
DATABASE
284 Data Concepts
The fact is something that is really occurred or is actually the case. The
things that are happening and happened in real form or virtual form are
considered to be fact. The fact is pursued by sense organs.
The process of converting fact to data will be the first task, for any person
in database concepts. The human intervention is mandatory for converting from
fact to data form. The data may be in the form of letters, numbers, symbols,
images, sound, video etc. The origin of fact can be from the organization/within
or outside organization or any part of universe. For example the marks obtained
by the student in the exam is 80, 80 is the fact, on the marks card the 80
entered will be in numberic symbols which is data.
Data Concepts 285
The different forms of data and its representation is illustrated. One such form
is sound represented using musical notes (software generated), these notes is
stored in the form of bytes in the sound file(software digitized). In the
hardware data is stored in the form of bits.
For example, in the marks card of the student total marks, percentage
and the result are processed data known as the information.
Initially, the computer files within the file system were similar to the manual
files. The description of computer files requires a specialized vocabulary.
Every discipline develops its own terminology to enable its practitioners to
communicate clearly.
Differences between Manual and Computerized data processing
1. Data input– This is any kind of data- letters, numbers, symbols, shapes,
images or whatever raw material put into the computer system that needs
processing. Input data is put into the computer using a keyboard, mouse or
other devices such as the scanner, microphone and the digital camera. In general
data must be converted to computer understandable form (English to machine
code by the input devices).
3. Storage - Data and information not currently being used must be stored so it
can be accessed later. There are two types of storage; primary and secondary
storage. Primary storage is the computer circuitry that temporarily holds data
Data Concepts 287
4 . O u t p u t – T h e
result(information)
obtained after processing
the data must be
presented to the user in
user understandable form.
The result may be in form
of reports(hardcopy/
softcopy). Some of the
output can be animated
Fig. 13.2 Data Processing cycle
with sound and video/
picture.
File : File is basic unit of storage in computer system. The file is the large
collection of related data.
The primary goal of a DBMS is to provide a way to store and retrieve database
information that is both convenient and efficient.
Serial File Organization : With serial file organization, records are arranged
one after another, in no particular order-other than, the chronological order in
which records are added to the file. Serial organization is commonly found in
the transaction data. Where records are created in a file in the order in which
transaction takes place. Serial file organization provides advantages like fast
294 Data Concepts
Two-tier Client / Server architecture is used for User Interface program and
Application Programs that runs on client side. An interface called ODBC (Open
Database Connectivity) provides an API that allows client side program to call
the DBMS. Most DBMS vendors provide ODBC drivers. A client program may
connect to several DBMS's. In this architecture some variation of client is also
possible for example in some DBMS's more functionality is transferred to the
client including data dictionary, optimization etc. Such clients are called Data
server.
structures and conceptual tools that is used to describe the structure (data
types, relationships, and constraints) of a database.
A data model not only describes the structure of the data, it also defines a set
of operations that can be performed on the data. A data model generally
consists of data model theory, which is a formal description of how data may
be structured and used, and data model instance, which is a practical data
model designed for a particular application. The process of applying a data
model theory to create a data model instance is known as data modeling.
A Database model defines the logical design of data. The model describes the
relationships between different parts of the data.
In history of database design, three models have been in use.
* Hierarchical Model
* Network Model
* Relational Model
13.11.1 Hierarchical Model
The hierarchical data model is the oldest type of data model, developed by IBM
in 1968. This data model organizes the data in a tree-like structure, in which
each child node (also known as dependents) can have only one parent node.
The database based on the hierarchical data model comprises a set of records
connected to one another through links. The link is an association between two
or more records. The top of the tree structure consists of a single node that does
not have any parent and is called the root node.
The root may have any number of dependents; each of these dependents may
have any number of lower level dependents. Each child node can have only one
parent node and a parent node can have any number of (many) child nodes. It,
therefore, represents only one-to-one and one-to-many relationships. The
collection of same type of records is known as a record type.
For simplicity, only few fields of each record type are shown. One complete
record of each record type represents a node.
300 Data Concepts
In this model each entity has only one parent but can have several children . At
the top of hierarchy there is only one entity which is called Root.
Hierarchical Model Example
Advantage Dis-advantage
The hierarchical data model is that The main drawback of this model is
the data access is quite predictable that the links are ‘hard coded’ into
in the structure and, therefore, both the data structure, that is, the link is
the retrieval and updates can be permanently established and cannot
highly optimized by the DBMS. be modified. The hard coding makes
the hierarchical model rigid. In
addition, the physical links make it
difficult to expand or modify the
database and the changes require
substantial redesigning efforts.
hierarchical data model, the data is organized in the form of trees and in network
data model, the data is organized in the form of graphs.
In the network model, entities are organized in a graph, in which some
entities can be accessed through several path
Advantage Dis-advantage
The network data model is that a The network data model is that it
parent node can have many child can be quite complicated to maintain
nodes and a child can also have all the links and a single broken link
many parent nodes. Thus, the can lead to problems in the
network model permits the modeling database. In addition, since there are
of many-to-many relationships in no restrictions on the number of
data. relationships, the database design
can become complex.
the relational model has become more programmer friendly and much more
dominant and popular in both industrial and academic scenarios. Oracle, Sybase,
DB2, Ingres, Informix, MS-SQL Server are few of the popular relational DBMSs.
You can clearly see here that student name Daryl is used twice in the table and
subject *maths* is also repeated. This violates the *First Normal form*. To reduce
above table to *First Normal form* break the table into two different tables
Data Concepts 305
A table to be normalized to Second Normal Form should meet all the needs of
First Normal Form and there must not be any partial dependency of any column
on primary key. It means that for a table that has concatenated primary key,
each column in the table that is not part of the primary key must depend upon
the entire concatenated key for its existence. If any column depends only on
one part of the concatenated key, then the table fails Second normal form. For
example, consider a table which is not in Second normal form.
Library Table : Fig 13.19 Student table with 1NF rule rewritten
Library_id S_Name Issue_id Issue_name Book_detail
101 RAMU 10 Rakesh C++
102 RAMU 11 Rakesh Java
103 Zama 12 Gopal MATHS
104 SATISH 13 Gopal MATHS
To reduce Library table to Second Normal form break the table into
following three different tables.
Library_id S_Name
Library_id S_Name
101 RAMU
102 RAMU
103 Zama
104 SATISH
Issue_Detail Table :
Issue_id Issue_name
10 Rakesh
11 Rakesh
12 Gopal
13 Gopal
Book_Detail Table :
Now all these three table comply with Second Normal form.
Library_id S_Name Issue_Detail Table :
Fig 13.20 2NF Library Tablewith primary key and Issue key
Data Concepts 307
Now all these three table comply with Second Normal form.
Third Normal form applies that every non-prime attribute of table must be
dependent on primary key. The transitive functional dependencyshould be
removed from the table. The table must be in Second Normal form.
For example, consider a table with following fields.
Student_Detail Table :
In this table Student_id is Primary key, but street, city and state depends upon
pin. The dependency between pin and other fields is called transitive dependency.
Hence to apply 3NF, we need to move the street, city and state to new table, with
pin as primary key.
Address Table :
3NF does not deal satisfactorily with the case of a relation with overlapping
candidate keys
A relation is in BCNF is, and only if, every determinant is a candidate key.
R(a,b,c,d)
a,d -> b
Here, the first determinant suggests that the primary key of R could be changed
from a,b to a,c. If this change was done all of the non-key attributes present in
R could still be determined, and therefore this change is legal. However, the
second determinant indicates that a,d determines b, but a,d could not be the
key of R as a,d does not determine all of the non key attributes of R (it does not
determine c). We would say that the first determinate is a candidate key, but the
second determinant is not a candidate key, and thus this relation is not in
BCNF (but is in 3rd normal form).
„A relation R(X) is in Boyce–Codd Normal Form if for every non-trivial functional
dependency Y->Z defined on it, Y contains a key K of R(X). That is, Y is a
superkey for R(X).
1) Entity
An Entity can be any object, place, person or class. In E-R Diagram, an entity is
represented using rectangles. Consider an example of an Organization.
Employee, Manager, Department, Product and many more can
2) Attribute
Composite Attribute : An
attribute can also have their own
attributes. These attributes are
known as Composite attribute.
3) Relationship
A Relationship describes relations between entities. Relationship is
represented using diamonds.
Relationship Relationship example
Binary Relationship
one-to-one example
2. One to Many : It reflects business rule that one entity is associated with
many number of same entity. For example, Student enrolls for only one Course
but a Course can have many Students.
One to Many
The arrows in the diagram describes that one student can enroll
for only one course.
3. Many to Many : The above diagram represents that many students can
enroll for more than one courses.
Many to Many
13.13.3 Cardinality
Generalization Specialization
Fig13.37 Aggregration
Aggregration : Aggregration is a process when
relation between two entity is treated as a single
entity. Here the relation between Center and Course, is acting as an Entity in
relation with Visitor.
13.14 Keys
The word “key” is used in the context of relational database design. They are
used to establish and identify relation between tables. The key is a set of one or
more columns whose combined values are unique among all occurrences in a
given table.
316 Data Concepts
3. Alternate key/Secondary key(sk): The alternate keys of any table are simply
those candidate keys which are not currently selected as the primary key.
An alternative key is a function of all candidate keys minus the primary
key.
5. Super Key : A superkey is basically all sets of columns for which no two
rows share the same values for those sets. An attribute or set of attributes
that uniquely identifies a tuple within a relation/table. Super Key is a
superset of Candidate key.
6. Foreign key(fk) :A foreign key is a field in a relational table that matches
the primary key column of another table. The foreign key can be used to
cross-reference tables.
Table Employees
Employee_id Name Age city Salary Car_loan_id
1 Rajappa 42 Tumkur 42000 585
Table BMWcars
Car_load_id Model Loanamount EMI Noof EMI Balance
585 Basic 1800000 80000 225 1000000
1.Composite Key : Key that consists of two or more attributes that uniquely
identify an entity occurrence is called
Composite key. But any attribute
that makes up the Composite key is
not a simple key in its own.
Example: Consider a Relation or
Table R1. Let A,B,C,D,E are the
attributes of this relation.
R(A,B,C,D,E)
A?BCDE This means the attribute 'A' uniquely determines the other attributes
B,C,D,E.
BC?ADE This means the attributes 'BC' jointly determines all the other attributes
A,D,E in the relation.
Primary Key :A
Candidate Keys :A, BC
Super Keys : A,BC,ABC,AD
Note : ABC,AD are not Candidate Keys since both are not minimal super keys.
Relational Algebra is :
the formal description of how a relational database operates
an interface to the data stored in the database itself
the mathematics which underpin SQL operations
Operators in relational algebra are not necessarily the same as SQL operators,
even if they have the same name. For example, the SELECT statement exists in
SQL, and also exists in relational algebra. These two uses of SELECT are not the
same. The DBMS must take whatever SQL statements the user types in and
translate them into relational algebra operations before applying them to the
database.
Terminology
Relation – a set of tuples.
Tuple – a collection of attributes which describe some real world entity.
Attribute – a real world role played by a named domain.
Domain – a set of atomic values.
Set – a mathematical definition for a collection of objects which
contains no duplicates.
Data Concepts 319
Operators – Write
INSERT – provides a list of attribute values for a new tuple in a relation.
This operator is the same as SQL.
DELETE – provides a condition on the attributes of a relation to
determine which tuple(s) to remove from the relation. This operator is the
same as SQL.
MODIFY – changes the values of one or more attributes in one or more
tuples of a relation, as identified by a condition operating on the attributes of
the relation. This is equivalent to SQL UPDATE.
Operators – Retrieval
There are two groups of operations:
Mathematical set theory based relations:
UNION, INTERSECTION, DIFFERENCE, and CARTESIAN PRODUCT.
SELECT EMPNO
FROM EMPLOYEE
WHERE DEPTNO=1;
Fig.13.41 Select and project
Figure : Mapping select and project
Set Operations – semantics
Consider two relations R and S.
UNION of R and S
the union of two relations is a relation that includes all the tuples that are
either in R or in S or in both R and S. Duplicate tuples are eliminated.
INTERSECTION of R and S
the intersection of R and S is a relation that includes all tuples that are
both in R and S.
DIFFERENCE of R and S
the difference of R and S is the relation that contains all the tuples that are
in R but that are not in S.
For set operations to function correctly the relations R and S must be union
compatible. Two relations are union compatible if
the domain of each attribute in column order is the same in both R and
S.
Data Concepts 321
UNION Example
Fig.13.42 Union
INTERSECTION Example
Fig.13.43 Intersection
DIFFERENCE Example
Fig.13.44 Difference
322 Data Concepts
CARTESIAN PRODUCT
The Cartesian Product is also an operator which works on two sets. It is
sometimes called the CROSS PRODUCT or CROSS JOIN.
It combines the tuples of one relation with all the tuples of the other relation.
CARTESIAN PRODUCT example
Natural Join
Invariably the JOIN involves an equality test, and thus is often described as an
equi-join. Such joins result in two attributes in the resulting relation having
exactly the same value. A ‘natural join’ will remove the duplicate attribute(s).
In most systems a natural join will require that the attributes have the
same name to identify the attribute(s) to be used in the join. This may
require a renaming mechanism.
If you do use natural joins make sure that the relations do not have two
attributes with the same name by accident.
OUTER JOINs
Notice that much of the data is lost when applying a join to two relations. In
some cases this lost data might hold useful information. An outer join retains
the information that would have been lost from the tables, replacing missing
data with nulls.
There are three forms of the outer join, depending on which data is to be kept.
JOIN example 2
From the example, one can see that for complicated cases a large amount of
the answer is formed from operator names, such as PROJECT and JOIN. It is
therefore commonplace to use symbolic notation to represent the operators.
Usage
The symbolic operators are used as with the verbal ones. So, to find all
employees in department 1:
SELECTdepno = 1(employee)
becomes ódepno = 1(employee)
Conditions can be combined together using ^ (AND) and v (OR). For example,
all employees in department 1 called ‘URS’:
Rename Operator
Ñemp2.surname,emp2.forenames (
óemployee.empno = 3 ^ employee.depno = emp2.depno (
employee × (ñemp2employee) ) )
Derivable Operators
Equivalences
A ×B Ô! B × A
A )” B Ô! B )” A
A *”B Ô! B *” A
ða1(A) Ô! ða1(ða1,etc(A))
When any query is submitted to the DBMS, its query ptimizat tries to find the
most efficient equivalent expression before evaluating it.
were able to bring in data from a range of different data sources, such as,
mainframe computer, minicomputer, as well as personal computer and office
automation software such as spreadsheets and integrate this information in a
single place. This capability, coupled with user-friendly reporting tools, and
freedom from operational impacts has led to a growth of this type of computer
system.
Data ware house have evolved though several fundamental stages like:
Offline operational databases – Data warehouse in this initial stage are developed
by simply copying the database of an operational system to an off-line server
where the processing load of reporting does not impact on the operational
system’s performance.
Data Sources: Data sources refer to any electronic repository of information that
contains data of interest for management use or analytics. From mainframe(IBM
DB2,ISAM,Adabas, etc.), client-server databases (e.g Oracle database, Informix,
Microsoft SQL Server etc.,), PC databases (e.g Microsoft Access), and ESS and
other electronic store of data. Data needs to be passed from these to systems to
the data warehouse either on the transaction-by-transaction basis for real-time
data warehouses or on a regular cycle(e.g daily or weekly) for offline data
warehouses.
Data transformation: The data transformation layer receives data from the data
sources, cleaned and standardizes and loads it into the data repository. This is
often called “staging” data as data often passes through a temporary database
whilst it is being transformed. This activity of transformation data can be
performed either by manually created code or a specific type of software could
be used called an Extract, Transform and load(ETL) tool.
Data Concepts 329
Metadata: Metadata or “Data about data” is used to inform operators and uses of
the data warehouses about its status and the information held within the data
warehouses.
Data mining analysis tends to work form the data up and the best techniques
are those developed with an orientation towards large volumes of data, making
use of as much of the collected data as possible to arrive at reliable conclusions
and decisions.
330 Data Concepts
The analysis process starts with a set of data, uses a methodology to develop an
optimal representation to the structure of the data during which time knowledge
is acquired. Once knowledge has been acquired this can be extended to larger
sets of data working on the assumption that the larger data set has a structure
similar to the sample data. Again this is analogous to a mining operation where
large amounts of low grade materials are sifted through in order to find something
of value.
Some of the data mining software’s are SPSS, SAS, Think Analytics and G-Sat
etc.
The phases start with the raw data and finish with the extracted knowledge
which was acquired as a result of the following stages:
Selection- Selecting or segmenting the data according to some criteria e.g. all
those people who won a car, in this way subsets of the data can be determined.
Preprocessing – This is the data cleaning stage where certain information is
removed which deemed unnecessary and may slow down queries for e.g.: gender
of the patient. The data is reconfigured to ensure a consistent format as there is
a possibility of inconsistent formats because the data is drawn from several
sources e.g. gender may be recorded as F or M also as 1 or 0.
Transformation – The data is not merely transferred, but transformed. E.g.:
demographic overlays commonly used in market research. The data is made
useable and navigable.
Data mining- This stage is concerned with the extraction of patterns from the
data. A pattern can be defined as given a set of facts(data) F, a language L, and
some measure of certainty C a pattern is a statement S in L that describes
relationships among a subset Fs of F with a certainly c such that S is simpler in
some sense than the enumeration of all the facts in Fs.
Interpretation and Evaluation – The patterns identified by the system are
interpreted into knowledge which can be used to support human decision-making
e.g. prediction and classification tasks, summarizing the content of a database
or explaining observed phenomena.
Summary
> The basic concepts of database that can be used by various users to store
and retrieve thae data in standardized format.
> DBMS features, parts,problems and solutions.
>Three database structures.
> Enitity relations.
> Various relationships.
> Keys
> Database warehouse, Data mining
Data Concepts 331
Review questions
One mark questions
1. What is data?
2. What is information?
3. What is database?
4. What is a field?
5. What is a record?
6. What is an entity?
7. What is an instance?
8. What is an attribute?
9. What is domain?
10.What is a relation?
11.What is a table?
12. What is normalization?
13. What is a key?
14. Give the symbol notation for project.
15.What is data mining?