Block-04 Introduction To Advanced Database Models
Block-04 Introduction To Advanced Database Models
Studio 13Editor
- PDF : Please read
for Mac, the following
Windows, unit of https://www.qoppa.com/pdfstudio
Linux. For Evaluation.
MCS-43 Block 3 Unit 1
Object Oriented Database
UNIT 1 OBJECT ORIENTED DATABASE
Structure Page No.
1.0 Introduction 5
1.1 Objectives 5
1.2 Why Object Oriented Database? 6
1.2.1 Limitation of Relational Databases
1.2.2 The Need for Object Oriented Databases
1.3 Object Relational Database Systems 8
1.3.1 Complex Data Types
1.3.2 Types and Inheritances in SQL
1.3.3 Additional Data Types of OOP in SQL
1.3.4 Object Identity and Reference Type Using SQL
1.4 Object Oriented Database Systems 15
1.4.1 Object Model
1.4.2 Object Definition Language
1.4.3 Object Query Language
1.5 Implementation of Object Oriented Concepts in Database Systems 22
1.5.1 The Basic Implementation issues for Object-Relational Database Systems
1.5.2 Implementation Issues of OODBMS
1.6 OODBMS Vs Object Relational Database 23
1.7 Summary 24
1.8 Solutions/Answers 24
1.0 INTRODUCTION
1.1 OBJECTIVES
5
PDF Studio - PDF Editor for Mac, Windows, Linux. For Evaluation. https://www.qoppa.com/pdfstudio
Enhanced Database
Models
familiarise yourself with object definition and query languages, and
define object relational and object-oriented databases.
An object oriented database is used for complex databases. Such database applications
require complex interrelationships among object hierarchies to be represented in
database systems. These interrelationships are difficult to be implement in relational
systems. Let us discuss the need for object oriented systems in advanced applications
in more details. However, first, let us discuss the weakness of the relational database
systems.
Relational database technology was not able to handle complex application systems
such as Computer Aided Design (CAD), Computer Aided Manufacturing (CAM), and
Computer Integrated Manufacturing (CIM), Computer Aided Software Engineering
(CASE) etc. The limitation for relational databases is that, they have been designed to
represent entities and relationship in the form of two-dimensional tables. Any
complex interrelationship like, multi-valued attributes or composite attribute may
result in the decomposition of a table into several tables, similarly, complex
interrelationships result in a number of tables being created. Thus, the main asset of
relational databases viz., its simplicity for such applications, is also one of its
weaknesses, in the case of complex applications.
The objects may be complex, or they may consists of low-level object (for example, a
window object may consists of many simpler objects like menu bars scroll bar etc.).
However, to represent the data of these complex objects through relational database
models you would require many tables – at least one each for each inherited class and
a table for the base class. In order to ensure that these tables operate correctly we
would need to set up referential integrity constraints as well. On the other hand, object
6
PDF Studio - PDF Editor for Mac, Windows, Linux. For Evaluation. https://www.qoppa.com/pdfstudio
oriented models would represent such a system very naturally through, an inheritance Object Oriented Database
hierarchy. Thus, it is a very natural choice for such complex objects.
Consider a situation where you want to design a class, (let us say a Date class), the
advantage of object oriented database management for such situations would be that
they allow representation of not only the structure but also the operation on newer
user defined database type such as finding the difference of two dates. Thus, object
oriented database technologies are ideal for implementing such systems that support
complex inherited objects, user defined data types (that require operations in addition
to standard operation including the operations that support polymorphism).
Another major reason for the need of object oriented database system would be the
seamless integration of this database technology with object-oriented applications.
Software design is now, mostly based on object oriented technologies. Thus, object
oriented database may provide a seamless interface for combining the two
technologies.
The Object oriented databases are also required to manage complex, highly
interrelated information. They provide solution in the most natural and easy way that
is closer to our understanding of the system. Michael Brodie related the object
oriented system to human conceptualisation of a problem domain which enhances
communication among the system designers, domain experts and the system end
users.
The concept of object oriented database was introduced in the late 1970s, however, it
became significant only in the early 1980s. The initial commercial product offerings
appeared in the late 1980s. Today, many object oriented databases products are
available like Objectivity/DB (developed by Objectivity, Inc.), ONTOS DB
(developed by ONTOS, Inc.), VERSANT (developed by Versant Object Technology
Corp.), ObjectStore (developed by Object Design, Inc.), GemStone (developed by
Servio Corp.) and ObjectStore PSE Pro (developed by Object Design, Inc.). An object
oriented database is presently being used for various applications in areas such as,
e-commerce, engineering product data management; and special purpose databases in
areas such as, securities and medicine.
Figure 1 traces the evolution of object oriented databases. Figure 2 highlights the
strengths of object oriented programming and relational database technologies. An
object oriented database system needs to capture the features from both these world.
Some of the major concerns of object oriented database technologies include access
optimisation, integrity enforcement, archive, backup and recovery operations etc.
The major standard bodies in this area are Object Management Group (OMG), Object
Database Management Group (ODMG) and X3H7.
7
PDF Studio - PDF Editor for Mac, Windows, Linux. For Evaluation. https://www.qoppa.com/pdfstudio
Enhanced Database
Models
Relational Database
Features
Object Oriented
Security
Programming + Integrity
Inheritance
Encapsulation Transactions
Object identity Concurrency
Polymorphism Recovery
Persistence
Now, the question is, how does one implement an Object oriented database system?
As shown in Figure 2 an object oriented database system needs to include the features
of object oriented programming and relational database systems. Thus, the two most
natural ways of implementing them will be either to extend the concept of object
oriented programming to include database features OODBMS or extend the
relational database technology to include object oriented related features – Object
Relational Database Systems. Let us discuss these two viz., the object relational and
object oriented databases in more details in the subsequent sections.
Object Relational Database Systems are the relational database systems that have been
enhanced to include the features of object oriented paradigm. This section provides
details on how these newer features have been implemented in the SQL. Some of the
basic object oriented concepts that have been discussed in this section in the context
of their inclusion into SQL standards include, the complex types, inheritance and
object identity and reference types.
In the previous section, we have used the term complex data types without defining it.
Let us explain this with the help of a simple example. Consider a composite attribute
Address. The address of a person in a RDBMS can be represented as:
8
PDF Studio - PDF Editor for Mac, Windows, Linux. For Evaluation. https://www.qoppa.com/pdfstudio
When using RDBMS, such information either needs to be represented as set attributes Object Oriented Database
as shown above, or, as just one string separated by a comma or a semicolon. The
second approach is very inflexible, as it would require complex string related
operations for extracting information. It also hides the details of an address, thus, it is
not suitable.
If we represent the attributes of the address as separate attributes then the problem
would be with respect to writing queries. For example, if we need to find the address
of a person, we need to specify all the attributes that we have created for the address
viz., House-no, Locality…. etc. The question is Is there any better way of
representing such information using a single field? If, there is such a mode of
representation, then that representation should permit the distinguishing of each
element of the address? The following may be one such possible attempt:
Thus, Address is now a new type that can be used while showing a database system
scheme as:
* Similarly, complex data types may be extended by including the date of birth field
(dob), which is represented in the discussed scheme as??? This complex data type
should then, comprise associated fields such as, day, month and year. This data type
should also permit the recognition of difference between two dates; the day; and the
year of birth. But, how do we represent such operations. This we shall see in the next
section.
But, what are the advantages of such definitions?
Consider the following queries:
Find the name and address of the students who are enrolled in MCA programme.
SELECT name, address
FROM student
WHERE programme = ‘MCA’ ;
Please note that the attribute ‘address’ although composite, is put only once in the
query. But can we also refer to individual components of this attribute?
Find the name and address of all the MCA students of Mumbai.
9
PDF Studio - PDF Editor for Mac, Windows, Linux. For Evaluation. https://www.qoppa.com/pdfstudio
Enhanced Database Thus, such definitions allow us to handle a composite attribute as a single attribute
Models
with a user defined type. We can also refer to any of the component of this attribute
without any problems so, the data definition of attribute components is still intact.
Complex data types also allow us to model a table with multi-valued attributes which
would require a new table in a relational database design. For example, a library
database system would require the representation following information for a book.
Book table:
ISBN number
Book title
Authors
Published by
Subject areas of the book.
Clearly, in the table above, authors and subject areas are multi-valued attributes. We
can represent them using tables (ISBN number, author) and (ISBN number, subject
area) tables. (Please note that our database is not considering the author position in the
list of authors).
Although this database solves the immediate problem, yet it is a complex design. This
problem may be most naturally represented if, we use the object oriented database
system. This is explained in the next section.
In the previous sub-section we discussed the data type – Address. It is a good example
of a structured type. In this section, let us give more examples for such types, using
SQL. Consider the attribute:
SQL uses Persistent Stored Module (PSM)/PSM-96 standards for defining functions
and procedures. According to these standards, functions need to be declared both
within the definition of type and in a CREATE METHOD statement. Thus, the types
such as those given above, can be represented as:
)
NOT FINAL
10
PDF Studio - PDF Editor for Mac, Windows, Linux. For Evaluation. https://www.qoppa.com/pdfstudio
‘FINAL’ and ‘NOT FINAL’ key words have the same meaning as you have learnt in
JAVA. That is a final class cannot be inherited further.
There also exists the possibility of using constructors but, a detailed discussion on that
is beyond the scope of this unit.
Type Inheritance
In the present standard of SQL, you can define inheritance. Let us explain this with
the help of an example.
11
PDF Studio - PDF Editor for Mac, Windows, Linux. For Evaluation. https://www.qoppa.com/pdfstudio
Notice, that, both the inherited types shown above-inherit the name and address
attributes from the type University-person. Methods can also be inherited in a similar
way, however, they can be overridden if the need arises.
Table Inheritance
Consider the University-person, Staff and Student as we have defined in the previous
sub-section. We can create the table for the type University-person as:
Now the table inheritance would allow us to create sub-tables for such tables as:
The type that associated with the sub-table must be the sub-type of the type of
the parent table. This is a major requirement for table inheritance.
All the attributes of the parent table – (University-members in our case) should
be present in the inherited tables.
Also, the three tables may be handled separately, however, any record present in
the inherited tables are also implicitly present in the base table. For example,
any record inserted in the student-list table will be implicitly present in
university-members tables.
You can restrict your query to only the parent table used by using the keyword
– ONLY. For example,
12
PDF Studio - PDF Editor for Mac, Windows, Linux. For Evaluation. https://www.qoppa.com/pdfstudio
The object oriented/relational database must support the data types that allows multi-
valued attributes to be represented easily. Two such data types that exit in SQL are:
Let us explain this with the help of example of book database as introduced in section
1.3. This database can be represented using SQL as:
Please note, the use of the type ARRAY. Arrays not only allow authors to be
represented but, also allow the sequencing of the name of the authors. Multiset allows
a number of keywords without any ordering imposed on them.
But how can we enter data and query such data types? The following SQL commands
would help in defining such a situation. But first, we need to create a table:
You can create many such queries, however, a detailed discussion on this, can be
found in the SQL 3 standards and is beyond the scope of this unit.
13
PDF Studio - PDF Editor for Mac, Windows, Linux. For Evaluation. https://www.qoppa.com/pdfstudio
Enhanced Database Let us explain this concept with the help of an example; consider a book procurement
Models
system which provides an accession number to a book:
The command above would create the table that would give an accession number of a
book and will also refer to it in the library table.
However, now a fresh problem arises how do we insert the books reference into the
table? One simple way would be to search for the required ISBN number by using the
system generated object identifier and insert that into the required attribute reference.
The following example demonstrates this form of insertion:
UPDATE book-table
SET ISBNNO = (SELECT book_id
FROM library
WHERE ISBNNO = ‘83-7758-476-6’)
WHERE ACCESSION-NO = ‘912345678’
Please note that, in the query given above, the sub-query generates the object
identifier for the ISBNNO of the book whose accession number is 912345678. It then
sets the reference for the desired record in the book-purchase-table.
This is a long procedure, instead in the example as shown above, since, we have the
ISBNNO as the key to the library table, therefore, we can create a user generated
object reference by simply using the following set of SQL statements:
);
……………………………………………………………………………..
.…………………………………………………………………………….
…………………………………………………………………………….
14
PDF Studio - PDF Editor for Mac, Windows, Linux. For Evaluation. https://www.qoppa.com/pdfstudio
3) Represent an address using SQL that has a method for locating pin-code Object Oriented Database
information.
……………………………………………………………………………..
.…………………………………………………………………………….
…………………………………………………………………………….
……………………………………………………………………………..
.…………………………………………………………………………….
…………………………………………………………………………….
Object oriented database systems are the application of object oriented concepts into
database system model to create an object oriented database model. This section
describes the concepts of the object model, followed by a discussion on object
definition and object manipulation languages that are derived SQL.
The ODMG has designed the object model for the object oriented database
management system. The Object Definition Language (ODL) and Object
Manipulation Language (OML) are based on this object model. Let us briefly define
the concepts and terminology related to the object model.
Objects and Literal: These are the basic building elements of the object model. An
object has the following four characteristics:
A unique identifier
A name
A lifetime defining whether it is persistent or not, and
A structure that may be created using a type constructor. The structure in
OODBMS can be classified as atomic or collection objects (like Set, List,
Array, etc.).
A literal does not have an identifier but has a value that may be constant. The structure
of a literal does not change. Literals can be atomic, such that they correspond to basic
data types like int, short, long, float etc. or structured literals (for example, current
date, time etc.) or collection literal defining values for some collection object.
Enhanced Database Atomic Objects: An atomic object is an object that is not of a collection type. They
Models
are user defined objects that are specified using class keyword. The properties of an
atomic object can be defined by its attributes and relationships. An example is the
book object given in the next sub-section. Please note here that a class is instantiable.
Inheritance: The interfaces specify the abstract operations that can be inherited by
classes. This is called behavioural inheritance and is represented using “: “ symbol.
Sub-classes can inherit the state and behaviour of super-class(s) using the keyword
EXTENDS.
Extents: An extent of an object that contains all the persistent objects of that class. A
class having an extent can have a key.
In the following section we shall discuss the use of the ODL and OML to implement
object models.
Object Definition Language (ODL) is a standard language on the same lines as the
DDL of SQL, that is used to represent the structure of an object-oriented database. It
uses unique object identity (OID) for each object such as library item, student,
account, fees, inventory etc. In this language objects are treated as records. Any class
in the design process has three properties that are attribute, relationship and methods.
A class in ODL is described using the following syntax:
class <name>
{
<list of properties>
};
Here, class is a key word, and the properties may be attribute method or relationship.
The attributes defined in ODL specify the features of an object. It could be simple,
enumerated, structure or complex type.
class Book
{
attribute string ISBNNO;
attribute string TITLE;
attribute enum CATEGORY
{text,reference,journal} BOOKTYPE;
attribute struct AUTHORS
{string fauthor, string sauthor, string
tauthor}
AUTHORLIST;
};
Please note that, in this case, we have defined authors as a structure, and a new field
on book type as an enum.
These books need to be issued to the students. For that we need to specify a
relationship. The relationship defined in ODL specifies the method of connecting one
object to another. We specify the relationship by using the keyword “relationship”.
Thus, to connect a student object with a book object, we need to specify the
relationship in the student class as:
16
PDF Studio - PDF Editor for Mac, Windows, Linux. For Evaluation. https://www.qoppa.com/pdfstudio
Here, for each object of the class student there is a reference to book object and the Object Oriented Database
set of references is called receives.
But if we want to access the student based on the book then the “inverse relationship”
could be specified as
We specify the connection between the relationship receives and receivedby by, using
a keyword “inverse” in each declaration. If the relationship is in a different class, it is
referred to by the relationships name followed by a double colon(::) and the name of
the other relationship.
class Book
{
attribute string ISBNNO;
attribute string TITLE;
attribute integer PRICE;
attribute string PUBLISHER;
attribute enum CATEGORY
{text,reference}BOOKTYPE;
attribute struct AUTHORS
{string fauthor, string sauthor, string
tauthor} AUTHORLIST;
relationship set <Student> receivedby
inverse Student::receives;
relationship set <Supplier> suppliedby
inverse Supplier::supplies;
};
class Student
{
attribute string ENROLMENT_NO;
attribute string NAME;
attribute integer MARKS;
attribute string COURSE;
relationship set <Book> receives
inverse Book::receivedby;
};
class Supplier
{
attribute string SUPPLIER_ID;
attribute string SUPPLIER_NAME;
attribute string SUPPLIER_ADDRESS;
attribute string SUPPLIER_CITY;
relationship set <Book> supplies
inverse Book::suppliedby;
};
Methods could be specified with the classes along with input/output types. These
declarations are called “signatures”. These method parameters could be in, out or
inout. Here, the first parameter is passed by value whereas the next two parameters are
passed by reference. Exceptions could also be associated with these methods.
class Student
{
attribute string ENROLMENT_NO;
attribute string NAME;
attribute string st_address;
relationship set <book> receives
17
PDF Studio - PDF Editor for Mac, Windows, Linux. For Evaluation. https://www.qoppa.com/pdfstudio
In the method find city, the name of city is passed referenced, in order to find the
name of the student who belongs to that specific city. In case blank is passed as
parameter for city name then, the exception notfoundcity is raised.
The ODL could be atomic type or class names. The basic type uses many class
constructors such as set, bag, list, array, dictionary and structure. We have shown the
use of some in the example above. You may wish to refer to the further readings
section.
Like the difference between relation schema and relation instance, ODL uses the class
and its extent (set of existing objects). The objects are declared with the keyword
“extent”.
It is not necessary in case of ODL to define keys for a class. But if one or more
attributes have to be declared, then it may be done with the declaration on key for a
class with the keyword “key”.
18
PDF Studio - PDF Editor for Mac, Windows, Linux. For Evaluation. https://www.qoppa.com/pdfstudio
The major considerations while converting ODL designs into relational designs are as
follows:
a) It is not essential to declare keys for a class in ODL but in Relational design now
attributes have to be created in order for it to work as a key.
c) Methods could be part of design in ODL but, they can not be directly converted
into relational schema although, the SQL supports it, as it is not the property of a
relational schema.
d) Relationships are defined in inverse pairs for ODL but, in case of relational
design, only one pair is defined.
For example, for the book class schema the relation is:
Book(ISBNNO,TITLE,CATEGORY,fauthor,sauthor,tauthor)
Thus, the ODL has been created with the features required to create an object oriented
database in OODBMS. You can refer to the further readings for more details on it.
Object Query Language (OQL) is a standard query language which takes high-level,
declarative programming of SQL and object-oriented features of OOPs. Let us explain
it with the help of examples.
Find the list of authors for the book titled “The suitable boy”
SELECT b.AUTHORS
FROM Book b
WHERE b.TITLE=”The suitable boy”
The more complex query to display the title of the book which has been issued to the
student whose name is Anand, could be
SELECT b.TITLE
FROM Book b, Student s
WHERE s.NAME =”Anand”
SELECT b.TITLE
FROM Book b
WHERE b.receivedby.NAME =”Anand”
In the previous case, the query creates a bag of strings, but when the keyword
DISTINCT is used, the query returns a set.
19
PDF Studio - PDF Editor for Mac, Windows, Linux. For Evaluation. https://www.qoppa.com/pdfstudio
SELECT b.TITLE
FROM Book b
WHERE b.receivedby.NAME =”Anand”
ORDER BY b.CATEGORY
In case of complex output the keyword “Struct” is used. If we want to display the pair
of titles from the same publishers then the proposed query is:
Aggregate operators like SUM, AVG, COUNT, MAX, MIN could be used in OQL. If
we want to calculate the maximum marks obtained by any student then the OQL
command is
Group by is used with the set of structures, that are called “immediate collection”.
Union, intersection and difference operators are applied to set or bag type with the
keyword UNION, INTERSECT and EXCEPT. If we want to display the details of
suppliers from PATNA and SURAT then the OQL is
(SELECT DISTINCT su
FROM Supplier su
WHERE su.SUPPLIER_CITY=”PATNA”)
UNION
(SELECT DISTINCT su
FROM Supplier su
WHERE su.SUPPLIER_CITY=”SURAT”)
The result of the OQL expression could be assigned to host language variables. If,
costlyBooks is a set <book> variable to store the list of books whose price is below
Rs.200 then
20
PDF Studio - PDF Editor for Mac, Windows, Linux. For Evaluation. https://www.qoppa.com/pdfstudio
1) Create a class staff using ODL that also references the Book class given in
section 1.5.
……………………………………………………………………………..
.…………………………………………………………………………….
…………………………………………………………………………….
2) What modifications would be needed in the Book class because of the table
created by the above query?
……………………………………………………………………………..
.…………………………………………………………………………….
…………………………………………………………………………….
……………………………………………………………………………..
.…………………………………………………………………………….
…………………………………………………………………………….
21
PDF Studio - PDF Editor for Mac, Windows, Linux. For Evaluation. https://www.qoppa.com/pdfstudio
Enhanced Database
Models
Create a new DBMS that is exclusively devoted to the Object oriented database.
Let us discuss more about them.
One of the ways of representing inherited tables may be to store the inherited primary
key attributes along with the locally defined attributes. In such a case, to construct the
complete details for the table, you need to take a join between the inherited table and
the base class table.
The second possibility here would be, to allow the data to be stored in all the inherited
as well as base tables. However, such a case will result in data replication. Also, you
may find it difficult at the time of data insertion.
As far as arrays are concerned, since they have a fixed size their implementation is
straight forward However, the cases for the multiset would desire to follow the
principle of normalisation in order to create a separate table which can be joined with
the base table as and when required.
Please note: The embedded language requires a lot many steps for the transfer of data
from the database to local variables and vice-versa. The question is, can we implement
an object oriented language such as C++ and Java to handle persistent data? Well a
persistent object-orientation would need to address some of the following issues:
Object Identity: All the objects created during the execution of an object oriented
program would be given a system generated object identifier, however, these
identifiers become useless once the program terminates. With the persistent objects it
is necessary that such objects have meaningful object identifiers. Persistent object
identifiers may be implemented using the concept of persistent pointers that remain
valid even after the end of a program.
22
PDF Studio - PDF Editor for Mac, Windows, Linux. For Evaluation. https://www.qoppa.com/pdfstudio
Storage and access: The data of each persistent object needs to be stored. One simple Object Oriented Database
approach for this may be to store class member definitions and the implementation of
methods as the database schema. The data of each object, however, needs to be stored
individually along with the schema. A database of such objects may require the
collection of the persistent pointers for all the objects of one database together.
Another, more logical way may be to store the objects as collection types such as sets.
Some object oriented database technologies also define a special collection as class
extent that keeps track of the objects of a defined schema.
23
PDF Studio - PDF Editor for Mac, Windows, Linux. For Evaluation. https://www.qoppa.com/pdfstudio
1.7 SUMMARY
Object oriented technologies are one of the most popular technologies in the present
era. Object orientation has also found its way into database technologies. The object
oriented database systems allow representation of user defined types including
operation on these types. They also allow representation of inheritance using both the
type inheritance and the table inheritance. The idea here is to represent the whole
range of newer types if needed. Such features help in enhancing the performance of a
database application that would otherwise have many tables. SQL support these
features for object relational database systems.
The object definition languages and object query languages have been designed for
the object oriented DBMS on the same lines as that of SQL. These languages tries to
simplify various object related representations using OODBMS.
The object relational and object oriented databases do not compete with each other but
have different kinds of applications areas. For example, relational and object relational
DBMS are most suited for simple transaction management systems, while OODBMS
may find applications with e- commerce, CAD and other similar complex
applications.
1.8 SOLUTIONS/ANSWERS
24
PDF Studio - PDF Editor for Mac, Windows, Linux. For Evaluation. https://www.qoppa.com/pdfstudio
1)
class Staff
{
attribute string STAFF_ID;
attribute string STAFF_NAME;
attribute string DESIGNATION;
relationship set <Book> issues
inverse Book::issuedto;
};
2) The Book class needs to represent the relationship that is with the Staff class.
This would be added to it by using the following commands:
RELATIONSHIP SET < Staff > issuedto
INVERSE :: issues Staff
3) SELECT DISTINCT b.TITLE
FROM BOOK b
WHERE b.issuedto.NAME = “Shashi”
25
PDFFor Unit
Studio 14Editor
- PDF : Please read
for Mac, the following
Windows, units ofhttps://www.qoppa.com/pdfstudio
Linux. For Evaluation.
MCS-43 Block 3 Unit 3 & Unit 4
Introduction to Data
UNIT 3 INTRODUCTION TO DATA Warehousing
WAREHOUSING
Structure Page Nos.
3.0 Introduction 59
3.1 Objectives 59
3.2 What is Data Warehousing? 60
3.3 The Data Warehouse: Components and Processes 62
3.3.1 Basic Components of a Data Warehouse
3.3.2 Data Extraction, Transformation and Loading (ETL)
3.4 Multidimensional Data Modeling for Data Warehouse 67
3.5 Business Intelligence and Data Warehousing 70
3.5.1 Decision Support System (DSS)
3.5.2 Online Analytical Processing (OLAP)
3.6 Building of Data Warehouse 73
3.7 Data Marts 75
3.8 Data Warehouse and Views 76
3.9 The Future: Open Issues for Data Warehouse 77
3.10 Summary 77
3.11 Solutions/Answers 78
3.0 INTRODUCTION
Information Technology (IT) has a major influence on organisational performance and
competitive standing. With the ever increasing processing power and availability of
sophisticated analytical tools and techniques, it has built a strong foundation for the
product - data warehouse. But, why should an organisation consider investing in a
data warehouse? One of the prime reasons, for deploying a data warehouse is that, the
data warehouse is a kingpin of business intelligence.
The data warehouses provide storage, functionality and responsiveness to queries, that
is far superior to the capabilities of today’s transaction-oriented databases. In many
applications, users only need read-access to data, however, they need to access larger
volume of data very rapidly – much more than what can be conveniently handled by
traditional database systems. Often, such data is extracted from multiple operational
databases. Since, most of these analyses performed do occur periodically, therefore,
software developers and software vendors try to design systems to support these
functions. Thus, there is a definite need for providing decision makers at middle
management level and higher level with information as per the level of details to
support decision-making. The data warehousing, online analytical processing (OLAP)
and data mining technologies provide this functionality.
This unit covers the basic features of data warehousing and OLAP. Data Mining has
been discussed in more details in unit 4 of this Block.
3.1 OBJECTIVES
59
PDF Studio - PDF Editor for Mac, Windows, Linux. For Evaluation. https://www.qoppa.com/pdfstudio
60
PDF Studio - PDF Editor for Mac, Windows, Linux. For Evaluation. https://www.qoppa.com/pdfstudio
One of the major advantages a data warehouse offers is that it allows a large collection Introduction to Data
of historical data of many operational databases, which may be heterogeneous in Warehousing
nature, that can be analysed through one data warehouse interface, thus, it can be said
to be a ONE STOP portal of historical information of an organisation. It can also be
used in determining many trends through the use of data mining techniques.
Remember a data warehouse does not create value of its own in an organisation.
However, the value can be generated by the users of the data of the data warehouse.
For example, an electric billing company, by analysing data of a data warehouse can
predict frauds and can reduce the cost of such determinations. In fact, this technology
has such great potential that any company possessing proper analysis tools can benefit
from it. Thus, a data warehouse supports Business Intelligence (that is), the
technology that includes business models with objectives such as reducing operating
costs, increasing profitability by improving productivity, sales, services and decision-
making. Some of the basic questions that may be asked from a software that supports
business intelligence include:
A data warehouse has many characteristics. Let us define them in this section and
explain some of these features in more details in the later sections.
61
PDF Studio - PDF Editor for Mac, Windows, Linux. For Evaluation. https://www.qoppa.com/pdfstudio
Enhanced Database easy to use interfaces, strong data manipulation, support for applying and
Models
reporting of various analyses and user-friendly output.
Figure 2 defines the basic architecture of a data warehouse. The analytical reports are
not a part of the data warehouse but are one of the major business application areas
including OLAP and DSS.
Introduction to Data
The warehouse database obtains most of its data from such different forms of legacy Warehousing
systems files and databases. Data may also be sourced from external sources as well
as other organisational systems, for example, an office system. This data needs to be
integrated into the warehouse. But how do we integrate the data of these large
numbers of operational systems to the data warehouse system? We need the help of
ETL tools to do so. These tools capture the data that is required to be put in the data
warehouse database. We shall discuss the ETL process in more detail in section 3.3.2.
Data of Data Warehouse
A data warehouse has an integrated, “subject-oriented”, “time-variant” and “non-
volatile” collection of data. The basic characteristics of the data of a data warehouse
can be described in the following way:
i) Integration: Integration means bringing together data of multiple, dissimilar
operational sources on the basis of an enterprise data model. The enterprise data
model can be a basic template that identifies and defines the organisation’s key data
items uniquely. It also identifies the logical relationships between them ensuring
organisation wide consistency in terms of:
Data naming and definition: Standardising for example, on the naming of
“student enrolment number” across systems.
Encoding structures: Standardising on gender to be represented by “M” for male
and “F” for female or that the first two digit of enrolment number would represent
the year of admission.
Measurement of variables: A Standard is adopted for data relating to some
measurements, for example, all the units will be expressed in metric system or all
monetary details will be given in Indian Rupees.
ii) Subject Orientation: The second characteristic of the data warehouse’s data is that
its design and structure can be oriented to important objects of the organisation. These
objects such as STUDENT, PROGRAMME, REGIONAL CENTRES etc., are in
contrast to its operational systems, which may be designed around applications and
functions such as ADMISSION, EXAMINATION and RESULT DELCARATIONS
(in the case of a University). Refer to Figure 3.
Figure 3: Operations system data orientation vs. Data warehouse data orientation
iii) Time-Variance: The third defining characteristic of the database of data
warehouse is that it is time-variant, or historical in nature. The entire data in the data
63
PDF Studio - PDF Editor for Mac, Windows, Linux. For Evaluation. https://www.qoppa.com/pdfstudio
Enhanced Database warehouse is/was accurate at some point of time. This is, in contrast with operational
Models
data that changes over a shorter time period. The data warehouse’s data contains data
that is date-stamped, and which is historical data. Figure 4 defines this characteristic
of data warehouse.
Figure 4: Time variance characteristics of a data of data warehouse and operational data
iv) Non-volatility (static nature) of data: Data warehouse data is loaded on to the
data warehouse database and is subsequently scanned and used, but is not updated in
the same classical sense as operational system’s data which is updated through the
transaction processing cycles.
A data warehouse may support many OLAP and DSS tools. Such decision support
applications would typically access the data warehouse database through a standard
query language protocol; an example of such a language may be SQL. These
applications may be of three categories: simple query and reporting, decision support
systems and executive information systems. We will define them in more details in the
later sections.
The meta data directory component defines the repository of the information stored in
the data warehouse. The meta data can be used by the general users as well as data
administrators. It contains the following information:
Meta data has several roles to play and uses in the data warehouse system. For an end
user, meta data directories also provide some additional information, such as what a
particular data item would mean in business terms. It also identifies the information on
reports, spreadsheets and queries related to the data of concern. All database
management systems (DBMSs) have their own data dictionaries that serve a similar
purpose. Information from the data dictionaries of the operational system forms a
valuable source of information for the data warehouse’s meta data directory.
3.3.2 Data Extraction, Transformation and Loading (ETL)
64
PDF Studio - PDF Editor for Mac, Windows, Linux. For Evaluation. https://www.qoppa.com/pdfstudio
The first step in data warehousing is, to perform data extraction, transformation, and Introduction to Data
loading of data into the data warehouse. This is called ETL that is Extraction, Warehousing
Transformation, and Loading. ETL refers to the methods involved in accessing and
manipulating data available in various sources and loading it into a target data
warehouse. Initially the ETL was performed using SQL programs, however, now
there are tools available for ETL processes. The manual ETL was complex as it
required the creation of a complex code for extracting data from many sources. ETL
tools are very powerful and offer many advantages over the manual ETL. ETL is a
step-by-step process. As a first step, it maps the data structure of a source system to
the structure in the target data warehousing system. In the second step, it cleans up the
data using the process of data transformation and finally, it loads the data into the
target system.
The ETL is three-stage process. During the Extraction phase the desired data is
identified and extracted from many different sources. These sources may be different
databases or non-databases. Sometimes when it is difficult to identify the desirable
data then, more data than necessary is extracted. This is followed by the
identification of the relevant data from the extracted data. The process of extraction
sometimes, may involve some basic transformation. For example, if the data is being
extracted from two Sales databases where the sales in one of the databases is in
Dollars and in the other in Rupees, then, simple transformation would be required in
the data. The size of the extracted data may vary from several hundreds of kilobytes
to hundreds of gigabytes, depending on the data sources and business systems. Even
the time frame for the extracted data may vary, that is, in some data warehouses, data
extraction may take a few days or hours to a real time data update. For example, a
situation where the volume of extracted data even in real time may be very high is a
web server.
The extraction process involves data cleansing and data profiling. Data cleansing can
be defined as the process of removal of inconsistencies among the data. For example,
the state name may be written in many ways also they can be misspelt too. For
example, the state Uttar Pradesh may be written as U.P., UP, Uttar Pradesh, Utter
Pradesh etc. The cleansing process may try to correct the spellings as well as resolve
such inconsistencies. But how does the cleansing process do that? One simple way
may be, to create a Database of the States with some possible fuzzy matching
algorithms that may map various variants into one state name. Thus, cleansing the
data to a great extent. Data profiling involves creating the necessary data from the
point of view of data warehouse application. Another concern here is to eliminate
duplicate data. For example, an address list collected from different sources may be
merged as well as purged to create an address profile with no duplicate data.
One of the most time-consuming tasks - data transformation and loading follows the
extraction stage. This process includes the following:
Use of data filters,
Data validation against the existing data,
Checking of data duplication, and
Information aggregation.
Transformations are useful for transforming the source data according to the
requirements of the data warehouse. The process of transformation should ensure the
quality of the data that needs to be loaded into the target data warehouse. Some of the
common transformations are:
65
PDF Studio - PDF Editor for Mac, Windows, Linux. For Evaluation. https://www.qoppa.com/pdfstudio
Enhanced Database Filter Transformation: Filter transformations are used to filter the rows in a
Models
mapping that do not meet specific conditions. For example, the list of employees of
the Sales department who made sales above Rs.50,000/- may be filtered out.
Joiner Transformation: This transformation is used to join the data of one or more
different tables that may be stored on two different locations and could belong to two
different sources of data that may be relational or from any other sources like XML
data.
Aggregator Transformation: Such transformations perform aggregate calculations
on the extracted data. Some such calculations may be to find the sum or average.
When should we perform the ETL process for data warehouse? ETL process should
normally be performed during the night or at such times when the load on the
operational systems is low. Please note that, the integrity of the extracted data can be
ensured by synchronising the different operational applications feeding the data
warehouse and the data of the data warehouse.
66
PDF Studio - PDF Editor for Mac, Windows, Linux. For Evaluation. https://www.qoppa.com/pdfstudio
Introduction to Data
3.4 MULTIDIMENSIONAL DATA MODELING Warehousing
A data warehouse is a huge collection of data. Such data may involve grouping of data
on multiple attributes. For example, the enrolment data of the students of a University
may be represented using a student schema such as:
Here, some typical data value may be (These values are shown in Figure 5 also.
Although, in an actual situation almost all the values will be filled up):
In the year 2002, BCA enrolment at Region (Regional Centre Code) RC-07
(Delhi) was 350.
In year 2003 BCA enrolment at Region RC-07 was 500.
In year 2002 MCA enrolment at all the regions was 8000.
Please note that, to define the student number here, we need to refer to three
attributes: the year, programme and the region. Each of these attributes is identified as
the dimension attributes. Thus, the data of student_enrolment table can be modeled
using dimension attributes (year, programme, region) and a measure attribute
(number). Such kind of data is referred to as a Multidimensional data. Thus, a data
warehouse may use multidimensional matrices referred to as a data cubes model. The
multidimensional data of a corporate data warehouse, for example, would have the
fiscal period, product and branch dimensions. If the dimensions of the matrix are
greater than three, then it is called a hypercube. Query performance in
multidimensional matrices that lend themselves to dimensional formatting can be
much better than that of the relational data model.
350 500
RC-07
All 62000 60300 57000 53000 232300 RC-01
All
2002 2003 2004 2005 All
Figure 5: A sample multidimensional data
67
PDF Studio - PDF Editor for Mac, Windows, Linux. For Evaluation. https://www.qoppa.com/pdfstudio
The table given above, shows, the multidimensional data in cross-tabulation. This is
also referred to as a pivot-table. Please note that cross-tabulation is done on any two
dimensions keeping the other dimensions fixed as ALL. For example, the table above
has two dimensions Year and Programme, the third dimension Region has a fixed
value ALL for the given table.
Please note that, the cross-tabulation as we have shown in the table above is,
different to a relation. The relational representation for the data of the table above may
be:
Table: Relational form for the Cross table as above
Year Programme Region Number
2002 BCA All 9000
2002 MCA All 8000
2002 Others All 45000
2002 All All 62000
2003 BCA All 9500
2003 MCA All 7800
2003 Others All 43000
2003 All All 60300
2004 BCA All 6000
2004 MCA All 9000
2004 Others All 42000
2004 All All 57000
2005 BCA All 4000
2005 MCA All 9000
2005 Others All 40000
2005 All All 53000
All BCA All 28500
All MCA All 33800
All Others All 170000
All All All 232300
68
PDF Studio - PDF Editor for Mac, Windows, Linux. For Evaluation. https://www.qoppa.com/pdfstudio
A cross tabulation can be performed on any two dimensions. The operation of Introduction to Data
changing the dimensions in a cross tabulation is termed as pivoting. In case a cross Warehousing
tabulation is done for a value other than ALL for the fixed third dimension, then it is
called slicing. For example, a slice can be created for Region code RC-07 instead of
ALL the regions in the cross tabulation of regions. This operation is called dicing if
values of multiple dimensions are fixed.
Multidimensional data allows data to be displayed at various level of granularity. An
operation that converts data with a fine granularity to coarse granularity using
aggregation is, termed rollup operation. For example, creating the cross tabulation for
All regions is a rollup operation. On the other hand an operation that moves from a
coarse granularity to fine granularity is known as drill down operation. For example,
moving from the cross tabulation on All regions back to Multidimensional data is a
drill down operation. Please note: For the drill down operation, we need, the original
data or any finer granular data.
Now, the question is, how can multidimensional data be represented in a data
warehouse? or, more formally, what is the schema for multidimensional data?
Two common multidimensional schemas are the star schema and the snowflake
schema. Let us, describe these two schemas in more detail. A multidimensional
storage model contains two types of tables: the dimension tables and the fact table.
The dimension tables have tuples of dimension attributes, whereas the fact tables have
one tuple each for a recorded fact. In order to relate a fact to a dimension, we may
have to use pointers. Let us demonstrate this with the help of an example. Consider
the University data warehouse where one of the data tables is the Student enrolment
table. The three dimensions in such a case would be:
Year
Programme, and
Region
Duration Region
Start date
Enrolment .
.
RCcode
RCname
Dimension Table:
Region RCaddress
RCphone
Figure 6: A Star Schema
69
PDF Studio - PDF Editor for Mac, Windows, Linux. For Evaluation. https://www.qoppa.com/pdfstudio
Enhanced Database Please note that in Figure 6, the fact table points to different dimension tables, thus,
Models
ensuring the reliability of the data. Please notice that, each Dimension table is a table
for a single dimension only and that is why this schema is known as a star schema.
However, a dimension table may not be normalised. Thus, a new schema named the
snowflake schema was created. A snowflake schema has normalised but hierarchical
dimensional tables. For example, consider the star schema shown in Figure 6, if in the
Region dimension table, the value of the field Rcphone is multivalued, then the
Region dimension table is not normalised.
Duration Region
Start date
Enrolment .
.
RCcode
RCphone
Dimension Table: RCcode
Region
RCname
RCaddress
Phone
Table
…..
Data warehouse storage can also utilise indexing to support high performance access.
Dimensional data can be indexed in star schema to tuples in the fact table by using a
join index. Data warehouse storage facilitates access to summary data due to the non-
volatile nature of the data.
A data warehouse is an integrated collection of data and can help the process of
making better business decisions. Several tools and methods are available to that
enhances advantage of the data of data warehouse to create information and
knowledge that supports business decisions. Two such techniques are Decision-
support systems and online analytical processing. Let us discuss these two in more
details in this section.
3.5.1 Decision Support Systems (DSS)
70
PDF Studio - PDF Editor for Mac, Windows, Linux. For Evaluation. https://www.qoppa.com/pdfstudio
Introduction to Data
The DSS is a decision support system and NOT a decision-making system. DSS is a Warehousing
specific class of computerised information systems that support the decision-making
activities of an organisation. A properly designed DSS is an interactive software based
system that helps decision makers to compile useful information from raw data,
documents, personal knowledge, and/or business models to identify and solve
problems and make decisions.
The DSS assists users in evaluating appropriate analysis or performing different types
of studies on the datasets. For example, a spreadsheet can be used to store answers to a
series of questionnaires in the form of Excel spreadsheets. This information then, can
be passed on to decision makers. More specifically, the feedback data collected on a
programme like CIC may be given to subject matter experts for making decisions on
the quality, improvement, and revision of that programme. The DSS approach
provides a self-assessment weighing tool to facilitate the determining of the value of
different types of quality and quantity attributes. Decision support systems are
sometimes also referred to as the Executive Information Systems (EIS).
The first factor is a strange but true ‘pull’ factor, that is, executives are suggested
to be more computer-literate and willing to become direct users of computer
systems. For example, a survey suggests that more than twenty percent of senior
executives have computers on their desks but rarely 5% use the system, although
there are wide variations in the estimates yet, there is a define pull towards this
simple easy to use technology.
The other factor may be the increased use of computers at the executive level. For
example, it has been suggested that middle managers who have been directly
using computers in their daily work are being promoted to the executive level.
71
PDF Studio - PDF Editor for Mac, Windows, Linux. For Evaluation. https://www.qoppa.com/pdfstudio
Enhanced Database This new breed of executives do not exhibit the fear of computer technology that
Models
has characterised executive management up to now and are quite willing to be
direct users of computer technology.
In an OLAP system a data analyst would like to see different cross tabulations by
interactively selecting the required attributes. Thus, the queries in an OLAP are
expected to be executed extremely quickly. The basic data model that may be
supported by OLAP is the star schema, whereas, the OLAP tool may be compatible to
a data warehouse.
Let us, try to give an example on how OLAP is more suitable to a data warehouse
rather than to a relational database. An OLAP creates an aggregation of information,
for example, the sales figures of a sales person can be grouped (aggregated) for a
product and a period. This data can also be grouped for sales projection of the sales
person over the regions (North, South) or states or cities. Thus, producing enormous
amount of aggregated data. If we use a relational database, we would be generating
such data many times. However, this data has many dimensions so it is an ideal
candidate for representation through a data warehouse. The OLAP tool thus, can be
used directly on the data of the data warehouse to answer many analytical queries in a
short time span. The term OLAP is sometimes confused with OLTP. OLTP is online
transaction processing. OLTP systems focus on highly concurrent transactions and
better commit protocols that support high rate of update transactions. On the other
hand, OLAP focuses on good query-evaluation and query-optimisation algorithms.
OLAP Implementation
This classical form of OLAP implementation uses multidimensional arrays in the
memory to store multidimensional data. Such implementation of OLAP is also
referred to as Multidimensional OLAP (MOLAP). MOLAP is faster as it stores data in
an already processed aggregated data form using dimension and fact tables. The other
important type of OLAP implementation is Relational OLAP (ROLAP), which stores
data directly in the relational databases. ROLAP creates multidimensional views upon
request rather than in advance as in MOLAP. ROLAP may be used on complex data
with a wide number of fields.
72
PDF Studio - PDF Editor for Mac, Windows, Linux. For Evaluation. https://www.qoppa.com/pdfstudio
The first basic issue for building a data warehouse is to identify the USE of the data Introduction to Data
warehouse. It should include information on the expected outcomes of the design. A Warehousing
good data warehouse must support meaningful query facility on the attributes of
dimensional and fact tables. A data warehouse design in addition to the design of the
schema of the database has to address the following three issues:
Data Acquisition: A data warehouse must acquire data so that it can fulfil the
required objectives. Some of the key issues for data acquisition are:
Data storage: The data acquired by the data is also to be stored as per the storage
schema. This data should be easily accessible and should fulfil the query needs of the
users efficiently. Thus, designers need to ensure that there are appropriate indexes or
paths that allow suitable data access. Data storage must be updated as more data is
acquired by the data warehouse, but it should still provide access to data during this
time. Data storage also needs to address the issue of refreshing a part of the data of the
data warehouse and purging data from the data warehouse.
Environment of the data warehouse: Data warehouse designers should also keep in
mind the data warehouse environment considerations. The designers must find the
expected use of the data and predict if it is consistent with the schema design. Another
key issue here would be the design of meta data directory component of the data
warehouse. The design should be such that it should be maintainable under the
environmental changes.
The data warehouse technologies use very diverse vocabulary. Although the
vocabulary of data warehouse may vary for different organisations, the data
warehousing industry is in agreement with the fact that the data warehouse lifecycle
model fundamentally can be defined as the model consisting of five major phases –
design, prototype, deploy, operation and enhancement.
73
PDF Studio - PDF Editor for Mac, Windows, Linux. For Evaluation. https://www.qoppa.com/pdfstudio
Enhanced Database communication with the end users, finding the available catalogues, defining
Models
key performance and quality indicators, mapping of decision-making processes
as per the information needs at various end user levels, logical and physical
schema design etc.
2) Prototype: A data warehouse is a high cost project, thus, it may be a good idea
to deploy it partially for a select group of decision-makers and database
practitioners in the end user communities. This will help in developing a system
that will be easy to accept and will be mostly as per the user’s requirements.
3) Deploy: Once the prototype is approved, then the data warehouse can be put to
actual use. A deployed data warehouse comes with the following processes;
documentation, training, maintenance.
The programs created during the previous phase are executed to populate the data
warehouse’s database.
The Development and Implementation Team: A core team for such
implementation may be:
A Project Leader responsible for managing the overall project and the one who
helps in obtaining resources and participates in the design sessions.
Analysts documents the end user requirements and creates the enterprise data models
for the data warehouse.
A Data Base Administrator is responsible for the physical data base creation, and
74
PDF Studio - PDF Editor for Mac, Windows, Linux. For Evaluation. https://www.qoppa.com/pdfstudio
Programmers responsible for programming the data extraction and transformation Introduction to Data
programs and end user access applications. Warehousing
Training: Training will be required not only for end users, once the data warehouse
is in place, but also for various team members during the development stages of the
data warehouse.
Check Your Progress 2
1) What is a dimension, how is it different from a fact table?
……………………………………………………………………………………
……………………………………………………………………………………
……………………………………………………………………………………
2) How is snowflake schema different from other schemes?
……………………………………………………………………………………
……………………………………………………………………………………
……………………………………………………………………………………
3) What are the key concerns when building a data warehouse?
……………………………………………………………………………………
……………………………………………………………………………………
……………………………………………………………………………………
4) What are the major issues related to data warehouse implementation?
……………………………………………………………………………………
……………………………………………………………………………………
……………………………………………………………………………………
5) Define the terms: DSS and ESS.
……………………………………………………………………………………
……………………………………………………………………………………
……………………………………………………………………………………
6) What are OLAP, MOLAP and ROLAP?
……………………………………………………………………………………
……………………………………………………………………………………
……………………………………………………………………………………
Data marts can be considered as the database or collection of databases that are
designed to help managers in making strategic decisions about business and the
organisation. Data marts are usually smaller than data warehouse as they focus on
some subject or a department of an organisation (a data warehouses combines
databases across an entire enterprise). Some data marts are also called dependent data
marts and may be the subsets of larger data warehouses.
A data mart is like a data warehouse and contains operational data that helps in
making strategic decisions in an organisation. The only difference between the two is
75
PDF Studio - PDF Editor for Mac, Windows, Linux. For Evaluation. https://www.qoppa.com/pdfstudio
Enhanced Database that data marts are created for a certain limited predefined application. Even in a data
Models
mart, the data is huge and from several operational systems, therefore, they also need
a multinational data model. In fact, the star schema is also one of the popular schema
choices for a data mart.
(i) For making a separate schema for OLAP or any other similar system.
In fact, to standardise data analysis and usage patterns, data warehouses are generally
organised as task specific small units the data marts. The data organisation of a data
mart is a very simple star schema. For example, the university data warehouse that we
discussed in section 3.4 can actually be a data mart on the problem “The prediction of
student enrolments for the next year.” A simple data mart may extract its contents
directly from operational databases. However, in complex multilevel data warehouse
architectures the data mart content may be loaded with the help of the warehouse
database and Meta data directories.
Data warehouse extracts and transforms and then stores the data into its schema;
however, views are only logical and may not be materialised.
You can apply mechanisms for data access in an enhanced way in a data
warehouse, however, that is not the case for a view.
76
PDF Studio - PDF Editor for Mac, Windows, Linux. For Evaluation. https://www.qoppa.com/pdfstudio
The administration of a data warehouse is a complex and challenging task. Some of Introduction to Data
the open issues for data warehouse may be: Warehousing
However, data warehouses are still an expensive solution and are typically found in
large firms. The development of a central warehouse is capital intensive with high
risks. Thus, at present data marts may be a better choice.
3.10 SUMMARY
This unit provided an introduction to the concepts of data warehousing systems. The
data warehouse is a technology that collects operational data from several operational
systems, refines it and stores it in its own multidimensional model such as star schema
or snowflake schema. The data of a data warehouse can be indexed and can be used
for analyses through various DSS and EIS. The architecture of data warehouse
supports contains an interface that interact with operational system, transformation
processing, database, middleware and DSS interface at the other end. However, data
warehouse architecture is incomplete if, it does not have meta data directory which is
extremely useful for each and every step of the data warehouse. The life cycle of a
data warehouse has several stages for designing, prototyping, deploying and
maintenance. The database warehouse’s life cycle, however, can be clubbed with
SDLC. Data mart is a smaller version of a data warehouse designed for a specific
purpose. Data warehouse is quite different from views. A data warehouse is complex
and offers many challenges and open issues, but, in the future data warehouses will
be-extremely important technology that will be deployed for DSS. Please go through
further readings for more details on data warehouse.
77
PDF Studio - PDF Editor for Mac, Windows, Linux. For Evaluation. https://www.qoppa.com/pdfstudio
Enhanced Database
Models 3.11 SOLUTIONS/ANSWERS
2). ETL is Extraction, transformation, and loading. ETL refers to the methods
involved in accessing and manipulating data available in various sources and
loading it into target data warehouse. The following are some of the
transformations that may be used during ETL:
Filter Transformation
Joiner Transformation
Aggregator transformation
Sorting transformation.
78
PDF Studio - PDF Editor for Mac, Windows, Linux. For Evaluation. https://www.qoppa.com/pdfstudio
1) The basic constructs used to design a data warehouse and a data mart are the
same. However, a Data Warehouse is designed for the enterprise level, while
Data Marts may be designed for a business division/department level. A data
mart contains the required subject specific data for local analysis only.
Data warehouse extracts and transforms and then stores the data into its
schema that is not true for the materialised views.
79
PDF Studio - PDF Editor for Mac, Windows, Linux. For Evaluation. https://www.qoppa.com/pdfstudio
For Unit 14 : Please read the following units of
MCS-43 Block 3 Unit 3 & Unit 4
Enhanced Database
Models UNIT 4 INTRODUCTION TO DATA MINING
Structure Page Nos.
4.0 Introduction 80
4.1 Objectives 80
4.2 Data Mining Technology 81
4.2.1 Data, Information, Knowledge
4.2.2 Sample Data Mining Problems
4.2.3 Database Processing vs. Data Mining Processing
4.2.4 Data Mining vs KDD
4.3 Approaches to Data Mining Problems 84
4.4 Classification 85
4.4.1 Classification Approach
4.4.2 Classification Using Distance (K-Nearest Neighbours)
4.4.3 Decision or Classification Tree
4.4.4 Bayesian Classification
4.5 Clustering 93
4.5.1 Partitioning Clustering
4.5.2 Nearest Neighbours Clustering
4.5.2 Hierarchical Clustering
4.6 Association Rule Mining 96
4.7 Applications of Data Mining Problem 99
4.8 Commercial Tools of Data Mining 100
4.9 Summary 102
4.10 Solutions/Answers 102
4.11 Further Readings 103
4.0 INTRODUCTION
Data mining is emerging as a rapidly growing interdisciplinary field that takes its
approach from different areas like, databases, statistics, artificial intelligence and data
structures in order to extract hidden knowledge from large volumes of data. The data
mining concept is now a days not only used by the research community but also a lot
of companies are using it for predictions so that, they can compete and stay ahead of
their competitors.
With rapid computerisation in the past two decades, almost all organisations have
collected huge amounts of data in their databases. These organisations need to
understand their data and also want to discover useful knowledge as patterns, from
their existing data.
This unit aims at giving you some of the fundamental techniques used in data mining.
This unit emphasises on a brief overview of data mining as well as the application of
data mining techniques to the real world. We will only consider structured data as
input in this unit. We will emphasise on three techniques of data mining:
(a) Classification,
(b) Clustering, and
(c) Association rules.
4.1 OBJECTIVES
After going through this unit, you should be able to:
Data mining is related to data warehouse in this respect that, a data warehouse is well
equipped for providing data as input for the data mining process. The advantages of
using the data of data warehouse for data mining are or many some of them are listed
below:
Data quality and consistency are essential for data mining, to ensure, the accuracy
of the predictive models. In data warehouses, before loading the data, it is first
extracted, cleaned and transformed. We will get good results only if we have good
quality data.
Data warehouse consists of data from multiple sources. The data in data
warehouses is integrated and subject oriented data. The data mining process
performed on this data.
In data mining, it may be the case that, the required data may be aggregated or
summarised data. This is already there in the data warehouse.
Data warehouse provides the capability of analysing data by using OLAP
operations. Thus, the results of a data mining study can be analysed for hirtherto,
uncovered patterns.
81
PDF Studio - PDF Editor for Mac, Windows, Linux. For Evaluation. https://www.qoppa.com/pdfstudio
He does not know statistics, and he does not want to hire statisticians.
The answer of some of the above questions may be answered by data mining.
b) Mr. Avinash Arun is an astronomer and the sky survey has 3 tera-bytes (1012) of
data, 2 billion objects. Some of the questions that can come to the mind of
Mr. Arun are as follows:
He knows the data and statistics, but that is not enough. The answer to some of
the above questions may be answered once again, by data mining.
Please note: The use of data mining in both the questions given above lies in finding
certain patterns and information. Definitely the type of the data in both the database as
given above will be quite different.
The output of the query of database processing is precise and is the subset of the data,
while, in the case of data mining the output is fuzzy and it is not a subset of the data.
Some of the examples of database queries are as follows:
Find all credit card applicants with the last name Ram.
Identify customers who have made purchases of more than Rs.10,000/- in the
last month.
Find all customers who have purchased shirt(s).
Some data mining queries may be:
Find all credit card applicants with poor or good credit risks.
Identify the profile of customers with similar buying habits.
Find all items that are frequently purchased with shirt (s).
82
PDF Studio - PDF Editor for Mac, Windows, Linux. For Evaluation. https://www.qoppa.com/pdfstudio
Preprocessing: It includes cleansing the data which has already been extracted
by the above step.
Extracting data set: It includes extracting required data which will later, be
used for analysis.
Data cleansing process: It involves basic operations such as, the removal of
noise, collecting necessary information to from noisy data, such as, deciding on
strategies for handling missing data fields.
83
PDF Studio - PDF Editor for Mac, Windows, Linux. For Evaluation. https://www.qoppa.com/pdfstudio
Enhanced Database
Models
Data organised by
function The KDD Process
Create/select
target database
Data Warehousing
Select sample
data
Find important
Normalize Transform Create derived attributes and
values values attributes value ranges
The classification task maps data into predefined groups or classes. The class of a
tuple is indicated by the value of a user-specified goal attribute. Tuples consists of a
set of predicating attributes and a goal attribute. The task, is to discover, some kind of
relationship between the predicating attributes and the goal attribute, so that, the
discovered information/ knowledge can be used to predict the class of new tuple(s).
The task of clustering is to group the tuples with similar attribute values into the same
class. Given a database of tuples and an integer value k, the Clustering is to define a
mapping, such that, tuples are mapped to different cluster.
The task of association rule mining is to search for interesting relationships among
items in a given data set. Its original application is on “market basket data”. The rule
has the form X Y, where X and Y are sets of items and they do not intersect. Each
84
PDF Studio - PDF Editor for Mac, Windows, Linux. For Evaluation. https://www.qoppa.com/pdfstudio
rule has two measurements, support and confidence. Given the user-specified Introduction to
Data Mining
minimum support and minimum confidence, the task is to find, rules with support and
confidence above, minimum support and minimum confidence.
The distance measure finds, the distance or dissimilarity between objects the measures
that are used in this unit are as follows:
k
Euclidean distance: dis(ti,tj)= (tih t jh ) 2
h 1
k
Manhattan distance: dis(ti,tj)= | (tih t jh ) |
h 1
where ti and tj are tuples and h are the different attributes which can take values from 1
to k
4.4 CLASSIFICATION
The classification task maps data into predefined groups or classes.
Given a database/dataset D={t1,t2,…,tn} and a set of classes C={C1,…,Cm}, the
classification Problem is to define a mapping f:D C where each ti is assigned to one
class, that is, it divides database/dataset D into classes specified in the Set C.
85
PDF Studio - PDF Editor for Mac, Windows, Linux. For Evaluation. https://www.qoppa.com/pdfstudio
Enhanced Database
Models The basic approaches to classification are:
To create specific models by, evaluating training data, which is basically the old
data, that has already been classified by using the domain of the experts’
knowledge.
Now applying the model developed to the new data.
Some of the most common techniques used for classification may include the use of
Decision Trees, Neural Networks etc. Most of these techniques are based on finding
the distances or uses statistical methods.
One of the algorithms that is used is K-Nearest Neighbors. Some of the basic points to
be noted about this algorithm are:
The training set includes classes along with other attributes. (Please refer to the
training data given in the Table given below).
The value of the K defines the number of near items (items that have less
distance to the attributes of concern) that should be used from the given set of
training data (just to remind you again, training data is already classified data).
This is explained in point (2) of the following example.
A new item is placed in the class in which the most number of close items are
placed. (Please refer to point (3) in the following example).
86
PDF Studio - PDF Editor for Mac, Windows, Linux. For Evaluation. https://www.qoppa.com/pdfstudio
to you. Introduction to
Data Mining
2) Let us take only the height attribute for distance calculation and suppose K=5
then the following are the near five tuples to the data that is to be classified (using
Manhattan distance as a measure on the height attribute).
3) On examination of the tuples above, we classify the tuple <Ram, M, 1.6> to Short
class since most of the tuples above belongs to Short class.
Each arc is labeled with the predicate which can be applied to the attribute at the
parent node.
Decision Tree Induction is the process of learning about the classification using the
inductive approach. During this process, we create a decision tree from the training
data. This decision tree can, then be used, for making classifications. To define this
we need to define the following.
Let us assume that we are given probabilities p1, p2, .., ps whose sum is 1. Let us also
define the term Entropy, which is the measure of the amount of randomness or
surprise or uncertainty. Thus our basic goal in the classification process is that, the
entropy for a classification should be zero, that, if no surprise then, entropy is equal to
zero. Entropy is defined as:
s
H(p1,p2,…ps)= i 1
( pi * log(1 / pi )) ……. (1)
Algorithm: ID3 algorithm for creating decision tree from the given training
data.
Input: The training data and the attribute-list.
87
PDF Studio - PDF Editor for Mac, Windows, Linux. For Evaluation. https://www.qoppa.com/pdfstudio
Enhanced Database
Models Output: A decision tree.
Process:
Step 2: If sample data are all of the same class, C (that is probability is 1 )
then return N as a leaf node labeled class C
Step 4: Select split-attribute, which is the attribute in the attribute-list with the
highest information gain;
Step 6: for each known value Ai, of split-attribute // partition the samples
Create a branch from node N for the condition: split-attribute = Ai;
// Now consider a partition and recursively create the decision tree:
Let xi be the set of data from training data that satisfies the condition:
split-attribute = Ai
if the set xi is empty then
attach a leaf labeled with the most common class in the prior
set of training data;
else
attach the node returned after recursive call to the program
with training data as xi and
new attribute list = present attribute-list – split-attribute;
End of Algorithm.
Please note: The algorithm given above, chooses the split attribute with the highest
information gain, that is, calculated as follows:
s
Gain (D,S) =H(D) - i 1
( P( Di ) * H ( Di )) ………..(2)
where S is new states ={D1,D2,D3…DS} and H(D) finds the amount of order in that
state
88
PDF Studio - PDF Editor for Mac, Windows, Linux. For Evaluation. https://www.qoppa.com/pdfstudio
Now let us calculate gain for the departments using the formula at (2)
Since age has the maximum gain, so, this attribute is selected as the first splitting
attribute. In age range 31-40, class is not defined while for other ranges it is defined.
So, we have to again calculate the spitting attribute for this age range (31-40). Now,
the tuples that belong to this range are as follows:
89
PDF Studio - PDF Editor for Mac, Windows, Linux. For Evaluation. https://www.qoppa.com/pdfstudio
The Gain is maximum for salary attribute, so we take salary as the next splitting
attribute. In middle range salary, class is not defined while for other ranges it is
defined. So, we have to again calculate the spitting attribute for this middle range.
Since only department is left, so, department will be the next splitting attribute. Now,
the tuples that belong to this salary range are as follows:
Department Position
Personnel Boss
Administration Boss
Administration Assistant
Again in the Personnel department, all persons are Boss, while, in the Administration
there is a tie between the classes. So, the person can be either Boss or Assistant in the
Administration department.
Age ?
21-30 41-50
31-40
Assistant
Boss
Salary ?
High Range
Low Range
Medium range
Assistant Boss
Department ?
Administration
Personnel
Boss
Assistant/Boss
Figure 4: The decision tree using ID3 algorithm for the sample data of Figure 3.
Now, we will take a new dataset and we will classify the class of each tuple by
applying the decision tree that we have built above.
90
PDF Studio - PDF Editor for Mac, Windows, Linux. For Evaluation. https://www.qoppa.com/pdfstudio
This is a statistical classification, which predicts the probability that a given sample is Introduction to
Data Mining
a member of a particular class. It is based on the Bayes theorem. The Bayesian
classification shows better accuracy and speed when applied to large databases. We
will discuss here the simplest form of Bayesian classification.
The basic underlying assumptions (also called class conditional independence) for this
simplest form of classification known as the native Bayesian classification is:
“The effect of an attribute value on a given class is independent of the values of other
attributes”
Let us discuss naive Bayesian classification in more details. But before that, let us,
define the basic theorem on which this classification is based.
Bayes Theorem:
Please note: We can calculate P(X), P(X | H) and P(H) from the data sample X and
training data. It is only P(H | X) which basically defines the probability that X
belongs to a class C, and cannot be calculated. Bayes theorem does precisely this
function. The Bayer’s theorem states:
Now after defining the Bayes theorem, let us explain the Bayesian classification with
the help of an example.
i) Consider the sample having an n-dimensional feature vector. For our example,
it is a 3-dimensional (Department, Age, Salary) vector with training data as
given in the Figure 3.
ii) Assume that there are m classes C1 to Cm. And an unknown sample X. The
problem is to data mine which class X belongs to. As per Bayesian
classification, the sample is assigned to the class, if the following holds:
In other words the class for the data sample X will be the class, which has the
maximum probability for the unknown sample. Please note: The P(Ci |X) will
be found using:
So P(C1) P(C2)
iv) P(X|Ci) calculation may be computationally expensive if, there are large
numbers of attributes. To simplify the evaluation, in the naïve Bayesian
classification, we use the condition of class conditional independence, that is the
values of attributes are independent of each other. In such a situation:
n
P(X|Ci)= P(xk|Ci) ….(4)
k=1
P(xk|Ci)= Number of training samples of class Ci having the value xk for the attribute Ak
Number of training samples belonging to Ci
Since, the first probability of the above two is higher, the sample data may be
classified into the BOSS position. Kindly check to see that you obtain the same result
from the decision tree of Figure 4.
92
PDF Studio - PDF Editor for Mac, Windows, Linux. For Evaluation. https://www.qoppa.com/pdfstudio
Introduction to
4.5 CLUSTERING Data Mining
Clustering is grouping thing with similar attribute values into the same group. Given a
database D={t1,t2,…,tn} of tuples and an integer value k, the Clustering problem is to
define a mapping where each tuple ti is assigned to one cluster Kj, 1<=j<=k.
A Cluster, Kj, contains precisely those tuples mapped to it. Unlike the classification
problem, clusters are not known in advance. The user has to the enter the value of the
number of clusters k.
In other words a cluster can be defined as the collection of data objects that are similar
in nature, as per certain defining property, but these objects are dissimilar to the
objects in other clusters.
Some of the clustering examples are as follows:
Outlier handling: How will the outlier be handled? (outliers are the objects that
do not comply with the general behaviour or model of the data) Whether it is to
be considered or it is to be left aside while calculating the clusters?
Dynamic data: How will you handle dynamic data?
Interpreting results: How will the result be interpreted?
Evaluating results: How will the result be calculated?
Number of clusters: How many clusters will you consider for the given data?
Data to be used: whether you are dealing with quality data or the noisy data?
If, the data is noisy how is it to be handled?
Scalability: Whether the algorithm that is used is to be scaled for small as well as
large data set/database.
There are many different kinds of algorithms for clustering. However, we will discuss
only three basic algorithms. You can refer to more details on clustering from the
further readings.
Squared Error
K-Means
93
PDF Studio - PDF Editor for Mac, Windows, Linux. For Evaluation. https://www.qoppa.com/pdfstudio
Enhanced Database Now in this unit, we will briefly discuss these algorithms.
Models
Squared Error Algorithms
The most frequently used criterion function in partitioning clustering techniques is the
squared error criterion. The method of obtaining clustering by applying this approach
is as follows:
(1) Select an initial partition of the patterns with a fixed number of clusters and
cluster centers.
(2) Assign each pattern to its closest cluster center and compute the new cluster
centers as the centroids of the clusters. Repeat this step until convergence is
achieved, i.e., until the cluster membership is stable.
(3) Merge and split clusters based on some heuristic information, optionally
repeating step 2.
Some of the parameters that are used in clusters are as follows:
N
Centriod(Cm)= t mi / N
i 1
N
Radius (Rm) = (t mi Cm ) 2 / N
i 1
N N
Diameter (Dm)= (tmi tmj ) 2 /( N * ( N 1))
i 1 j 1
A detailed discussion on this algorithm is beyond the scope of this unit. You can refer
to more details on clustering from the further readings.
K-Means clustering
In the K-Means clustering, initially a set of clusters is randomly chosen. Then
iteratively, items are moved among sets of clusters until the desired set is reached. A
high degree of similarity among elements in a cluster is obtained by using this
algorithm. For this algorithm a set of clusters Ki={ti1,ti2,…,tim} is given , the cluster
mean is:
mi = (1/m)(ti1 + … + tim) …(5)
Where ti represents the tuples and m represents the mean
In this approach, items are iteratively merged into the existing clusters that are closest.
It is an incremental method. The threshold, t, used to determine if items are added to
existing clusters or a new cluster is created. This process continues until all patterns
are labeled or no additional labeling occurs.
In this method, the clusters are created in levels and depending upon the threshold
value at each level the clusters are again created.
A divisive method begins with all tuples in a single cluster and performs splitting until
a stopping criterion is met. This is the top down approach.
95
PDF Studio - PDF Editor for Mac, Windows, Linux. For Evaluation. https://www.qoppa.com/pdfstudio
Enhanced Database Most hierarchical clustering algorithms are variants of the single-link, average link
Models and complete-link algorithms. Out of these the single-link and complete-link
algorithms are the most popular. These two algorithms differ in the way they
characterise the similarity between a pair of clusters.
In the single-link method, the distance between two clusters is the minimum of the
distances between all pairs of patterns drawn from the two clusters (one pattern from
the first cluster, the other from the second).
In the complete-link algorithm, the distance between two clusters is the maximum of
all pair-wise distances between patterns in the two clusters.
In either case, two clusters are merged to form a larger cluster based on minimum
distance criteria.
You can refer to more detail on the hierarchical clustering algorithms from the further
readings.
2) What is clustering?
……………………………………………………………………………………
……………………………………………………………………………………
……………………………………………………………………………………
Formal Definition:
Let I = {i1, i2, …, im} be a set of literals, called items. Let D be a set of transactions,
where each transaction T is a set of items such that T I. TID indicates a unique
96
PDF Studio - PDF Editor for Mac, Windows, Linux. For Evaluation. https://www.qoppa.com/pdfstudio
Given a user specified minimum support and minimum confidence, the problem of
mining association rules is to find all the association rules whose support and
confidence are larger than the minimum support and minimum confidence. Thus, this
approach can be broken into two sub-problems as follows:
(1) Finding the frequent itemsets which have support above the predetermined
minimum support.
(2) Deriving all rules, based on each frequent itemset, which have confidence more
than the minimum confidence.
There are a lots of ways to find the large itemsets but we will only discuss the Apriori
Algorithm.
The apriori algorithm applies the concept that if an itemset has minimum support, then
all its subsets also have minimum support. An itemset having minimum support is
called frequent itemset or large itemset. So any subset of a frequent itemset must also
be frequent.
Apriori algorithm generates the candidate itemsets to be counted in the pass, by using
only the large item set found in the previous pass – without considering the
transactions in the database.
It starts by finding all frequent 1-itemsets (itemsets with 1 item); then consider 2-
itemsets from these 1-itemsets, and so forth. During each iteration only candidates
found to be frequent in the previous iteration are used to generate a new candidate set
during the next iteration. The algorithm terminates when there are no frequent
k-itemsets.
Apriori algorithm function takes as argument Lk-1 and returns a superset of the set of
all frequent k-itemsets. It consists of a join step and a prune step. The Apriori
algorithm is given below :
APRIORI
1. k=1
2. Find frequent set Lk from Ck of all candidate itemsets
3. Form Ck+1 from Lk; k = k + 1
4. Repeat 2-3 until Ck is empty
Step 2: Scan the data set D and count each itemset in Ck , if it is greater
than minimum support, it is frequent
97
PDF Studio - PDF Editor for Mac, Windows, Linux. For Evaluation. https://www.qoppa.com/pdfstudio
Enhanced Database
Models Step 3:
For k=1, C1 = all frequent 1-itemsets. (all individual items).
For k>1, generate Ck from Lk-1 as follows:
The join step
Ck = k-2 way join of Lk-1 with itself
If both {a1, …,ak-2, ak-1} & {a1, …, ak-2, ak} are in Lk-1, then
add {a1, …,ak-2, ak-1, ak} to Ck
(We keep items sorted).
The prune step
Remove {a1, …,ak-2, ak-1, ak} if it contains a non-frequent
(k-1) subset.
{In the prune step, delete all itemsets c Ck such that some
(k-1)-subset of C is not in Lk-1.}
Example: Finding frequent itemsets:
Consider the following transactions with minimum support s=30% for finding the
frequent itemsets by applying Apriori algorithm:
Join operation yields 3 item sets as: {{Coat, Shirt, Tie}, {Coat, Shirt, Trouser},
{Coat, Tie, Trouser}}
98
PDF Studio - PDF Editor for Mac, Windows, Linux. For Evaluation. https://www.qoppa.com/pdfstudio
However, the Prune operation removes two of these items from the set due to the Introduction to
Data Mining
following reasons:
The following algorithm creates the association rules from the set L so created by the
Apriori algorithm.
Input:
D //Database of transactions
I //Items
L //Large itemsets
s // Support
c // Confidence
Output:
R //Association rules satisfying minimum s and c
AR algorithm:
R= Ø
For each l L do // for each large item ( l ) in the set L
For each x I such that x <> Ø and x <>l do
if support(l) /support(x) c then
R=R U {x (l - x)};
Apriori Advantages/Disadvantages:
The following are the advantages and disadvantages of the Apriori algorithm:
Advantages:
It uses large itemset property.
It easy to implement.
Disadvantages:
It assumes transaction database is memory resident.
Marketing and sales data analysis: A company can use customer transactions
in their database to segment the customers into various types. Such companies
may launch products for specific customer bases.
Investment analysis: Customers can look at the areas where they can get good
returns by applying the data mining.
Loan approval: Companies can generate rules depending upon the dataset
they have. On that basis they may decide to whom, the loan has to be approved.
Fraud detection: By finding the correlation between faults, new faults can be
detected by applying data mining.
99
PDF Studio - PDF Editor for Mac, Windows, Linux. For Evaluation. https://www.qoppa.com/pdfstudio
Enhanced Database Network management: By analysing pattern generated by data mining for the
Models networks and its faults, the faults can be minimised as well as future needs can
be predicted.
Risk Analysis: Given a set of customers and an assessment of their risk-
worthiness, descriptions for various classes can be developed. Use these
descriptions to classify a new customer into one of the risk categories.
Brand Loyalty: Given a customer and the product he/she uses, predict whether
the customer will change their products.
Housing loan prepayment prediction: Rule discovery techniques can be used
to accurately predict the aggregate number of loan prepayments in a given
quarter as a function of prevailing interest rates, borrower characteristics and
account data.
16) XpertRule Miner (Attar Software) provides association rule discovery from
any ODBC data source.
17) DMSK: Data-Miner Software Kit :Task: Collection of tools for efficient
mining of big data (Classification, Regression, Summarisation, Deviation
Detection multi-task tools).
100
PDF Studio - PDF Editor for Mac, Windows, Linux. For Evaluation. https://www.qoppa.com/pdfstudio
16) OSHAM Task: Task (Clustering) interactive-graphic system for discovering Introduction to
Data Mining
concept hierarchies from unsupervised data
Free Tools:
1) EC4.5, a more efficient version of C4.5, which uses the best among three
strategies at each node construction.
2) IND, provides CART and C4.5 style decision trees and more. Publicly available
from NASA but with export restrictions.
3) ODBCMINE, shareware data-mining tool that analyses ODBC databases using
the C4.5, and outputs simple IF..ELSE decision rules in ASCII.
4) OC1, decision tree system continuous feature values; builds decision trees with
linear combinations of attributes at each internal node; these trees then partition
the space of examples with both oblique and axis-parallel hyper planes.
5) PC4.5, a parallel version of C4.5 built with Persistent Linda system.
6) SE-Learn, Set Enumeration (SE) trees generalise decision trees. Rather than
splitting by a single attribute, one recursively branches on all (or most) relevant
attributes. (LISP)
7) CBA, mines association rules and builds accurate classifiers using a subset of
association rules.
8) KINOsuite-PR extracts rules from trained neural networks.
9) RIPPER, a system that learns sets of rules from data
3) Apply the Apriori algorithm for generating large itemset on the following
dataset:
Transaction ID Items purchased
T100 A1a3a4
T200 A2a3a5
T300 A1a2a3a5
T400 A2a5
……………………………………………………………………………………
……………………………………………………………………………………
……………………………………………………………………………………
101
PDF Studio - PDF Editor for Mac, Windows, Linux. For Evaluation. https://www.qoppa.com/pdfstudio
Enhanced Database
Models 4.9 SUMMARY
1) Data mining is the process of automatic extraction of interesting (non trivial,
implicit, previously unknown and potentially useful) information or pattern
from the data in large databases.
2) Data mining is one of the steps in the process of Knowledge Discovery in
databases.
3) In data mining tasks are classified as: Classification, Clustering and Association
rules.
5) Clustering task groups things with similar properties/ behaviour into the same
groups.
2) Data mining is only one of the many steps involved in knowledge discovery in
databases. The various steps in KDD are data extraction, data cleaning and
preprocessing, data transformation and reduction, data mining and knowledge
interpretation and representation.
3) The query language of OLTP is well defined and it uses SQL for it, while, for
data mining the query is poorly defined and there is no precise query language.
The data used in OLTP is operational data while in data mining it is historical
data. The output of the query of OLTP is precise and is the subset of the data
while in the case of data mining the output is fuzzy and it is not a subset of the
data.
1) The classification task maps data into predefined groups or classes. The class of
a tuple is indicated by the value of a user-specified goal attribute. Tuples
consists of a set of predicating attributes and a goal attribute. The task is to
discover some kind of relationship between the predicating attributes and the
goal attribute, so that the discovered knowledge can be used to predict the class
of new tuple(s).
102
PDF Studio - PDF Editor for Mac, Windows, Linux. For Evaluation. https://www.qoppa.com/pdfstudio
2) The task of clustering is to group the tuples with similar attribute values into the Introduction to
Data Mining
same class. Given a database of tuples and an integer value k, Clustering defines
mapping, such that, tuples are mapped to different clusters.
3) In classification, the classes are predetermined, but, in the case of clustering the
groups are not predetermined. The number of clusters has to be given by the
user.
Check Your Progress 3
1) The task of association rule mining is to search for interesting relationships
among items in a given data set. Its original application is on “market basket
data”. The rule has the form X Y, where X and Y are sets of items and they do
not intersect.
Assuming the minimum support as 50% for calculating the large item sets. As
we have 4 transaction, at least 2 transaction should have the data item.
Thus L={L1,L2,L3}
103
PDFFor Unit
Studio 16Editor
- PDF : Please read
for Mac, the following
Windows, unit of https://www.qoppa.com/pdfstudio
Linux. For Evaluation.
MCS-43 Block 4 Unit 1
Emerging Database
UNIT 1 EMERGING DATABASE MODELS, Models, Technologies
and Applications-I
TECHNOLOGIES AND
APPLICATIONS-I
Structure Page Nos.
1.0 Introduction 5
1.1 Objectives 5
1.2 Multimedia Database 6
1.2.1 Factors Influencing the Growth of Multimedia Data
1.2.2 Applications of Multimedia Database
1.2.3 Contents of MMDB
1.2.4 Designing MMDBs
1.2.5 State of the Art of MMDBMS
1.3 Spatial Database and Geographic Information Systems 10
1.4 Gnome Databases 12
1.4.1 Genomics
1.4.2 Gene Expression
1.4.3 Proteomics
1.5 Knowledge Databases 17
1.5.1 Deductive Databases
1.5.2 Semantic Databases
1.6 Information Visualisation 18
1.7 Summary 19
1.8 Solutions/Answers 20
1.0 INTRODUCTION
Database technology has advanced from the relational model to the distributed DBMS
and Object Oriented databases. The technology has also advanced to support data
formats using XML. In addition, data warehousing and data mining technology has
become very popular in the industry from the viewpoint of decision making and
planning.
1.1 OBJECTIVES
After going through this unit, you should be able to:
5
PDF Studio - PDF Editor for Mac, Windows, Linux. For Evaluation. https://www.qoppa.com/pdfstudio
(iii) Applications
With the rapid growth of computing and communication technologies, many
applications have come to the forefront. Thus, any such applications in future will
support life with multimedia data. This trend is expected to go on increasing in the
days to come.
media commerce
medical media databases
bioinformatics
ease of use of home media
news and entertainment
surveillance
6
PDF Studio - PDF Editor for Mac, Windows, Linux. For Evaluation. https://www.qoppa.com/pdfstudio
Multimedia Databases (MMDBs) must cope with the large volume of multimedia
data, being used in various software applications. Some such applications may include
digital multimedia libraries, art and entertainment, journalism and so on. Some of
these qualities of multimedia data like size, formats etc. have direct and indirect
influence on the design and development of a multimedia database.
Media Format Data: This data defines the format of the media data after the
acquisition, processing, and encoding phases. For example, such data may consist of
information about sampling rate, resolution, frame rate, encoding scheme etc. of
various media data.
Media Keyword Data: This contains the keyword related to the description of media
data. For example, for a video, this might include the date, time, and place of
recording, the person who recorded, the scene description, etc. This is also known as
content description data.
Media Feature Data: This contains the features derived from the media data. A
feature characterises the contents of the media. For example, this could contain
information on the distribution of colours, the kinds of textures and the different
shapes present in an image. This is also referred to as content dependent data.
The last three types are known as meta data as, they describe several different aspects
of the media data. The media keyword data and media feature data are used as indices
for searching purpose. The media format data is used to present the retrieved
information.
Emerging Trends and complexity of representation and subjective interpretation specially from the
Example DBMS
Architectures
viewpoint of the meta data.
6) One of the main requirements for such a Database would be to handle different
kinds of indices. The multimedia data is in exact and subjective in nature, thus,
the keyword-based indices and exact range searches used in traditional
databases are ineffective in such databases. For example, the retrieval of records
of students based on enrolment number is precisely defined, but the retrieval of
records of student having certain facial features from a database of facial
images, requires, content-based queries and similarity-based retrievals. Thus,
the multimedia database may require indices that are content dependent key-
word indices.
7) The Multimedia database requires developing measures of data similarity that
are closer to perceptual similarity. Such measures of similarity for different
media types need to be quantified and should correspond to perceptual
similarity. This will also help the search process.
8) Multimedia data is created all over world, so it could have distributed database
features that cover the entire world as the geographic area. Thus, the media data
may reside in many different distributed storage locations.
8
PDF Studio - PDF Editor for Mac, Windows, Linux. For Evaluation. https://www.qoppa.com/pdfstudio
Emerging Database
9) Multimedia data may have to be delivered over available networks in real-time. Models, Technologies
and Applications-I
Please note, in this context, the audio and video data is temporal in nature. For
example, the video frames need to be presented at the rate of about 30
frames/sec for smooth motion.
Multimedia data is now being used in many database applications. Thus, multimedia
databases are required for efficient management and effective use of enormous
amounts of data.
These software are used to provide support for a wide variety of different media types,
specifically different media file formats such as image formats, video etc. These files
need to be managed, segmented, linked and searched.
The later commercial systems handle multimedia content by providing complex object
types for various kinds of media. In such databases the object orientation provides the
facilities to define new data types and operations appropriate for the media, such as
video, image and audio. Therefore, broadly MMDBMSs are extensible Object-
Relational DBMS (ORDBMSs). The most advanced solutions presently include
Oracle 10g, IBM DB2 and IBM Informix. These solutions purpose almost similar
approaches for extending the search facility for video on similarity-based techniques.
Some of the newer projects address the needs of applications for richer semantic
content. Most of them are based on the new MPEG-standards MPEG-7 and MPEG-
21.
MPEG-7
MPEG-7 is the ISO/IEC 15938 standard for multimedia descriptions that was issued
in 2002. It is XML based multimedia meta-data standard, and describes various
elements for multimedia processing cycle from the capture, analysis/filtering, to the
delivery and interaction.
the applications utilising multimedia data are very diverse in nature. There is a
need for the standardisation of such database technologies,
9
PDF Studio - PDF Editor for Mac, Windows, Linux. For Evaluation. https://www.qoppa.com/pdfstudio
Emerging Trends and technology is ever changing, thus, creating further hurdles in the way of
Example DBMS
Architectures
multimedia databases,
there is still a need to refine the algorithms to represent multimedia information
semantically. This also creates problems with respect to information
interpretation and comparison.
Check Your Progress 1
1) What are the reasons for the growth of multimedia data?
……………………………………………………………………………………
……………………………………………………………………………………
……………………………………………………………………………………
2) List four application areas of multimedia databases.
……………………………………………………………………………………
……………………………………………………………………………………
……………………………………………………………………………………
3) What are the contents of multimedia database?
……………………………………………………………………………………
……………………………………………………………………………………
……………………………………………………………………………………
4) List the challenges in designing multimedia databases.
……………………………………………………………………………………
……………………………………………………………………………………
……………………………………………………………………………………
planning, evaluation of land use, facility and landscape management, traffic Emerging Database
monitoring system etc. These applications need to store data as per the required Models, Technologies
and Applications-I
applications. For example, irrigation facility management would require study
of the various irrigation sources, the land use patterns, the fertility of the land,
soil characteristics, rain pattern etc. some of the kinds of data stored in various
layers containing different attributes. This data will also require that any
changes in the pattern should also be recorded. Such data may be useful for
decision makers to ascertain and plan for the sources and types of and means of
irrigation.
Requirements of a GIS
The data in GIS needs to be represented in graphical form. Such data would require
any of the following formats:
A GIS must also support the analysis of data. Some of the sample data analysis
operations that may be needed for typical applications are:
11
PDF Studio - PDF Editor for Mac, Windows, Linux. For Evaluation. https://www.qoppa.com/pdfstudio
Emerging Trends and Once the data is captured in GIS it may be processed through some special operations.
Example DBMS
Architectures
Some such operations are:
The GIS also requires the process of visualisation in order to display the data in a
proper visual.
Thus, GIS is not a database that can be implemented using either the relational or
object oriented database alone. Much more needs to be done to support them. A
detailed discussion on these topics is beyond the scope of this unit.
Biological data by nature is enormous. Bioinformation is one such key area that has
emerged in recent years and which, addresses the issues of information management
of genetic data related to DNA sequence. A detailed discussion on this topic is beyond
the scope of this unit. However, let us identify some of the basic characteristics of the
biological data.
The Human Genome Initiative is an international research initiative for the creation of
detailed genetic and physical maps for each of the twenty-four different human
chromosomes and the finding of the complete deoxyribonucleic acid (DNA) sequence
of the human genome. The term Genome is used to define the complete genetic
information about a living entity. A genetic map shows the linear arrangement of
genes or genetic marker sites on a chromosome. There are two types of genetic maps
genetic linkage maps and physical maps. Genetic linkage maps are created on the
12
PDF Studio - PDF Editor for Mac, Windows, Linux. For Evaluation. https://www.qoppa.com/pdfstudio
basis of the frequency with which genetic markers are co-inherited. Physical maps are Emerging Database
Models, Technologies
used to determine actual distances between genes on a chromosome. and Applications-I
One of the major uses of such databases is in computational Genomics, which refers
to the applications of computational molecular biology in genome research. On the
basis of the principles of the molecular biology, computational genomics has been
classified into three successive levels for the management and analysis of genetic data
in scientific databases. These are:
Genomics.
Gene expression.
Proteomics.
1.4.1 Genomics
Genomics is a scientific discipline that focuses on the systematic investigation of the
complete set of chromosomes and genes of an organism. Genomics consists of two
component areas:
Genome Databases
Genome databases are used for the storage and analysis of genetic and physical maps.
Chromosome genetic linkage maps represent distances between markers based on
meiotic re-combination frequencies. Chromosome physical maps represent distances
between markers based on numbers of nucleotides.
13
PDF Studio - PDF Editor for Mac, Windows, Linux. For Evaluation. https://www.qoppa.com/pdfstudio
Emerging Trends and Genome databases should define four data types:
Example DBMS
Architectures Sequence
Physical
Genetic
Bibliographic
Sequence-tagged sites
Coding regions
Non-coding regions
Control regions
Telomeres
Centromeres
Repeats
Metaphase chromosome bands.
Locus name
Location
Recombination distance
Polymorphisms
Breakpoints
Rearrangements
Disease association
Bibliographic references should cite primary scientific and medical literature.
14
PDF Studio - PDF Editor for Mac, Windows, Linux. For Evaluation. https://www.qoppa.com/pdfstudio
Gene expression databases have not established defined standards for the collection,
storage, retrieval and querying of gene expression data derived from libraries of gene
expression experiments.
Data visualisation is used to display the partial results of cluster analysis generated
from large gene expression database cluster.
1.4.3 Proteomics
Proteomics is the use of quantitative protein-level measurements of gene expression in
order to characterise biological processes and describe the mechanisms of gene
translation. The objective of proteomics is the quantitative measurement of protein
expression particularly under the influence of drugs or disease perturbations. Gene
expression monitors gene transcription whereas proteomics monitors gene translation.
Proteomics provides a more direct response to functional genomics than the indirect
approach provided by gene expression.
Proteome Databases
Proteome databases also provide integrated data management and analysis systems for
the translational expression data generated by large-scale proteomics experiments.
Proteome databases integrate expression levels and properties of thousands of proteins
with the thousands of genes identified on genetic maps and offer a global approach to
the study of gene expression.
Proteome databases address five research problems that cannot be resolved by DNA
analysis:
15
PDF Studio - PDF Editor for Mac, Windows, Linux. For Evaluation. https://www.qoppa.com/pdfstudio
Emerging Trends and The creation of comprehensive databases of genes and gene products will lay the
Example DBMS
Architectures
foundation for further construction of comprehensive databases of higher-level
mechanisms, e.g., regulation of gene expression, metabolic pathways and signalling
cascades.
A detailed discussion on these databases is beyond the scope of this Unit. You may
wish to refer to the further readings for more information.
Emerging Database
1.5 KNOWLEDGE DATABASES Models, Technologies
and Applications-I
Knowledge databases are the database for knowledge management. But what is
knowledge management? Knowledge management is the way to gather, manage and
use the knowledge of an organisation. The basic objectives of knowledge management
are to achieve improved performance, competitive advantage and higher levels of
innovation in various tasks of an organisation.
Please note that during the representation of the fact the data is represented
using the attribute value only and not the attribute name. The attribute name
determination is on the basis of the position of the data. For instance, in the
example above Rakesh is the Mgrname.
The rules in the Datalog do not contain the data. These are evaluated on the
basis of the stored data in order to deduce more information.
Another similar term used in the context of visualisation is knowledge visualisation Emerging Database
the main objective of which is to improve transfer of knowledge using visual formats Models, Technologies
and Applications-I
that include images, mind maps, animations etc.
Please note the distinction here. Information visualisation mainly focuses on the tools
that are supported by the computer in order to explore and present large amount of
data in formats that may be easily understood.
You can refer to more details on this topic in the fifth semester course.
1.7 SUMMARY
This unit provides an introduction to some of the later developments in the area of
database management systems. Multimedia databases are used to store and deal with
multimedia information in a cohesive fashion. Multimedia databases are very large in
size and also require support of algorithms for searches based on various media
components. Spatial database primarily deals with multi-dimensional data. GIS is a
spatial database that can be used for many cartographic applications such as irrigation
system planning, vehicle monitoring system etc. This database system may represent
information in a multi-dimensional way.
Genome database is another very large database system that is used for the purpose of
genomics, gene expression and proteomics. Knowledge database store information
either as a set of facts and rules or as semantic models. These databases can be utilised
in order to deduce more information from the stored rules using an inference engine.
Information visualisation is an important area that may be linked to databases from
the point of visual presentation of information for better user interactions.
19
PDF Studio - PDF Editor for Mac, Windows, Linux. For Evaluation. https://www.qoppa.com/pdfstudio
1) GIS is a spatial database application where the spatial and non-spatial data is
represented along with the map. Some of the applications of GIS are:
Cartographic applications
3-D Digital modeling applications like land elevation records
Geographic object applications like traffic control system.
2) A GIS has the following requirements:
Data representation through vector and raster
Support for analysis of data
Representation of information in an integrated fashion
Capture of information
Visualisation of information
Operations on information
3) The data may need to be organised for the following three levels:
Geonomics: Where four different types of data are represented. The
physical data may be represented using eight different fields.
Gene expression: Where data is represented in fourteen different fields
Proteomics: Where data is used for five research problems.
Check Your Progress 3
1) A good knowledge database will have good information, good classification and
structure and an excellent search engine.
20
PDF Studio - PDF Editor for Mac, Windows, Linux. For Evaluation. https://www.qoppa.com/pdfstudio
Emerging Database
Models, Technologies
and Applications-I
21