0% found this document useful (0 votes)
376 views192 pages

DWDM Notes 1-5 Units

This document provides an overview of a course on data warehousing and data mining. It discusses (1) the definition of a data warehouse as a subject-oriented, integrated, non-volatile and time-variant collection of data from single or multiple sources, (2) the key differences between operational and warehouse data in terms of orientation, relationships, and time variance, and (3) the typical structure of a data warehouse with different levels of data summarization and detail.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
376 views192 pages

DWDM Notes 1-5 Units

This document provides an overview of a course on data warehousing and data mining. It discusses (1) the definition of a data warehouse as a subject-oriented, integrated, non-volatile and time-variant collection of data from single or multiple sources, (2) the key differences between operational and warehouse data in terms of orientation, relationships, and time variance, and (3) the typical structure of a data warehouse with different levels of data summarization and detail.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 192

COURSE CODE COURSE TITLE L T P C

1151CS114 DATA WAREHOUSING AND DATA MINING 3 0 0 3


CO Level of learning domain (Based
Course Outcomes
Nos. on revised Bloom’s taxonomy)
Explain and identify the subject areas for which a data
 CO1 warehouse is to be built. K2

1151CS114-Data Warehousing and Data mining

UNIT-I

Data warehouse is an information system that contains historical and commutative data from single or multiple sources.
It simplifies reporting and analysis process of the organization.

Data warehousing is the process of constructing and using a data warehouse. A data warehouse is constructed by
integrating data from multiple heterogeneous sources that support analytical reporting, structured and/or ad hoc queries,
and decision making. Data warehousing involves data cleaning, data integration, and data consolidations.

Data warehouse environment


In this course we would concentrate more on the data warehouse level. The data warehouse level is the main source of
the entire departmental /data mart. This forms the heart of the architected environment and it is the foundation of all the
DSS processing

Data warehouse is the center of the architecture for information systems for the 1990's. Data warehouse supports
informational processing by providing a solid platform of integrated, historical data from which to do analysis. Data
warehouse provides the facility for integration in a world of non integrated application systems. Data warehouse is
achieved in an evolutionary, step at a time fashion. Data warehouse organizes and stores the data needed for
informational, analytical processing over a long historical time perspective.

It is a subject-oriented, integrated, non-volatile and time variant collection of data. They contain granular corporate data
in the later half of the report the granularity concept would be explained in detail.

1
Subject orientation

The main feature of the data warehouse is that the data is oriented around major subject areas of business. Figure 2 shows
the contrast between the two types of orientations.

The operational world is designed around applications and functions such as loans, savings, bankcard, and trust for a
financial institution. The data warehouse world is organized around major subjects such as customer, vendor, product,
and activity. The alignment around subject areas affects the design and implementation of the data found in the data
warehouse.

Another important way in which the application oriented operational data differs from data warehouse data is in the
relationships of data. Operational data maintains an ongoing relationship between two or more tables based on a business
rule that is in effect. Data warehouse data spans a spectrum of time and the relationships found in the data warehouse are
vast.

Integration
The most important aspect of the data warehouse environment is the data integration. The very essence of the data
warehouse environment is that data contained within the boundaries of the warehouse is integrated. The integration is
seen in different ways- one would be the consistency of naming conventions, consistency in the measurement variables,
consistency in the physical attributes of the data and so forth. Figure 3 shows the concept of integration in a data
warehouse
2
Time Variant:
All data in the data warehouse is accurate as of some moment in time. This basic characteristic of data in the warehouse
is very different from data found in the operational environment. In the operational environment data is accurate as of the
moment of access. In other words, in the operational environment when you access a unit of data, you expect that it will
reflect accurate values as of the moment of access. Because data in the data warehouse is accurate as of some moment in
time (i.e., not "right now"), data

found in the warehouse is said to be "time variant". Figure 4 shows the time variance of data warehouse data.

3
The time variant of data in this shows up in different ways. The simplest way would be the data for a time horizon of 10
to 15 years, but in the case of an operational environment the time span is much shorter.

The second way that time variance shows up in the data warehouse is in the key structure. Every key structure in the data
warehouse contains - implicitly or explicitly - an element of time, such as day, week, month, etc. The element of time is
almost always at the bottom of the concatenated key found in the data warehouse.

The third way that time variance appears is that data warehouse data, once correctly recorded, cannot be updated. Data
warehouse data is, for all practical purposes, a long series of snapshots. Of course if the snapshot of data has been taken
incorrectly, then snapshots can be changed. But assuming that snapshots are made properly, they are not altered once
made.

Non Volatile:
Figure 5 explains the concept of non volatile. Figure 5 shows that updates (inserts, deletes, and changes) are done
regularly to the operational environment on a record by record basis. But the basic manipulation of data that occurs in the
data warehouse is much simpler. There are only two kinds of operations that occur in the data warehouse - the initial
loading of data, and the access of data. There is no update of data (in the general sense of update) in the data warehouse
as a normal part of processing.

4
Structure of a Data warehouse
Data warehouses have a distinct structure. There are different levels of summarization and detail. The structure of a data
warehouse is shown by Figure 6.

Older detail data is data that is stored on some form of mass storage. It is infrequently accessed and is stored at a level of
detail consistent with current detailed data

Lightly summarized data is data that is distilled from the low level of detail found at the current detailed level. This level
of the data warehouse is almost always stored on disk storage

Highly summarized data is compact and easily accessible. Sometimes the highly summarized data is found in the data
warehouse environment and in other cases the highly summarized data is found outside the immediate walls of the
technology that houses the data warehouse

The final component of the data warehouse is that of meta data. In many ways meta data sits in a different dimension than
other data warehouse data, because meta data contains no data directly taken from the operational environment. Meta
data plays a special and very important role in the data warehouse. Meta data is used as:

• a directory to help the DSS analyst locate the contents of the data warehouse,

• a guide to the mapping of data as the data is transformed from the operational environment to the data warehouse
environment.

Meta data plays a much more important role in the data warehouse environment than it ever did in the classical
operational environment.

Flow of data

5
There is a normal and predictable flow of data within the data warehouse. Figure 7 shows that flow.

Data enters the data warehouse from the operational environment. Upon entering the data warehouse, data goes into the
current detail level of detail, as shown. It resides there and is used there until one of three events occurs:

• it is purged,

• it is summarized, and/or

• it is archived

The aging process inside a data warehouse moves current detail data to old detail data, based on the age of data. As the
data is summarized, it passes from the lightly summarized data to highly summarized.

Based on the above facts we now realize that the data warehouse is not built at once. Instead it is populated and designed
one step at a time, it develops based on the evolutionary phenomenon and not revolutionary. The cost of building a data
warehouse all at once would be very expensive and the results also would not be very accurate. So it is always suggested
and dictated that the environment is build using the step by step approach.

Activities like delete, update, and insert which are performed in an operational application environment are omitted in
Data warehouse environment. Only two types of data operations performed in the Data Warehousing are

1. Data loading
2. Data access
3.

Here, are some major differences between Application and Data Warehouse

Operational Application Data Warehouse


Complex program must be coded to make sure that data
This kind of issues does not happen because data update is
upgrade processes maintain high integrity of the final
not performed.
product.
Data is placed in a normalized form to ensure minimal
Data is not stored in normalized form.
redundancy.
Technology needed to support issues of transactions, data It offers relative simplicity in technology.
6
recovery, rollback, and resolution as its deadlock is quite
complex.

Evolution of Decision Support Systems

Decision support systems are a class of computer-based information systems including knowledge based systems that
support decision making activities.

The Data Warehouse and Data Models


Before attempting to apply the conventional database design techniques the designer must make an attempt to understand
the applicability and the limitation of those techniques. The process model applies only to the operational environment as
it is a requirement driven but the data model concept is applicable for both the operational and the data warehouse
environment we cannot use the process model in data warehousing as many development tools and requirement are not
applicable for the data warehouse.

Corporate Data Model


The corporate data model focuses only on and represents only primitive data. To construct a separate existing data model
the corporate data model is the first step. There are a fair number of changes which are made to the corporate data model
when the data moves to the data warehouse environment. First the data which is only used in the operational environment
is completely removed. Then we enhance the key structure of the data by adding the time factor into consideration. Then
we take the derived data and add it into the corporate model we also see if the derived data is continuously changing or

7
historically changing what ever is the scenario we do the changes in the data. Finally, data relationships in the operational
environment are turned into artifacts in the data warehouse.
So during this analysis what we do is we group the data which seldom changes and then we group the data which
regularly changes and then we do a stability analysis to create groups of data which are having similar characteristics.
The stability analysis is done as shown in the figure.

Data Warehouse Data Model


We have three levels of data modeling: high level data modeling(ER model), middle level modeling (DIS or data item
set) and low-level modeling (physical modeling).
In the high level modeling the features entities and relationships are shown. The entities that are shown in the ERD level
are at the highest level of abstraction. What entities belong and what entities don't belong is determined by what we
termed as “scope of integration”. Separate high level data models have been created for different communities within the
corporation and they collectively make the corporate ERD.

What is Data Modelling?

Data modeling (data modelling) is the process of creating a data model for the data to be stored in a
Database. This data model is a conceptual representation of Data objects, the associations between different
data objects and the rules. Data modeling helps in the visual representation of data and enforces business
rules, regulatory compliances, and government policies on the data. Data Models ensure consistency in
naming conventions, default values, semantics, security while ensuring quality of the data.

8
Data model emphasizes on what data is needed and how it should be organized instead of what operations
need to be performed on the data. Data Model is like architect's building plan which helps to build a
conceptual model and set the relationship between data items.

The two types of Data Models techniques are

1. Entity Relationship (E-R) Model


2. UML (Unified Modelling Language

Why use Data Model?

The primary goal of using data model are:

 Ensures that all data objects required by the database are accurately represented. Omission of data
will lead to creation of faulty reports and produce incorrect results.
 A data model helps design the database at the conceptual, physical and logical levels.
 Data Model structure helps to define the relational tables, primary and foreign keys and stored
procedures.
 It provides a clear picture of the base data and can be used by database developers to create a
physical database.
 It is also helpful to identify missing and redundant data.
 Though the initial creation of data model is labor and time consuming, in the long run, it makes your IT
infrastructure upgrade and maintenance cheaper and faster.

Types of Data Models

There are mainly three different types of data models:

1. Conceptual: This Data Model defines WHAT the system contains. This model is typically created by
Business stakeholders and Data Architects. The purpose is to organize, scope and define business
concepts and rules.
2. Logical: Defines HOW the system should be implemented regardless of the DBMS. This model is
typically created by Data Architects and Business Analysts. The purpose is to developed technical
map of rules and data structures.
3. Physical: This Data Model describes HOW the system will be implemented using a specific DBMS
system. This model is typically created by DBA and developers. The purpose is actual implementation
of the database.

9
Conceptual Model

The main aim of this model is to establish the entities, their attributes, and their relationships. In this Data
modeling level, there is hardly any detail available of the actual Database structure.

The 3 basic tenants of Data Model are

Entity: A real-world thing

Attribute: Characteristics or properties of an entity

Relationship: Dependency or association between two entities

For example:

 Customer and Product are two entities. Customer number and name are attributes of the Customer
entity
 Product name and price are attributes of product entity
 Sale is the relationship between the customer and product

10
Charac

teristics of a conceptual data model

 Offers Organisation-wide coverage of the business concepts.


 This type of Data Models are designed and developed for a business audience.
 The conceptual model is developed independently of hardware specifications like data storage
capacity, location or software specifications like DBMS vendor and technology. The focus is to
represent data as a user will see it in the "real world."

Conceptual data models known as Domain models create a common vocabulary for all stakeholders by
establishing basic concepts and scope.

Logical Data Model

Logical data models add further information to the conceptual model elements. It defines the structure of the
data elements and set the relationships between them.

The advantage of the Logical data model is to provide a foundation to form the base for the Physical model.
However, the modeling structure remains generic.

At this Data Modeling level, no primary or secondary key is defined. At this Data modeling level, you need to
verify and adjust the connector details that were set earlier for relationships.

Characteristics of a Logical data model

 Describes data needs for a single project but could integrate with other logical data models based on
the scope of the project.
 Designed and developed independently from the DBMS.

11
 Data attributes will have datatypes with exact precisions and length.
 Normalization processes to the model is applied typically till 3NF.

Physical Data Model

A Physical Data Model describes the database specific implementation of the data model. It offers an
abstraction of the database and helps generate schema. This is because of the richness of meta-data offered
by a Physical Data Model.

This type of Data model also helps to visualize database structure. It helps to

model database columns keys, constraints, indexes, triggers, and other RDBMS features.

Characteristics of a physical data model:

 The physical data model describes data need for a single project or application though it maybe
integrated with other physical data models based on project scope.
 Data Model contains relationships between tables that which addresses cardinality and nullability of
the relationships.
 Developed for a specific version of a DBMS, location, data storage or technology to be used in the
project.
 Columns should have exact datatypes, lengths assigned and default values.
 Primary and Foreign keys, views, indexes, access profiles, and authorizations, etc. are defined.

Advantages and Disadvantages of Data Model:

Advantages of Data model:

 The main goal of a designing data model is to make certain that data objects offered by the functional
team are represented accurately.
 The data model should be detailed enough to be used for building the physical database.
 The information in the data model can be used for defining the relationship between tables, primary
and foreign keys, and stored procedures.
 Data Model helps business to communicate the within and across organizations.
 Data model helps to documents data mappings in ETL process
 Help to recognize correct sources of data to populate the model

Disadvantages of Data model:

12
 To develop Data model one should know physical data stored characteristics.
 This is a navigational system produces complex application development, management. Thus, it
requires a knowledge of the biographical truth.
 Even smaller change made in structure require modification in the entire application.
 There is no set data manipulation language in DBMS.

Conclusion

 Data modeling is the process of developing data model for the data to be stored in a Database.
 Data Models ensure consistency in naming conventions, default values, semantics, security while
ensuring quality of the data.
 Data Model structure helps to define the relational tables, primary and foreign keys and stored
procedures.
 There are three types of conceptual, logical, and physical.
 The main aim of conceptual model is to establish the entities, their attributes, and their relationships.
 Logical data model defines the structure of the data elements and set the relationships between them.
 A Physical Data Model describes the database specific implementation of the data model.
 The main goal of a designing data model is to make certain that data objects offered by the functional
team are represented accurately.
 The biggest drawback is that even smaller change made in structure require modification in the entire
application.

 Granularity in the Datawarehouse

Granularity refers to the level of detail or summarization of the units of data in the data warehouse. The more detail there
is, the lower the level of granularity. The less detail there is, the higher the level of granularity. For example, a simple
transaction would be at a low level of granularity. A summary of all transactions for the month would be at a high level
of granularity. Granularity of data has always been a major design issue. In early operational systems, granularity was
taken for granted. When detailed data is being updated, it is almost a given that data be stored at the lowest level of
granularity.

13
Major design issues of the data warehouse: granularity, partitioning, and proper design.

14
Determining the level of granularity is the most important design issue in the
data warehouse environment.

The single most important aspect and issue of the design of the data warehouse is the issue of granularity. It refers to the
detail or summarization of the units of data in the data warehouse. The more detail there is, the lower the granularity
level. The less detail there is, the higher the granularity level.

Granularity is a major design issue in the data warehouse as it profoundly affects the volume of data. The figure below
shows the issue of granularity in a data warehouse.

15
Granularity is the most important to the data warehouse architect because it affects all the environments that depend in
the data warehouse for data. The main issue of granularity is that of getting it at the right level. The level of granularity
needs to be neither too high nor too low.

Raw Estimates
The starting point to determine the appropriate level of granularity is to do a rough estimate of the number of rows that
would be there in the data warehouse. If there are very few rows in the data warehouse then any level of granularity
would be fine. After these projections are made the index data space projections are calculated. In this index data
projection we identify the length of the key or element of data and determine whether the key would exist for each and
every entry in the primary table.
The raw estimate of the number of rows of data that will reside in the data warehouse tells the architect a great deal.
–If there are only 10,000 rows, almost any level of granularity will do.
–If there are 10 million rows, a low level of granularity is possible.
–If there are 10 billion rows, not only is a higher level of granularity needed, but a major portion of the data will probably
go into overflow storage.
Data in the data warehouse grows in a rate never seen before. The combination of historical data and detailed data
produces a growth rate which is phenomenal. It is only after data warehouse the terms terabyte and petabyte came into
existence. As data keeps growing some part of the data becomes inactively used and they are sometimes called as
dormant data. So it is always better to have these kinds of dormant data in external storage media.
Data which is usually stored externally are much less expensive than the data which resides on the disk storage. Some
times as these data are external it becomes difficult to retrieve the data and this causes lots of performance issues and
these issues cause lots of effect on the granularity. It is usually the rough estimates which tell whether the overflow
storage should be considered or not.

16
Levels of Granularity
After simple analysis is done the next step would be to determine the level of granularity for the data which is residing on
the disk storage. Determining the level of granularity requires some extent of common sense and intuition. Having a very
low level of granularity also doesn't make any sense as we will have to need many resources to analyze and process the
data. While if the level of granularity is very high then this means that analysis needs to done on the data which reside in
the external storage. Hence this is a very tricky issue so the only way to handle this to put the data in front of the user and
let he/she decide on what the type of data should be. The below figure shows the iterative loop which needs to be
followed.

17
The process which needs to be followed is.
 Build a small subset quickly based on the feedback
 Prototyping
 Looking what other people have done
 Working with experienced user
 Looking at what the organization has now
 Having sessions with the simulated output.

Dual levels of Granularity:

Sometimes there is a great need for efficiency in storing and accessing data and the ability to analyze the data in great
data. When an organization has huge volumes of data it makes sense to consider two or more levels of granularity in the
detailed portion of the data warehouse. The figure below shows two levels of granularity in a data warehouse. In the
below figure we see a phone company which fits the needs of most of its shops. There is a huge amount of data in the
operational level. The data up to 30 days is stored in the operational environment. Then the data shifts to the lightly and
highly summarized zone.

18
This process of granularity not only helps the data warehouse it supports more than data marts. It supports the process of
exploration and data mining. Exploration and data mining takes masses of detailed historical data and examine the same
to analyze and previously unknown patterns of business activity.

It is usually said that if both granularity and partitioning are done properly then all most all the aspects of the data
warehouse implementation comes easily. Proper partitioning of data allows the data to grow and to be managed

Partitioning of data:

The main purpose of this partitioning is to break up the data into small manageable physical units the main advantage of
this would be that the developer would have a greater flexibility in managing the physical units of the data.

The main tasks that are carried out while partitioning is as follows:

 Restructuring
 Indexing
 Sequential scanning
 Reorganization
 Recovery
 Monitoring

In short the main aim for this activity is the flexible access of data. Partitioning can be done in many different ways. One
of the major issues facing the data warehouse developer is whether the partitioning is done at system or application level.
Partitioning at system level is a function of the DBMS and operating system to some extent.
19
Building a Data warehouse

There are two factors that drive you to build and use data warehouse. They are:
 Business factors:
Business users want to make decision quickly and correctly using all available data.
 Technological factors:
 To address the incompatibility of operational data stores
 IT infrastructure is changing rapidly. Its capacity is increasing and cost is decreasing so that building a data
warehouse is easy There are several things to be considered while building a successful data warehouse

Business considerations:
Organizations interested in development of a data warehouse can choose one of the
Following two approaches:
 Top - Down Approach (Suggested by Bill Inmon)
 Bottom - Up Approach (Suggested by Ralph Kimball)

Top - Down Approach

In the top down approach suggested by Bill Inmon, we build a centralized storage area to house corporate wide business
data. This repository (storage area) is called Enterprise Data Warehouse (EDW). The data in the EDW is stored in a
normalized form in order to avoid redundancy.
The central repository for corporate wide data helps us maintain one version of truth of the
data.
The data in the EDW is stored at the most detail level. The reason to build the EDW on the most detail level is to
leverage
1. Flexibility to be used by multiple departments.
2. Flexibility to provide for future requirements.
The disadvantages of storing data at the detail level are
1. The complexity of design increases with increasing level of detail.
2. It takes large amount of space to store data at detail level, hence increased cost.

Implement the top-down approach when


1. The business has complete clarity on all or multiple subject areas data warehosue
requirements.
2. The business is ready to invest considerable time and money.

The advantage of using the Top Down approach is that we build a centralized repository to
provide for one version of truth for business data. This is very important for the data to be reliable, consistent across
subject areas and for reconciliation in case of data related contention between subject areas.
The disadvantage of using the Top Down approach is that it requires more time and initial
investment. The business has to wait for the EDW to be implemented followed by building
the data marts before which they can access their reports.

Bottom Up Approach
The bottom up approach suggested by Ralph Kimball is an incremental approach to build a data warehouse. Here we
build the data marts separately at different points of time as and when the specific subject area requirements are clear.
The data marts are integrated or combined together to form a data warehouse. Separate data marts are combined through
the use of conformed dimensions and conformed facts. A conformed dimension and a conformed fact is one that can be
shared across data marts.
 Conformed dimension has consistent dimension keys, consistent attribute names and consistent values across
separate data marts. The conformed dimension means exact same thing with every fact table it is joined.
 Conformed fact has the same definition of measures, same dimensions joined to it and at
the same granularity across data marts.

20
The bottom up approach helps us incrementally build the warehouse by developing and integrating data marts as and
when the requirements are clear. We don’t have to wait for knowing the overall requirements of the warehouse. We
should implement the bottom up approach when
1. We have initial cost and time constraints.
2. The complete warehouse requirements are not clear.

The advantage of using the Bottom Up approach is that they do not require high initial costs
and have a faster implementation time; hence the business can start using the marts much earlier as compared to the top-
down approach.

The disadvantages of using the Bottom Up approach are that it stores data in the de normalized format, hence there would
be high space usage for detailed data.

Design considerations
To be a successful data warehouse designer must adopt a holistic approach that is
considering all data warehouse components as parts of a single complex system, and take
into account all possible data sources and all known usage requirements.
Most successful data warehouses that meet these requirements have these common
characteristics:
 Are based on a dimensional model
 Contain historical and current data
 Include both detailed and summarized data
 Consolidate disparate data from multiple sources while retaining consistency
Data warehouse is difficult to build due to the following reason:
 Heterogeneity of data sources
 Use of historical data
 Growing nature of data base

Data warehouse design approach muse be business driven, continuous and iterative
engineering approach. In addition to the general considerations there are following specific
points relevant to the data warehouse design:

Data content
The content and structure of the data warehouse are reflected in its data model. The data model is the template that
describes how information will be organized within the integrated warehouse framework. The data warehouse data must
be a detailed data. It must be formatted, cleaned up and transformed to fit the warehouse data model.

Meta data
It defines the location and contents of data in the warehouse. Meta data is searchable by users to find definitions or
subject areas. In other words, it must provide decision support oriented pointers to warehouse data and thus provides a
logical link between warehouse data and decision support applications.

Data distribution
One of the biggest challenges when designing a data warehouse is the data placement and distribution strategy. Data
volumes continue to grow in nature. Therefore, it becomes necessary to know how the data should be divided across
multiple servers and which users should get access to which types of data. The data can be distributed based on the
subject area, location (geographical region), or time (current, month, year).

Tools
A number of tools are available that are specifically designed to help in the implementation of the data warehouse. All
selected tools must be compatible with the given data warehouse environment and with each other. All tools must be able
to use a common Meta data repository.

Design steps
The following nine-step method is followed in the design of a data warehouse:
1. Choosing the subject matter
21
2. Deciding what a fact table represents
3. Identifying and conforming the dimensions
4. Choosing the facts
5. Storing pre calculations in the fact table
6. Rounding out the dimension table
7. Choosing the duration of the db
8. The need to track slowly changing dimensions
9. Deciding the query priorities and query models

Technical considerations
A number of technical issues are to be considered when designing a data warehouse
environment. These issues include:
 The hardware platform that would house the data warehouse
 The dbms that supports the warehouse data
 The communication infrastructure that connects data marts, operational systems and
 end users
 The hardware and software to support meta data repository
 The systems management framework that enables admin of the entire environment

Implementation considerations
The following logical steps needed to implement a data warehouse:
 Collect and analyze business requirements
 Create a data model and a physical design
 Define data sources
 Choose the db tech and platform
 Extract the data from operational db, transform it, clean it up and load it into the
 warehouse
 Choose db access and reporting tools
 Choose db connectivity software
 Choose data analysis and presentation s/w
 Update the data warehouse

Access tools
Data warehouse implementation relies on selecting suitable data access tools. The best way to choose this is based on the
type of data can be selected using this tool and the kind of access it permits for a particular user. The following lists the
various type of data that can be accessed:
 Simple tabular form data
 Ranking data
 Multivariable data
 Time series data
 Graphing, charting and pivoting data
 Complex textual search data
 Statistical analysis data
 Data for testing of hypothesis, trends and patterns
 Predefined repeatable queries
 Ad hoc user specified queries
 Reporting and analysis data
 Complex queries with multiple joins, multi level sub queries and sophisticated search criteria

Data extraction, clean up, transformation and migration


A proper attention must be paid to data extraction which represents a success factor for a data warehouse architecture.
When implementing data warehouse several the following selection criteria that affect the ability to transform,
consolidate, integrate and repair the data should be considered:
 Timeliness of data delivery to the warehouse
 The tool must have the ability to identify the particular data and that can be read by
 conversion tool
22
 The tool must support flat files, indexed files since corporate data is still in this type
 The tool must have the capability to merge data from multiple data stores
 The tool should have specification interface to indicate the data to be extracted
 The tool should have the ability to read data from data dictionary
 The code generated by the tool should be completely maintainable
 The tool should permit the user to extract the required data
 The tool must have the facility to perform data type and character set translation
 The tool must have the capability to create summarization, aggregation and derivation of records
 The data warehouse database system must be able to perform loading data directly from these tools

Data placement strategies


 As a data warehouse grows, there are at least two options for data placement. One is to put some of the data in
the data warehouse into another storage media.
 The second option is to distribute the data in the data warehouse across multiple servers.

User levels
The users of data warehouse data can be classified on the basis of their skill level in accessing the warehouse. There are
three classes of users:
Casual users: are most comfortable in retrieving info from warehouse in pre defined formats and running pre existing
queries and reports. These users do not need tools that allow for building standard and ad hoc reports
Power Users: can use pre defined as well as user defined queries to create simple and ad hoc reports. These users can
engage in drill down operations. These users may have the experience of using reporting and query tools.
Expert users: These users tend to create their own complex queries and perform standard analysis on the info they
retrieve. These users have the knowledge about the use of query and report tools.

Benefits of data warehousing


Data warehouse usage includes,
 Locating the right info
 Presentation of info
 Testing of hypothesis
 Discovery of info
 Sharing the analysis
The benefits can be classified into two:
 Tangible benefits (quantified / measureable):It includes,
 Improvement in product inventory
 Decrement in production cost
 Improvement in selection of target markets
 Enhancement in asset and liability management
 Intangible benefits (not easy to quantified): It includes,
 Improvement in productivity by keeping all data in single location and eliminating rekeying of data
 Reduced redundant processing
 Enhanced customer relation

Data warehouse Architecture

Components of Datawarehouse

1. Data sourcing, cleanup, transformation, and migration tools


2. Metadata repository
23
3. Warehouse/database technology
4. Data marts
5. Data query, reporting, analysis, and mining tools
6. Data warehouse administration and management
7. Information delivery system

Architecture

Data warehouse is an environment, not a product which is based on relational database management system that
functions as the central repository for informational data.

 The central repository information is surrounded by number of key components designed to make the
environment is functional, manageable and accessible.
 The data source for data warehouse is coming from operational applications. The data
 entered into the data warehouse transformed into an integrated structure and format.
 The transformation process involves conversion, summarization, filtering and condensation.
 The data warehouse must be capable of holding and managing large volumes of data as well as different
structure of data structures over the time.

Seven Major components :-

 Data warehouse database


This is the central part of the data warehousing environment. This is the item number 2 in the above arch. diagram.
This is implemented based on RDBMS technology.

 Sourcing, Acquisition, Clean up, and Transformation Tools

This is item number 1 in the above arch diagram. They perform conversions, summarization, key changes, structural
changes and condensation. The data transformation is required so that the information can be used by decision support
tools.

24
The transformation produces programs, control statements, JCL code, COBOL code, UNIX scripts, and SQL DDL code
etc., to move the data into data warehouse from multiple operational systems.

The functionalities of these tools are listed below:

 To remove unwanted data from operational database


 Converting to common data names and attributes
 Calculating summaries and derived data
 Establishing defaults for missing data
 Accommodating source data definition changes

Issues to be considered while data sourcing, cleanup, extract and transformation:

Data heterogeneity: It refers to DBMS different nature such as it may be in different data
modules, it may have different access languages, it may have data navigation methods,
operations, concurrency, integrity and recovery processes etc.,

Data heterogeneity: It refers to the different way the data is defined and used in different
modules. E.g Prism Solutions, Evolutionary Technology Inc., Vality, Praxis and Carleton.

Meta data

It is data about data. It is used for maintaining, managing and using the data warehouse. It is classified into two:
Technical Meta data:
It contains information about data warehouse data used by warehouse designer,
administrator to carry out development and management tasks. It includes,
 Information about data stores
 Transformation descriptions. That is mapping methods from operational database to warehouse database.
 Warehouse Object and data structure definitions for target data
 The rules used to perform clean up, and data enhancement
 Data mapping operations
 Access authorization, backup history, archive history, info delivery history, data
 acquisition history, data access etc.,

Business Meta data:


It contains info that gives info stored in data warehouse to users. It includes,
 Subject areas, and info object type including queries, reports, images, video, audio
 clips etc.
 Internet home pages
 Info related to info delivery system.
 Data warehouse op\erational info such as ownerships, audit trails etc.,

Meta data helps the users to understand content and find the data. Meta data are stored in a separate data stores which is
known as informational directory or Meta data repository which helps to integrate, maintain and view the contents of the
data warehouse.

The following lists the characteristics of info directory/ Meta data:


 It is the gateway to the data warehouse environment
 It supports easy distribution and replication of content for high performance and
 Availability.
 It should be searchable by business oriented key words
 It should act as a launch platform for end user to access data and analysis tools
25
 It should support the sharing of information
 It should support scheduling options for request
 It should support and provide interface to other applications
 It should support end user monitoring of the status of the data warehouse environment

 Access tools
Its purpose is to provide info to business users for decision making. There are five categories:
 Data query and reporting tools
 Application development tools
 Executive info system tools (EIS)
 OLAP tools
 Data mining tools
Query and reporting tools are used to generate query and report. There are two types of reporting tools. They are:
 Production reporting tool used to generate regular operational reports
 Desktop report writer are inexpensive desktop tools designed for end users.
 Managed Query tools: used to generate SQL query. It uses Meta layer software in between users and databases
which offers a point-and-click creation of SQL statement. This tool is a preferred choice of users to perform
segment identification, demographic analysis, territory management and preparation of customer mailing lists
etc.

Application development tools: This is a graphical data access environment which integrates OLAP tools with data
warehouse and can be used to access all db systems OLAP Tools: are used to analyze the data in multi dimensional and
complex views. To enable multidimensional properties it uses MDDB and MRDB where MDDB refers multidimensional
data base and MRDB refers multi relational data bases.
Data mining tools: are used to discover knowledge from the data warehouse data also can be used for data visualization
and data correction purposes.

Data marts
Departmental subsets that focus on selected subjects. They are independent used by dedicated user group. They are used
for rapid delivery of enhanced decision support functionality to end users. Data mart is used in the following situation:
 Extremely urgent user requirement
 The absence of a budget for a full scale data warehouse strategy
 The decentralization of business needs
 The attraction of easy to use tools and mind sized project

Data mart presents two problems:


1. Scalability: A small data mart can grow quickly in multi dimensions. So that while designing it, the organization has
to pay more attention on system scalability, consistency and manageability issues
2. Data integration

Data warehouse administration and management


The management of data warehouse includes,
 Security and priority management
 Monitoring updates from multiple sources.
 Data quality checks
 Managing and updating meta data
 Auditing and reporting data warehouse usage and status
 Purging data
 Replicating, sub setting and distributing data
 Backup and recovery
 Data warehouse storage management which includes capacity planning, hierarchical storage management and
purging of aged data etc.,

Information delivery system

26
• It is used to enable the process of subscribing for data warehouse information.
• Delivery to one or more destinations according to specified scheduling algorithm

Metadata

Metadata is one of the most important aspects of data warehousing. It is data about data stored in the warehouse and its
users.

Metadata contains:

i. The location and description of warehouse system and data components (warehouse objects).
ii. Names, definition, structure and content of the data warehouse and end user views.
iii. Identification of reliable data sources (systems of record).
iv. Integration and transformation rules
- used to generate the data warehouse; these include the mapping method from operational databases into the
warehouse, and lgorithms used to convert, enhance, or transform data.
v. Integration and transformation rules
-used to deliver data to end
-user analytical tools.
vi. Subscription information
-for the information delivery to the analysis subscribers.
vii. Data warehouse operational information,
- which includes a history of warehouse updates, refreshments, snapshots, versions, ownership authorizations and
extract audit trail.
viii. Metrics
- used to analyze warehouse usage and performance and end user usage patterns.
ix. Security
- authorizations access control lists, etc.

27
28
Metadata Interchange Initiative (idea)
In a situation such as a data warehouse different tools must be able to freely and easily access, and in some cases
manipulate and update, the metadata must be created by other tools and stored in a variety of different storages. To
achieve this goal is to establish atleast minimum common method of interchange standards and guidelines for fulfill
different vendors tools. This can be offered by the data warehousing vendors and is known as the metadata interchange
initiative.
The metadata interchange standard defines two different meta models:
 The application meta model — the tables, etc., used to "hold" the metadata for a
particular application.
 The metadata meta model — the set of objects that the metadata interchange standard can be used to describe.
These represent the information that is common to one or more classes of tools, such as data extraction tools, replication
tools, user query tools and database servers.

Metadata interchange standard framework – (architecture):

This defines three approaches.


• Procedural approach: The API(Application program Interface) tool need to create update, access, and interact with
metadata. This approach used to do this in terms of developing the standard metadata implementation.
• ASCII batch approach: This approach depend on the ASCII file format which contains the description of metadata
components and standardized access requirements that make up the interchange standard meta data model.
• Hybrid approach: Data-driven model, A table driven API support only fully qualified
references for each metadata element, a tool interact with the API through the standard
29
access framework and directly access just the specific metadata object needed.
The components of metadata interchange standard framework are:
• The standard metadata model, which refers to the ASCII file format used to represent the metadata that is being
exchanged.
• The standard access framework, which describes the minimum number of API functions a vendor must support
• Tool profile, which is provided by each tool vendor. The tool profile is a file that describes what aspects of the
interchange standard metamodel a particular tool supports.
• The user configuration, which is a file describing the legal interchange paths for metadata in the user's environment.

Metadata Repository(storage)
• The data warehouse architecture framework includes the metadata interchange
framework as one of its components.
• It defines a number of components all of which interact with each other via the
architecturally defined layer of metadata.
Metadata repository management software can be used to map the source data to the
target database, generate code for data transformations, integrate and transform the data, and control moving data to the
warehouse.
Metadata defines the contents and location of data (data model) in the warehouse,
relationships between the operational databases and the data warehouse and the business views of the warehouse data that
are accessible by end-user tools.
A data warehouse design ensures a mechanism for maintaining the metadata repository
and all the access paths to the data warehouse must have metadata-as an entry point.
The variety of access paths available into the data warehouse, and at the same time to
show how many tool classes can be involved in the process.

Meta Data repository provides the following benefits.


It provides a complete set of tools for metadata management.
It reduces and eliminates information redundancy, inconsistency.
It simplifies management and improves organization, control and accounting of
information assets.
It increases identification, understanding, coordination and utilization of enterprise wide information assets.
It provides effective data administration tools to manage corporate information assets with full-function data
dictionary.
It increases flexibility, control, and reliability of the application development, process and step up internal application
development.
It controls investment in legacy systems with the ability to inventory and utilize existing applications.
It provides a universal relational model for heterogeneous RDBMS to interact and share information.
It implements CASE development standards and eliminates redundancy with the ability share and reuse metadata.
6.4 Metadata Management
A major problem in data warehousing is the inability to communicate to the end user about
what information resides in the data warehouse and how it can be accessed.

• It can define all data elements and their attributes, data sources and timing, and the rules
that govern data use and data transformation
. • Metadata needs to be collected as the warehouse is designed and built.
• Even through there are a number of tools available to help users understand and use the warehouse, these tools need to
be carefully evaluated before any purchasing decision is made.
Implementation Examples
Platinum technologies, R&O, Prism solutions and Logic works.
30
Metadata Trends
The data warehouse arena must include external data within the data warehouse.
The data warehouse must reduce costs and to increase competitiveness and business
quickness.
The process of integrating external and internal data into the warehouse faces a number of challenges.
In consistent data formats
Missing or invalid data
Different levels of aggregation
Semantic inconsistency
Unknown or questionable data quality and timeliness
Data warehouses integrate various data types such as alphanumeric data types, data types for, text, voice, image, full
motion video, web pages in HTML formats.

COURSE CODE COURSE TITLE L T P C


1151CS114 DATA WAREHOUSING AND DATA MINING 3 0 0 3

UNIT-II

CO Level of learning domain (Based


Course Outcomes
Nos. on revised Bloom’s taxonomy)
 CO2 Analyze OLAP tools K2

Correlation of COs with Programme Outcomes:

PSO PSO PSO


PO PO PO
Cos PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 1 2 3
10 11 12
M
CO2 M L M

UNIT II BUSINESS ANALYSIS L-9

Mapping the Data Warehouse to a Multiprocessor Architecture – DBMS Schemas for Decision Support – Data
Extraction, Cleanup, and Transformation Tools –. Reporting and Query tools and Applications –Online Analytical
Processing (OLAP) – Need – Multidimensional Data Model – OLAP Guidelines – Multidimensional versus
Multirelational OLAP – Categorization of OLAP Tools.

Mapping the data warehouse architecture to Multiprocessor Architecture

31
Relational data base technology for data warehouse
The functions of data warehouse are based on the relational data base technology. The relational data base technology is
implemented in parallel manner. There are two advantages of having parallel relational data base technology for data
warehouse:
 Linear Speed up: refers the ability to increase the number of processor to reduce response time.
 Linear Scale up: refers the ability to provide same performance on the same requests as the database size
increases.

Types of parallelism .There are two types of parallelism:


Inter query Parallelism: In which different server threads or processes handle multiple requests at the same time.
Intra query Parallelism: This form of parallelism decomposes the serial SQL query into lower level operations such as
scan, join, sort etc. Then these lower level operations are executed concurrently in parallel.

Intra query parallelism can be done in either of two ways:

Horizontal parallelism: which means that the data base is partitioned across multiple disks and parallel processing
occurs within a specific task that is performed concurrently on different processors against different set of data.

Vertical parallelism: This occurs among different tasks. All query components such as scan, join, sort etc are executed
in parallel in a pipelined fashion. In other words, an output from one task becomes an input into another task.

Data partitioning:

Data partitioning is the key component for effective parallel execution of data base operations. Partition can be done
randomly or intelligently.
Random portioning includes random data striping across multiple disks on a single server. Another option for random
portioning is round robin fashion partitioning in which each record is placed on the next disk assigned to the data base.

Intelligent partitioning assumes that DBMS knows where a specific record is located and does not waste time searching
for it across all disks. The various intelligent partitioning include:
Hash partitioning: A hash algorithm is used to calculate the partition number based on the value of the partitioning key
for each row
Key range partitioning: Rows are placed and located in the partitions according to the value of the partitioning key.
That is all the rows with the key value from A to K are in partition 1, L to T are in partition 2 and so on.
Schema portioning: an entire table is placed on one disk; another table is placed on different disk etc. This is useful for
small reference tables.

32
User defined portioning: It allows a table to be partitioned on the basis of a user defined expression.

Data base architectures of parallel processing


There are three DBMS software architecture styles for parallel processing:
1. Shared memory or shared everything Architecture
2. Shared disk architecture
3. Shared nothing architecture

Shared Memory Architecture

Tightly coupled shared memory systems, illustrated in following figure have the following characteristics:
 Multiple PUs share memory.
 Each PU has full access to all shared memory through a common bus.
 Communication between nodes occurs via shared memory.
 Performance is limited by the bandwidth of the memory bus.

Symmetric multiprocessor (SMP) machines are often nodes in a cluster. Multiple SMP nodes can be used with Oracle
Parallel Server in a tightly coupled system, where memory is shared among the multiple PUs, and is accessible by all the
PUs through a memory bus.

Examples of tightly coupled systems include the Pyramid, Sequent, and Sun SparcServer.

Performance is potentially limited in a tightly coupled system by a number of factors. These include various system
components such as the memory bandwidth, PU to PU communication bandwidth, the memory available on the system,
the I/O bandwidth, and the bandwidth of the common bus.

Parallel processing advantages of shared memory systems are these:


 Memory access is cheaper than inter-node communication. This means that internal synchronization is faster than
using the Lock Manager.
 Shared memory systems are easier to administer than a cluster.

A disadvantage of shared memory systems for parallel processing is as follows:


 Scalability is limited by bus bandwidth and latency, and by available memory.

33
Shared Disk Architecture
Shared disk systems are typically loosely coupled. Such systems, illustrated in following figure, have the following
characteristics:
 Each node consists of one or more PUs and associated memory.
 Memory is not shared between nodes.
 Communication occurs over a common high-speed bus.
 Each node has access to the same disks and other resources.
 A node can be an SMP if the hardware supports it.
 Bandwidth of the high-speed bus limits the number of nodes (scalability) of the system.

The cluster illustrated in figure is composed of multiple tightly coupled nodes. The Distributed Lock Manager (DLM ) is
required. Examples of loosely coupled systems are VAXclusters or Sun clusters.
Since the memory is not shared among the nodes, each node has its own data cache. Cache consistency must be
maintained across the nodes and a lock manager is needed to maintain the consistency. Additionally, instance locks using
the DLM on the Oracle level must be maintained to ensure that all nodes in the cluster see identical data.
There is additional overhead in maintaining the locks and ensuring that the data caches are consistent. The performance
impact is dependent on the hardware and software components, such as the bandwidth of the high-speed bus through
which the nodes communicate, and DLM performance.

Parallel processing advantages of shared disk systems are as follows:


 Shared disk systems permit high availability. All data is accessible even if one node dies.
 These systems have the concept of one database, which is an advantage over shared nothing systems.
 Shared disk systems provide for incremental growth.

Parallel processing disadvantages of shared disk systems are these:


 Inter-node synchronization is required, involving DLM overhead and greater dependency on high-speed interconnect.
 If the workload is not partitioned well, there may be high synchronization overhead.
 There is operating system overhead of running shared disk software.

Shared Nothing Architecture

34
Shared nothing systems are typically loosely coupled. In shared nothing systems only one CPU is connected to a given
disk. If a table or database is located on that disk, access depends entirely on the PU which owns it. Shared nothing
systems can be represented as follows:

Shared nothing systems are concerned with access to disks, not access to memory. Nonetheless, adding more PUs and
disks can improve scaleup. Oracle Parallel Server can access the disks on a shared nothing system as long as the
operating system provides transparent disk access, but this access is expensive in terms of latency.

Shared nothing systems have advantages and disadvantages for parallel processing:
Advantages
 Shared nothing systems provide for incremental growth.
 System growth is practically unlimited.
 MPPs are good for read-only databases and decision support applications.
 Failure is local: if one node fails, the others stay up.
Disadvantages
 More coordination is required.
 More overhead is required for a process working on a disk belonging to another node.
 If there is a heavy workload of updates or inserts, as in an online transaction processing system, it may be worthwhile
to consider data-dependent routing to alleviate contention.

Data base architectures of parallel processing


 Scope and techniques of parallel DBMS operations
 Optimizer implementation
 Application transparency
35
 Parallel environment which allows the DBMS server to take full advantage of the existing facilities on a very low level

DBMS management tools help to configure, tune, admin and monitor a parallel RDBMS as effectively as if it were a
serial RDBMS
 Price / Performance: The parallel RDBMS can demonstrate a non linear speed up and scale up at reasonable costs.

Parallel DBMS vendors

Oracle: Parallel Query Option (PQO)


Architecture: shared disk arch
Data partition: Key range, hash, round robin
Parallel operations: hash joins, scan and sort
Informix: eXtended Parallel Server (XPS)
Architecture: Shared memory, shared disk and shared nothing models
Data partition: round robin, hash, schema, key range and user defined
Parallel operations: INSERT, UPDATE, DELELTE
IBM: DB2 Parallel Edition (DB2 PE)
Architecture: Shared nothing models
Data partition: hash
Parallel operations: INSERT, UPDATE, DELELTE, load, recovery, index creation, backup, table reorganization

SYBASE: SYBASE MPP


Architecture: Shared nothing models
Data partition: hash, key range, Schema
Parallel operations: Horizontal and vertical parallelism

DBMS schemas for decision support.

The basic concepts of dimensional modeling are: facts, dimensions and measures. A fact is a collection of related data
items, consisting of measures and context data. It typically represents business items or business transactions. A
dimension is a collection of data that describe one business dimension. Dimensions determine the contextual background
for the facts; they are the parameters over which we want to perform OLAP. A measure is a numeric attribute of a fact,
representing the performance or behavior of the business relative to the dimensions.

Considering Relational context, there are three basic schemas that are used in dimensional modeling:
1. Star schema
2. Snowflake schema
3. Fact constellation schema

Star schema
The multidimensional view of data that is expressed using relational data base semantics is provided by the data base
schema design called star schema. The basic of stat schema is that information can be classified into two groups:
 Facts
 Dimension
Star schema has one large central table (fact table) and a set of smaller tables (dimensions) arranged in a radial pattern
around the central table.
Facts are core data element being analyzed while dimensions are attributes about the facts.

36
The following diagram shows the sales data of a company with respect to the four dimensions, namely time, item, branch,
and location.

Each dimension in a star schema is represented with only one-dimension table.


 This dimension table contains the set of attributes.
 There is a fact table at the center. It contains the keys to each of four dimensions.
 The fact table also contains the attributes, namely dollars sold and units sold.

The star schema architecture is the simplest data warehouse schema. It is called a star schema because the diagram
resembles a star, with points radiating from a center. The center of the star consists of fact table and the points of the star
are the dimension tables. Usually the fact tables in a star schema are in third normal form (3NF) whereas dimensional
tables are de-normalized. Despite the fact that the star schema is the simplest architecture, it is most commonly used
nowadays and is recommended by Oracle.

Fact Tables

37
A fact table is a table that contains summarized numerical and historical data (facts) and a multipart index composed of
foreign keys from the primary keys of related dimension tables. A fact table typically has two types of columns: foreign
keys to dimension tables and measures those that contain numeric facts. A fact table can contain fact's data on detail or
aggregated level.

Dimension Tables
Dimensions are categories by which summarized data can be viewed. E.g. a profit summary in a fact table can be viewed
by a Time dimension (profit by month, quarter, year), Region dimension (profit by country, state, city), Product
dimension (profit for product1, product2).

Typical fact tables store data about sales while dimension tables data about geographic region (markets, cities), clients,
products, times, channels.
Measures are numeric data based on columns in a fact table. They are the primary data which end users are interested in.
E.g. a sales fact table may contain a profit measure which represents profit on each sale.
The main characteristics of star schema:
 Simple structure -> easy to understand schema
 Great query effectives -> small number of tables to join
 Relatively long time of loading data into dimension tables -> de-normalization, redundancy data caused that size of the
table could be large.
 The most commonly used in the data warehouse implementations -> widely supported by a large number of business
intelligence tools

Potential Performance Problems with star schemas.

The star schema suffers the following performance problems.


1.Indexing
Multipart key presents some problems in the star schema model.
(day->week-> month-> quarter-> year )
• It requires multiple metadata definition ( one for each component) to design a single table.
• Since the fact table must carry all key components as part of its primary key, addition or deletion of levels in the
hierarchy will require physical modification of the affected table, which is time-consuming processed that limits
flexibility.
• Carrying all the segments of the compound dimensional key in the fact table increases the size of the index, thus
impacting both performance and scalability.

38
2. Level Indicator.
The dimension table design includes a level of hierarchy indicator for every record.
Every query that is retrieving detail records from a table that stores details and aggregates must use this indicator as an
additional constraint to obtain a correct result.
The user is not and aware of the level indicator, or its values are in correct, the otherwise valid query may result in a
totally invalid answer.
Alternative to using the level indicator is the snowflake schema. Aggregate fact tables are created separately from detail
tables. Snowflake schema contains separate fact tables for each level of aggregation.

Other problems with the star schema design - Pairwise Join Problem

5 tables require joining first two tables, the result of this join with third table and so on. The intermediate result of every
join operation is used to join with the next table. Selecting the best order of pairwise joins rarely can be solve in a
reasonable amount of time.

Five-table query has 5!=120 combinations

2 .Snowflake schema: is the result of decomposing one or more of the dimensions. The many-to-one relationships
among sets of attributes of a dimension can separate new dimension tables, forming a hierarchy. The decomposed
snowflake structure visualizes the hierarchical structure of dimensions very well.

3.Fact constellation schema: For each star schema it is possible to construct fact constellation schema(for example by
splitting the original star schema into more star schemes each of them describes facts on another level of dimension
hierarchies). The fact constellation architecture contains multiple fact tables that share many dimension tables. The main
shortcoming of the fact constellation schema is a more complicated design because many variants for particular kinds of
aggregation must be considered and selected. Moreover, dimension tables are still large.

39
4.2 STAR join and STAR Index.
A STAR join is high-speed, single pass, parallelizable muti-tables join method. It performs many joins by single
operation with the technology called Indexing. For query processing the indexes are used in columns and rows of the
selected tables.
Red Brick's RDBMS indexes, called STAR indexes, used for STAR join performance. The STAR indexes are created on
one or more foreign key columns of a fact table. STAR index contains information that relates the dimensions of a fact
table to the rows that contains those dimensions. STAR indexes are very space-efficient. The presence of a STAR index
allows Red Brick's RDBMS to quickly identify which target rows of the fact table are of interest for a particular set of
dimension. Also, because STAR indexes are created over foreign keys, no assumptions are made about the type of
queries which can use the STAR indexes.

4.3 Bit Mapped Indexing


 SYBASE IQ is an example of a product that uses a bit mapped index structure of the data
stored in the SYBASE DBMS.
 Sybase released SYBASE IQ database targeted an "ideal" data mart solution for handle
multi user adhoc(unstructured) queries.

Over view:
 SYBASE IQ is a separate SQL database.
 Once loaded, SYBASE IQ converts all data into a series of bit maps, which are then highly compressed and stored on
disk.
 SYBASE positions SYBASE IQ as a read only database for data marts, with a practical size limitations currently
placed at 100 Gbytes.

Data cardinality: Bitmap indexes are used to optimize queries against low- cardinality data
— that is, data in which the total number of possible values is relatively low.
(Cardinal meaning – important)

40
Fig: - Bitmap index
For example, address data cardinality pin code is 50 (50 possible values), and gender data cardinality is only 2 (male and
female)..
If the bit for a given index is "on", the value exists in the record. Here, a 10,000 — row employee table that contains the
"gender" column is bitmap-indexed for this value.
Bitmap indexes can become bulky and even unsuitable for high cardinality data where the range of possible values is
high. For example, values like "income" or "revenue" may have an almost infinite number of values.
SYBASE IQ uses a patented technique called Bit-wise technology to build bitmap indexes for high-cardinality data.
Index types: The first release of SYBASE IQ provides five index techniques.

 Fast projection index


 A low-or high cardinality index.
 Low fast index, involves functions like SUM, AVERAGE and COUNTS.
 Low Disk index, involves disk space usage.
 High group and high non-group index.
SYBASE IQ advantages/Performance:
 Bitwise technology
 Compression
 Optimized memory-based processing
 Column wise processing
 Low operating cost
 Large block I/O
Operating-system-level parallelism
Prejoin and ad hoc join capabilities
Disadvantages of SYBASE IQ indexing:
 No updates
 Lack of core RDBMS features
 Less advantageous for planned queries
 High memory usage

Column Local Storage


.
Thinking Machine Corporation has developed CM-SQL RDBMS product, this approach is
based on storing data column-wise, as opposed to traditional row wise storage.

41
A traditional RDBMS approach to storing data in memory and on the disk is to store it one row at a time, and each row
can be viewed and accessed a single record. This approach works well for OLTP environments in which a typical
transaction access a record at a time.
However, for a set processing adhoc query environment in data warehousing the goal is to retrieve multiple values of
several columns. For example, if a problem is to calculate average, maximum and minimum salary, the column wise
storage of the salary field requires a DBMS to read only one record.

42
Data Extraction, Cleanup, and Transformation Tools

Tool Requirements .
The tools that provide data contents and formats from operational and external data stores into the data warehouse
includes following tasks.
• Data transformation - from one format to another on possible differences between the source and target platforms.
• Data transformation and calculation - based on the application of business rules.
• Data consolidation and integration,- which include combining several source records into a single record to be loaded
into the warehouse.
• Metadata synchronization and management- which includes storing and/or updating meta data definitions about source
data files, transformation actions, loading formats, and events, etc.
The following are the Criteria’s that affects the Tools ability to transform, consolidate, integrate and repair the data.
1. The ability to identify data - in the data source environments that can be read by the conversion tool is important.
2. Support for flat files, indexed files is critical. eg. VSAM , IMS and CA-IDMS
3. The capability to merge data from multiple data stores is required in many installations.
4. The specification interface to indicate the data to be extracted and the conversion criteria is important.
5. The ability to read information from data dictionaries or import information from warehouse products is desired.
6. The code generated by the tool should be completely maintainable from within the development environment.
7. Selective data extraction of both data elements and records enables users to extract only the required data.
8. A field-level data examination for the transformation of data into information is needed.
9. The ability to perform data-type and character-set translation is a requirement when moving data between incompatible
systems.
10. The capability to create summarization, aggregation and derivation records and field is very important.
11. Vendor stability and support for the product items must be carefully evaluated.

Vendor Approaches
Integrated solutions can fall into one of the categories described below.

43
• Code generators create modified 3GL/4GL programs based on source, target data definitions, data transformation,
improvement rules defined by the developer. This approach reduces the need for an organization to write its own data
capture,
transformation, and load programs.
•Database data replication tools utilize database triggers or a recovery log to capture changes to a single data source on
one system and apply the changes to a copy of the source data located on a different systems.
• Rule-driven-dynamic transformation engines ( data mart builders). Capture data from a source system at user defined
intervals, transforms the data, and then send and load the results into a target environment, typically a data mart.

Access to legacy Data


With Enterprise/Access, legacy systems on virtually any platform can be connected to a new data warehouse via
client/server interfaces without the significant time, cost, or risk involved in reengineering application code.
Enterprise/Access provides a three-tiered architecture that defines applications partitioned with new-term integration and
long-term migration objectives.
• The data layer provides - data access and transaction services for management of corporate data assets. This layer is
independent of any current business process or user interface application. It manages the data and implements the
business rules for data integrity
. • The process layer - provides services to manage automation and support for current business processes. It allows
modification of the supporting application logic independent of the necessary data or user interface.
The user layer - manages user interaction with process and/or data layer services. It allows the user interface to change
independently of the basic business processes.

Vendor Solution
• Prism solutions
• SAS Institute
• Validity Corporation
• Information Builders
Prism solutions: While Enterprise/Access focuses on providing access to legacy data; Prism warehouse manager
provides a solution for data warehousing by mapping source data to a target dbms to be used as a warehouse.
Prism warehouse manager can extract data from multiple source, environments, including DB2, IDMS, IMS, VSAM,
RMS, and sequential files under UNIX or MVS. It has strategic relationship with pyramid and Informix.

SAS institute:
SAS starts with the basis of critical data still resides in the data center and offer its traditional SAS system tools to serve
at data warehousing functions. Its data repository function can act to build the informational database.
SAS Data Access Engines serve as extraction tools to combine common variables, transform data representation forms
for consistency, consolidate redundant data, and use business rules to produce computed values in the warehouse.
SAS engines can work with hierarchical and relational database and sequential files.

Validity Corporation:
Validity Corporation's Integrity data reengineering tool is used to investigate, standardize, transform and integrate data
from multiple operational systems and external sources.
Integrity is a specialized, multipurpose data tool that organizations apply on projects such as:
• Data audits
• Data warehouse and decision support systems
• Customer information files and house holding applications
• Client/Server business applications such as SAP R/S, Oracle and Hogan
• System consolidations.

Information builders:
A product that can be used as a component for data extraction, transformation and legacy access tool suite for building
data warehouse is EDA/SQL from information builders.
 EDA/SQL implements a client/server model that is optimized for higher performance
 EDA/SQL supports copy management, data quality management, data replication capabilities, and standards
support for both ODBC and the X/Open CLI.

44
Transformation Engines
1.Informatica:
This is a multicompany metadata integration idea. Informatica joined services with Andyne, Brio, Business objects,
Cognos, Information Advantage, Info space, IQ software and Microstrategy to deliver a "back-end" architecture and
publish AFI specifications supporting its technical and business metadata.

2. Power Mart:
Informatica's flagship product — PowerMart suite — consists of the following components.
• Power Mart Designer
• Power Mart server
• The Informatica Server Manager
• The Informatica Repository
• Informatica Power Capture

3. Constellar:
The constellar Hub consists of a set of components supporting the distributed transformation management capabilities.
The product is designed to handle the movement and transformation of data for both data migration and data distribution,
in an operational system, and for capturing operational data for loading into a data warehouse.

The transformation hub performs the tasks of data cleanup and transformation.
The Hub Supports:
 Record reformatting and restructuring.
 Field level data transformation, validation and table look up.
 File and multi-file set-level data transformation and validation.
 The creation of intermediate results for further downstream transformation by the hub.

45
Reporting and query tools for data analysis:-

The principal purpose of data warehousing is to provide information to business users for strategic decision making.
These users interact with the data warehouse using front-end tools, or by getting the required information through the
information delivery system.

Tool Categories
There are five categories of decision support tools
1. Reporting
2. Managed query
3. Executive information systems (EIS)
4. On-line analytical processing (OLAP)
5. Data mining (DM)
Reporting tools:
Reporting tools can be divided into production reporting tools and desktop report writers.
Production reporting tools: Companies generate Production reporting tools for regular
operational reports or support high-volume batch jobs. E.g calculating and printing pay
checks.
Production reporting tools include third-generation languages such as COBOL, specialized
fourth-generation languages, such as Information Builders, Inc.'s Focus, and high-end
client/server tools, such as MITI'S SQL.
Report writers: Are inexpensive desktop tools designed for end users. Products such as Seagate software's crystal
reports allows users to design and run reports without having to rely on the IS department.
In general, report writers have graphical interfaces and built-in charting functions, They can pull groups of data from a
variety of data sources and integrate them in a single report.
Leading report writers include Crystal Reports, Actuate and Platinum Technology, Inc's Info Reports. Vendors are trying
to increase the scalability of report writers by supporting thGuiree-tiered architectures in which report processing is done
on a Windows NT or UNIX server.

46
Report writers also are beginning to offer object-oriented interfaces for designing and manipulating reports and modules
for performing ad hoc queries and OLAP analysis.

Users and related activities


User Activity Tools
Clerk Simple retrieval 4GL
EIS
Executive Exception reports
4GL
Manager Simple retrieval
Spreadsheets; OLAP, data
Business analysts Complex analysis
mining

Managed query tools:


Managed query tools protect end users from the complexities of SQL and database structures by inserting a metalayer
between users and the database. Metalayer is the software that provides subject-oriented views of a database and supports
point-and-click creation of SQL. Some vendors, such as Business objects, Inc., call this layer a "universe". Managed
query tools have been extremely popular because they make it possible for knowledge workers to access corporate data
without IS intervention.
Most managed query tools have embraced three-tiered architectures to improve scalability. Managed query tool vendors
are racing to embed support for OLAP and Data mining features.

Other tools are IQ software's IQ objects, Andyne Computing Ltd,'s GQL, IBM's Decision Server, Speedware Corp's
Esperant (formerly sold by software AG), and Oracle Corp's Discoverer/2000.

Executive Information System tools:


Executive Information System (EIS) tools earlier than report writers and managed query tools they were first install on
mainframes.
EIS tools allow developers to build customized, graphical decision support applications or "briefing books".
• EIS applications highlight exceptions to normal business activity or rules by using color — coded graphics.

EIS tools include pilot software, Inc.'s Light ship, Platinum Technology's Forest and Trees,
Comshare, Inc.'s Commander Decision, Oracle's Express Analyzer and SAS Institute, Inc.'s
SAS/EIS.
EIS vendors are moving in two directions.
 Many are adding managed query functions to compete head-on with other –decision support tools.
 Others are building packaged applications that address horizontal functions, such as sales budgeting, and marketing, or
vertical industries such as financial services.
Ex: Platinum Technologies offers Risk Advisor.

OLAP tools:
It provides a sensitive way to view corporate data.
These tools aggregate data along common business subjects or dimensions and then let users navigate through the
hierarchies and dimensions with the click of a mouse button.
Some tools such as Arbor software Corp.'s Essbase , Oracle's Express, pre aggregate data in special multi dimensional
database.
Other tools work directly against relational data and aggregate data on the fly, such as Micro-strategy, Inc.'s DSS Agent
or Information /Advantage, Inc.'s Decision suite.

Some tools process OLAP data on the desktop instead of server.


Desktop OLAP tools include Cognos Power play, Brio Technology, In is Brio query, Planning Sciences, Inc.'s Gentium,
and Andyne's Pablo.

47
Data mining tools:
Provide close to corporate data that aren't easily differentiate with managed query or OLAP
tools.
Data mining tools use a variety of statistical and artificial intelligence (AI) algorithm to analyze the correlation of
variables in the data and search out interesting patterns and relationship to investigate.
Data mining tools, such as IBM's Intelligent Miner, are expensive and require statisticians to implement and manage.
These include Data Mind CorP's Data Mind, Pilot's Discovery server, and tools from Business objects and SAS Institute.
This tools offer simple user interfaces that plug in directly to existing OLAP tools or databases and can be run directly
against data warehouses.
For example, all end-user tools use metadata definitions to obtain access to data stored in the warehouse, and some of
these tools (eg., OLAP tools) may employ additional or intermediary data stores. (eg., data marts, multi dimensional data
base).

Applications
Organizations use a familiar application development approach to build a query and reporting environment for the data
warehouse. There are several reasons for doing this:
 A legacy DSS or EIS system is still being used, and the reporting facilities appear adequate.
 An organization has made a large investment in a particular application development environment (eg., Visual C++,
Power Builder).
 A new tool may require an additional investment in developers skill set, software, and the infrastructure, all or part of
which was not budgeted for in the planning stages of the project.
 The business users do not want to get involved in this phase of the project, and will continue to relay on the IT
organization to deliver periodic reports in a familiar format .
 A particular reporting requirement may be too complicated of an available reporting tool to handle.
All these reasons are perfectly valid and in many cases result in a timely and cost-effective delivery of a reporting system
for a data warehouse.

Need for applications:-


The tools and applications fit into the managed query and EIS categories. As these are easy-to use, point-and-click tools
that either accept SQL or generate SQL statements to query relational
data stored in the warehouse.
Some of these tools and applications can format the retrieved data in easy-to-read reports, while others concentrate on the
on-screen presentation.
The users of business applications such as
 segment identification,
 demographic analysis,
 territory management, and
 Customer mailing lists.
The complexity of the question grows these tools may rapidly become inefficient. Consider the various access types to
the data stored in a data warehouse.
 Simple tabular form reporting.
 Ad hoc user-specified queries.
 Predefined repeatable queries.
 Complex queries with multi table joins, multi-level sub queries, and Sophisticated search criteria.
 Ranking.
 Multivariable analysis.
 Time series analysis.
 Data visualization, graphing, charting and pivoting.
 Complex textual search
 Statistical analysis
 AI techniques for testing of hypothesis, trends discovery, definition and validation
of data clusters and segments.
 Information mapping (i.e., mapping of spatial data in geographic informationSystems).
 Interactive drill-down reporting and analysis.
The first four types of access are covered by the combined category of tools called query
and reporting tools
48
1. Creation and viewing of standard reports:
This is the main reporting activity: the routine delivery of reports based on pre determined measures.
2. Definition and creation of ad-hoc reports:
These can be quite complex, and the trend is to off-load this time-consuming activity to the users.
Reporting tools that allow managers and business users to quickly create their own reports and get quick answers to
business questions are becoming increasingly popular.
3. Data exploration: With the newest wave of business intelligence tools, users can easily "surf' through data without a
preset path to quickly uncover business trends or problems. While reporting type 1 may appear relatively simple, types 2
and 3, combined with certain business requirements often exceed existing tools capabilities and may require building
sophisticated applications to retrieve and analyze warehouse data. This approach may be very useful for those data
warehouse users who are not yet comfortable with ad hoc queries.

Online Analytical Processing

OLAP stands for Online Analytical Processing. It uses database tables (fact and dimension tables) to enable
multidimensional viewing, analysis and querying of large amounts of data.
E.g. OLAP technology could provide management with fast answers to complex queries on their operational data or
enable them to analyze their company's historical data for trends and patterns.
Online Analytical Processing (OLAP) applications and tools are those that are designed to ask “complex queries of large
multidimensional collections of data.” Due to that OLAP is accompanied with data warehousing.
OLAP is an application architecture, not basically a data warehouse or a database management system (DBMS). Whether
it utilizes a data warehouse or not OLAP is becoming an architecture that an increasing number of enterprises are
implementing to support analytical applications.
The majority of OLAP applications are deployed in a "stovepipe" fashion, using specialized MDDBMS technology, a
narrow set of data; a preassembled application- user interface.

Needs for OLAP

Business problems such as market analysis and financial forecasting requires query-centric database schemas that are
array-oriented and multi dimensional in nature.
These business problems are characterized by the need to retrieve large number of records from very large data sets
(hundreds of gigabytes and even terabytes). The multidimensional nature of the problems it is designed to address is the
key driver for OLAP.
The result set may look like a multidimensional spreadsheet (hence the term multidimensional). All the necessary data
can be represented in a relational database accessed via SQL.
The two dimensional relational model of data and the Structured Query Language (SQL) have limitations for such
complex real-world problems.

SQL Limitations and need for OLAP :


One of the limitations of SQL is, it cannot represent complex problems. A query will be translated in to several SQL
statements. These SQL statements will involve multiple joins, intermediate tables, sorting, aggregations and a huge
temporary memory to store these tables. These procedures required a lot of computation which will require a long time in
computing.
The second limitation of SQL is its inability to use mathematical models in these SQL statements. If an analyst, create
these complex statements using SQL statements, there will be a large number of computation and huge memory needed.
Therefore the use of OLAP is preferable to solve this kind of problem.

Multidimensional Data Model with neat diagram.

The multidimensional data model is an integral part of On-Line Analytical Processing, or OLAP. Because OLAP is on-
line, it must provide answers quickly; analysts create iterative queries during interactive sessions, not in batch jobs that
49
run overnight. And because OLAP is also analytic, the queries are complex. The multidimensional data model is
designed to solve complex queries in real time.
Multidimensional data model is to view it as a cube. The cable at the left contains detailed sales data by product, market
and time. The cube on the right associates sales number (unit sold) with dimensions-product type, market and time with
the unit variables organized as cell in an array.

Fig: 2.2 - Relational tables and multidimensional cubes


This cube can be expended to include another array-price-which can be associates with all or only some dimensions. As
number of dimensions increases number of cubes cell increase exponentially.
Dimensions are hierarchical in nature i.e. time dimension may contain hierarchies for years, quarters, months, weak and
day. GEOGRAPHY may contain country, state, city etc. In this cube we can observe, that each side of the cube
represents one of the elements of the question. The x-axis represents the time, the y-axis represents the products and the z
axisrepresents different centers. The cells of in the cube represents the number of product sold or can represent the price
of the items

This Figure also gives a different understanding to the drilling down operations. The relations defined must not be
directly related, they related directly.
The size of the dimension increase, the size of the cube will also increase exponentially.
The time response of the cube depends on the size of the cube.
OLAP Operations (Operations in Multidimensional Data Model:)
 Roll-up
 Drill-down
 Slice and dice
 Pivot (rotate)

1.Roll-up

Roll-up performs aggregation on a data cube in any of the following ways:

By climbing up a concept hierarchy for a dimension


 By dimension reduction
The following diagram illustrates how roll-up works.

50
 Roll-up is performed by climbing up a concept hierarchy for the dimension location.
 Initially the concept hierarchy was "street < city < state < country".
 On rolling up, the data is aggregated by ascending the location hierarchy from the level of city to the level of country.
 The data is grouped into cities rather than countries.
 When roll-up is performed, one or more dimensions from the data cube are removed.

Drill-down
Drill-down is the reverse operation of roll-up. It is performed by either of the following ways:
 By stepping down a concept hierarchy for a dimension
 By introducing a new dimension.
The following diagram illustrates how drill-down works:

51
Drill-down is performed by stepping down a concept hierarchy for the dimension time.
 Initially the concept hierarchy was "day < month < quarter < year."
 On drilling down, the time dimension is descended from the level of quarter to the level of month.
 When drill-down is performed, one or more dimensions from the data cube are added.
 It navigates the data from less detailed data to highly detailed data.
Slice
The slice operation selects one particular dimension from a given cube and provides a new sub-cube. Consider the
following diagram that shows how slice works.

Here Slice is performed for the dimension "time" using the criterion time = "Q1".
 It will form a new sub-cube by selecting one or more dimensions.
Dice
Dice selects two or more dimensions from a given cube and provides a new sub-cube.
Consider the following diagram that shows the dice operation.

52
The dice operation on the cube based on the following selection criteria involves three dimensions.
 (location = "Toronto" or "Vancouver")
 (time = "Q1" or "Q2")
 (item =" Mobile" or "Modem")
Pivot
The pivot operation is also known as rotation. It rotates the data axes in view in order to provide an alternative
presentation of data. Consider the following diagram that shows the pivot operation. In this the item and location axes in
2-D slice are rotated.

OLAP Guidelines and rules for implementation process.

Dr. E.F. Codd the ―father of the relational model, created a list of rules to deal with the
OLAP systems.

53
These rules are:
1).Multidimensional conceptual view: The OLAP should provide a suitable multidimensional business model that suits
the business problems and requirements.
2).Transparency: -(OLAP must transparency to the input data for the users).
The OLAP systems technology, the basic database and computing architecture {client/server, mainframe gateways, etc.)
and the heterogeneity of input data sources should be transparent to users to save their productivity and ability with front-
end environments and tools (eg., MS Windows, MS Excel).
3).Accessibility:-(OLAP tool should only access the data required only to the analysis Needed).
The OLAP system should access only the data actually required to perform the analysis. The system should be able to
access data from all heterogeneous enterprise data sources required/for the analysis.
4).Consistent reporting performance: Size of the database should not affect in performance).
As the number of dimensions and the size of the database increase, users should not identify any significant decrease in
performance.
5).Client/server architecture:(c/s architecture to ensure better performance and flexibility ).

The OLAP system has to conform to client/server architectural principles for maximum price and performance,
flexibility, adaptivity and interoperability

6).Generic dimensionality: Data entered should be equivalent to the structure and operation requirements.
7).Dynamic sparse matrix handling: The OLAP too should be able to manage the sparse matrix and so maintain the level
of performance.
8).Multi-user support: The OLAP should allow several users working concurrently to work together on a specific model.
9).Unrestricted cross-dimensional operations: The OLAP systems must be able to recognize dimensional hierarchies and
automatically perform associated roll-up calculations within and across dimensions.
10).Intuitive data manipulation. Consolidation path reorientation pivoting drill down and Rollup
and other manipulation should be accomplished via direct point-and-click; drag-and-drop operations on the cells of the
cube.
11).Flexible reporting: The ability to arrange rows, columns, and cells in a fashion that facilitates analysis by spontaneous
visual presentation of analytical report must exist.
12).Unlimited dimensions and aggregation levels: This depends on the kind of business, where multiple dimensions and
defining hierarchies can be made.

In addition to these guidelines an OLAP system should also support:


13).Comprehensive database management tools: This gives the database management to control distributed businesses.
14).The ability to drill down to detail source record level: Which requires that the OLAP tool should allow smooth
transitions in the multidimensional database.
15).Incremental database refresh: The OLAP tool should provide partial refresh.
16).Structured Query Language (SQL interface): the OLAP system should be able to integrate

MultiDimensional OLAP and MultiRelational OLAP

Multidimensional structure: - “A variation of the relational model that uses multidimensional structures for organize data
and express the relationships between data”.
Multidimensional: MOLAP
MOLAP is the 'classic' form of OLAP and is sometimes referred to as just OLAP.
MOLAP stores this data in optimized multidimensional array storage, rather than in a relational database. Therefore it
requires the pre-computation and storage of information in the cube the operation known as processing.
MOLAP analytical operations:-
Consolidation: involves the aggregation of data such as roll-ups or complex expressions involving interrelated data. For
example, branch offices can be rolled up to cities and rolled up to countries.
Drill-Down: is the reverse of consolidation and involves displaying the detailed data that comprises the consolidated
data.
Slicing and dicing: refers to the ability to look at the data from different viewpoints. Slicing and dicing is often
performed along a time axis in order to analyze trends and find patterns.
Multi relational OLAP: ROLAP
54
ROLAP works directly with relational databases. The base data and the dimension tables are stored as relational tables
and new tables are created to hold the aggregated information. It depends on a specialized schema design.

This methodology relies on manipulating the data stored in the relational database to give
the appearance of traditional OLAP's slicing and dicing functionality.
Comparison:

Comparison:
 MOLAP implementations are smooth to database explosion, such as usage of large storage space ,high number of
dimensions, pre-calculated results and sparse multidimensional data.
 MOLAP generally delivers better performance by indexing and storage optimizations.
 MOLAP also needs less storage space compared to ROLAP because the specialized storage typically includes
compression techniques.
 ROLAP is generally more scalable. However large volume pre-processing is difficult to implement efficiently so it is
frequently skipped.
 ROLAP query performance can therefore suffer extremely.
 ROLAP relies more on the database to perform calculations, it has more limitations in the specialized functions it can
use.
A chart comparing capabilities of these two classes of OLAP tools.

55
The area of the circle implies data size.
Fig: - OLAP style comparison

Categories of OLAP Tools.


1. MOLAP Multidimesional OLAP
2. ROLAP Relational OLAP
3. Managed query environment (MQE)
MOLAP
The products used a data structure [multidimensional database management systems (MDDBMS)] to organize, navigate,
and analyze data, typically in an accumulated form and required a tight coupling with the application layer and
presentation layer.

Architectures enables excellent performance when the data is utilized as designed and predictable application response
times for applications addressing a narrow breadth of data for a specific DSS requirement.
Applications requiring iterative and comprehensive time series analysis of trends are well suited for MOLAP technology
(eg., financial analysis and budgeting). Examples include Arbor software's Ess base, Oracle's Express Server.

The implementation of applications with MOLAP products.

First, there are limitations in the ability of data structures to support multiple subject areas of data (a common trait of
many strategic DSS applications) and the detail data required by many, analysis applications. This has begun to be

56
addressed in some products, utilizing basic "reach through" mechanisms that enable the MOLAP tools to access detail
data maintained in an RDBMS.

Fig: - MOLAP architecture

MOLAP products require a different set of skills and tools for the database administrator to build and maintain the
database, thus increasing the cost and complexity of support. These hybrid solutions have as their primary characteristic
the integration of specialized multidimensional data storage with RDBMS technology, providing users with a facility that
tightly "couples" the multidimensional data structures (MDDSs) with data maintained in an RDBMS.
This approach can be very useful for organizations with performance — sensitive multidimensional analysis
requirements and that have built, or are in the process of building, a data warehouse architecture that contains multiple
subject areas.
Eg: (Product and sales region) to be stored and maintained in a persistent structure. These structure can be automatically
refreshed at predetermined intervals established by an administrator.

2.ROLAP
The fastest growing style of OLAP technology, with new vendors (eg., Sagnent technology) entering the market at an
accelerating step. Products directly through a dictionary layer of metadata, bypassing any requirement for creating a static
multidimensional data structures

Fig: - ROLAP architecture

57
This enables multiple multidimensional views of the two-dimensional relational tables to be created without the need to
structure the data around the desired view.
Some of the products in this segment have developed strong SQL-generation engines to support the complexity of
multidimensional analysis.
Flexibility is an attractive feature of ROLAP products, there are products in this segment that recommend, or require, the
use of highly de-normalized database designs (e.g., Star schema).

Shift in technology importance is coming in two forms.


First is the movement toward pure middleware technology that provides facility to simply development of
multidimensional applications. Second, there continues further hiding of the lines that define ROLAP and hybrid-OLAP
products. Example include Information
Advantage (Axsys), Micro strategy (DSS Agent/DSS Sever) Platinum/Pr Odea Software (Bercon), Informix/Standard
Technology Group (Meta cube), and Sybase (High Gate Project).

3. Managed Query Environment (MQE)


This style of OLAP, which is beginning to see increased activity, provided users with the ability to perform limited
analysis capability, either directly against RDBMS products, or by force an intermediate MOLAP server

Fig: - Hybrid/MQE architecture

Some products (e.g, Andyne's Pablo) that have a custom in ad hoc query have developed features to provide "datacube"
and "slice" and "dice" analysis capabilities. This is achieved by first developing a query to select data from the DBMS
which then delivers the requested data to the desktop, where it is placed into a data cube. This data cube can be stored and
maintained locally, to reduce the overhead required to create the structure each time the query is executed.

Once the data is in the data cube; users can perform multidimensional analysis (i.e., Slice, dice, and pivot operations)
against it. The simplicity of the installation and administration of such products makes them particularly attractive to
organizations looking to provide seasoned users with more sophisticated analysis capabilities, without the significant cost
and maintenance of more complex products.

This mechanism allows for the flexibility of each user to build a custom data cube, the lack of data consistency among
users, and the relatively small amount of data that can be efficiently maintained are significant challenges facing tool
administrators. Examples include Cognos Software's Power play, Andyne Software's Pablo, Business Objects, Mercury
Project, Dimensional Insight's cross target and Speedware's Media.

58
COURSE CODE COURSE TITLE L T P C
1151CS114 DATA WAREHOUSING AND DATA MINING 3 0 0 3

UNIT-III

CO Level of learning domain (Based


Course Outcomes
Nos. on revised Bloom’s taxonomy)
Explain the concept of Data mining system and apply the various
 CO3 K2
pre-processing techniques on large dataset.
.

Data Mining

Data mining is the non-trivial(non-unimportant) process of identifying valid, novel(original), potentially useful and
ultimately understandable patterns(model) in data.

Data mining techniques supports automatic searching of data and tries to source out patterns and trends in the data and
also gather rules from these patterns which will help the user to support review and examine decisions in some related
business- or scientific area.

Data refers to extracting or mining knowledge from large databases. Data mining and
knowledge discovery in the databases is a new inter disciplinary field, merging ideas from statistics, machine, learning,
databases and parallel computing.

Fig:1 - Data mining as a confluence of multiple disciplines

Fig:2 - Data mining — searching for knowledge (interesting patterns) in your data

59
KDD: Knowledge Discovery in Database (KDD) was formalized in search of seeking
knowledge from the data.

Fayyed et al distinguish between KDD and data mining by giving the following definitions:

Knowledge Discovery in Databases KDD is the process of identifying a valid, potentially useful and ultimately
understandable structure in data. This process involves selecting or sampling data from a data warehouse, cleaning or pre-
processing it, transforming or reducing it, applying a data mining component to produce a structure and then evaluating
the derived structure.

Data mining is a step in the KDD process concerned with the algorithmic means by which patterns or structures are
enumerated from the data under acceptable, computational efficiency limitations.

Some of the definitions of data mining are:


1. Data mining is the non-trivial extraction of implicit, exactly unknown and potentially useful information from the data.
2. Data mining is the search for the relationships and global patterns that exist in large
databases but are hidden among vast amounts of data.
3. Data mining refers to using a variety of techniques to identify piece of information or
decision-making knowledge in the database and extracting these in such a way they can be put to use in areas such as
decision support, prediction, forecasting and estimation.
4. Data mining system self-learns from the previous history of investigated system, formulating and testing hypothesis
about rules which system obey.
5. Data mining is the process of discovering meaningful new correlation pattern and trends by shifting through large
amount of data stored in repositories, using pattern recognition techniques as well as statistical and mathematical
techniques.
Fig: 3 - Data mining as a step in the process of knowledge discover

60
Steps in KDD process:
Data cleaning: It is the process of removing noise and inconsistent data.
Data integrating: It is the process of combining data from multiple sources.
Data selection: It is the process of retrieving relevant data from the databases.
Data transformation: In this process, data are transformed or consolidated into forms
suitable for mining by performing summary of aggregation operations.
Data mining: It is an essential process where intelligent methods are applied in support to extract data patterns.
Pattern evaluation: The patterns obtained in the data mining stage are converted into
knowledge based on some interestingness measures.
Knowledge presentation: Visualization and knowledge representation techniques are used to present the mined
knowledge to the user.

61
Architecture of Data Mining System

Fig:4 - Architecture of a typical data mining system


Data mining is the process of discovering interesting knowledge from large amounts of data stored either in databases,
data warehouses or other information repositories.

Major components.

Database, data warehouse or other information repository: This is a single or a


collection of multiple databases, data warehouse, flat file spread sheets or other kinds of information repositories. Data
cleaning and data integration techniques may be performed on the data.

Database or data warehouse server: The database or data warehouse serve obtains the relevant data, based on the user's
data mining request.

Knowledge base: This is the domain knowledge that used to guide the search or evaluate the interestingness of resulting
patterns. Such knowledge can include concept hierarchies, used to organize attributes or attribute values into different
levels of abstraction knowledge such as user beliefs; threshold and metadata can be used to access a patterns
interestingness.

Data mining engine: This is essential to the data mining system and ideally consists of a set of functional modules for
task such as characterization, association classification,
cluster analysis, evolution and outlier analysis.

62
Pattern evaluation module: This component uses interestingness measures and interacts with the data mining modules
so as to focus the search towards increasing patterns. It may use interestingness entrances to / filter out discovered
patterns. Alternately, the pattern evaluation module may also be integrate with mining module.

Graphical user interface: This module communicates between users and the data mining system, allowing the user to
interact with the system by specifying a task or data mining query for performing exploratory data mining based on
intermediate data mining results.

This module allows the user to browse database and datawarehouse schemes or data structure, evaluate mined patterns
and visualize the pattern in different forms such as maps, charts etc.

Data Mining — on What Kind of Data

Data mining should be applicable to any kind of information repository. This includes
Flat files
Relational databases,
Data warehouses,
Transactional databases,
Advanced database systems,
World-Wide Web.

Advanced database systems include


Object-oriented and
Object relational databases, and
Special c application-oriented databases such as
Spatial databases,
Time-series databases,
Text databases,
Multimedia databases.

Flat files: Flat files are simple data files in text or binary format with a structure known by the data mining algorithm to
be applied. The data in these files can be transactions, time series data, scientific measurements, etc.

Relational databases: A relational database is a collection of tables. Each table consists of a set of attributes (columns or
fields) and a set of tuples (records or rows). Each tuple is identified by a unique key and is described by a set of attribute
values. Entity relationships (ER) data model is often constructed for relational databases. Relational data can be accessed
by database queries written in a relational query language.
e.g Product and market table

Data warehouse:
A data warehouse is a repository of information collected from multiple sources, stored
under a unified scheme residing on a single site.

63
A data warehouse is formed by a multidimensional database structure, where each
dimension corresponds to an attribute or a set of attributes in the schema.

Fig: - Data Cube

Data warehouse is formed by data cubes. Each dimension is an attribute and each cell represents the aggregate measure.
A data warehouse collects information about subjects that cover an entire organization whereas data mart focuses on
selected subjects. The multidimensional data views makes (OLAP) Online Analytical Processing easier.

Transactional databases: A transactional database consists of a file where each record represents a transaction. A
transaction includes transaction identity number, list of items, date of transactions etc.

Advanced databases:

Object oriented databases: Object oriented databases are based on object-oriented programming concept. Each entity is
considered as an object which encapsulates data and code into a single unit objects are grouped into a class.

Object-relational database: Object relational database are constructed based on an object relational data mode which
extends the basic relational data model by handling complex data types, class hierarchies and object inheritance.

Spatial databases: A spatial database stores a large amount of space-related data, such as maps, preprocessed remote
sensing or medical imaging data and VLSI chip layout data. Spatial data may be represented in raster format, consisting
of n-dimensional bit maps or fixed maps.
Temporal Databases, Sequence Databases, and Time-Series Databases
A temporal database typically stores relational data that include time-related attributes.
A sequence database stores sequences of ordered events, with or without a existing view of time. E.g customer shopping
sequences, Web click streams.

64
A time-series database stores sequences of values or events obtained over repeated measurements of time (e.g., hourly,
daily, weekly). E.g stock exchange, inventory control, observation of temperature and wind.
Text databases and multimedia databases: Text databases contains word descriptions of objects such as long sentences
or paragraphs, warning messages, summary reports etc. Text database consists of large collection of documents from
various sources. Data stored in most text databases are semi structured data.
A multimedia database stores and manages a large collection of multimedia objects such as audio data, image, video,
sequence and hypertext data.
Heterogeneous databases and legacy databases:
A heterogeneous database consists of a set of interconnected, autonomous component databases. The components
communicate in order to exchange information and answer queries.

A legacy database is a group of heterogeneous databases that combines different kinds of data systems, such as relational
or object-oriented databases, hierarchical databases, network databases, spreadsheets, multimedia databases, or file
systems.

The heterogeneous databases in a legacy database may be connected by intra or intercomputer networks.

The World Wide Web: The World Wide Web and its associated distributed information services, such as Yahoo!,
Google, America Online, and AltaVista, provides worldwide, online
information services. Capturing user access patterns in such distributed information environments is called Web usage
mining or Weblog mining.

Data mining functionalities.

Data Mining tasks: what kinds of patterns can be mined?

Data mining functionalities are used to specify the kind of patterns to be found in data mining tasks. In general, data
mining tasks can be classified into two categories: descriptive and predictive.

Descriptive mining:- tasks characterize the general properties of the data in the database.
Predictive mining: - tasks perform conclusion on the current data in order to make predictions.

Users have no idea regarding what kinds of patterns is required, so they search for several different kinds of patterns in
parallel.

Data mining systems


65
- should be able to discover patterns at various granularity
- should help users for interesting patterns.

Data mining functionalities:


 Concept/class description - Characterization, Discrimination,
 \ Association and correlation analysis,
 Classification, prediction,
 Clustering,
 Outlier analysis
 Evolution analysis

Concept/Class Description: Characterization and Discrimination


Data can be associated with class or concepts, for example, in the all electronics store, classes of items for sale include
computers and printers and concepts of customers include big spenders and budget spenders. Such descriptions of a class
or a concept are called concept /class descriptions.
These descriptions can be derived via.
1. Data characterization
2. Data discrimination
3. Both data characterization and discrimination

Data characterization: It is a summarization of the general characteristics of a class (target class) of data. The data
related to the user specified class are collected by a database query. Several methods like OLAP roll up operation and
attribute-oriented technique are used for effective data summarization and characterization. The output of data
characterization can be presented in various forms like • Pie charts • Bar charts • Curves • Multidimensional cubes •
Multidimensional tables etc. The resulting descriptions can be presented as generalized relations or in rule forms called
characteristics rules.

Data discrimination is a comparison of the general features of target class data objects with the general features of
objects from one or a set of contrasting classes. The output of data discrimination can be presented in the same manner as
data characterization. Discrimination descriptions expressed in rule form are referred to as discriminant rules. E.g the user
may like to compare the general features of software products whose sales increased by 10% in the last year with those
whose sales decreased by at least 30% during the same period

Mining Frequent Patterns, Associations, and Correlations


Frequent patterns, are patterns that occur frequently in data.

Kinds of frequent patterns,


 Itemsets - refers to a set of items that frequently appear together in a transactional data set – e.g milk and bread.
 Subsequences e.g- customers like to purchase first a PC, followed by a digital camera, and then a memory card ,
 Substructures e.g graphs, trees, or lattices

Association analysis.
single-dimensional association rule.

A marketing manager of All Electronics shop, find which items are frequently purchased together within the same
transactions. An example of such a rule, mined from the AllElectronics transactional database, is

X => variable representing a customer.


Confidence of 50% => that if a customer buys a computer, there is a 50% chance that he
will buy software
support of 1 % => computer and software were purchased together.

66
This association rule involves a single attribute or predicate (i.e., buys) that repeats. Association rules that contain a
single predicate are referred to as single-dimensional association rules. Also, the above rule can be written simply as
Multidimensional association rule.
Consider “AllElectronics” relational database relating to purchases.
A data mining system may find association rules like

The rule indicates that of the AllElectronics customers under study, 2% are 20 to 29 years
of age with an income of 20,000 to 29,000 and have purchased a CD player at AllElectronics. There is a 60% probability
that a customer in this age and income group will purchase a CD player. Note that this is an association between more
than one attribute, or predicate (i.e., age, income, and buys).

Classification and prediction


Classification -> process of finding a model (or function) that describes and differentiates
data classes or concepts, for the purpose of using the model to predict the class of objects
whose class label is unknown. The derived model is based on the analysis of a set of training data (i.e., data objects
whose class label is known).
The derived model represented by
(i).classification (IF-THEN) rules, (ii). decision trees,(iii) neural networks

A decision tree is a flow-chart-like tree structure, node -> a test on an attribute value, branch-> outcome of the test, tree
leaves -> classes or class distributions. A neural network is typically a collection of neuron-like processing units with
weighted connections between the units.

Prediction
Prediction models calculate continuous-valued functions. Prediction is used to predict missing or unavailable numerical
data values. Prediction refers both numeric prediction and class label prediction. Regression analysis is a statistical
methodology is used for numeric prediction. Prediction also includes the identification of distribution trends based on the
available data.
Clustering Analysis
67
Clustering analyzes data objects without consulting a known class label. Clusters can be grouped based on the principle
of maximizing the intra-class similarity and minimizing the interclass similarity. Clustering is a method of grouping data
into different groups, so that data in each group share similar trends and patterns. The objectives of clustering are
* To uncover natural groupings
* To initiate hypothesis about the data
* To find consistent and valid organization of the data

5 .Outlier Analysis
A database may contain data objects that do not fulfil with the general model of the data.
These data objects are called outliers.Most data mining methods discard outliers as noise or exceptions. Applications like
credit card fraud detection, cell phone cloning fraud and detection of suspicious activities the rare events can be more
interesting than the more regularly occurring ones. The analysis of outlier data is referred to as outlier mining. Outliers
may be detected using statistical tests, distance measures, deviation-based methods.

6. Evolution Analysis
Data evolution analysis describes and model regularities (or) trends of objects whose behaviour changes over time.
Normally, evolution analysis is used to predict the future trends by effective decision making process. It include
characterization, discrimination, association and correlation analysis, classification, prediction, or clustering of time-
related data, time-series data analysis, sequence or periodicity pattern matching, and similarity-based data analysis. E.g
stock market (time-series) data of the last several years available from the New York Stock Exchange and like to invest
in shares of high-tech industrial companies.

Interestingness of patterns. [CO3-H1]


 A data mining system has the possible to generate thousands or even millions of patterns, or rules.
 A pattern is interesting if it is (1) easily understood by humans, (2) valid on new or test data with some degree of
certainty, (3) possibly useful, and (4) innovative.
 A pattern is also interesting if it validates a assumptions that the user wanted to confirm. An interesting pattern
represents knowledge.

Objective measures of pattern interestingness.

68
These are based on the structure of discovered patterns and the statistics underlying them. An objective measure for
association rules of the form X =>Y is rule support, representing the percentage of transactions from a transaction
database that the given rule satisfies. This is taken to be the probability P(X Y),where X Y indicates that a transaction
contains both X and Y, that is, the union of itemsets X and Y.

Another objective measure for association rules is confidence, which measures the degree of certainty of the detected
association. This is taken to be the conditional probability P(Y/X), that is, the probability that a transaction containing X
also contains Y. More formally, support and confidence are defined as
support(X=>Y) = P(X Y):
confidence(X=>)Y) = P(Y/X):

For example, rules that do not satisfy a confidence 50% can be considered uninteresting. Rules below this reflect noise,
exceptions, or minority cases and are probably of less value. Subjective interestingness measures These are based on user
beliefs in the data. These measures find patterns interesting if they are unexpected (opposing a user’s belief) or offer
strategic information on which the user can act. Patterns that are expected can be interesting if they confirm a hypothesis
that the user wished to validate, or resemble a user’s idea.

A data mining system generate all of the interesting patterns—refers to the completeness of a data mining algorithm. It is
often unrealistic and inefficient for data mining systems to generate all of the possible patterns.

A data mining system generate only interesting patterns—is an optimization problem in data mining. It is highly
desirable for data mining systems to generate only interesting patterns.
This is efficient for users and data mining systems, because have search through the patterns generated in order to
identify the truly interesting ones.

Classification of Data Mining Systems

Data mining is an interdisciplinary field, that merging a set of disciplines, including database systems, statistics, machine
learning, visualization, and information science.

Depending on the data mining approach used, techniques from other disciplines may be
applied, such as
o neural networks,
o fuzzy and/or rough set theory,
o knowledge representation,
o inductive logic programming,
o high-performance computing.

Depending on the kinds of data to be mined or on the given data mining application, the
data mining system may also integrate techniques from
o spatial data analysis,
o information retrieval,
o pattern recognition,
o image analysis,
o signal processing,
o computer graphics,
o Web technology,
o economics,
o business,
o bioinformatics,
o psychology
Data mining systems can be categorized according to various criteria, as follows:

69
Classification according to the kinds of databases mined: Database systems can be
Classified according to different criteria may require its own data mining technique. For example, if classifying
according to data models, it may have a relational, transactional, object relational, or data warehouse mining system. If
classifying according to the special types of data handled, it may have a spatial, time-series, text, stream data, multimedia
data mining system, or a World Wide Web mining system.
Classification according to the kinds of knowledge mined:
o It is, based on data mining functionalities, such as characterization, discrimination, association and correlation analysis,
classification, prediction, clustering, outlier analysis, and evolution analysis.

A complete data mining system usually provides multiple and/or integrated data mining
functionalities.
o Moreover, data mining systems can be famous based on the granularity or levels of abstraction of the knowledge
mined, including generalized knowledge ,primitive-level knowledge, or knowledge at multiple levels .
o An advanced data mining system should facilitate the discovery of knowledge at multiple levels of abstraction.
o Data mining systems can also be categorized as those that mine data regularities (commonly occurring patterns) versus
those that mine data irregularities (such as exceptions, or outliers).
 In general, concept description, association and correlation analysis, classification, prediction, and clustering mine data
regularities, rejecting outliers as noise. These methods may also help detect outliers.
Classification according to the kinds of techniques utilized: These techniques can be described according to the degree
of user interaction involved e.g. Autonomous systems, interactive exploratory systems, query-driven systems or the
methods of data analysis employed.
e.g., database-oriented or data warehouse–oriented techniques, machine learning, statistics, visualization, pattern
recognition, neural networks, and so on.

Classification according to the applications adapted.


For e.g., data mining systems may be personalized specifically for finance, telecommunications, DNA, stock markets, e-
mail, and so on. Different applications often require the integration of application-specific methods. Therefore, a generic,
all-purpose data mining system may not fit domain-specific mining tasks.

Data Mining Task Primitives.


 A data mining task can be specified in the form of a data mining query, which is input to the data mining system.
 A data mining query is defined in terms of data mining task primitives. These primitives allow the user to interactively
communicate with the data mining system during discovery in order to direct the mining process, or examine the findings
from different angles or depths.

 The data mining primitives specify the following, as illustrated in Figure.

70
The set of task-relevant data to be mined: This specifies the portions of the database or the set of data in which the user is
interested. This includes the database attributes or data warehouse dimensions of interest

The kind of knowledge to be mined: This specifies the data mining functions to be performed, such as characterization,
discrimination, association or correlation analysis, classification, prediction, clustering, outlier analysis, or evolution
analysis.

The background knowledge to be used in the discovery process: This knowledge about the domain to be mined is useful
for guiding the knowledge discovery process and for evaluating the patterns found.
Concept hierarchies (shown in Fig 2) are a popular form of background knowledge, which allow data to be mined at
multiple levels of abstraction.

71
The interestingness measures and thresholds for pattern evaluation: They may be used to guide the mining process or,
after discovery, to evaluate the discovered patterns. Different kinds of knowledge may have different interestingness
measures.

The expected representation for visualizing the discovered patterns: This refers to the form in which discovered patterns
are to be displayed ,which may include rules, tables, charts, graphs, decision trees, and cubes.

A data mining query language can be designed to incorporate these primitives, allowing users to flexibly interact with
data mining systems. This facilitates a data mining system’s communication with other information systems and its
integration with the overall information processing environment.

Integration of a Data Mining System with a Database or Data Warehouse System

72
When a DM system works in an environment that requires it to communicate with other information system components,
such as DB and DW systems, possible integration schemes include
 No coupling,
 Loose coupling,
 Semi tight coupling,
 Tight coupling
No coupling: means that a DM system will not utilize any function of a DB or DW system. It may fetch data from a file
system, process data using some data mining algorithms, and then store the mining results in another file.

Drawbacks.

First, a DB system provides flexibility and efficiency at storing, organizing, accessing, and processing data. Without
using a DB/DWsystem, a DM system spend more time for finding, collecting, cleaning, and transforming data. In
DB/DW systems, data’s are well organized, indexed, cleaned, integrated, or consolidated, so that finding the task-
relevant, high-quality data becomes an easy task.
Second, there are many tested, scalable algorithms and data structures implemented in DB and DW systems. Without any
coupling of such systems, a DM system will need to use other tools to extract data, making it difficult to integrate such a
system into an information processing environment. Thus, no coupling represents a poor design.

Loose coupling: means that a DM system will use some facilities of a DB or DW system, fetching data from a data
repository managed by these systems, performing data mining, and then storing the mining results either in a file or in a
designated place in a database or data warehouse.( In computing and systems design a loosely coupled system is
one in which each of its components has, or makes use of, little or no knowledge of the definitions of other
separate components. Subareas include the coupling of classes, interfaces, data, and services. Loose
coupling is the opposite of tight coupling. )

73
Advantages : Loose coupling is better than nocoupling because it can fetch any portion of data stored in DB’s or DW’s
by using query processing, indexing, and other system facilities.

Drawbacks : However, many loosely coupled mining systems are main memory-based. Because mining does not explore
data structures and query optimization methods provided by DB or DW systems, it is difficult for loose coupling to
achieve high scalability and good performance with large data sets.

Semitight coupling: means that too linking a DM system to a DB/DW system, efficient implementations of a few
essential data mining primitives can be provided in the DB/DW system.

These primitives can include sorting, indexing, aggregation, histogram analysis, multiway join, and precomputation of
some essential statistical measures, such as sum, count, max, min, standard deviation, and so on.

Moreover, some frequently used intermediate mining results can be precomputed and stored in the DB/DW system.

Tight coupling: means that a DM system is smoothly integrated into the DB/DW system. This approach is highly
desirable because it facilitates efficient implementations of data mining functions, high system performance, and an
integrated information processing environment.

Data mining queries and functions are optimized based on mining query analysis, data structures, indexing schemes, and
query processing methods of a DB or DW system. By technology advances, DM, DB, and DW systems will integrate
together as one information system with multiple functionalities. This will provide a uniform information processing
environment.

Major Issues in Data Mining

Mining methodology and user interaction issues: These reflect the kinds of knowledge mined, the ability to mine
knowledge at multiple granularities, the use of domain knowledge, ad hoc mining, and knowledge visualization.

 Mining different kinds of knowledge databases: Data mining should cover a wide data analysis and knowledge
discovery tasks, including data characterization, discrimination, association, classification, prediction, clustering, outlier
analysis.
 Interactive mining of knowledge at multiple levels of abstraction: The data mining process
should be interactive. Interactive mining allows users to focus the search for patterns, providing and refining data mining
requests based on returned results.
 Incorporation of background knowledge: Background knowledge may be used to guide the discovery process and
allow discovered patterns to be expressed in concise terms and at different levels of abstraction.
 Data mining query languages and ad hoc mining: Relational query languages (such as SQL) allow users to use ad hoc
queries for data retrieval.
 Presentation and visualization of data mining results: Discovered knowledge should be expressed in high-level
languages, visual representations, or other expressive forms and directly usable by humans.
 Handling noisy or incomplete data: When mining data regularities, these objects may confuse the process, causing the
knowledge model constructed to over fit the data.
 Pattern evaluation--the interestingness problem: A data mining system can uncover thousands of patterns. Many of the
patterns discovered may be uninteresting to the given user, representing common knowledge or lacking newness.

Performance issues:

74
Efficiency and scalability of data mining algorithms: To effectively extract information from a huge amount of data in
databases, data mining algorithms must be efficient and scalable. Parallel, distributed, and incremental mining
algorithms: The huge size of many databases, the wide distribution of data, and the computational complexity of some
data mining methods are factors motivating the development of algorithms that divide data into partitions that can be
processed in parallel.

Issues relating to the diversity of database types:


 Handling of relational and complex types of data: Specific data mining systems should be constructed for mining
specific kinds of data.
 Mining information from heterogeneous databases and global information systems: Local and
wide-area computer networks (such as the Internet) connect many sources of data, forming huge, distributed, and
heterogeneous databases.

Data Preprocessing :-
 Data in the real world is dirty.
 incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data
 e.g., occupation=“ ”
o noisy: containing errors or outliers
 e.g., Salary=“-10”
o inconsistent: containing discrepancies in codes or names
 e.g., Age=“42” Birthday=“03/07/1997”
 e.g., Was rating “1,2,3”, now rating “A, B, C”
 e.g., discrepancy between duplicate records

Data Dirty reason


 Incomplete data may come from
o “Not applicable” data value when collected
o Different considerations between the time when the data was collected and when
it is analyzed.
o Human/hardware/software problems
 Noisy data (incorrect values) may come from
o Faulty data collection instruments
o Human or computer error at data entry
o Errors in data transmission
 Inconsistent data may come from
o Different data sources
o Functional dependency violation (e.g., modify some linked data)
 Duplicate records also need data cleaning
Data Preprocessing Important
 No quality data, no quality mining results!
o Quality decisions must be based on quality data
 e.g., duplicate or missing data may cause incorrect or even misleading
statistics.
o Data warehouse needs consistent integration of quality data
 Data extraction, cleaning, and transformation comprises the majority of the work of
building a data warehouse

Major Tasks in Data Preprocessing


 Data cleaning
o Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies
 Data integration
75
o Integration of multiple databases, data cubes, or files
 Data transformation
o Normalization and aggregation
 Data reduction
o Obtains reduced representation in volume but produces the same or similar analytical results
 Data discretization
o Part of data reduction but with particular importance, especially for numerical data

II. Data Cleaning


 Importance
o “Data cleaning is one of the three biggest problems in data warehousing”—Ralph
Kimball
o “Data cleaning is the number one problem in data warehousing”—DCI survey
 Data cleaning tasks
76
o Fill in missing values
o Identify outliers and smooth out noisy data
o Correct the inconsistent data
o Resolve redundancy caused by data integration

(i) .Missing Data


 Data is not always available
o E.g., many tuples have no recorded value for several attributes, such as
customer income in sales data
 Missing data may be due to
o equipment malfunction
o inconsistent with other recorded data and thus deleted
o data not entered due to misunderstanding
o certain data may not be considered important at the time of entry
o not register history or changes of the data
 Missing data may need to be inferred.

Handle Missing Data


 Ignore the tuple: usually done when class label is missing (assuming the tasks in classification—not effective when the
percentage of missing values per attribute varies considerably.
 Fill in the missing value manually: tedious + infeasible?
 Fill in it automatically with
o a global constant : e.g., “unknown”, a new class?!
o the attribute mean
o the attribute mean for all samples belonging to the same class: smarter
o the most probable value: inference-based such as Bayesian formula or decision tree

(ii). Noisy Data


 Noise: random error or variance in a measured variable
 Incorrect attribute values may due to
o faulty data collection instruments
o data entry problems
o data transmission problems
o technology limitation
o inconsistency in naming convention
 Other data problems which requires data cleaning
o duplicate records
o incomplete data
o inconsistent data

Handling Noisy Data


 Binning
o first sort data and partition into (equal-frequency) bins
o then one can smooth by bin means, smooth by bin median, smooth by bin
boundaries, etc.

 Regression
o smooth by fitting the data into regression functions
 Clustering
o detect and remove outliers
 Combined computer and human inspection
o detect suspicious values and check by human (e.g., deal with possible outliers)

Simple Discretization Methods: Binning


 Equal-width (distance) partitioning
o Divides the range into N intervals of equal size: uniform grid
77
o if A and B are the lowest and highest values of the attribute, the width of intervals
will be: W = (B –A)/N.
o The most straightforward, but outliers may dominate presentation
o Skewed data is not handled well
 Equal-depth (frequency) partitioning
o Divides the range into N intervals, each containing approximately same number
of samples
o Good data scaling
o Managing categorical attributes can be tricky

Binning Methods for Data Smoothing

 Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34

* Partition into equal-frequency (equi-depth) bins:


- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34

Cluster Analysis

78
III. Data Integration and Transformation
 Data integration:
o Combines data from multiple sources into a coherent store
 Schema integration: e.g., A.cust-id , B.cust no.
o Integrate metadata from different sources
 Entity identification problem:
o Identify real world entities from multiple data s ources,
o e.g., Bill Clinton = William Clinton
 Detecting and resolving data value conflicts
o For the same real world entity, attribute values from different sources are different
o Possible reasons: different representations, different scales,
e.g., metric vs. British units

Handling Redundancy in Data Integration


 Redundant data occur often when integration of multiple databases
o Object identification: The same attribute or object may have different names in different databases
o Derivable data: One attribute may be a “derived” attribute in another table, e.g., annual revenue
 Redundant attributes may be able to be detected by correlation analysis
 Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and
improve mining speed and quality

Correlation Analysis (Numerical Data)

 Correlation coefficient (also called Pearson’s product moment coefficient)


 where n is the number of tuples, and are the respective means of A and B, σA
and σB are the respective standard deviation of A and B, and Σ(AB) is the sum of the AB
cross-product.
 If rA,B > 0, A and B are positively correlated (A’s values increase as B’s). The higher, the
stronger correlation.
 rA,B = 0: independent; rA,B < 0: negatively correlated.
Correlation Analysis (Categorical Data)
 Χ2 (chi-square) test

 The larger the Χ2 value, the more likely the variables are related

79
 The cells that contribute the most to the Χ2 value are those whose actual count is very different from the expected
count
 Correlation does not imply causality
o No., of hospitals and no., of car-theft in a city are correlated
o Both are causally linked to the third variable: population

Data Transformation

 Smoothing: remove noise from data


 Aggregation: summarization, data cube construction
 Generalization: concept hierarchy climbing
 Normalization: scaled to fall within a small, specified range
o min-max normalization
o z-score normalization
o normalization by decimal scaling
 Attribute/feature construction
o New attributes constructed from the given ones

Data Transformation: Normalization

 Min-max normalization: to [new_minA, new_maxA]


o Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then
$73,000 is mapped to

Z-score normalization (μ: mean, σ: standard deviation):

o Ex. Let μ = 54,000, σ = 16,000. Then

 Normalization by decimal scaling

Where j is the smallest integer such that Max(|ν’|) < 1

Data reduction
 Data reduction necessity
o A database/data warehouse may store terabytes of data
o Complex data analysis/mining may take a very long time to run on the complete data set
 Data reduction
o Obtain a reduced representation of the data set that is much smaller in volume but yet produce the same (or
almost the same) analytical results

Data reduction strategies


80
1. Data cube aggregation:

2. Dimensionality reduction — e.g., remove unimportant attributes


3. Data Compression
4. Numerosity reduction — e.g., fit data into models
5. Discretization and concept hierarchy generation

1. Data cube aggregation:


Attribute Subset Selection
 Feature selection (i.e., attribute subset selection):
o Select a minimum set of features such that the probability distribution of different
classes given the values for those features is as close as possible to the original
distribution given the values of all features
o reduce no., of patterns in the patterns, easier to understand
 Heuristic methods (due to exponential no., of choices):
o Step-wise forward selection
o Step-wise backward elimination
o Combining forward selection and backward elimination
o Decision-tree induction

81
2. Dimensionality Reduction: Wavelet Transformation
 Discrete wavelet transform (DWT): linear signal processing, multi-resolutional analysis
 Compressed approximation: store only a small fraction of the strongest of the wavelet coefficients
 Similar to discrete Fourier transform (DFT), but better lossy compression, localized in space
 Method:
o Length, L, must be an integer power of 2 (padding with 0’s, when necessary)
o Each transform has 2 functions: smoothing, difference
o Applies to pairs of data, resulting in two set of data of length L/2
o Applies two functions recursively, until reaches the desired length

Principal Component Analysis (PCA)


 Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors (principal components) that can be best used to
represent data
 Steps
o Normalize input data: Each attribute falls within the same range
o Compute k orthonormal (unit) vectors, i.e., principal components
o Each input data (vector) is a linear combination of the k principal component vectors
o The principal components are sorted in order of decreasing “significance” or strength
o Since the components are sorted, the size of the data can be reduced by eliminating the weak components, i.e.,
those with low variance. (i.e., using the strongest principal components, it is possible to reconstruct a good
approximation of the original data
 Works for numeric data only
 Used when the number of dimensions is large

82
3. Data Compression
 String compression
o There are extensive theories and well-tuned algorithms
o Typically lossless
o But only limited manipulation is possible without expansion
 Audio/video compression
o Typically lossy compression, with progressive refinement
o Sometimes small fragments of signal can be reconstructed without reconstructing the
whole
 Time sequence is not audio
o Typically short and vary slowly with time

4. Numerosity Reduction
 Reduce data volume by choosing alternative, smaller forms of data representation
 Parametric methods
o Assume the data fits some model, estimate model parameters, store only the
parameters, and discard the data (except possible outliers)
o Example: Log-linear models—obtain value at a point in m-D space as the product on
appropriate marginal subspaces
 Non-parametric methods
o Do not assume models
o Major families: histograms, clustering, sampling
Parametric methods

Regression and Log-Linear Models


 Linear regression: Data are modeled to fit a straight line
o Often uses the least-square method to fit the line
 Multiple regression: allows a response variable Y to be modeled as a linear function of multidimensional feature
vector
 Log-linear model: approximates discrete multidimensional probability distributions

Non-parametric methods

 Histograms,

83
 Divide data into buckets and store average (sum) for each bucket
 Partitioning rules:
o Equal-width: equal bucket range
o Equal-frequency (or equal-depth)
o V-optimal: with the least histogram variance (weighted sum of the original values that
each bucket represents)
o MaxDiff: set bucket boundary between each pair for pairs have the β–1 largest differences

Clustering
 Partition data set into clusters based on similarity, and store cluster representation (e.g., centroid and
diameter) only
 Can be very effective if data is clustered but not if data is “dirty”
 Can have hierarchical clustering and be stored in multi-dimensional index tree structures
 There are many choices of clustering definitions and clustering algorithms.

Sampling

 Sampling: obtaining a small sample s to represent the whole data set N


 Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data
 Choose a representative subset of the data
o Simple random sampling may have very poor performance in the presence of skew
 Develop adaptive sampling methods
o Stratified sampling:
 Approximate the percentage of each class (or subpopulation of interest) in the
overall database
 Used in conjunction with skewed data
 Note: Sampling may not reduce database I/Os (page at a time)

84
Discretization and concept hierarchy generation

 Three types of attributes:


o Nominal — values from an unordered set, e.g., colour, profession
o Ordinal — values from an ordered set, e.g., military or academic rank
o Continuous — real numbers, e.g., integer or real numbers

Discretization:
o Divide the range of a continuous attribute into intervals
o Some classification algorithms only accept categorical attributes.
o Reduce data size by discretization
o Prepare for further analysis
o Reduce the number of values for a given continuous attribute by dividing the range of the attribute into
intervals
o Interval labels can then be used to replace actual data values
o Supervised vs. unsupervised
o Split (top-down) vs. merge (bottom-up)
o Discretization can be performed recursively on an attribute Concept hierarchy formation
o Recursively reduce the data by collecting and replacing low level concepts (such as numeric values for age) by
higher level concepts (such as young, middle-aged, or senior)

Discretization and Concept Hierarchy Generation for Numeric Data


 Typical methods: All the methods can be applied recursively
o Binning (see above)
 Top-down split, unsupervised,
o Histogram analysis (see above)
 Top-down split, unsupervised
o Clustering analysis (see above)
 Either top-down split or bottom-up merge, unsupervised
o Entropy-based discretization: supervised, top-down split
o Interval merging by c2 Analysis: unsupervised, bottom-up merge
85
o Segmentation by natural partitioning: top-down split, unsupervised

Segmentation by natural partitioning: top-down split, unsupervised


A simply 3-4-5 rule can be used to segment numeric data into relatively uniform, “natural”
intervals.
o If an interval covers 3, 6, 7 or 9 distinct values at the most significant digit, partition the range into 3
equi-width intervals
o If it covers 2, 4, or 8 distinct values at the most significant digit, partition the range into 4 intervals
o If it covers 1, 5, or 10 distinct values at the most significant digit, partition the range into 5 intervals

Concept Hierarchy Generation for Categorical Data

 Specification of a partial/total ordering of attributes explicitly at the schema level by users or


experts
o street < city < state < country
 Specification of a hierarchy for a set of values by explicit data grouping
o {Urbana, Champaign, Chicago} < Illinois
 Specification of only a partial set of attributes
o E.g., only street < city, not others
 Automatic generation of hierarchies (or attribute levels) by the analysis of the number of
distinct values
o E.g., for a set of attributes: {street, city, state, country}

Automatic Concept Hierarchy Generation


 Some hierarchies can be automatically generated based on the analysis of the number of
distinct values per attribute in the data set
o The attribute with the most distinct values is placed at the lowest level of the hierarchy
o Exceptions, e.g., weekday, month, quarter, year.

UNIT IV ASSOCIATION RULE MINING AND CLASSIFICATION


Mining Association Rules in Large Databases – Mining Various Kinds of Association Rules – Correlation Analysis
–Constraint Based Association Mining – Classification and Prediction - Basic Concepts - Decision Tree Induction -
Bayesian Classification – Classification by Back propagation – Support Vector Machines – Associative
Classification – Lazy Learners – Other Classification Methods – Prediction.

UNIT-IV
Association rules are if-then statements that help to show the probability of relationships between data items
within large data sets in various types of databases. Association rule mining has a number of applications
and is widely used to help discover sales correlations in transactional data or in medical data sets.

 Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in
transaction databases, relational databases, and other information repositories
 Applications – Basket data analysis, cross‐marketing, catalog design, loss‐ leader analysis, clustering,
classification, etc.
 Association Rule: Basic Concepts • Given a database of transactions each transaction is a list of items
(purchased (purchased by a customer customer in a visit) • Find all rules that correlate the presence of one set

86
of items with that of another another set of items • Find frequent patterns • Example for frequent itemset
mining is market basket analysis
Association rule performance measures
• Confidence • Support • Minimum support threshold • Minimum confidence threshold

Minimum support and confidence

1.Frequent patterns are patterns (such as item sets, subsequences, or substructures) that appear in a data set frequently.
For example, a set of items, such as milk and bread, that appear frequently together in a transaction data set is a frequent
itemset.
A subsequence, such as buying first a PC, then a digital camera, and then a memory card, if it occurs frequently in a
shopping history database, is a (frequent) sequential pattern.
A substructure can refer to different structural forms, such as subgraphs, subtrees, or
sublattices, which may be combined with itemsets or subsequences. If a substructure occurs frequently, it is called a
(frequent) structured pattern.

2.Market Basket Analysis

Frequent itemset mining used to find associations and correlations of all items in large
transactional or relational data sets. With large amounts of data continuously collected and stored, many industries are
interested in mining such patterns from their databases. This can help in many business decision-making processes, such
as catalogue design, cross marketing, and customer shopping behaviour analysis.
A typical example of frequent itemset mining is market basket analysis. This process
analyzes customer buying habits by finding associations between the different items that customers place in their
“shopping baskets” (Figure 5.1).

87
The discovery of such associations can help retailers develop marketing strategies by
which items are frequently purchased together by customers. For example, if customers are buying milk, how many of
them also buy bread on the same trip to the supermarket? Such information can lead to increased sales by helping
retailers do selective marketing and plan theirshelf space.

3.Frequent Itemsets, Closed Itemsets, and Association Rules


 A set of items is referred to as an itemset.
 An itemset that contains k items is a k-itemset.
 The set {computer, antivirus software} is a 2-itemset.
 The occurrence frequency of an itemset is the number of transactions that contain the itemset.

This is also known, simply, as the frequency, support count, or count of the itemset.
Note that the itemset support defined in Equation is sometimes referred to as relative support, whereas the occurrence
frequency is called the absolute support.

From above Equation we have

Rules that satisfy both a minimum support threshold (min sup) and a minimum confidence threshold (min conf) are called
Strong Association Rules.

In general, association rule mining can be viewed as a two-step process:

1.Find all frequent itemsets: By definition, each of these itemsets will occur at least as frequently as a predetermined
minimum support count, min_sup.

88
2. Generate strong association rules from the frequent itemsets: By definition, these rules must satisfy minimum support
and minimum confidence.

Closed Itemsets : An itemset X is closed in a data set S if there exists no proper super-itemset Y such that Y has the same
support count as X in S. An itemset X is a closed frequent itemset in set S if X is both closed and frequent in S.
Maximal frequent itemset: An itemset X is a maximal frequent itemset (or max-itemset) in set S if X is frequent, and there
exists no super-itemset Y such that X belongsY and Y is frequent in S.

4.Frequent Pattern Mining

Frequent pattern mining can be classified in various ways, based on the following criteria:

1. Based on the completeness of patterns to be mined: The following can be mined based on the Completeness of
patterns.
 Frequent itemsets, Closed frequent itemsets, Maximal frequent itemsets,
 Constrained frequent itemsets (i.e., those that satisfy a set of user-defined constraints),
 Approximate frequent itemsets (i.e., those that derive only approximate support counts for the mined frequent
itemsets),
 Near-match frequent itemsets (i.e., those that tally the support count of the near or almost matching itemsets),
 Top-k frequent itemsets (i.e., the k most frequent itemsets for a user-specified value, k),

2. Based on the levels of abstraction involved in the rule set:

e.g “computer” is a higher-level abstraction of “laptop computer”


buys (X, “computer”) =>buys (X, “HP printer”)
buys (X, “laptop computer”) => buys (X, “HP printer”)

3. Based on the number of data dimensions involved in the rule:

buys (X, “computer”) =>buys (X, “HP printer”)


buys (X, “laptop computer”) => buys (X, “HP printer”)
buys (X, “computer”) => buys (X, “antivirus software”)
The above Rules are single-dimensional association rules they refer only one dimension, buys.
The following rule is an example of a multidimensional rule:

4. Based on the types of values handled in the rule:


Associations between the presence or absence of items, it is a Boolean association rule.e.g
Computer => antivirus software [support = 2%; confidence = 60%]
buys (X, “computer”) =>buys (X, “HP printer”)
buys (X, “laptop computer”) => buys (X, “HP printer”)
Associations between quantitative items or attributes, then it is a quantitative association rule. E.g

5. Based on the kinds of rules to be mined:


e.g Association rules and Correlation rules.

6. Based on the kinds of patterns to be mined:

 Frequent itemset mining: mining of frequent itemsets (sets of items) from transactional or relational data sets.
 Sequential pattern mining: searches for frequent subsequences in a sequence data set
 Structured pattern mining: searches for frequent substructures in a structured dataset.

89
Mining Methods

1. The Apriori Algorithm: Finding Frequent Itemsets Using Candidate Generation


2. Generating Association Rules from Frequent Itemsets
3. Improving the Efficiency of Apriori
4. Mining Frequent Itemsets without Candidate Generation
5. Mining Frequent Itemsets Using Vertical Data Format
6. Mining Closed Frequent Itemsets

1. The Apriori Algorithm: Finding Frequent Itemsets Using Candidate Generation

Apriori is an algorithm proposed by R. Agrawal and R. Srikant in 1994 for mining frequent itemsets for Boolean
association rules. The name of the algorithm is based on the fact that the algorithm uses prior knowledge of frequent
itemset properties.
Apriori uses an iterative approach known as a level-wise search, where k-itemsets are used to explore (k+1)-itemsets.
First, the set of frequent 1-itemsets is found by scanning the database to accumulate the count for each item, and
collecting those items that satisfy minimum support. The resulting set is denoted L1.

Next, L1 is used to find L2, the set of frequent 2-itemsets, which is used to find L3, and so
on, until no more frequent k-itemsets can be found. The finding of each Lk requires one full scan of
the database.
To improve the efficiency of the level-wise generation of frequent itemsets, an important
property called the Apriori property, presented below, is used to reduce the search space.
Apriori property: All nonempty subsets of a frequent itemset must also be frequent.

Consider the table D with nine transactions |D| = 9.

90
91
92
1.The Apriori Algorithm: Finding Frequent Itemsets Using Candidate Generation
The Apriori Algorithm: Basics
The Apriori Algorithm is an influential algorithm for mining frequent itemsets for boolean
association rules.
Key Concepts :
• Frequent Itemsets: The sets of item which has minimum support (denoted by Li for ith-
Itemset).
• Apriori Property: Any subset of frequent itemset must be frequent.
• Join Operation: To find Lk , a set of candidate k-itemsets is generated by joining Lk-1 with
itself.
The Apriori Algorithm Steps
Find the frequent itemsets: the sets of items that have minimum support
A subset of a frequent itemset must also be a frequent itemset
i.e., if {AB} is a frequent itemset, both {A} and {B} should be a frequent itemset
Iteratively find frequent itemsets with cardinality from 1 to k (k-itemset)
Use the frequent itemsets to generate association rules.
Join Step: Ck is generated by joining Lk-1with itself

Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k itemset

Pseudo-code:

Consider a database, D , consisting of 9 transactions.


Suppose min. support count required is 2 (i.e. min_sup = 2/9 = 22 % )
Let minimum confidence required is 70%.
We have to first find out the frequent itemset using Apriori algorithm.
Then, Association rules will be generated using min. support & min. confidence.

Step 1: Generating 1-itemset Frequent Pattern


93
• In the first iteration of the algorithm, each item is a member of the set of candidate.
• The set of frequent 1-itemsets, L1 , consists of the candidate 1-itemsets satisfying minimum support.
Step 2: Generating 2-itemset Frequent Pattern

To discover the set of frequent 2-itemsets, L2 , the algorithm uses L1 Join L1 to generate a candidate set of 2-
itemsets, C2.
Next, the transactions in D are scanned and the support count for each candidate itemset in C2 is accumulated (as
shown in the middle table).
The set of frequent 2-itemsets, L2 , is then determined, consisting of those candidate 2-itemsets in C2 having
minimum support.
Note: We haven’t used Apriori Property yet.
Step 3: Generating 3-itemset Frequent Pattern


The generation of the set of candidate 3-itemsets, C3 , involves use of the Apriori Property.
In order to find C3, we compute L2 Join L2.
94
C3 = L2 Join L2 = {{I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4}, {I2, I3, I5}, {I2, I4, I5}}.
Now, Join step is complete and Prune step will be used to reduce the size of C3.
Prune step helps to avoid heavy computation due to large Ck.
Based on the Apriori property that all subsets of a frequent itemset must also be frequent, we can determine that four
latter candidates cannot possibly be frequent. How ?
For example , lets take {I1, I2, I3}. The 2-item subsets of it are {I1, I2}, {I1, I3} & {I2,
I3}. Since all 2-item subsets of {I1, I2, I3} are members of L2, We will keep {I1, I2, I3}
in C3.
Lets take another example of {I2, I3, I5} which shows how the pruning is performed.
The 2-item subsets are {I2, I3}, {I2, I5} & {I3,I5}.
BUT, {I3, I5} is not a member of L2 and hence it is not frequent violating Apriori Property. Thus We will have to
remove {I2, I3, I5} from C3.
Therefore, C3 = {{I1, I2, I3}, {I1, I2, I5}} after checking for all members of result of Join operation for Pruning.
Now, the transactions in D are scanned in order to determine L3, consisting of those candidates 3-itemsets in C3
having minimum support.
Step 4: Generating 4-itemset Frequent Pattern
The algorithm uses L3 Join L3 to generate a candidate set of 4-itemsets, C4. Although the join results in {{I1, I2, I3,
I5}}, this itemset is pruned since its subset {{I2, I3, I5}} is not frequent.
Thus, C4 = φ , and algorithm terminates, having found all of the frequent items. This completes our Apriori
Algorithm.

Step5.Generating Association Rules from Frequent Itemsets


Once the frequent itemsets from transactions in a database D have been found, then generate strong association rules
from them (where strong association rules satisfy both minimum support and minimum confidence). This can be done
using Equation as

The conditional probability is expressed in terms of itemset support count, where:-


support_count(AUB) is the number of transactions containing the itemsets AUB support_count(A) is the number of
transactions containing the itemset A.

Based on this equation, association rules can be generated as follows:

For each frequent itemset l, generate all nonempty subsets of l.


For every nonempty subset s of l, output the rule “s => ( l - s)” if
support_count( l ) / support_count(s)>=min_cont, where min_cont is the minimum confidence threshold.

Back to e.g
L = {{I1}, {I2}, {I3}, {I4}, {I5}, {I1,I2}, {I1,I3}, {I1,I5}, {I2,I3}, {I2,I4}, {I2,I5}, {I1,I2,I3},
{I1,I2,I5}}.
 Lets take l = {I1,I2,I5}.
 Its all nonempty subsets are {I1,I2}, {I1,I5}, {I2,I5}, {I1}, {I2}, {I5}.
 Let minimum confidence threshold is , say 70%.
 The resulting association rules are shown below, each listed with its confidence.

95
If the minimum confidence threshold is, say, 70%, then only the second, third, and last rules above are output, because
these are the only ones generated that are strong.

3.Improving the Efficiency of Apriori

Hash-based itemset counting: A k-itemset whose corresponding hashing bucket count is below the threshold cannot
be frequent.
Transaction reduction: A transaction that does not contain any frequent k-itemset is useless in subsequent scans.
Partitioning: Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB.
Sampling: mining on a subset of given data, lower support threshold + a method to determine the completeness.
Dynamic itemset counting: add new candidate itemsets only when all of their subsets are estimated to be frequent.
• Apriori Advantages:
– Uses large itemset property.
– Easily parallelized
– Easy to implement.
• Apriori Disadvantages:
– Assumes transaction database is memory resident.
– Requires up to m database scans

4.Mining Frequent Itemsets without Candidate Generation
Compress a large database into a compact, Frequent-Pattern tree (FP-tree) structure
highly condensed, but complete for frequent pattern mining
avoid costly database scans
Develop an efficient, FP-tree-based frequent pattern mining method
A divide-and-conquer methodology: decompose mining tasks into smaller ones
Avoid candidate generation: sub-database test only!

FP-Growth Method: An Example

96
Consider the same previous example of a database, D , consisting of 9 transactions.
Suppose min. support count required is 2 (i.e. min_sup = 2/9 = 22 % )
The first scan of database is same as Apriori, which derives the set of 1-itemsets &
their support counts.
The set of frequent items is sorted in the order of descending support count.
The resulting set is denoted as L = {I2:7, I1:6, I3:6, I4:2, I5:2}
FP-Growth Method: Construction of FP-Tree
First, create the root of the tree, labeled with “null”.
Scan the database D a second time. (First time we scanned it to create 1-itemset and then L).
The items in each transaction are processed in L order (i.e. sorted order).
A branch is created for each transaction with items having their support count separated by colon.
Whenever the same node is encountered in another transaction, we just increment the support count of the
common node or Prefix.
To facilitate tree traversal, an item header table is built so that each item points to its occurrences in the tree
via a chain of node-links.
Now, The problem of mining frequent patterns in database is transformed to that of mining the FP-Tree.

FP-Growth Method: Construction of FP-Tree


Mining the FP-Tree by Creating Conditional (sub) pattern bases

Steps:
1. Start from each frequent length-1 pattern (as an initial suffix pattern).
2. Construct its conditional pattern base which consists of the set of prefix paths in the FP-Tree co-occurring with suffix
pattern.
3. Then, Construct its conditional FP-Tree & perform mining on such a tree.
4. The pattern growth is achieved by concatenation of the suffix pattern with the frequent patterns generated from a
conditional FP-Tree.

97
5. The union of all frequent patterns (generated by step 4) gives the required frequent itemset.

Table : Mining the FP-Tree by creating conditional (sub) pattern bases

Now, Following the above mentioned steps:


Lets start from I5. The I5 is involved in 2 branches namely {I2 I1 I5: 1} and {I2 I1
I3 I5: 1}.
Therefore considering I5 as suffix, its 2 corresponding prefix paths would be {I2
I1: 1} and {I2 I1 I3: 1}, which forms its conditional pattern base.
Out of these, Only I1 & I2 is selected in the conditional FP-Tree because I3 is not
satisfying the minimum support count.
o For I1 , support count in conditional pattern base = 1 + 1 = 2
o For I2 , support count in conditional pattern base = 1 + 1 = 2
o For I3, support count in conditional pattern base = 1
o Thus support count for I3 is less than required min_sup which is 2 here.

Now , We have conditional FP-Tree with us.


All frequent pattern corresponding to suffix I5 are generated by considering all possible combinations of I5 and
conditional FP-Tree.
The same procedure is applied to suffixes I4, I3 and I1.
Note: I2 is not taken into consideration for suffix because it doesn’t have any prefix at all.
Advantages of FP growth
Performance study shows
FP-growth is an order of magnitude faster than Apriori, and is also faster than tree-projection
Reasoning
No candidate generation, no candidate test
Use compact data structure
Eliminate repeated database scan
Basic operation is counting and FP-tree building
Disadvantages of FP-Growth
– FP-Tree may not fit in memory!!
– FP-Tree is expensive to build

Pseudo Code
Algorithm: FP growth. Mine frequent itemsets using an FP-tree by pattern fragment
growth.
Input:
• D, a transaction database;
• min sup, the minimum support count threshold.
Output: The complete set of frequent patterns.
Method:
1. The FP-tree is constructed in the following steps:

98
(a) Scan the transaction database D once. Collect F, the set of frequent items, and their support counts. Sort F in support
count descending order as L, the list of frequent items.
(b) Create the root of an FP-tree, and label it as “null.” For each transaction Trans in D
do the following.
Select and sort the frequent items in Trans according to the order of L. Let the sorted frequent item list in Trans be [pjP],
where p is the first element and P is the remaining list. Call insert tree([pjP], T), which is performed as follows. If T has a
child N such that N.itemname= p.item-name, then increment N’s count by 1; else create a new node N, and let its count be
1, its parent link be linked to T, and its node-link to the nodes with the same item-name via the node-link structure. If P is
nonempty, call insert tree(P, N) recursively.

2. The FP-tree is mined by calling FP growth(FP tree, null), which is implemented as


follows.
procedure FP growth(Tree, a)
(1) if Tree contains a single path P then
(2) for each combination (denoted as b) of the nodes in the path P
(3) generate pattern b[a with support count = minimum support count o f nodes in b;
(4) else for each ai in the header of Tree f
(5) generate pattern b = ai [a with support count = ai:support count;
(6) construct b’s conditional pattern base and then b’s conditional FP tree Treeb;
(7) if Treeb 6= /0 then
(8) call FP growth(Treeb, b); g

5 .Mining Frequent Itemsets Using Vertical Data Format


Both the Apriori and FP-growth methods mine frequent patterns from a set of transactions
in TID-itemset format (that is,{ TID : itemset}), where TID is a transaction-id and itemset is the set of items bought in
transaction TID. This data format is known as horizontal data format. Alternatively, data can also be presented in item-
TID-set format (that is,{item : TIDset}), where item is an item name, and TID set is the set of transaction identifiers
containing the item. This format is known as vertical data format.

99
Various Kinds of Mining Association Rules

• Multilevel association rules involve concepts at different levels of abstraction.


• Multidimensional association rules involve more than one dimension or
predicate(e.g., rules relating what a customer buys as well as the customer’s age.)
• Quantitative association rules involve numeric attributes that have an implicit
ordering among values (e.g., age).

1.Mining Multilevel Association Rules - involve concepts at different levels of


Abstraction

Let’s examine the following example.


The transactional data in Table for sales in an AllElectronics store, showing the
items purchased for each transaction. The concept hierarchy for the items is shown next
Figure .
A concept hierarchy defines a sequence of mappings from a set of low-level
concepts to higher level, more general concepts.
Data can be generalized by replacing low-level concepts within the data by their
higher-level concepts, or ancestors, from a concept hierarchy.

100
Association rules generated from mining data at multiple levels of abstraction are called multiple-level or multilevel
association rules.
Multilevel association rules can be mined efficiently using concept hierarchies under a support-confidence framework. A
top-down strategy is used, where counts are collected for the calculation of frequent itemsets at each concept level,
starting at the concept level 1 and working downward in the hierarchy towards the specific concept levels, until no more
frequent itemsets can be found. For each level, any algorithm for discovering frequent itemsets may be used, such as
Apriori or its variations.

Using uniform minimum support for all levels (referred to as uniform support):

The same minimum support entry is used when mining at each level of abstraction. For example, in following Figure , a
minimum support enrty of 5% is used throughout (e.g., for mining from “computer” down to “laptop computer”). Both
“computer” and “laptop computer” are found to be frequent, while “desktop computer” is not. The method is also
simple in that users are required to specify only one minimum support entry. An Apriori-optimization technique can be
used, based on the concept of an ancestor is a superset of its children’s: The search avoids examining itemsets containing
any item whose ancestors do not have minimum support.

101
Using reduced minimum support at lower levels (referred to as reduced support):

Each level of abstraction has its own minimum support threshold. The deeper the level of abstraction, the smaller the
corresponding threshold is. For example, in Figure, the minimum support thresholds for levels 1 and 2 are 5% and 3%,
respectively. In this way, “computer,” “laptop computer,” and “desktop computer” are all considered frequent.

Using item or group-based minimum support (referred to as group-based support):

When mining multilevel rules users approaching which groups are more important than others, also it is important to set
up user-specific, item, or group based minimal support entries.
For example, a user could set up the minimum support entries based on product price, on items of interest, such as low
support entries for laptop computers and flash drives which association patterns containing items in these categories.

2.Mining Multidimensional Association Rules from Relational Databases and


DataWarehouses (involves more than one dimension or predicate)

E.g Mining association rules containing single predicates,


Following the terminology used in multidimensional databases, we refer to each distinct predicate in a rule as a
dimension. Hence, we can refer to Rule above as a single dimensional or intra dimensional association rule because it
contains a single distinct predicate (e.g., buys)with multiple occurrences (i.e., the predicate occurs more than once within
the rule).
Considering each database attribute or warehouse dimension as a predicate, we can therefore mine association rules
containing multiple predicates, such as

Association rules that involve two or more dimensions or predicates can be referred to as multidimensional association
rules. Rule above contains three predicates (age, occupation, and buys), each of which occurs only once in the rule.

102
Hence, it has no repeated predicates. Multidimensional association rules with no repeated predicates are called inter
dimensional association rules.
We can also mine multidimensional association rules with repeated predicates, which contain multiple occurrences of
some predicates. These rules are called hybrid dimensional association rules.
An example of such a rule is the following, where the predicate buys is repeated:

Note that database attributes can be categorical or quantitative.


Categorical attributes have a finite number of possible values, with no ordering among the values (e.g., occupation,
brand, color). Categorical attributes are also called nominal attributes, because their values are “names of things.”
Quantitative attributes are numeric and have an implicit ordering among values (e.g., age, income, price).
Techniques for mining multidimensional association rules can be categorized into two basic approaches regarding the
treatment of quantitative attributes.

(i).Mining Multidimensional Association Rules Using Static Discretization of Quantitative


Attributes
(ii).Mining Quantitative Association Rules

(i).Mining Multidimensional Association Rules Using Static Discretization of Quantitative Attributes


Quantitative attributes, are discretized before mining using predefined concept hierarchies or data discretization
techniques, where numeric values are replaced by interval labels.
The transformed multidimensional data may be used to construct a data cube.
Data cubes are well suited for the mining of multidimensional association rules. They store aggregates (such as counts),
in multidimensional space, which is essential for computing the support and confidence of multidimensional association
rules.
Following Figure shows the lattice of cuboids defining a data cube for the dimensions age, income, and buys. The cells of
an n-dimensional cuboid can be used to store the support counts of the corresponding n-predicate sets.
The base cuboid aggregates the task-relevant data by age, income, and buys; the 2-D cuboid, (age, income), aggregates
by age and income, and so on; the 0-D (apex) cuboid contains the total number of transactions in the task-relevant data.

(ii). Mining Quantitative Association Rules


103
Quantitative association rules are multidimensional association rules in which the numeric attributes are dynamically
discretized during the mining process satisfy mining criteria, such as maximizing the confidence or compactness of the
rules.
Quantitative association rules, having two quantitative attributes on the left-hand side of the rule and one categorical
attribute on the right-hand side of the rule. That is,

where Aquan1 and Aquan2 are tests on quantitative attribute intervals, and Acat tests a categorical attribute from the task-
relevant data. Such rules have been referred to as two-dimensional quantitative association rules, because they contain
two quantitative dimensions.
An example of such a 2-D quantitative association rule is

Association Rule Clustering System


This approach maps pairs of quantitative attributes onto a 2-D grid for tuples satisfying a given categorical attribute
condition. The grid is then searched for clusters of points from which the association rules are generated.
The following steps are involved in ARCS:
Binning: Quantitative attributes can have a very wide range of values defining their domain. A big 2-D grid can be if it is
plotted age and income as axes, where each possible value of age was assigned a unique position on one axis, and
similarly, each possible value of income was assigned a unique position on the other axis. To keep grids down to a
manageable size, we instead partition the ranges of quantitative attributes into intervals.The partitioning process is
referred to as binning, that is, where the intervals are considered “bins.” Three common binning strategies area as
follows:
Equal-width binning, where the interval size of each bin is the same
Equal-frequency binning, where each bin has approximately the same number of
tuples assigned to it.
Clustering-based binning, where clustering is performed on the quantitative attribute
to group neighboring points (judged based on various distance measures) into the
same bin

Finding frequent predicate sets: Once the 2-D array containing the count distribution for each category is set up, it can
be scanned to find the frequent predicate sets (those satisfying minimum support) that also satisfy minimum confidence.
Strong association rules can then be generated from these predicate sets, using a rule generation algorithm.
Clustering the association rules: The strong association rules obtained in the Previous step are then mapped to a 2-D
grid. Following figure shows a 2-D grid for 2-D quantitative association rules predicting the condition buys (X, “HDTV”)
on the rule right-hand side, given the quantitative attributes age and income.
The four Xs correspond to the rules

The four rules can be combined or “clustered” together to form the following simpler rule, which subsumes and replaces
the above four rules:

104
Correlation Analysis (correlation - relationship)
Strong Rules Are Not Necessarily Interesting: An Example
In analysing transactions at AllElectronics shop purchase of computer games and videos. Let game refer to the
transactions of computer games, and video refer to videos.
Of the 10,000 transactions analyzed, 6,000 of the customer transactions included computer games, while 7,500 included
videos, and 4,000 included both computer games and videos.
If minimum support 30% and a minimum confidence of 60% was given then the following association rule is discovered:
buys(X, “computer games”))buys(X, “videos”) [support = 40%, confidence = 66%]
Above Rule is a strong association rule since its support value of 4,000/10,000 =40% and confidence value of
4,000/6,000 =66% satisfy the minimum support and minimum confidence thresholds, respectively. However, Rule is
misleading because the probability of purchasing videos is 75%, which is even larger than 66%.The above example also
illustrates that the confidence of a rule A=>B can be misleading in that it is only an estimate of the conditional
probability of itemset B given itemset A. It does not measure the real strength of the correlation and implication between
A and B.

From Association Analysis to Correlation Analysis.


From above, the support and confidence measures are insufficient at filtering out uninteresting association rules. To avoid
this weakness, a correlation measure can be used to increase the support-confidence framework for association rules. This
leads to correlation rules of the form

That is, a correlation rule is measured not only by its support and confidence but also by the correlation between itemsets
A and B. Lift is a simple correlation measure that is given as follows. The occurrence of itemset A is independent of the
occurrence of itemset B if

otherwise, itemsets A and B are dependent and correlated as events.


This definition can easily be extended to more than two itemsets. The lift between the occurrence of A and B can be
measured by computing

105
If the resulting value of above Equation is less than 1, then the occurrence of A is negatively
correlated with the occurrence of B.
If the resulting value is greater than 1, then A and B are positively correlated, meaning that the occurrence of one implies
the occurrence of the other. If the resulting value is equal to 1, then A and B are independent and there is no correlation
between them.

Above last Equation is equivalent to , which


is also referred as the lift of the association (or correlation) rule A=>B.

Constraint Based Association Mining

1. Metarule-Guided Mining of Association Rules


2. Constraint Pushing: Mining Guided by Rule Constraints

A data mining process may uncover so many rules which uninteresting to the users. A good practical is to have the users
constraints to limit the search space. This strategy is known as constraint-based mining.
The constraints can include the following:
Knowledge type constraints: These specify the type of knowledge to be mined, such as association or correlation.
Data constraints: These specify the set of task-relevant data.
Dimension/level constraints: These specify the desired dimensions (or attributes) of the data, or levels of the concept
hierarchies, to be used in mining.
Interestingness constraints: These specify thresholds on statistical measures of rule interestingness, such as support,
confidence, and correlation.
Rule constraints: These specify the form of rules to be mined. Such constraints may be
expressed as metarules.

Metarule-guided mining.
E.g Consider Market analyst for AllElectronics, describing customers (such as customer age, address, and credit rating)
the list of customer transactions. Finding associations between customer characters and the customers purchased items.
Instead of finding all of the association rules find only which pairs of customer characters increase the sale of office
software.
An example of such a metarule is

P1 ( X, Y ) ^ P2( X, W ) => buys(X, “office software”), (1)

where P1 and P2 are variables that are instantiated to attributes from the given database during the mining process, X is a
variable representing a customer, and Y and W take on values of the attributes assigned to P1 and P2, respectively.

The data mining system can then search for rules that match the given metarule. For instance, Rule (2) matches or
complies with Metarule (1).
age(X, “30……39”) ^ income(X, “41K….60K”) =>buys(X, “office software”) (2)

Constraint Pushing: Mining Guided by Rule Constraints


Rule constraints specify expected set/subset relationships of the variables in the mined
rules, constant initiation of variables, and aggregate functions.

Classification and Prediction

Basic Concepts
What Is Classification? What Is Prediction?
106
Databases are rich with hidden information that can be used for intelligent decision making.
Classification and prediction are two forms of data analysis that can be used to extract models describing important data
classes or to predict future data trends. Whereas classification predicts definite labels, prediction represents continuous
valued functions.
For example, build a classification model to categorize bank loan applications as either safe or risky, build a prediction
model to predict the expenses of customers on computer devices given their income and occupation.
A bank loans officer needs analysis of her data in order to learn which loan applicants are
“safe”and which are “risky” for the bank.
A marketing manager at AllElectronics needs data analysis to help guess whether a customer with a given profile will buy
a new computer.
A medical researcher wants to analyze breast cancer data in order to predict which one of
three specific treatments a patient should receive. In each of these examples, the data analysis task is classification, where
a model or classifier is constructed to predict categorical labels, such as “safe” or “risky” for the loan application data;
“yes” or “no” for the marketing data; or “treatment A,” “treatment B,” or “treatment C” for the medical data.
Suppose the marketing manager like to predict how much a given customer will spend during a sale at AllElectronics.
This data analysis task is an example of numeric prediction, where the model constructed predicts a continuous-valued
function, or ordered value, as opposed to a categorical label. This model is a predictor. Regression analysis is a statistical
methodology that is used for numeric prediction, hence the two terms are often used equally.

Classification and numeric prediction are the two major types of prediction problems.
The term of prediction to refer to numeric prediction. Classification work. Data classification is a two-step process, as
shown for the loan application data of Figure 1.

107
Issu
es: Data Preparation
Data cleaning
o Preprocess data in order to reduce noise and handle missing values
Relevance analysis (feature selection)
108
o Remove the irrelevant or redundant attributes
Data transformation
o Generalize and/or normalize data

Issues: Evaluating Classification Methods


Accuracy
o classifier accuracy: predicting class label
o predictor accuracy: guessing value of predicted attributes
Speed
o time to construct the model (training time)
o time to use the model (classification/prediction time)
Robustness: handling noise and missing values
Scalability: efficiency in disk-resident databases
Interpretability:
o understanding and insight provided by the model
Other measures, e.g., goodness of rules, such as decision tree size or
compactness of classification rules.

Classify Decision Tree Induction (generation)

Decision tree induction is the learning of decision trees from class-labelled training tuples. A decision tree is a flowchart-
like tree structure, where each internal node (non leaf node)
denotes a test on an attribute, each branch represents an outcome of the test, and each leaf
node (or terminal node) holds a class label. The topmost node in a tree is the root node.Internal nodes are denoted by
rectangles, and leaf nodes are denoted by ovals.

Decision trees are used for classification- Given a tuple, X, for which the associated class label is unknown, the attribute
values of the tuple are tested against the decision tree. A path is traced from the root to a leaf node, which holds the class
prediction for that tuple. Decision trees can easily be converted to classification rules.
“Why are decision tree classifiers so popular?”
The construction of decision tree classifiers does not require any domain knowledge
Decision trees can handle high dimensional data.
The learning and classification steps of decision tree induction are simple and fast.
Decision tree classifiers have good accuracy.

109
Decision tree induction algorithms have been used for classification in many application areas, such as medicine,
manufacturing and production, financial analysis, astronomy, and molecular biology.

Decision Tree Induction


Algorithm: Generate decision tree. Generate a decision tree from the training tuples of data
partition D.
Input:
Data partition, D, which is a set of training tuples and their associated class labels;
attribute list, the set of candidate attributes;
Attribute selection method, a procedure to determine the splitting criterion that “best” partitions the data tuples into
individual classes. This criterion consists of a splitting attribute and, possibly, either a split point or splitting subset.
Output: A decision tree.
Method:
(1) create a node N;
(2) if tuples in D are all of the same class, C then
(3) return N as a leaf node labeled with the class C;
(4) if attribute list is empty then
(5) return N as a leaf node labeled with the majority class in D; // majority voting
(6) apply Attribute selection method(D, attribute list) to find the “best” splitting criterion;
(7) label node N with splitting criterion;
(8) if splitting attribute is discrete-valued and multiway splits allowed then // not
restricted to binary trees
(9) attribute list attribute list splitting attribute; // remove splitting attribute
(10) for each outcome j of splitting criterion
// partition the tuples and grow subtrees for each partition
(11) let Dj be the set of data tuples in D satisfying outcome j; // a partition
(12) if Dj is empty then
(13) attach a leaf labeled with the majority class in D to node N;
(14) else attach the node returned by Generate decision tree(Dj, attribute list) to node N;
endfor
(15) return N;

Algorithm
Basic algorithm (a greedy algorithm)
o Tree is constructed in a top-down recursive divide-and-conquer manner
o At start, all the training examples are at the root
o Attributes are categorical (if continuous-valued, they are discretized in advance)
o Examples are partitioned recursively based on selected attributes
o Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain)
o Conditions for stopping partitioning
o All samples for a given node belong to the same class
o There are no remaining attributes for further partitioning – majority voting
is employed for classifying the leaf
o There are no samples left

110
2. Attribute Selection Measures
o An attribute selection measure is a trial for selecting by splitting the criterion to “best” from a given data partition, D, of
class-labelled training tuples into individual classes.
o If we were to split D into smaller partitions according to the outcomes of the splitting criterion, ideally each partition
would be pure.
o Attribute selection measures are also known as splitting rules because they determine how the tuples at a given node are
to be split.

The three popular attribute selection measures


1).Information gain, 2). Gain ratio, 3). Gini index
1). Information Gain
ID3 uses information gain as its attribute selection measure.

111
Information gain is defined as the difference between the original information
requirement (i.e., based on just the proportion of classes) and the new requirement (i.e.,
obtained after partitioning on A). That is,

Above Table presents a training set, D, of class-labeled tuples randomly selected


from the AllElectronics customer database. In this example, each attribute is discrete valued. Continuous-valued
attributes have been generalized. The class label attribute, buys computer, has two distinct values (namely, {yes, no});
therefore, there are two distinct classes (that is, m = 2).
Let class “p”correspond to yes and class “n” correspond to no. There are nine tuples of class yes and five tuples of class
no. A (root) node N is created for the tuples in D. To find the splitting criterion for these tuples, compute the information
gain of each attribute.
o Class P: buys_computer = “yes”
o Class N: buys_computer = “no”

112
113
2). Gain ratio

C4.5 (a successor of ID3) uses gain ratio to overcome the problem (normalization to information gain). To find gain ratio
Gain Ratio (A) = Gain (A) / SplitInfo(A) where

Computation of gain ratio for the attribute income. E.g

So Gain_ratio(income) = 0.029/0.926 = 0.031


The attribute with the maximum gain ratio is selected as the splitting attribute

3. Gini index
The Gini index is used in CART. Using the notation described above, the Gini index measures the impurity of D, a data
partition or set of training tuples, as

114
E.g

D has 9 tuples in buys_computer = “yes” and 5 in “no”


Suppose the attribute income partitions D into 10 in D1: {low, medium} and 4 in D2

but Gini{medium,high} is 0.30 and thus the best since it is the lowest.

Tree Pruning
Overfitting: An induced tree may overfit the training data
o Too many branches, some may reflect differences due to noise or outliers
o Poor accuracy for unseen samples
Two approaches to avoid overfitting
o Prepruning: Halt tree construction early—do not split a node if this would result in the goodness measure
falling below a threshold
Difficult to choose an appropriate threshold
o Postpruning: Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees
Use a set of data different from the training data to decide which is the “best pruned tree”

Scalable Decision Tree Induction Methods


SLIQ
o Builds an index for each attribute and only class list and the current attribute list reside in memory
SPRINT
o Constructs an attribute list data structure
PUBLIC
o Integrates tree splitting and tree pruning: stop growing the tree earlier
RainForest
o Builds an AVC-list (attribute, value, class label)
BOAT Bootstrapped Optimistic Algorithm for Tree Construction
o Uses bootstrapping to create several small samples

Bayesian Classification with examples.


115
Bayes’ Theorem

Naïve Bayesian Classification


Bayesian Belief Networks
Training Bayesian Belief Networks
Need for Bayesian Classification

A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities
Foundation: Based on Bayes’ Theorem.
Performance: A simple Bayesian classifier, naïve Bayesian classifier, has comparable performance with decision tree
and selected neural network classifiers
Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct —
prior knowledge can be combined with observed data
Standard: Even when Bayesian methods are computationally inflexible, they can provide a standard of optimal
decision making against which other methods can be measured

1. Bayes’ Theorem : Basics


Let X be a data sample (“evidence”): class label is unknown
Let H be a hypothesis that X belongs to class C
Classification is to determine P(H|X), the probability that the hypothesis holds given the observed data sample X
P(H) (prior probability), the initial probability
o E.g., X will buy computer, irrespective of age, income, …
P(X): probability that sample data is observed
P(X|H) (posteriori probability), the probability of observing the sample X, given that the hypothesis holds
o E.g., Given that X will buy computer, the prob. that X is 31..40, medium income

Given training data X, posteriori probability of a hypothesis H, P(H|X), follows the


Bayes theorem

Informally, this can be written as


posteriori = likelihood x prior/evidence
Predicts X belongs to C2 if the probability P(Ci|X) is the highest among all the P(Ck|X) for all the k classes

116
Practical difficulty: require initial knowledge of many probabilities, significant computational

117
Naïve Bayesian Classifier: Training Dataset
Class:
C1:buys_computer = ‘yes’
C2:buys_computer = ‘no’
Data sample
X = (age <=30, Income = medium, Student = yes , Credit_rating = Fair)

118
119
o Advantages
Easy to implement
Good results obtained in most of the cases
o Disadvantages
Assumption: class conditional independence, therefore loss of accuracy
Practically, dependencies exist among variables
E.g., hospitals: patients: Profile: age, family history, etc.

120
Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc.

Dependencies among these cannot be modeled by Naïve Bayesian Classifier


o To deal with these dependencies Bayesian Belief Networks are used.

Bayesian Belief Networks


o Bayesian belief network allows a subset of the variables conditionally independent
o A graphical model of causal relationships
Represents dependency among the variables
Gives a specification of joint probability distribution
Nodes: random variables
Links: dependency
X and Y are the parents of Z, and Y is the parent of P
No dependency between Z and P
Has no loops or cycles

Derivation of the probability of a particular combination of values of X, from


Conditional probability table :

Training Bayesian Networks


o Several scenarios:
Given both the network structure and all variables observable: learn only
the CPTs
Network structure known, some hidden variables: gradient descent

121
(greedy hill-climbing) method, analogous to neural network learning
Network structure unknown, all variables observable: search through the
model space to reconstruct network topology
Unknown structure, all hidden variables: No good algorithms known for
this purpose

Rule Based Classification


1. Using IF-THEN Rules for Classification
2. Rule Extraction from a Decision Tree
3. Rule Induction Using a Sequential Covering Algorithm

1) Using IF-THEN Rules for Classification


Rules are a good way of representing information or bits of knowledge. A rule-based
classifier uses a set of IF-THEN rules for classification. An IF-THEN rule is an
expression of the form
IF condition THEN conclusion.
An example is rule R1,
R1: IF age = youth AND student = yes THEN buys computer = yes.
The “IF”-part (or left-hand side)of a rule is known as the rule antecedent or precondition. The “THEN”-part (or right-
hand side) is the rule consequent. In the rule antecedent, the condition consists of one or more attribute tests (such as age
= youth, and student = yes) that are logically ANDed. The rule’s consequent contains a class prediction (in this case, we
are predicting whether a customer will buy a computer). R1 can also be written as

A rule R can be assessed by its coverage and accuracy. Given a tuple, X, from a class labeled data set D, let ncovers be
the number of tuples covered by R; ncorrect be the number of tuples correctly classified by R; and |D| be the number of
tuples in D. We can define the coverage and accuracy of R as

e.g Consider rule R1 above, which covers 2 of the 14 tuples. It can correctly classify both tuples. Therefore, coverage(R1)
= 2/14 = 14:28% and accuracy (R1) = 2/2 = 100%.( See table)

If more than one rule is triggered, need conflict resolution


o Size ordering: assign the highest priority to the triggering rules that has the “toughest” requirement (i.e., with the most
attribute test)
o Class-based ordering: decreasing order of prevalence or misclassification cost per class
o Rule-based ordering: (decision list): rules are organized into one long priority list, according to some measure of rule
quality or by experts
2.Rule Extraction from a Decision Tree
To extract rules from a decision tree, one rule is created for each path from the root to a leaf node. Each splitting criterion
along a given path is logically ANDed to form the rule antecedent (“IF” part). The leaf node holds the class prediction,
forming the rule consequent (“THEN” part).
E.g Extracting classification rules from a decision tree. The above decision tree can be converted to classification IF-
THEN rules by tracing the path from the root node to each leaf node in the tree. The rules extracted from Figure are

R1: IF age = youth AND student = no THEN buys computer = no


R2: IF age = youth AND student = yes THEN buys computer = yes
R3: IF age = middle aged THEN buys computer = yes
R4: IF age = senior AND credit rating = excellent THEN buys computer = yes

3.Rule Induction Using a Sequential Covering Algorithm

122
IF-THEN rules can be extracted directly from the training data (i.e., without having to generate a decision tree first) using
a sequential covering algorithm. In this the rules are learned sequentially (one at a time), where each rule for a given class
will ideally cover many of the tuples of that class (and none of the tuples of other classes). Algorithm: Sequential
covering. Learn a set of IF-THEN rules for classification.
Input:
D, a data set class-labeled tuples;
Att vals, the set of all attributes and their possible values.
Output: A set of IF-THEN rules.
Method:
(1) Rule_set = { }; // initial set of rules learned is empty
(2) for each class c do
(3) repeat
(4) Rule = Learn_ One_ Rule(D, Att_ vals, c);
(5) remove tuples covered by Rule from D;
(6) until terminating condition;
(7) Rule_ set = Rule_ set +Rule; // add new rule to rule set
(8) endfor
(9) return Rule_ Set;

Classification by Back propagation

A Multilayer Feed-Forward Neural Network


Defining a Network Topology
Backpropagation
Inside the Black Box: Backpropagation and Interpretability

Backpropagation: A neural network learning algorithm


Started by psychologists and neurobiologists to develop and test computational analogues of neurons
A neural network: A set of connected input/output units where each connection has a weight associated with it

123
During the learning phase, the network learns by adjusting the weights so as to be able to predict the correct class
label of the input tuples
Also referred to as connectionist learning due to the connections between units

Neural Network as a Classifier


Weakness
o Long training time
o Require a number of parameters typically best determined empirically, e.g.,
the network topology or ``structure."
o Poor interpretability: Difficult to interpret the symbolic meaning behind the
learned weights and of ``hidden units" in the network
Strength
o High tolerance to noisy data
o Ability to classify untrained patterns
o Well-suited for continuous-valued inputs and outputs
o Successful on a wide array of real-world data
o Algorithms are inherently parallel
o Techniques have recently been developed for the extraction of rules from
trained neural networks
A Neuron (= a perceptron)

For Example

The n-dimensional input vector x is mapped into variable y by means of the scalar
product and a nonlinear function mapping.
A Multilayer Feed-Forward Neural Network

124
Working process of Multilayer Feed-Forward Neural Network
The inputs to the network correspond to the attributes measured for each training tuple
Inputs are fed simultaneously into the units making up the input layer
They are then weighted and fed simultaneously to a hidden layer
The number of hidden layers is arbitrary, although usually only one
The weighted outputs of the last hidden layer are input to units making up the output layer, which emits the
network's prediction
The network is feed-forward in that none of the weights cycles back to an input unit or to an output unit of a
previous layer
From a statistical point of view, networks perform nonlinear regression: Given enough hidden units and enough
training samples, they can closely approximate any function

Defining a Network Topology


First decide the network topology: number of units in the input layer, number of hidden layers (if > 1), number of
units in each hidden layer, and number of units in the output layer
Normalizing the input values for each attribute measured in the training tuples to [0.0—1.0]
One input unit per domain value, each initialized to 0
Output, if for classification and more than two classes, one output unit per class is used
Once a network has been trained and its accuracy is unacceptable, repeat the training process with a different
network topology or a different set of initial weights.

Backpropagation
Iteratively process a set of training tuples & compare the network's prediction with the actual known target value
For each training tuple, the weights are modified to minimize the mean squared error between the network's
prediction and the actual target value
Modifications are made in the “backwards” direction: from the output layer, through each hidden layer down to the
first hidden layer, hence “backpropagation”
Steps
o Initialize weights (to small random #s) and biases in the network
o Propagate the inputs forward (by applying activation function)
o Backpropagate the error (by updating weights and biases)
o Terminating condition (when error is very small, etc.)
125
Backpropagation and Interpretability
Efficiency of backpropagation: Each time (one interation through the training set) takes O(|D|x w), with |D| tuples and
w weights, but number of times can be exponential to n, the number of inputs, in the worst case.
 Rule extraction from networks: network pruning
o Simplify the network structure by removing weighted links that have the least effect on the trained network
o Then perform link, unit, or activation value clustering
o The set of input and activation values are studied to derive rules describing the relationship between the input
and hidden unit layers
 Sensitivity analysis: assess the impact that a given input variable has on a network output. The knowledge gained from
this analysis can be represented in rules

Support Vector Machines


 A new classification method for both linear and nonlinear data.
 It uses a nonlinear mapping to transform the original training data into a higher dimension
 With the new dimension, it searches for the linear optimal separating hyperplane (i.e., “decision boundary”)
 With an suitable nonlinear mapping to a sufficiently high dimension, data from two classes can always be
separated by a hyperplane
 SVM finds this hyperplane using support vectors (“essential” training tuples) and margins (defined by the
support vectors)
 Used both for classification and prediction
 Applications:
a. handwritten digit recognition, object recognition, speaker identification, benchmarking time-series
prediction tests
1. The Case When the Data Are Linearly Separable

126
 Let data D be (x1, y1), (x2, y2) …, (x|D|, y|D|), where Xi is the set of training tuples associated with the class labels yi.
 There are infinite lines (hyperplanes) separating the two classes but find the best one.
 SVM searches for the hyperplane with the largest margin, i.e., maximum marginal hyperplane (MMH)
 A separating hyperplane can be written as
W . X+b = 0;
where W={w1, w2, …, wn} is a weight vector and b a scalar (bias)
 For 2-D it can be written as
w0 + w1x1 +w2x2 = 0:
 The hyperplane defining the sides of the margin:
H1: w0 + w1 x1 + w2 x2 ≥ 1 for yi = +1, and
127
H2: w0 + w1 x1 + w2 x2 ≤ – 1 for yi = –1
 Any training tuples that fall on hyperplanes H1 or H2 (i.e., the sides defining the margin) are support vectors
 This becomes a constrained (convex) quadratic optimization problem: Quadratic objective function and linear
constraints  Quadratic Programming (QP)  Lagrangian multipliers
 That is, any tuple that falls on or above H1 belongs to class +1, and any tuple that falls
on or below H2 belongs to class -1. Combining the two inequalities of above two Equations
we get
yi (w0 + w1x1 + w2x2 ) ≥ 1, .
 Any training tuples that fall on hyperplanes H1 or H2 (i.e., the “sides” defining the margin) satisfy above Equation and
are called support vectors.
2. The Case When the Data Are Linearly inseparable

Two steps for extending linear SVMs to nonlinear SVMs .


Step 1. Transform the original input data into a higher dimensional space using a nonlinear mapping.
Step 2. Search for a linear separating hyperplane in the new space. This can be end up with a quadratic optimization
problem solved using the linear SVM formulation. The maximal marginal hyperplane found in the new space
corresponds to a nonlinear separating hypersurface in the original space.

Associative Classification Classification by Association Rule Analysis

Association rules show strong associations between attribute-value pairs (or items) that occur frequently in a given data
set. Such analysis is useful in many decision-making processes, such as product placement, catalog design, and cross-
marketing. Association rules are mined in a two-step process frequent itemset mining, and rule generation.
The first step searches for patterns of attribute-value pairs that occur repeatedly in a data set, where each attribute-value
pair is considered an item. The resulting attribute value pairs form frequent itemsets.
The second step analyses the frequent itemsets in order to generate association rules.
Advantages
o It explores highly confident associations among multiple attributes and may overcome some constraints by decision-
tree induction, which considers only one attribute at a time
o It is more accurate than some traditional classification methods, such as C4.5
128
Classification: Based on evaluating a set of rules in the form of
p1 ^ p2 … ^ pi => Aclass = C (confidence, support)

where “^” represents a logical “AND.”

Typical Associative Classification Methods

1. CBA (Classification By Association)


 CBA uses an approach to frequent itemset mining, where multiple passes are made over the data and the derived
frequent itemsets are used to generate longest itemsets. In general, the number of passes made is equal to the length of the
longest rule found. The complete set of rules satisfying minimum confidence and minimum support thresholds are found
and insert in the classifier.
 CBA uses a method to construct the classifier, where the rules are organized according to decreasing preference based
on their confidence and support. In this way, the set of rules making up the classifier form a decision list.

2. CMAR (Classification based on Multiple Association Rules)


It uses several rule pruning strategies with the help of a tree structure for efficient storage and retrieval of rules.
 CMAR adopts a variant of the FP-growth algorithm to find the complete set of rules satisfying the minimum
confidence and minimum support thresholds. FP-growth uses a tree structure, called an FP-tree, to register all of the
frequent itemset information contained in the given data set, D. This requires only two scans of D. The frequent itemsets
are then mined from the FP-tree.
 CMAR uses an enhanced FP-tree that maintains the distribution of class labels among tuples satisfying each frequent
itemset. In this way, it is able to combine rule generation together with frequent itemset mining in a single step.
 CMAR employs another tree structure to store and retrieve rules efficiently and to prune rules based on confidence,
correlation, and database coverage. Rule pruning strategies are triggered whenever a rule is inserted into the tree.
 CMAR also prunes rules for which the rule antecedent and class are not positively correlated, based on a c2 test of
statistical significance.

2. CPAR (Classification based on Predictive Association Rules)

 CPAR uses an algorithm for classification known as FOIL (First Order Inductive Learner). FOIL builds rules to
differentiate positive tuples ( having class buys computer = yes) from negative tuples (such as buys computer = no).
 For multiclass problems, FOIL is applied to each class. That is, for a class, C, all tuples of class C are considered
positive tuples, while the rest are considered negative tuples. Rules are generated to differentiate C tuples from all others.
Each time a rule is generated, the positive samples it satisfies (or covers) are removed until all the positive tuples in the
data set are covered.
 CPAR relaxes this step by allowing the covered tuples to remain under consideration, but reducing their weight. The
process is repeated for each class. The resulting rules are merged to form the classifier rule set.

Lazy Learners (or Learning from Your Neighbours)

Eager learners
 Decision tree induction, Bayesian classification, rule-based classification, classification by backpropagation, support
vector machines, and classification based on association rule mining—are all examples of eager learners.
 Eager learners - when given a set of training tuples, will construct a classification model before receiving new tuples to
classify.
Lazy Learners
o In a lazy approach, for a given training tuple, a lazy learner simply stores it or does only a little minor processing and
waits for until a test tuple given. After seeing the test tuple it perform classification in order to classify the tuple based on
its similarity to the stored training tuples.
o Lazy learners do less work when a training tuple is presented and more work when making a classification or
prediction. Because lazy learners store the training tuples or “instances,” (they are also referred to as instance based
learners,) even though all learning is essentially based on instances.

129
Examples of lazy learners:
 k-nearest neighbour classifiers
 case-based reasoning classifiers

1. k-nearest neighbour classifiers (K-NN classifier)

 Nearest-neighbour classifiers are based on a comparison, between given test tuple with training tuples that are similar
to it.
 The training tuples are named as n attributes. Each tuple represents a point in an n dimensional
space. All of the training tuples are stored in an n-dimensional pattern space.
 When given an unknown tuple, a k-nearest-neighbour classifier searches the pattern space for the k training tuples that
are closest to the unknown tuple. These k training tuples are the k “nearest neighbours” of the unknown tuple.
 “Closeness” is defined in terms of a distance metric, such as Euclidean distance.
The Euclidean distance between two points or tuples, say,

 For k-
nearest-neighbour classification, the unknown tuple is assigned the most common class among its k nearest neighbours.
When k = 1, the unknown tuple is assigned the class of the training tuple that is closest to it in pattern space.
 Nearest neighbour classifiers can also be used for prediction, that is, to return a realvalued prediction for a given
unknown tuple. In this case, the classifier returns the average value of the real-valued labels associated with the k nearest
neighbours of the unknown tuple.

130
Case-Based Reasoning (CBR)
 Case-based reasoning classifiers use a database of problem solutions to solve new problems. CBR stores the tuples or
“cases” for problem solving as complex symbolic descriptions. e.g Medical education - where patient case histories and
treatments are used to help diagnose and treat new patients.
 When given a new case to classify, a case-based reasoner will first check if an identical training case exists. If one is
found, then the associated solution to that case is returned. If no identical case is found, then the case-based reasoner will
search for training cases having components that are similar to those of the new case.
 Ideally, these training cases may be considered as neighbours of the new case. If cases are represented as graphs, this
involves searching for subgraphs that are similar to subgraphs within the new case. The case-based reasoner tries to
combine the solutions of the neighbouring training cases in order to propose a solution for the new case.

Challenges in case-based reasoning


 Finding a good similarity metric and suitable methods for combining solutions.
 The selection of salient features for indexing training cases and the development of efficient indexing techniques.
 A balance between accuracy and efficiency changes as the number of stored cases becomes very large. As this number
increases, the case-based reasoned becomes more intelligent. After a certain point, however, the efficiency of the system
will suffer as the time required searching for and process relevant cases increases.

Other Classification Methods.


1. Genetic Algorithms
2. Rough Set Approach
3. Fuzzy Set Approaches
1. Genetic Algorithms
 Genetic Algorithm: based on a comparison to biological evolution
 An initial population is created consisting of randomly generated rules
o Each rule is represented by a string of bits
o e.g., “IF A1 AND NOT A2 THEN C2” can be encoded as “100”
o Similarly, the rule “IF NOT A1 AND NOT A2 THEN C1” can be encoded as “001.”

131
o If an attribute has k > 2 values, k bits can be used to encode the attribute’s values
 Based on the notion of survival of the fittest, a new population is formed to consist of the fittest rules and their
offspring
 The fitness of a rule is represented by its classification accuracy on a set of training examples
 Off springs are generated by crossover and mutation
 The process continues until a population P evolves when each rule in P satisfies a pre-specified threshold
 Slow but easily parallelizable

2. Rough Set Approach


o Rough sets are used to approximately or “roughly” define equivalent classes o A rough set for a given class C is
approximated by two sets: a lower approximation (certain to be in C) and an upper approximation (cannot be described as
not belonging to C)
o Finding the minimal subsets (reducts) of attributes for feature reduction is NPhard but a discernibility matrix (which
stores the differences between attribute values for each pair of data tuples) is used to reduce the computation intensity

3.Fuzzy Set Approaches


o Fuzzy logic uses truth values between 0.0 and 1.0 to represent the degree of membership (such as using fuzzy
membership graph)
o Attribute values are converted to fuzzy values
o e.g., income is mapped into the discrete categories {low, medium, high} with fuzzy values calculated
o For a given new sample, more than one fuzzy value may apply
o Each applicable rule contributes a vote for membership in the categories
o Typically, the truth values for each predicted category are summed, and these sums are combined

132
Predictions (Numeric prediction / Regression)

1. Linear Regression
2. Nonlinear Regression
3. Other Regression-Based Methods
Numeric prediction is the task of predicting continuous values for given input. e.g.,To predict the salary of employee with
10 years of work experience, the sales of a new product.
An approach for numeric prediction is regression, a statistical methodology. Regression analysis can be used to model the
relationship between one or more independent or predictor variables and a dependent or response variable (which is
continuous-valued).
The predictor variables are the attributes of the tuple. In general, the values of the predictor variables are known. The
response variable is unknown so predict it.

To solve regression problems software packages such as SAS (www.sas.com), SPSS


(www.spss.com), and S-Plus (www.insightful.com), Numerical Recipes in C were used .
1. Linear Regression
Straight-line regression analysis involves a response variable, y, and a single predictor
variable, x. It is the simplest form of regression, and models y as a linear function of x.
That is,

where y is constant, and b and w are regression coefficients.

The regression coefficients, w and b, can also be as weights, so that


The regression coefficients can be estimated using this method with the following equations:

where x is the mean value of x1, x2, ….. , x|D|, and y is the mean value of y1, y2, …., y|D|.

133
Example. Straight-line regression using method of least squares. Table shows a set of paired data where x is the number
of years of work experience of a employee and y is the corresponding salary of the employee.

The 2-D data can be graphed on a scatter plot, as in Figure. The plot suggests a linear relationship between the two
variables, x and y.
We model the relationship that salary may be related to the number of years of work experience with the equation

134
Given the above data, we compute x = 9.1 and y = 55.4. Substituting these values into
Equations we get

For Multiple linear regression

2. Nonlinear Regression
 Some nonlinear models can be modeled by a polynomial function
 A polynomial regression model can be transformed into linear regression model.
 For example,

 Other functions, such as power function, can also be transformed to linear model.
 Some models are intractable nonlinear (e.g., sum of exponential terms)
 possible to obtain least square estimates through extensive calculation on more complex formulae

3. Other Regression-Based Methods


(i).Generalized linear model:
o Foundation on which linear regression can be applied to modeling categorical response variables.
o Variance of y is a function of the mean value of y, not a constant.
o Logistic regression: models the probability of some event occurring as a linear function of a set of predictor variables
o Poisson regression: models the data that exhibit a Poisson distribution.
(ii).Log-linear models: (for categorical data)
o Approximate discrete multidimensional prob. distributions
o Also useful for data compression and smoothing
(iii). Regression trees and model trees
o Trees to predict continuous values rather than class labels
 Regression tree: proposed in CART system
o CART: Classification And Regression Trees
o Each leaf stores a continuous-valued prediction
o It is the average value of the predicted attribute for the training tuples that
reach the leaf
 Model tree: proposed by Quinlan (1992)
o Each leaf holds a regression model—a multivariate linear equation for the
predicted attribute
o A more general case than regression tree

135
 Regression and model trees tend to be more accurate than linear regression when
the data are not represented well by a simple linear model

UNIT V

1. Define Clustering?

Clustering is a process of grouping the physical or conceptual data object into clusters.

2. What do you mean by Cluster Analysis?


A cluster analysis is the process of analyzing the various clusters to organize the different objects
into meaningful and descriptive objects.

3. What are the fields in which clustering techniques are used?

136
• Clustering is used in biology to develop new plants and animal taxonomies. • Clustering is used
in business to enable marketers to develop new distinct groups of their customers and characterize
the customer group on basis of purchasing. • Clustering is used in the identification of groups of
automobiles Insurance policy customer. • Clustering is used in the identification of groups of house
in a city on the basis of house type, their cost and geographical location.• Clustering is used to
classify the document on the web for information discovery.

4.What are the requirements of cluster analysis?


The basic requirements of cluster analysis are • Dealing with different types of attributes. • Dealing
with noisy data. • Constraints on clustering. • Dealing with arbitrary shapes. • High dimensionality
• Ordering of input data • Interpretability and usability • Determining input parameter and •
Scalability

5.What are the different types of data used for cluster analysis?
The different types of data used for cluster analysis are interval scaled, binary, nominal, ordinal and
ratio scaled data.

6.What are interval scaled variables?


Interval scaled variables are continuous measurements of linear scale. For Example , height and
weight, weather temperature or coordinates for any cluster. These measurements can be calculated
using Euclidean distance or Minkowski distance

7. Define Binary variables? And what are the two types of binary variables?
Binary variables are understood by two states 0 and 1, when state is 0, variable is absent and when
state is 1, variable is present. There are two types of binary variables,

137
symmetric and asymmetric binary variables. Symmetric variables are those variables that have same state values and
weights. Asymmetric variables are those variables that have not same state values and weights.

8. Define nominal, ordinal and ratio scaled variables?


A nominal variable is a generalization of the binary variable. Nominal variable has more than two states, For
example, a nominal variable, color consists of four states, red, green, yellow, or black. In Nominal variables the total
number of states is N and it is denoted by letters, symbols or integers. An ordinal variable also has more than two
states but all these states are ordered in a meaningful sequence. A ratio scaled variable makes positive measurements
on a non-linear scale, such as exponential scale, using the formula AeBt or Ae-Bt Where A and B are constants.

9. What do you mean by partitioning method?


In partitioning method a partitioning algorithm arranges all the objects intovarious partitions, where the total
number of partitions is less than the total number of objects. Here each partition represents a cluster. The two types
of partitioning method are k- means and k-medoids.

10. Define CLARA and CLARANS?


Clustering in LARge Applications is called as CLARA. The efficiency of CLARA depends upon the size of the
representative data set. CLARA does not work properly if any representative data set from the selected
representative data sets does not find best k- medoids. To recover this drawback a new algorithm, Clustering Large
Applications based upon RANdomized search (CLARANS) is introduced. The CLARANS works like CLARA, the
only difference between CLARA and CLARANS is the clustering process that is done after selecting the
representative data sets.

11. What is Hierarchical method?


Hierarchical method groups all the objects into a tree of clusters that are arranged in a hierarchical order. This
method works on bottom-up or top-down approaches.

12. Differentiate Agglomerative and Divisive Hierarchical Clustering?


Agglomerative Hierarchical clustering method works on the bottom-up approach.In Agglomerative hierarchical
method, each object creates its own clusters. The single Clusters are merged to make larger clusters and the process
of merging continues until all the singular clusters are merged into one big cluster that consists of all the objects.
Divisive Hierarchical clustering method works on the top-down approach. In this method all the objects are
arranged within a big singular cluster and the large cluster is continuously divided into smaller clusters until each
cluster has a single object.

138
13. What is CURE?
Clustering Using Representatives is called as CURE. The clustering algorithms generally work on spherical and
similar size clusters. CURE overcomes the problem of spherical and similar size cluster and is more robust with
respect to outliers.

14. Define Chameleon method?

Chameleon is another hierarchical clustering method that uses dynamic modeling. Chameleon is introduced to
recover the drawbacks of CURE method. In this method two clusters are merged, if the interconnectivity between
two clusters is greater than the interconnectivity between the objects within a cluster.

15. Define Density based method?

Density based method deals with arbitrary shaped clusters. In density-based method, clusters are formed on the
basis of the region where the density of the objects is high.

16. What is a DBSCAN?


Density Based Spatial Clustering of Application Noise is called as DBSCAN. DBSCAN is a density based
clustering method that converts the high-density objects regions into clusters with arbitrary shapes and sizes.
DBSCAN defines the cluster as a maximal set of density connected points.

17. What do you mean by Grid Based Method?

In this method objects are represented by the multi resolution grid data structure. All the objects are quantized into a
finite number of cells and the collection of cells build the grid structure of objects. The clustering operations are
performed on that grid structure. This method is widely used because its processing time is very fast and that is
independent of number of objects.

18. What is a STING?

Statistical Information Grid is called as STING; it is a grid based multi resolution clustering method. In STING
method, all the objects are contained into rectangular cells, these cells are kept into various levels of resolutions and
these levels are arranged in a hierarchical structure.

19. Define Wave Cluster?


It is a grid based multi resolution clustering method. In this method all the objects are represented by a
multidimensional grid structure and a wavelet transformation is applied

139
for finding the dense region. Each grid cell contains the information of the group of objects that map into a
cell. A wavelet transformation is a process of signaling that produces the signal of various frequency sub
bands.

20. What is Model based method?


For optimizing a fit between a given data set and a mathematical model based methods are used. This method uses
an assumption that the data are distributed by probability distributions. There are two basic approaches in this
method that are
Approach

21. What is the use of Regression?


Regression can be used to solve the classification problems but it can also be used for applications such as
forecasting. Regression can be performed using many different types of techniques; in actually regression takes a set
of data and fits the data to a formula

22. What are the reasons for not using the linear regression model to estimate the output data?

There are many reasons for that, One is that the data do not fit a linear model, It is possible however that the data
generally do actually represent a linear model, but thelinear model generated is poor because noise or outliers exist
in the data. Noise is erroneous data and outliers are data values that are exceptions to the usual and expected data.

23. What are the two approaches used by regression to perform classification?
Regression can be used to perform classification using the following approaches
.

24. What do u mean by logistic regression?


Instead of fitting a data into a straight line logistic regression uses a logistic curve. The formula for the univariate
logistic curve is P= e (C0+C1X1) 1+e (C0+C1X1) The logistic curve gives a value between 0 and 1 so it can be
interpreted as the probability of class membership.

25. What is Time Series Analysis?


A time series is a set of attribute values over a period of time. Time Series Analysis may be viewed as finding
patterns in the data and predicting future values.

140
COURSE CODE COURSE TITLE L T P C
DATA WAREHOUSING AND DATA MINING
1151CS114 3 0 0 3

UNIT-V

CO Level of learning domain (Based


Course Outcomes
Nos. on revised Bloom’s taxonomy)
Apply clustering techniques in various data mining applications
 CO5 K3

Correlation of COs with Programme Outcomes:

PSO PSO PSO


PO PO PO
Cos PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 1 2 3
10 11 12
H H M
CO5 H L H L M M M M

UNIT VCLUSTERING, APPLICATIONS AND TRENDS IN DATA MINING L-8

Cluster Analysis - Types of Data – Categorization of Major Clustering Methods - K- means – Partitioning Methods –
Hierarchical Methods - Outlier Analysis – Data Mining Applications – Social Impacts of Data Mining – Mining WWW -
Mining Text Database – Mining Spatial Databases - Case Studies (Simulation Tool).

1. Cluster Analysis concepts

Cluster. A cluster is a collection of data objects that are similar to one another within the same
cluster and are dissimilar to the objects in other clusters.

Clustering. The process of grouping a set of physical or abstract objects into classes of similar
objects is called clustering.

Cluster analysis has wide applications, - market or customer segmentation, pattern recognition,
biological studies, spatial data analysis, Web document classification, etc

Cluster analysis can be used as a


 Stand-alone data mining tool to gain insight into the data distribution
141
 Serve as a pre-processing step for other data mining algorithms

General Applications of Clustering:

 Pattern Recognition
 Spatial Data Analysis
o Create thematic maps in Geographical information system by clustering feature
spaces
o Detect spatial clusters or for other spatial mining tasks
 Image Processing
 Economic Science (especially market research)
 World Wide Web
o Document classification
o Cluster Weblog data to discover groups of similar access patterns

Clustering Applications - Marketing , Land use, Insurance, City-planning , Earth- quake studies.

Requirements of Clustering in Data Mining


 Scalability
 Ability to deal with different types of attributes
 Ability to handle dynamic data
 Discovery of clusters with arbitrary shape
 Minimal requirements for domain knowledge to determine input parameters
 Able to deal with noise and outliers
 Insensitive to order of input records
 High dimensionality
 Incorporation of user-specified constraints
 Interpretability and usability

142
Types of Data in cluster analysis

 Interval – scaled Variables


 Binary variables
 Categorical, Ordinal and Ratio scaled Variables
 Variables of mixed type
 Vector Objects

Suppose that a data set to be clustered contains n objects, which may represent persons, houses, documents,
countries, and so on. The two data structures are used.

Data matrix (or object-by-variable structure): This represents n objects, such as persons, with p variables (also
called measurements or attributes), such as age, height, weight, gender, and so on. The structure is in the form of a
relational table, or n-by-p matrix (n objects _p variables):

Dissimilarity matrix (or object-by-object structure): This stores a collection of proximities that are available for all
pairs of n objects. It is often represented by an n-by- n table:

Measure the Quality of Clustering

 Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance function, typically metric: d(i, j)
 There is a separate “quality” function that measures the “goodness” of a cluster.
 The definitions of distance functions are usually very different for interval-scaled, boolean, categorical, ordinal
ratio, and vector variables.
 Weights should be associated with different variables based on applications and data semantics.

143
 It is hard to define “similar enough” or “good enough”
o the answer is typically highly subjective.

1. Interval – scaled Variables - Euclidean distance, Manhattan distance

Interval-scaled variables are continuous measurements of a roughly linear scale. Examples -weight and height,
latitude and longitude coordinates and weather temperature.

After standardization, or without standardization in certain applications, the dissimilarity or similarity between the
objects described by interval-scaled variables is typically computed based on the distance between each pair of
objects.

1). The most popular distance measure is Euclidean distance, which is defined as

where are two n-dimensional data


objects.

2). Another metric is Manhattan distance, defined as

Both the Euclidean distance and Manhattan distance satisfy the following mathematic requirements of a distance
function

144
2. Binary variables

 Distance measure for symmetric binary variables:

 Distance measure for asymmetric binary variables:

 Jaccard coefficient (similarity measure for asymmetric binary variables):

145
146
 Dissimilarity between Binary Variables:

 gender is a symmetric attribute


 the remaining attributes are asymmetric binary
 let the values Y and P be set to 1, and the value N be set to 0

3. Categorical, Ordinal and Ratio scaled Variables

A categorical variable is a generalization of the binary variable in that it can take on more

than two states. For example, map colour is a categorical variable that may have, say, five
states: red, yellow, green, pink, and blue.

The dissimilarity between two objects i and j can be computed based on the ratio of mismatches:

where m is the number of matches (i.e., the number of variables for which i and j are in the same state), and p is
the total number of variables.

Dissimilarity between categorical variables

147
Consider object identifier, test-1 column only to find the categorical variables. By using above equation we get

Ordinal Variables
 An ordinal variable can be discrete or continuous
 Order is important, e.g., rank
 The values of an ordinal variable can be mapped to ranks. For example, suppose that an ordinal variable f has
Mf states. These ordered states define the ranking 1,
….., Mf .

 Ordinal variables handled by

 Dissimilarity between ordinal variables.

148
o From above table consider only the object-identifier and the continuous ordinal variable, test-2, are available.
There are 3 states for test-2, namely fair, good, and excellent, that is Mf =3.
o For step 1, if we replace each value for test-2 by its rank, the four objects are assigned the ranks 3, 1, 2, and 3,
respectively.
o Step 2 normalizes the ranking by mapping rank 1 to 0.0, rank 2 to 0.5, and rank 3 to 1.0.
o For step 3, we can use, say, the Euclidean distance (Equation (7.5)), which results in the following dissimilarity
matrix:

Ratio scaled Variables

 A ratio-scaled variable makes a positive measurement on a nonlinear scale, such as an


exponential scale, approximately following the formula

where A and B are positive constants, and t typically represents time. E.g.,the growth of a bacteria population , the
decay of a radioactive element.

 Methods to handle ratio-scaled variables for computing the dissimilarity between objects by Apply
logarithmic transformation to a ratio-scaled variable.

 Dissimilarity between ratio-scaled variables.

o This time, from the above table consider only the object-identifier and the ratio- scaled
variable, test-3, are available.
o Logarithmic transformation of the log of test-3 results in the values 2.65, 1.34, 2.21, and 3.08 for the objects 1
to 4, respectively.
o Using the Euclidean distance on the transformed values, we obtain the following dissimilarity
matrix:

4. Variables of Mixed Types

149
 A database may contain all the six types of variables
o symmetric binary, asymmetric binary, nominal, ordinal, interval and ratio
 One may use a weighted formula to combine their effects

o f is interval-based : {write interval based formula}


o f is binary or categorical: {write binary formula}
o f is ordinal : {write ordinal formula}
o f is ratio-scaled: {write ratio-scaled formula}

5. Vector objects:

 Vector objects: keywords in documents, gene features in micro-arrays, etc.


 Broad applications: information retrieval, biologic taxonomy, etc.
 To define such a similarity function, s(x, y), to compare two vectors x and y.
Cosine measure

and y.

 A variant: Tanimoto coefficient

150
Categorization of Major Clustering Methods

Clustering is a dynamic field of research in data mining. Many clustering algorithms have been developed. These
can be categorized into (i).Partitioning methods, (ii).hierarchical methods,(iii). density-based methods, (iv).grid-
based methods, (v).model-based methods, (vi).methods for high-dimensional data, and (vii), constraint based
methods.

A partitioning method first creates an initial set of k partitions, where parameter k is the number of partitions to
construct. It then uses an iterative relocation technique that attempts to improve the partitioning by moving objects
from one group to another. Typical partitioning methods include k-means, k-medoids, CLARANS, and their
improvements.

A hierarchical method creates a hierarchical decomposition of the given set of data objects. The method can be
classified as being either agglomerative (bottom-up) or divisive (top-down), based on how the hierarchical
decomposition is formed. To compensate for the rigidity of merge or split, the quality of hierarchical agglomeration
can be improved by analyzing object linkages at each hierarchical partitioning (such as in ROCK and Chameleon),
or by first performing microclustering (that is, grouping objects into “microclusters”) and then operating on the
microclusters with other clustering techniques, such as iterative relocation (as in BIRCH).

A density-based method clusters objects based on the notion of density. It either grows clusters according to the
density of neighborhood objects (such as in DBSCAN) or according to some density function (such as in
DENCLUE). OPTICS is a density based method that generates an increased ordering of the clustering structure of
the data.

A grid-based method first quantizes the object space into a finite number of cells that form a grid structure, and
then performs clustering on the grid structure. STING is a typical example of a grid-based method based on
statistical information stored in grid cells. WaveCluster and CLIQUE are two clustering algorithms that are both grid
based and density-based.

A model-based method hypothesizes a model for each of the clusters and finds the best fit of the data to that model.
Examples of model-based clustering include the EM algorithm (which uses a mixture density model), conceptual
clustering (such as COBWEB), and neural network approaches (such as self-organizing feature maps).

Clustering high-dimensional data is of vital importance, because in many advanced applications, data objects such
as text documents and microarray data are high- dimensional in nature. There are three typical methods to handle
high dimensional data sets: dimension-growth subspace clustering, represented by CLIQUE, dimension-
reduction projected clustering, represented by PROCLUS, and frequent pattern–based clustering, represented by
pCluster.

A constraint-based clustering method groups objects based on application dependent or user-specified constraints.
For example, clustering with the existence of obstacle objects and clustering under user-specified constraints are
typical methoads of constraint-based clustering. Typical examples include clustering with the existence of obstacle
objects, clustering under user-specified constraints, and semi-supervised clustering based on “weak” supervision
(such as pairs of objects labeled as belonging to the same or different cluster).

151
One person’s noise could be another person’s signal. Outlier detection and analysis are very useful for fraud
detection, customized marketing, medical analysis, and many other tasks. Computer-based outlier analysis methods
typically follow either a statistical distribution-based approach, a distance-based approach, a density-based local
outlier detection approach, or a deviation-based approach.

Partitioning Methods

Given D, a data set of n objects, and k, the number of clusters to form, a partitioning algorithm organizes the objects
into k partitions (k ≤ n), where each partition represents a cluster. The commonly used partitioning methods are (i).
k-means, (ii). k-medoids.

Centroid-Based Technique: The k-Means Method

o k-means. where each cluster’s center is represented by the mean value of the
objects in the cluster. i.e Each cluster is represented by the center of the cluster.

o Algorithm

Input:
k: the number of clusters,

152
D: a data set containing n objects.

Output: A set of k clusters.

Method:
(1) arbitrarily choose k objects from D as the initial cluster centers;
(2) repeat
(3) (re)assign each object to the cluster to which the object is the most similar, based on the mean
value of the objects in the cluster;
(4) update the cluster means, i.e., calculate the mean value of the objects for each cluster;
(5) until no change;

o Strength: Relatively efficient:


o Comment: Often terminates at a local optimum. The global optimum may be found using techniques such as:
deterministic annealing and genetic algorithms
o Weakness
 Applicable only when mean is defined, then what about categorical data?
 Need to specify k, the number of clusters, in advance
 Unable to handle noisy data and outliers
 Not suitable to discover clusters with non-convex shapes

Representative Object-Based Technique: The k-Medoids Method

o The k-means algorithm is sensitive to outliers .Since an object with an extremely large value may largely
change the distribution of the data.

o K-Medoids: Instead of taking the mean value of the object in a cluster as a reference point, medoids can be
used, which is the most centrally located object in a cluster.

10
9

10 8

9 7
6
8
5
7
6 4

5 3
2
4
3 1
0
2
153
0 12 3 4567 89 10
1
0 0 12 3 4567 89 10
The K-Medoids Clustering Methods

Find representative objects, called medoids, in clusters


1. PAM (Partitioning Around Medoids, 1987)
o starts from an initial set of medoids and iteratively replaces one of the medoids by one of the non-medoids
if it improves the total distance of the resulting clustering
o PAM works effectively for small data sets, but does not scale well for large data sets
2. CLARA ((Clustering LARge Applications)
3. CLARANS (Ng & Han, 1994): Randomized sampling

PAM (Partitioning Around Medoids)

Algorithm: k-medoids. PAM, a k-medoids algorithm for partitioning based on medoid or central objects.

Input:
k: the number of clusters,
D: a data set containing n

objects. Output: A set of k clusters.

Method:
(1) arbitrarily choose k objects in D as the initial representative
154 objects or seeds;
(2) repeat
(3) assign each remaining object to the cluster with the nearest representative object;
(4) randomly select a nonrepresentative object, Orandom;
(5) compute the total cost, S, of swapping representative object, Oj, with Orandom;
CLARA (Clustering LARge Applications) - Sampling based method

 PAM works efficiently for small data sets but does not scale well for large data sets.
 Built in statistical analysis packages, such as S+
 It draws multiple samples of the data set, applies PAM on each sample, and gives the best clustering as the
output
 Strength: deals with larger data sets than PAM
 Weakness:
o Efficiency depends on the sample size
o A good clustering based on samples will not necessarily represent a good clustering of the whole data
set if the sample is biased

3. CLARANS (“Randomized” CLARA)

 CLARANS (A Clustering Algorithm based on Randomized Search)


 CLARANS draws sample of neighbors dynamically.
 The clustering process can be presented as searching a graph where every node is a potential solution, that is, a
set of k medoids
 If the local optimum is found, CLARANS starts with new randomly selected node in search for a new local
optimum
 It is more efficient and scalable than both PAM and CLARA
 Focusing techniques and spatial access structures may further improve its performance
Hierarchical clustering methods is called agglomerative

A hierarchical clustering method works by grouping data objects into a tree of clusters. Hierarchical clustering
methods can be further classified as either agglomerative or divisive,
depending on whether the hierarchical decomposition is formed in a bottom-up (merging) or top-down (splitting)
fashion.

There are two types of hierarchical clustering methods:

Agglomerative hierarchical clustering: This bottom-up strategy starts by placing each object in its own cluster and
then merges these atomic clusters into larger and larger clusters, until all of the objects are in a single cluster or until
certain termination conditions are satisfied..

Divisive hierarchical clustering: This top-down strategy does the reverse of agglomerative hierarchical clustering
by starting with all objects in one cluster. It subdivides the cluster into smaller and smaller pieces, until each object
forms a cluster on its own or until it satisfies certain termination conditions, such as a desired number of clusters is
obtained.

155
Decompose data objects into a several levels of nested partitioning (tree of clusters), called a dendrogram.
A clustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connected
component forms a cluster.

 Major weakness of agglomerative clustering methods


o do not scale well: time complexity of at least O(n2), where n is the number of total objects
o can never undo what was done previously

 Integration of hierarchical with distance-based clustering


o BIRCH (1996): uses CF-tree and incrementally adjusts the quality of sub- clusters
o ROCK (1999): clustering categorical data by neighbor and link analysis
o CHAMELEON (1999): hierarchical clustering using dynamic modeling
156
BIRCH (1996):
 Birch: Balanced Iterative Reducing and Clustering using Hierarchies
 Scales linearly: finds a good clustering with a single scan and improves the quality with a few additional scans
 Weakness: handles only numeric data, and sensitive to the order of the data record.

Cluster Feature (CF)

 A CF tree is a height-balanced tree that stores the clustering features for a hierarchical
Clustering.

 Incrementally construct a CF (Clustering Feature) tree, a hierarchical data structure for multiphase clustering
o Phase 1: scan DB to build an initial in-memory CF tree (a multi-level compression of the data that tries to
preserve the inherent clustering structure of the data)
o Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of the CF- tree

 A CF tree is a height-balanced tree that stores the clustering features for a hierarchical clustering
o A nonleaf node in a tree has descendants or “children”
o The nonleaf nodes store sums of the CFs of their children

 A CF tree has two parameters


o Branching factor: specify the maximum number of children.
o threshold: max diameter of sub-clusters stored at the leaf nodes

ROCK . A Hierarchical Clustering

 ROCK: RObust Clustering using linKs


157
 Major ideas
o Use links to measure similarity/proximity
o Not distance-based
o Computational complexity: The

 Algorithm: sampling-based clustering


o Draw random sample
o Cluster with links
o Label data in disk
 Experiments
o Congressional voting, mushroom data

 Similarity Measure in ROCK


o Example: Two groups (clusters) of transactions
o C1. <a, b, c, d, e>: {a, b, c}, {a, b, d}, {a, b, e}, {a, c, d}, {a, c, e}, {a, d, e},
{b, c, d},
{b, c, e}, {b, d, e}, {c, d, e}

o C2. <a, b, f, g>: {a, b, f}, {a, b, g}, {a, f, g}, {b, f, g}
o Jaccard co-efficient may lead to wrong clustering result
o C1: 0.2 ({a, b, c}, {b, d, e}} to 0.5 ({a, b, c}, {a, b, d})
o C1 & C2: could be as high as 0.5 ({a, b, c}, {a, b, f})
o Jaccard co-efficient-based similarity function:

o Ex. Let T1 = {a, b, c}, T2 = {c, d, e}

 Link Measure in ROCK

o Links: no. of common neighbors


o C1 <a, b, c, d, e>: {a, b, c}, {a, b, d}, {a, b, e}, {a, c, d}, {a, c, e}, {a, d, e},
{b, c, d},
 {b, c, e}, {b, d, e}, {c, d, e}

o C2 <a, b, f, g>: {a, b, f}, {a, b, g}, {a, f, g}, {b, f, g}

158
o Let T1 = {a, b, c}, T2 = {c, d, e}, T3 = {a, b, f}

o link(T1, T2) = 4, since they have 4 common neighbors


 {a, c, d}, {a, c, e}, {b, c, d}, {b, c, e}
o link(T1, T3) = 3, since they have 3 common neighbors
 {a, b, d}, {a, b, e}, {a, b, g}
o Thus link is a better measure than Jaccard coefficient

CHAMELEON: Hierarchical Clustering Using Dynamic Modeling (1999)

 Measures the similarity based on a dynamic model


o Two clusters are merged only if the interconnectivity and closeness (proximity) between two clusters are
high relative to the internal interconnectivity of the clusters and closeness of items within the clusters
o Cure ignores information about interconnectivity of the objects, Rock ignores information about the
closeness of two clusters
 A two-phase algorithm
o Use a graph partitioning algorithm: cluster objects into a large number of relatively small sub-clusters
o Use an agglomerative hierarchical clustering algorithm: find the genuine clusters by repeatedly combining
these sub-clusters.

Density-Based Methods clustering

Density-based clustering methods developed to discover clusters with arbitrary shape.

 Major features:
o Discover clusters of arbitrary shape
o Handle noise
o One scan
o Need density parameters as termination condition Methods (1).
DBSCAN (2).OPTICS (3).DENCLUE

1) .DBSCAN: A Density-Based Clustering Method Based on Connected Regions with Sufficiently High
Density
159
 DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density based clustering
algorithm.
 The algorithm grows regions with sufficiently high density into clusters and discovers clusters of arbitrary
shape in spatial databases with noise. It defines a cluster as a maximal set of density-connected points.
 Density-reachability and density connectivity.
 Consider Figure for a given £ represented by the radius of the circles, and, MinPts = 3.
 Labeled points ,m, p, o, and r are core objects because each is in an £ neighbourhood containing at least three
points.
 q is directly density-reachable from m. m is directly density-reachable from p and vice versa.
 q is (indirectly) density-reachable from p because q is directly density-reachable from
m and m is directly density-reachable from p. However, p is not density-reachable from q because q is not a
core object. Similarly, r and s are density-reachable from o, and o is density-reachable from r.
 o, r, and s are all density-connected.

 DBSCAN searches for clusters by checking the £ -neighborhood of each point in the database. If the £
neighborhood of a point p contains more than MinPts, a new cluster with p as a core object is created.
 DBSCAN then iteratively collects directly density-reachable objects from these core objects, which may
involve the merge of a few density-reachable clusters. The process terminates when no new point can be added
to any cluster.

2) OPTICS : Ordering Points to Identify the Clustering Structure

OPTICS computes an better cluster ordering for automatic and interactive cluster analysis .The cluster ordering can
be used to extract basic clustering information such as cluster centers or arbitrary-shaped clusters as well as provide

160
the basic clustering structure.
Fig : OPTICS terminology.

161
Core-distance and reachability-distance.

 Figure illustrates the concepts of core distance and reachability distance.


 Suppose that £ =6 mm and MinPts = 5.
 The core distance of p is the distance, £ ’, between p and the fourth closest data object.
 The reachability-distance of q1 with respect to p is the core-distance of p (i.e., £ ‘ =3
mm) because this is greater than the Euclidean distance from p to q1.

 The reachability distance of q2 with respect to p is the Euclidean distance from p to


q2 because this is greater than the core-distance of p.

Fig :Cluster ordering in OPTICS

For example, in above Figure is the reachability plot for a simple two-dimensional data set, which presents a general
overview of how the data are structured and clustered. The data objects are plotted in cluster order (horizontal axis)
together with their respective reachability-distance (vertical axis). The three Gaussian “bumps” in the plot reflect
three clusters in the data set.
3). DENCLUE (DENsity-based CLUstEring) Clustering Based on
Density Distribution Functions

DENCLUE is a clustering method based on a set of density distribution functions. The method is built on the
following ideas:

(1) the influence of each data point can be formally modeled using a mathematical function called an influence
function, which describes the impact of a data point within its neighborhood;

162
(2) the overall density of the data space can be modeled analytically as the sum of the influence function applied to
all data points.

(3) clusters can then be determined mathematically by identifying density attractors, where density attractors are
local maxima of the overall density function.

Fig .Possible density functions for a 2-D


data set

Advantages

 Solid mathematical foundation


 Good for data sets with large amounts of noise
 It allows a compact mathematical description of arbitrarily shaped clusters in high dimensional data sets.
 Significantly faster than some influential algorithms than DBSCAN.

Grid-Based Methods “STING”

The grid-based clustering approach uses a multiresolution grid data structure. It quantizes the object space into a
finite number of cells that form a grid structure on which all of the operations for clustering are performed.

163
STING: STatistical INformation Grid

 STING is a grid-based multiresolution clustering technique in which the spatial area is divided into
rectangular cells. These cells form a hierarchical structure. Each cell at a high level is partitioned to form a
number of cells at the next lower level.
 Statistical parameters of higher-level cells can easily be computed from the parameters of the lower-level
cells.
 These parameters includes
o Attribute independent parameter, count;
o Attribute dependent parameters, mean, stdev (standard deviation), min , max.
o Attribute type of distribution such as normal, uniform, exponential, or none.
 When the data are loaded into the database, the parameters count, mean, stdev, min, and max of the bottom-
level cells are calculated directly from the data.
 The value of distribution may either be assigned by the user if the distribution type is known beforehand or
obtained by hypothesis tests such as the X2 test.
 The type of distribution of a higher-level cell can be computed based on the majority of distribution types
of its corresponding lower-level cells in conjunction with a threshold filtering process.
 If the distributions of the lower level cells disagree with each other and fail the threshold test, the
distribution type of the high-level cell is set to none.

WaveCluster: Clustering Using Wavelet Transformation


 WaveCluster is a multiresolution clustering algorithm summarizes the data by imposing
a multidimensional grid structure onto the data space.
 It then uses a wavelet transformation to transform the original feature space, finding dense regions in the
transformed space.
 A wavelet transform is a signal processing technique that decomposes a signal into different frequency
subbands.
 The wavelet model can be applied to d-dimensional signals by applying a one- dimensional wavelet transforms
d times.
 In applying a wavelet transform, data are transformed so as to reserve distance between objects at different
164
levels of resolution. This allows natural clusters in the data to become more different.
 Clusters can then be identified by searching for dense regions in the new domain.

Advantages:

 It provides unsupervised clustering.


 The multiresolution property of wavelet transformations can help detect clusters at varying levels of accuracy.
 Wavelet-based clustering is very fast and made parallel

Model-Based Clustering Methods

Model-based clustering methods attempt to optimize the fit between the given data and some mathematical model.

165
Such methods are often based on the assumption that the data are generated by a mixture of underlying probability
distributions.

 Typical methods
o Statistical approach
 EM (Expectation maximization), AutoClass
o Machine learning approach
 COBWEB, CLASSIT
o Neural network approach
 SOM (Self-Organizing Feature Map) (i). Statistical
approach : EM (Expectation maximization),

 EM — A popular iterative refinement algorithm


 An extension to k-means
o Assign each object to a cluster according to a weight (prob. distribution)
o New means are computed based on weighted measures
 General idea
o Starts with an initial estimate of the parameter vector
o Iteratively rescores the patterns against the mixture density produced by the parameter vector
o The rescored patterns are used to update the parameter updates

166
o Patterns belonging to the same cluster, if they are placed by their scores in a particular component
 Algorithm converges fast but may not be in global optima

The EM (Expectation Maximization) Algorithm

 Initially, randomly assign k cluster centers


 Iteratively refine the clusters based on two steps
o Expectation step: assign each data point Xi to cluster Ci with the following probability

o Maximization step:
 Estimation of model parameters

(ii). Machine learning approach ( COBWEB)

 Conceptual clustering
o A form of clustering in machine learning
o Produces a classification scheme for a set of unlabeled objects
o Finds characteristic description for each concept (class)

 COBWEB (Fisher’87)
o A popular a simple method of incremental conceptual learning
o Creates a hierarchical clustering in the form of a classification tree
o Each node refers to a concept and contains a probabilistic description of that concept

 Fig. A classification Tree for a set of animal data.

167
 Working method:

o For a given new object, COBWEB decides where to include it into the classification tree. For this COBWEB
derives the tree along an suitable path, updating counts along the way, in search of the “best host” or node at
which to classify the object.

o If the object does not really belong to any of the concepts represented in the tree then better to create a new
node for the given object. The object is then placed in an existing class, or a new class is created for it, based on
the partition with the highest category utility value.

 Limitations of COBWEB
o The assumption that the attributes are independent of each other is often too strong because correlation may
exist
o Not suitable for clustering large database data – skewed tree and expensive probability distributions

 . CLASSIT
o an extension of COBWEB for incremental clustering of continuous data
o suffers similar problems as COBWEB

(iii). Neural network approach - SOM (Self-Organizing Feature Map)

 Neural network approaches


o Represent each cluster as an exemplar, acting as a “prototype” of the cluster
o New objects are distributed to the cluster whose exemplar is the most similar according to some
distance measure
 Typical methods
o SOM (Soft-Organizing feature Map)
o Competitive learning
 Involves a hierarchical architecture of several units (neurons)
 Neurons compete in a “winner-takes-all” fashion for the object
currently being presented

SOM (Soft-Organizing feature Map)


 SOMs, also called topological ordered maps, or Kohonen Self-Organizing Feature Map (KSOMs)
 It maps all the points in a high-dimensional source space into a 2 to 3-d target space, such that the distance and
proximity relationship (i.e., topology) are preserved as much as possible.
 Similar to k-means: cluster centers tend to lie in a low-dimensional manifold in the feature space
 Clustering is performed by having several units competing for the current object
o The unit whose weight vector is closest to the current object wins
o The winner and its neighbors learn by having their weights adjusted
 SOMs are believed to resemble processing that can occur in the brain.
 Useful for visualizing high-dimensional data in 2- or 3-D space.

168
Clustering High-Dimensional Data

 Clustering high-dimensional data


o Many applications: text documents, DNA micro-array data
o Major challenges:
 Many irrelevant dimensions may mask clusters
 Distance measure becomes meaningless—due to equi-distance
 Clusters may exist only in some subspaces
 Methods
o Feature transformation: only effective if most dimensions are relevant
 PCA & SVD useful only when features are highly correlated/redundant
o Feature selection: wrapper or filter approaches
 useful to find a subspace where the data have nice clusters
o Subspace-clustering: find clusters in all the possible subspaces
 CLIQUE, ProClus, and frequent pattern-based clustering

(i).CLIQUE: A Dimension-Growth Subspace Clustering Method

 CLIQUE (CLustering InQUEst)


 Automatically identifying subspaces of a high dimensional data space that allow better clustering than original
space
 CLIQUE can be considered as both density-based and grid-based
o It partitions each dimension into the same number of equal length interval
o It partitions an m-dimensional data space into non-overlapping rectangular units
o A unit is dense if the fraction of total data points contained in the unit exceeds the input model
parameter
o A cluster is a maximal set of connected dense units within a subspace CLIQUE: The Major
Steps

 Partition the data space and find the number of points that lie inside each cell of the partition.
 Identify the subspaces that contain clusters using the Apriori principle
 Identify clusters
o Determine dense units in all subspaces of interests
o Determine connected dense units in all subspaces of interests.
 Generate minimal description for the clusters
o Determine maximal regions that cover a cluster of connected dense units for each cluster
o Determination of minimal cover for each cluster.

169
170
Fig .Dense units found with respect to age for the dimensions salary and vacation are
intersected in order to provide a candidate search space for dense units of higher dimensionality.

 Strength
o automatically finds subspaces of the highest dimensionality such that high density
clusters exist in those subspaces
o insensitive to the order of records in input and does not presume some canonical
data distribution
o scales linearly with the size of input and has good scalability as the number of
dimensions in the data increases
 Weakness
o The accuracy of the clustering result may be degraded at the expense of simplicity
of the method

(ii). PROCLUS: A Dimension-Reduction Subspace Clustering Method

 PROCLUS (PROjected CLUStering) is a typical dimension-reduction subspace clustering


method.
 It starts by finding an initial calculation of the clusters in the high-dimensional attribute
space.

 Each dimension is then assigned a weight for each cluster, and the updated weights are
used in the next iteration to regenerate the clusters.
 This leads to the search of solid regions in all subspaces of some desired dimensionality
and avoids the generation of a large number of overlapped clusters in projected dimensions
of
lower dimensionality.

 The PROCLUS algorithm consists of three phases: initialization, iteration, and cluster
refinement.

(iii). Frequent Pattern–Based Clustering Methods


o Frequent pattern mining can be applied to clustering, resulting in frequent pattern–
based cluster analysis.
o Frequent pattern mining - searches for patterns (such as sets of items or objects) that
occur frequently in large data sets.

171
o Frequent pattern mining can lead to the discovery of interesting associations and
correlations among data objects.

Two forms of frequent pattern–based cluster analysis:


o Frequent term–based text clustering.
o Clustering by pattern similarity in microarray data analysis.

(a).Frequent term–based text clustering.

 Text documents are clustered based on the frequent terms they contain. A term can be
made up of a single word or several words. Terms are then extracted.
 A stemming algorithm is then applied to reduce each term to its basic stem. In this
way, each document can be represented as a set of terms. Each set is typically large.
Collectively, a large set of documents will contain a very large set of different terms.
 Advantage: It automatically generates a description for the generated clusters in terms
of their frequent term sets.

(b). Clustering by pattern similarity in DNA microarray data analysis ( pClustering


)

o Figure.1 shows a fragment of microarray data containing only three genes (taken as
“objects” ) and ten attributes (columns a to j ).
o However, if two subsets of attributes, {b, c, h, j, e} and { f , d, a, g, i}, are selected and
plotted as in Figure. 2 (a) and (b) respectively,
o Figure. 2(a) forms a shift pattern, where the three curves are similar to each other with
respect to a shift operation along the y-axis.
o Figure.2(b) forms a scaling pattern, where the three curves are similar to each other with
respect to a scaling operation along the y-axis.

172
Fig: Raw data from a fragment of microarray data containing only 3 objects and 10 attributes

Fig. Objects in Figure 1 form


Fig (a) a shift pattern in subspace
{b,c,h,j,e} Fig (b) a scaling pattern in
subspace { f,d,a,g,i}.

Constraint-Based Cluster Analysis

Constraint-based clustering finds clusters that satisfy user-specified preferences or constraints.


Depending on the nature of the constraints, constraint-based clustering may adopt different
approaches.

Different constraints in cluster analysis


i. Constraints on individual objects. (E.g Cluster on houses worth over $300K)
ii. Constraints on the selection of clustering parameters.
iii. Constraints on distance or similarity functions (e.g.,Weighted functions, obstacles
(e.g., rivers, lakes)
iv. User-specified constraints on the properties of individual clusters. (no.,of clusters,
MinPts)
v. Semi-supervised clustering based on “partial” supervision.(e.g., Contain at least 500
valued customers and 5000 ordinary ones)

I. Constraints on distance or similarity functions: Clustering with obstacle objects


(obstacle meaning -> difficult)

Clustering with obstacle objects using a partitioning approach requires that the distance
between each object and its corresponding cluster center be re-evaluated at each iteration

173
whenever the cluster center is changed.

e.g A city may have rivers, bridges, highways, lakes, and mountains. We do not want to swim
across a river to reach an ATM.

Approach for the problem of clustering with obstacles.

Fig(a) :First, a point, p, is visible from another point, q, in Region R, if the straight line joining
p and q does not intersect any obstacles.

174
The shortest path between two points, p and q, will be a subpath of VG’ as shown in Figure (a).
We see that it begins with an edge from p to either v1, v2, or v3, goes through some path in VG,
and then ends with an edge from either v4 or v5 to q.

Fig.(b).To reduce the cost of distance computation between any two pairs of objects,
microclusters techniques can be used. This can be done by first triangulating the region R into
triangles, and then grouping nearby points in the same triangle into microclusters, as shown in
Figure (b).

After that, precomputation can be performed to build two kinds of join indices based on the
shortest paths:
o VV index: indices for any pair of obstacle vertices
o MV index: indices for any pair of micro-cluster and obstacle indices

II. User-specified constraints on the properties of individual clusters

 e.g., A parcel delivery company with n customers would like to determine locations for k
service stations so as to minimize the traveling distance between customers and service
stations.
 The company’s customers are considered as either high-value customers (requiring
frequent, regular services) or ordinary customers (requiring occasional services).
 The manager has specified two constraints: each station should serve (1) at least 100 high-
value customers and (2) at least 5,000 ordinary customers.

 Proposed approach to solve above


o Find an initial “solution” by partitioning the data set into k groups and satisfying
175
user-constraints
o Iteratively refine the solution by micro-clustering relocation (e.g., moving δ μ- clusters
from cluster Ci to Cj) and “deadlock” handling (break the microclusters when
necessary)
o Efficiency is improved by micro-clustering

III. Semi-supervised clustering


Clustering process based on user feedback or guidance constraints is called semi- supervised
clustering.

Methods for semi-supervised clustering can be categorized into two classes:


(1).constraint-based semi-supervised clustering
(2).distance-based semi-supervised clustering.

Constraint-based semi-supervised clustering trusts on user-provided labels or constraints to


guide the algorithm toward a more suitable data partitioning. This includes modifying the
objective function based on constraints, or initializing and constraining the clustering process
based on the labeled objects.

Distance-based semi-supervised clustering employs a distance measure that is trained to satisfy


the constraints in the supervised data. A method CLTree (CLustering based on decision TREEs),
integrates unsupervised clustering with the idea of supervised classification.

Outlier Analysis

 Data objects which are totally different from or inconsistent with the remaining set of data,
are called outliers. Outliers can be caused by measurement or execution error
E.g The display of a person’s age as 999.

 Outlier detection and analysis is an interesting data mining task, referred to as outlier
mining.
 Applications:
o Fraud Detection (Credit card, telecommunications, criminal activity in e-
Commerce)
o Customized Marketing (high/low income buying habits)
o Medical Treatments (unusual responses to various drugs)
o Analysis of performance statistics (professional athletes)
o Weather Prediction
o Financial Applications (loan approval, stock tracking)

1) Statistical Distribution-Based Outlier Detection


The statistical distribution-based approach to outlier detection assumes a distribution or
probability model for the given data set (e.g., a normal or Poisson distribution) and then
identifies outliers with respect to the model using a discordancy test.

176
A statistical discordancy test examines two hypotheses:
 a working hypothesis
 an alternative hypothesis.

 Working hypothesis
A working hypothesis, H, is a statement that the entire data set of n objects comes from an
initial distribution model, F, that is,

A discordancy test verifies whether an object, oi, is significantly large (or small) in relation to
the distribution F.
 Alternative hypothesis.
An alternative hypothesis, H, which states that oi comes from another distribution model, G,
is adopted
 There are different kinds of alternative distributions.
o Inherent alternative distribution
o Mixture alternative distribution
o Slippage alternative distribution

There are two basic types of procedures for detecting outliers:


o Block procedures
o Consecutive
procedures Drawbacks:
o most tests are for single attributes
o in many cases, the data distribution may not be known.

2) Distance-Based Outlier Detection

An object, O, in a data set, D, is a distance-based (DB) outlier with parameters pct and dmin,
that is, a DB(pct;dmin)-outlier, if at least a fraction, pct, of the objects in D lie at a distance
greater than dmin from O.

Algorithms for mining distance-based outliers are

 Index-based algorithm, Nested-loop algorithm, Cell-based algorithm

 Index-based algorithm

Given a data set, the index-based algorithm uses multidimensional indexing structures, such as
R-trees or k-d trees, to search for neighbours of each object o within radius dmin around that
object.

o Nested-loop algorithm

177
This algorithm avoids index structure construction and tries to minimize the number of I/Os. It
divides the memory buffer space into two halves and the data set into several logical blocks. I/O
efficiency can be achieved by choosing the order in which blocks are loaded into each half.

o Cell-based algorithm: A cell-based algorithm was developed for memory-resident data sets.
Its complexity is O(ck +n), where c is a constant depending on the number of cells and k is
the dimensionality.

3) Density-Based Local Outlier Detection

o An object is a local outlier if it is outlying relative to its local neighbourhood, particularly


with respect to the density of the neighbourhood.
o In this view, o2 is a local outlier relative to the density of C 2. Object o1 is an outlier as
well, and no objects in C1 are mislabelled as outliers. This forms the basis of density-based
local outlier detection.

4) Deviation-Based Outlier Detection

It identifies outliers by examining the main characteristics of objects in a group. Objects that
“deviate” from this description are considered outliers. Hence, deviations is used to refer
outliers.

Techniques
o Sequential Exception Technique
o OLAP Data Cube Technique

Sequential Exception Technique


o simulates the way in which humans can decide unusual objects from among a series of
supposedly like objects.It uses implicit redundancy of the data.
o Given a data set, D, of n objects, it builds a sequence of subsets,{D1, D2, …,Dm},
178
of these objects with 2<=m <= n such that

The technique introduces the following key terms.

 Exception set: This is the set of deviations or outliers.

 Dissimilarity function: It is any function that, if given a set of objects, returns a low value if
the objects are similar to one another. The greater the dissimilarity among the objects, the
higher the value returned by the function.

 Cardinality function: This is typically the count of the number of objects in a given set.

 Smoothing factor: This function is computed for each subset in the sequence. It assesses
how much the dissimilarity can be reduced by removing the subset from the original set of
objects.

OLAP Data Cube Technique

o An OLAP approach to deviation detection uses data cubes to identify regions of


differences
in large multidimensional data.
o A cell value in the cube is considered an exception if it is different from the expected
value, based on a statistical model.
o The method uses visual cues such as background colour to reflect the degree of
exception of each cell.
o The user can choose to drill down on cells that are flagged as exceptions.
o The measure value of a cell may reflect exceptions occurring at more detailed or
lower levels of the cube, where these exceptions are not visible from the current
level.

Data Mining Applications

 Data mining is an interdisciplinary field with wide and various applications


o There exist nontrivial gaps between data mining principles and domain- specific
applications
 Some application domains
o Financial data analysis
o Retail industry
o Telecommunication industry
o Biological data analysis
179
I. Data Mining for Financial Data Analysis

 Financial data collected in banks and financial institutions are often relatively complete,
reliable, and of high quality
 Design and construction of data warehouses for multidimensional data analysis and data
mining
o View the debt and revenue changes by month, by region, by sector, and by other
factors
o Access statistical information such as max, min, total, average, trend, etc.
 Loan payment prediction/consumer credit policy analysis
o feature selection and attribute relevance ranking
o Loan payment performance
o Consumer credit rating
 Classification and clustering of customers for targeted marketing
o multidimensional segmentation by nearest-neighbor, classification, decision trees,
etc. to identify customer groups or associate a new customer to an appropriate
customer group
 Detection of money laundering and other financial crimes
o integration of from multiple DBs (e.g., bank transactions, federal/state crime
history DBs)
o Tools: data visualization, linkage analysis, classification, clustering tools, outlier
analysis, and sequential pattern analysis tools (find unusual access sequences)

II. Data Mining for Retail Industry

 Retail industry: huge amounts of data on sales, customer shopping history, etc.
 Applications of retail data mining
o Identify customer buying behaviors
o Discover customer shopping patterns and trends
o Improve the quality of customer service
o Achieve better customer retention and satisfaction
o Enhance goods consumption ratios
o Design more effective goods transportation and distribution policies
Examples

 Ex. 1. Design and construction of data warehouses based on the benefits of data
mining
 Ex. 2.Multidimensional analysis of sales, customers, products, time, and region
 Ex. 3. Analysis of the effectiveness of sales campaigns
 Ex. 4. Customer retention: Analysis of customer loyalty
o Use customer loyalty card information to register sequences of purchases of
particular customers
o Use sequential pattern mining to investigate changes in customer
180
consumption or loyalty
o Suggest adjustments on the pricing and variety of goods
 Ex. 5. Purchase recommendation and cross-reference of items

III. Data Mining for Telecommunication Industry

 A rapidly expanding and highly competitive industry and a great demand for data mining
o Understand the business involved
o Identify telecommunication patterns
o Catch fraudulent activities
o Make better use of resources
o Improve the quality of service

The following are a few scenarios for which data mining


improve telecommunication services

 Multidimensional analysis of telecommunication data


o Intrinsically multidimensional: calling-time, duration, location of caller, location
of callee, type of call, etc.
 Fraudulent pattern analysis and the identification of unusual patterns
o Identify potentially fraudulent users and their atypical usage patterns
o Detect attempts to gain fraudulent entry to customer accounts
o Discover unusual patterns which may need special attention
 Multidimensional association and sequential pattern analysis
o Find usage patterns for a set of communication services by customer group, by
month, etc.
o Promote the sales of specific services
o Improve the availability of particular services in a region
 Mobile telecommunication services
 Use of visualization tools in telecommunication data analysis

IV. Biomedical Data Analysis

 DNA sequences: 4 basic building blocks (nucleotides): adenine (A), cytosine (C),
guanine (G), and thymine (T).
 Gene: a sequence of hundreds of individual nucleotides arranged in a particular order
 Humans have around 30,000 genes
 Tremendous number of ways that the nucleotides can be ordered and sequenced to form
distinct genes

Data mining may contribute to biological data analysis in the following aspects

 Semantic integration of heterogeneous, distributed genome databases

181
Data Mining - Mining World Wide Web
The World Wide Web contains huge amounts of information that provides a rich source for data mining.

Challenges in Web Mining


The web poses great challenges for resource and knowledge discovery based on the following observations −

 The web is too huge − The size of the web is very huge and rapidly increasing. This seems that the web is
too huge for data warehousing and data mining.

 Complexity of Web pages − The web pages do not have unifying structure. They are very complex as
compared to traditional text document. There are huge amount of documents in digital library of web.
These libraries are not arranged according to any particular sorted order.

 Web is dynamic information source − The information on the web is rapidly updated. The data such as
news, stock markets, weather, sports, shopping, etc., are regularly updated.

 Diversity of user communities − The user community on the web is rapidly expanding. These users have
different backgrounds, interests, and usage purposes. There are more than 100 million workstations that
are connected to the Internet and still rapidly increasing.

 Relevancy of Information − It is considered that a particular person is generally interested in only small
portion of the web, while the rest of the portion of the web contains the information that is not relevant to
the user and may swamp desired results.

Mining Web page layout structure


The basic structure of the web page is based on the Document Object Model (DOM). The DOM structure refers to
a tree like structure where the HTML tag in the page corresponds to a node in the DOM tree. We can segment the
web page by using predefined tags in HTML. The HTML syntax is flexible therefore, the web pages does not
follow the W3C specifications. Not following the specifications of W3C may cause error in DOM tree structure.

The DOM structure was initially introduced for presentation in the browser and not for description of semantic
structure of the web page. The DOM structure cannot correctly identify the semantic relationship between the
different parts of a web page.

Vision-based page segmentation (VIPS)


 The purpose of VIPS is to extract the semantic structure of a web page based on its visual presentation.

 Such a semantic structure corresponds to a tree structure. In this tree each node corresponds to a block.

 A value is assigned to each node. This value is called the Degree of Coherence. This value is assigned to
indicate the coherent content in the block based on visual perception.

182
 The VIPS algorithm first extracts all the suitable blocks from the HTML DOM tree. After that it finds the
separators between these blocks.

 The separators refer to the horizontal or vertical lines in a web page that visually cross with no blocks.

 The semantics of the web page is constructed on the basis of these blocks.

The following figure shows the procedure of VIPS algorithm −

Data Mining - Mining Text Data

Text databases consist of huge collection of documents. They collect these information from several sources such
as news articles, books, digital libraries, e-mail messages, web pages, etc. Due to increase in the amount of
information, the text databases are growing rapidly. In many of the text databases, the data is semi-structured.

For example, a document may contain a few structured fields, such as title, author, publishing_date, etc. But along
with the structure data, the document also contains unstructured text components, such as abstract and contents.
Without knowing what could be in the documents, it is difficult to formulate effective queries for analyzing and
extracting useful information from the data. Users require tools to compare the documents and rank their
importance and relevance. Therefore, text mining has become popular and an essential theme in data mining.

Information Retrieval

183
Information retrieval deals with the retrieval of information from a large number of text-based documents. Some of
the database systems are not usually present in information retrieval systems because both handle different kinds of
data. Examples of information retrieval system include −

 Online Library catalogue system


 Online Document Management Systems
 Web Search Systems etc.
Note − The main problem in an information retrieval system is to locate relevant documents in a document
collection based on a user's query. This kind of user's query consists of some keywords describing an information
need.

In such search problems, the user takes an initiative to pull relevant information out from a collection. This is
appropriate when the user has ad-hoc information need, i.e., a short-term need. But if the user has a long-term
information need, then the retrieval system can also take an initiative to push any newly arrived information item to
the user.

This kind of access to information is called Information Filtering. And the corresponding systems are known as
Filtering Systems or Recommender Systems.

Basic Measures for Text Retrieval


We need to check the accuracy of a system when it retrieves a number of documents on the basis of user's input.
Let the set of documents relevant to a query be denoted as {Relevant} and the set of retrieved document as
{Retrieved}. The set of documents that are relevant and retrieved can be denoted as {Relevant} ∩ {Retrieved}.
This can be shown in the form of a Venn diagram as follows −

There are three fundamental measures for assessing the quality of text retrieval −

 Precision
 Recall
 F-score

Precision
Precision is the percentage of retrieved documents that are in fact relevant to the query. Precision can be defined as

184
Precision= |{Relevant} ∩ {Retrieved}| / |{Retrieved}|

Recall
Recall is the percentage of documents that are relevant to the query and were in fact retrieved. Recall is defined as

Recall = |{Relevant} ∩ {Retrieved}| / |{Relevant}|

F-score
F-score is the commonly used trade-off. The information retrieval system often needs to trade-off for precision or
vice versa. F-score is defined as harmonic mean of recall or precision as follows −

F-score = recall x precision / (recall + precision) / 2

Text mining applications: 10 examples


Text mining is a relatively new area of computer science, and its use has grown as the unstructured data
available continues to increase exponentially in both relevance and quantity.

Text mining can be used to make the large quantities of unstructured data accessible and useful, thereby generating
not only value, but delivering ROI from unstructured data management as we’ve seen with applications of text
mining for Risk Management Software and Cybercrime applications.

Through techniques such as categorization, entity extraction, sentiment analysis and others, text mining


extracts the useful information and knowledge hidden in text content. In the business world, this translates in
being able to reveal insights, patterns and trends in even large volumes of unstructured data. In fact, it’s this ability
to push aside all of the non-relevant material and provide answers that is leading to its rapid adoption, especially in
large organizations. 

These 10 text mining examples can give you an idea of how this technology is helping organizations today.

1 – Risk management

No matter the industry, Insufficient risk analysis is often a leading cause of failure. This is especially true in the
financial industry where adoption of Risk Management Software based on text mining technology
can dramatically increase the ability to mitigate risk, enabling complete management of thousands of sources and
petabytes of text documents, and providing the ability to link together information and be able to access the right
information at the right time.

2 – Knowledge management

Not being able to find important information quickly is always a challenge when managing large volumes of text
documents—just ask anyone in the healthcare industry. Here, organizations are challenged with a tremendous
amount of information—decades of research in genomics and molecular techniques, for example, as well as
volumes of clinical patient data—that could potentially be useful for their largest profit center: new product
development.  Here, knowledge management software based on text mining offer a clear and reliable
solution for the “info-glut” problem.

185
3 – Cybercrime prevention

The anonymous nature of the internet and the many communication features operated through it contribute to the
increased risk of  internet-based crimes. Today, text mining intelligence and anti-crime applications are making
internet crime prevention easier for any enterprise and law enforcement or intelligence agencies.

4 – Customer care service

Text mining, as well as natural language processing are frequent applications for customer care. Today, text
analytics software is frequently adopted to improve customer experience using different sources of valuable
information such as surveys, trouble tickets, and customer call notes to improve the quality, effectiveness and speed
in resolving problems. Text analysis is used to provide a rapid, automated response to the customer,
dramatically reducing their reliance on call center operators to solve problems. 

5 – Fraud detection through claims investigation

Text analytics is a tremendously effective technology in any domain where the majority of information is
collected as text. Insurance companies are taking advantage of text mining technologies by combining the results
of text analysis with structured data to prevent frauds and swiftly process claims.

6 – Contextual Advertising

Digital advertising is a moderately new and growing field of application for text analytics . Here,  companies
such as Admantx have made text mining the core engine for contextual retargeting  with great success.
Compared to the traditional cookie-based approach, contextual advertising provides better accuracy, completely
preserves the user’s privacy.

7 – Business intelligence

This process is used by large companies to uphold and support decision making. Here, text mining really makes
the difference, enabling the analyst to quickly jump at the answereven when analyzing petabytes of internal and
open source data. Applications such as the Cogito Intelligence Platform (link to CIP) are able to monitor thousands
of sources and analyze large data volumes to extract from them only the relevant content.

8 – Content enrichment

While it’s true that working with text content still requires a bit of human effort, text analytics techniques make a
significant difference when it comes to being able to more effectively manage large volumes of information. Text
mining techniques enrich content, providing a scalable layer to tag, organize and summarize the available
content  that makes it suitable for a variety of purposes.

9 – Spam filtering

E-mail is an effective, fast and reasonably cheap way to communicate, but it comes with a dark side: spam.
Today, spam is a major issue for  internet service providers, increasing their costs for service management and
hardware\software updating; for users, spam is an entry point for viruses and impacts productivity. Text mining
techniques can be implemented to improve the effectiveness of statistical-based filtering methods. 

10 – Social media data analysis

Today, social media is one of the most prolific sources of unstructured data; organizations have taken notice.
Social media is increasingly being recognized as a valuable source of market and customer intelligence, and
186
companies are using it to analyze or predict customer needs and understand the perception of their brand. In both
needs Text analytics can address both by analyzing large volumes of unstructured data, extracting opinions,
emotions and sentiment and their relations with brands and products.

Spatial Data Mining

A spatial database stores a large amount of space-related data, such as maps, preprocessed remote sensing or
medical imaging data, and VLSI chip layout data.

Spatial data mining refers to the extraction of knowledge, spatial relationships, or other interesting patterns not
explicitly stored in spatial databases. Such mining demands an integration of data miningwith spatial database
technologies. It can be used for understanding spatial data, discovering spatial relationships and relationships
between spatial and nonspatial data, constructing spatial knowledge bases, reorganizing spatial databases, and
optimizing spatial queries. It is expected to have wide applications in geographic information systems,
geomarketing, remote sensing, image database exploration, medical imaging, navigation, traffic control,
environmental studies, and many other areas where spatial data are used..

Statistical spatial data analysis has been a popular approach to analyzing spatial data and exploring geographic
information. The term geostatistics is often associated with continuous geographic space, whereas the term spatial
statistics is often associated with discrete space.

Spatial Data Cube Construction and Spatial OLAP

Spatial Data warehouse can be constructed by integrating spatial data to construct a data warehouse that facilitates
spatial data mining. A spatial data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile
collection of both spatial and nonspatial data in support of spatial data mining and spatial-datarelated decision-
making processes.

There are three types of dimensions in a spatial data cube:

A nonspatial dimension contains only nonspatial data. Nonspatial dimensions temperature and precipitation can be
constructed for the warehouse in Example 10.5, since each contains nonspatial data whose generalizations are
nonspatial (such as “hot” for temperature and “wet” for precipitation).

187
A spatial-to-nonspatial dimension is a dimension whose primitive-level data are spatial but whose generalization,
starting at a certain high level, becomes nonspatial. For example, the spatial dimension city relays geographic data
for the U.S. map. Suppose that the dimension’s spatial representation of, say, Seattle is generalized to the string
“pacific northwest.” Although “pacific northwest” is a spatial concept, its representation

is not spatial (since, in our example, it is a string). It therefore plays the role of a nonspatial dimension.

A spatial-to-spatial dimension is a dimension whose primitive level and all of its highlevel generalized data are
spatial. For example, the dimension equi temperature regioncontains spatial data, as do all of its generalizations,
such as with regions covering 0-5 degrees (Celsius), 5-10 degrees, and so on.

Two types of measures in a spatial data cube:

A numerical measure contains only numerical data. For example, one measure in a spatial data warehouse could be
the monthly revenue of a region, so that a roll-up may compute the total revenue by year, by county, and so on.
Numerical measures can be further classified into distributive, algebraic, and holistic.

A spatial measure contains a collection of pointers to spatial objects. For example, in a generalization (or roll-up) in
the spatial data cube of Example 10.5, the regions with the same range of temperature and precipitation will be
grouped into the same cell, and the measure so formed contains a collection of pointers to those regions.

A nonspatial data cube contains only nonspatial dimensions and numerical measures. If a spatial data cube contains
spatial dimensions but no spatial measures, its OLAP operations, such as drilling or pivoting, can be implemented in
a manner similar to that for nonspatial data cubes.

For example, two different roll-ups on the BC weather map data (Figure 10.2) may produce two different
generalized region maps, as shown in Figure 10.4, each being the result of merging a large number of small (probe)
regions from Figure 10.2.

188
Figure 10.3 presents hierarchies for each of the dimensions in the BC weather warehouse.

189
Mining Spatial Association and Co-location Patterns

Similar to the mining of association rules in transactional and relational databases, spatial association rules can be
mined in spatial databases. A spatial association rule is of the form A)B [s%;c%], where A and B are sets of spatial
or nonspatial predicates, s% is the support of the rule, and c%is the confidence of the rule. For example, the
following is a spatial association rule:

is a(X; “school”)^close to(X; “sports center”))close to(X; “park”) [0:5%;80%].

This rule states that 80% of schools that are close to sports centers are also close to parks, and 0.5% of the data
belongs to such a case.

Spatial Clustering Methods

Spatial data clustering identifies clusters, or densely populated regions, according to some distance measurement in
a large, multidimensional data set.

Spatial Classification and Spatial Trend Analysis

Spatial classification analyzes spatial objects to derive classification schemes in relevance to certain spatial
properties, such as the neighborhood of a district, highway, or river.Current: highly distributed, uncontrolled
generation and use of a wide variety of DNA data

190
o Data cleaning and data integration methods developed in data mining will help
 Alignment, indexing, similarity search, and comparative
analysis ofmultiple nucleotide/
protein sequences
o Compare the frequently occurring patterns of each class (e.g., diseased and
healthy)
o Identify gene sequence patterns that play roles in various diseases
 Discovery of structural patterns and analysis of genetic networks and protein pathways:
 Association analysis: identification of co-occurring gene sequences
o Most diseases are not triggered by a single gene but by a combination of genes
acting together
o Association analysis may help determine the kinds of genes that are likely to co-
occur together in target samples
 Path analysis: linking genes to different disease development stages
o Different genes may become active at different stages of the disease
o Develop pharmaceutical interventions that target the
different stages separately
 Visualization tools and genetic data analysis

V. Data Mining in Other Scientific Applications

 Vast amounts of data have been collected from scientific domains (including geosciences,
astronomy, and meteorology) using sophisticated telescopes, multispectral high-resolution
remote satellite sensors, and global positioning systems.

 Large data sets are being generated due to fast numerical simulations in various fields, such
as climate and ecosystem modeling, chemical engineering, fluid dynamics, and structural
mechanics.

 some of the challenges brought about by emerging scientific applications of data mining,
such as the following
o Data warehouses and data preprocessing:
o Mining complex data types:
o Graph-based mining:
o Visualization tools and domain-specific knowledge:

191
VI. Data Mining for Intrusion Detection
 The security of our computer systems and data is at constant risk. The extensive growth of
the Internet and increasing availability of tools and tricks for interrupting and attacking
networks have prompted intrusion detection to become a critical component of network
administration.

 An intrusion can be defined as any set of actions that threaten the integrity, confidentiality,
or availability of a network resource .

The following are areas in data mining technology applied or further developed for intrusion
detection:

o Development of data mining algorithms for intrusion detection


o Association and correlation analysis, and aggregation to help select and build
discriminating attributes
o Analysis of stream data
o Distributed data mining
o Visualization and querying tools

192

You might also like