0% found this document useful (0 votes)
13 views100 pages

Textbook

The document provides an introduction to the concept of data mining including its definition, why it has become important, the typical steps involved in a data mining process, and some common data mining techniques. It also lists some reference books and resources for further reading on data mining.

Uploaded by

Raihanna Jalloh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views100 pages

Textbook

The document provides an introduction to the concept of data mining including its definition, why it has become important, the typical steps involved in a data mining process, and some common data mining techniques. It also lists some reference books and resources for further reading on data mining.

Uploaded by

Raihanna Jalloh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 100

BHARATHIDASAN UNIVERSITY

CENTRE

M. Sc. COMPUTER SCIF,NCE


CORT] COURSE - IX SEMESTF.R- II
D.{TA MINING & DAI'A WAREHOUSING

Copy Rights Reserved For Private Circulati( Il


Sri Venkateswara Power pinters, Tichy-Ll. ph : 24417g3
SYLLABUS

CORE COURSE IX
DATA MINING AND DATA WAREHOUSING
Objective I In this course students shall learn the mqthematical & algorithmic details of
various data association techniques to discover patlerns in undirlying data (namely mining
data). They also learn how to consolidate huge volume of data in one place efJiciently.

Unit - I'
Introduction to data mining - Association Rule Mining.

Unit - II
Classification - Cluster analysis.

Unit - III
Web Data Mining - Search engines.

Unit - IV
Data warehousing - Algorithms & operations to create data warehouse - Designing data
warehouse - Applications ofdata warehouse.

Unit - V
Online analytical processing - Information Privacy.

Text Book :
l. GK. Gupta, Introduction to Data mining with case studies, Prentice Hall India. 2006
(ISBN 8l - 203-3053-6)[Unit-l : (Chapters 1,2); Unit - 2 : Chapters 3,4); Unit-3 (Chap-
ters 5,6);:Unit-4 (Chapters 7), Unit - 5 (Chapters 8,9)1.

REFERENCE BOOKS
l. KL.P. Soman & Shyam Diwakar and V. Ajay, tnsight to Data Mining Theory and
Practice, Prentice Hall ofIndia,2006. (ISBN - 8l-203-2897-3)
2. Jiawei Han end Micheline Kamber, Data Mining Concepts and Techniques, Elsevier,
Second Edition, 2007 (ISBN: 8l-312-0.535-5)
I

CHAPTER 1

TO DATA MINING
Learning Objectives
l Explain what data mining is and where it may be useful
2. List the steps that a data mining process often involves
3. Discuss why data mining has become inrportant
4. Introduce briefly some of the data mining techniques
5. Develop a good understanding of the data mining software available on the mail'.t

6. Identify data mining web resources and a bibliography of daia mining

I,1 WHAT IS DATA MINING?


' Dat6 mining or knowledge discovery in

information. The techniques can find novel patterns thal may assist en eni-:1ri:'- r,'

understanding the business better and in forecasting.

Data mining is a collection of techniques for el/icient automated discovery of previously unkrtr.ttt t i
valiS, novel, useful and unLderstandable pdttetns in large databases. The pdtterns must be aclionahic
so that they may be used in a decision of dn enterprise making process.

Data mining is a complex process and may require a variety of steps before some useful results are
olrtained. Often tlata pre-processing including data cleaning rnay be needed. Itt some cases, sampling
ofdata and testing ofvarious hypotheses may be required before data mining can start.

I.2 WHY DATA MINING NOW?

Data mining has found many applications in the lasr feiv vears lor a ntttnber of reasons.

1. Growth of OLAP data: The first database systems were implemented in the t 960's and
1970's. Many enterprises therefole have more than 30 years of experience in using database
systems and they have accumulated large amounts oldata during that time.

2. Grorvth of data due to cards: The growing use of credit cards and loyaity cards is an
important area of data growth. In the USA, there has been a tremendous grou'th in the use of
loyalty cards. Even in Australia, the use of cards like Fl,"-Rriys has grown considerabiy.

Table 1.1 shows the total number of VISA and Mastercard credil cards in the top ten c...iu

holding countries.
2

Table 1.1 Top ten card holding countries

Rank Country Cards (millions) Population (millions) Cards per capita

1 USA 75s 293 2.6


2 China 177 1294 0.14
3 Braz1l 148 184 0.80
4 UK 126 60 2.1

5 Japan 121 127 0.95


6 Germany 109 83 1.31

. 7 South Korea 95 47 2.02


.8 Taiwan 60 22 2.72
9 Spain 56 39 1.44

10 Canada 51 31 1.65

Total Top Ten 1700 2180 0.78


Total Global 2362 6443 0.43

3. Growth in data due to the web: E-commerce developments have resulted in information
about visitors to Web sites being captures, once again resulting in mountains of data for some
companies.
4. Growth in data due to other sources: There are many other sources of data.
Some of them are:
r TelephoneTransactions
. Frequent flyer transactions
r Medical transactions
o Immigration and customs transactions
o Banking transactions
o Motor vehicle transactions
. Utilities (e.g electricity and gas) hansactions
. Shoppingtransactions

5. Growth in data storage capacityl


Another way of illustrating data growth is to consider annual disk storage sales over the last few
years. As shown in Figure Ll, the total annual disk sales in 2003 were 16000 petabytes (or 16
exab)tes) ofstorage compared to only around 4000 petabytes in 2000
J

Annual disk storage sales in petabytes

18000
o 16000
{. r+ooo
f, rzooo
I loooo
.s 8000
$ aooo
oooo
$ 2000
o
0
'1996 1997 1998 1999 2000 2001
Year

Figure 1.1 Annual disk storage sales in petabytes

6. Decline in the cost of processing


The cost of computing hardware has declined rapidly the last 30 years coupled with the increase
in hardware performance. Not only do the prices for processors continue to decline, but also the
prices for computer peripherals have also been declining.

7. Competitive environment
Owning to increased globalization of trade, the business environment in most countries has
become very competitive. For example, in many countries the telecommunications industry used
to be a state monopoly but it has mostly been privatized now, leading to intense competition rn
this industry. Businesses have to work harder to find new customers and to retain old ones.

8. Availability of software
A number of companies have developed useful data mining software in the last few years.
Companies that were already operating in the statistics software market and were familiar with
statistical algorithms, some of which are now used in data mining, have developed some of the
software.

I.3 THE DATA MINING PROCESS


The data mining process involves much hard work, including building a data warehouse.

The data mining process includes the following steps:

1. Requirement arralysis: the enterprise decision makers need to formulate goals that the data mining
process is expected to achieve. The business problem must be clearly defined. One cannot use ,latir

mining without a good idea of what kind of outcomes the enterprise is looking for, since the
technique is to be used and the data is required.
2' Duttt Selection and collection: this step includes finding the best source the databases for the data
that is required. If the enterprise has implemented a data warehouse, then most of the data could be
available there. If
the data is not available in the warehouse or the enterprise does not have a
,varelrouse, the source Online Transaclion processing (OLTP) systems need to be identified and the
lequired information extracted and stored in some temporary system.

l, Cleaning preporing data: Tl'ris rnay not be an onerous task ifa data warehouse containing the
and

required data exists, since most ol this must have aheady been done when data was loaded in the
warehorne. Otherwise this task can be very resource intensive and sometimes more than 50% of
effort in a data mining project is spent on this step. Essentially, a data store that integrates data from
a number ofdatabases may need to be created. When integrating data, one often encounters problems

like identi$zing data. dealing with missing data. data conflicts and ambiguity. An ETL (extraction,
.transformation and loading) tool may be used to overcome these problems.

4, Data Mining exploration snd validation: Once appropriate data has been collected and cleaned. it
is possible to start data mining exploration. Assuming that the user has access to one or more data
mining.tools, a data mining model may be constructed based on the needs of the enterprise. _It may be
possible to.t$..iu sample of data and apply a number of relevant techniques. For each technique the
results should be evaluated and their significance interpreted. This is to be an iterative process which
shoulcl lead to selection of one or more techniques th4t are suitable for further exploration, testing
and validati,on.

5. Implemenling, evalauiing and monitoring.' Once a model has been selecied and validated, the
model can be implementetl for use by the decision makers. This may involve software development
lor generating reports, or f,or results visualization and explanation, for managers. It may be that more
than one technique is available for the given data mining task. It is then important to evaluate the
results qnd choose the best technique. Evaluation may involve checking the accuracy and
elfcctiveiress of the techniciue.
. 'Ihere is a need for regular monitoring of the performance of the techniques that have been

irnplemented. It is essential that use of the tools by the managers be monitored and the results
evaluated regularly. Every enterprise evolves with time and so too the data mining system.

6. Results visualization: Explaining the results of data mining to the decision makers is an import,xlt
step of the data mining process. Most commercial data mining tools include data visualization
;nodules. These tools are vital in communicating the data mining results to the managers; although a

l,roblem dealing with a number of dimensions must be visualized using a two-dimensional comput(:r
slreen or printout. CleYer data visualization tools are being developed to display results that deal
r.rith more than two dimensions. Thc' r isualization tools available should be tried and used if found
' ' .: live for tirc givei prcblem.
Figure 1.2 CRISP data mining process model
I.4 DATA MINING APPLICATIONS

Data mining is being used for a wide variety of applications. We group the applications into :l.s
following six groups. These are related groups, not disiointed groups.

1. Prediction antl description' Data minilrg used to.answer questions like "would this customer buy
a product?'' or "is this customer likely to leave?" Data mining techniques may also be used for sale
forecasting and analysis. Usually the techniques involve selecting sorne or all the attdbutes of the
objects available in a database to predict other variables of interest.

2. Retationship marketing: Data mining can help in analyzing cr.rstomer profiles, discovering sales
higgers, and in identifuing critical. issues that determine client loyalty and help in improving
customer retention. This also includes analyzing customer profiles and improving direct marketing
plans. lt may be possible to use cluster analysis to identifi customers suitable for cross-selling other
products.

3. Customer proliling: It is the process of using the relevant and available information to describe
tire characteristics ofa group of customers and to identiry their discriminators from other cust')mers
or ordinary consumers and drivers for their purchasing decisions. Profiling can help an ente,pl'ise
idt-'ntify its most valuable customers so thar the enterprise may differentiate their needs and valu:s.
6

4. Oattiers iilentiJication and tletecting fraud: There are many uses of data mining in identifiing
outliers, fraud or unusual cases. 'l'hese might be as simple as identifting unusual expense claims by
staff, identifying anomalies in expentliture between similar units of an enterprise, perhaps during
auditing, or identifying fraud, for exarnple, involving credit or phone cards.

5, Customer segmentation: It is a way to assess and view individuals in the market based on their
status and needs. Data mining can be used for customer segmentation, for promoting the cross-
selling ol services. and in increasing customer retention. Data mining may also be used for branch
segmentation and for evaluating the performance of various banking channels, such as phone or
online banking. Furthermore data mining may be used to understand and predict customer behavior
and profitabiliry-, to develop new products and services, and to effectively market new offerings.

6. Web site design and promotion: Web mining may be used to discover how users navigate a Web
site and the results can help in improving the site design and making it more visible on ihe Web.
Data mining may also be used in cross-selling by suggesting to a Web cusromer items that he /she
may be interested in, through correlating properties aboul the customers, or the items the person had
ordered, with a database of items that other customers might have ordered previously.

1.5 DATA MINING TECHNIQUES

Data mining employs number of techniques including the following:

Association rules mining or market basket oitdlysis

Association rules mining, is a technique that analyses a set of transactions at a supermarket


checkout, each transaction being a list of products or items purchased by one customer. The aim of
association rule mining is to determine which items are purchased together frequently so that they
may be grouped together on store shelves or the information may be used for cross-selling.
Sometimes the term /f is used to measure the power of association between items that are purchased
together. Lift essentially indicated how much more likely an item is to be purchased if the customer
has bought the other item that has been identified.

Association rules mining has many applications other than market basket analysis, including
applications in marketing, customer segmentation. medicine, electronic co lmerce, classification,
clustering, Web mining, bioinformatics. and finance. A simple algorithm called the Apriori algorithm
is used to find associations.

Supemised classiJication: Supervised classification is appropriate to use if the data is known to have
a small number of classes, the classes are known and some training data with their classes known is
available. The model built based on the training data may then be used to assign a new object to a
predefined class.
Supervised classilication can be used in predicting the class to which an object or individual
is likely to belong. This i.i useful, for example, in predicting whether an individual is like\. to
respond to a direct mail solicitation, in identiffing a good'candidate for a surgical procedwe, or in
identifuing a good risk for granting a loan or insurance. One of rhe most widely used supervised
classification techniques is decision tree. The decision tree teghnique is wide.ly used because it
generates easily understandable rules for classiffing_data.

Cluster anatysis

Cluster analysis or clustering is similar to classification but, in contrast to supervised


classification, cluster analysis is useful when the classes in the data are not already known and the
training data is not available. The aim ofcluster analysis is to find groups that are very different from
each other in a collection ofdata. Cluster analysis breaks up a single collection of diverse data into a
number of groups. Often these techniques require that the user specifies how many gloups are
expected.

One of the most widely used cluster analysis methods is called the K-means algorithm, which
requires that the user specified not only the number of clusters but also their starting seeds. The
algorithm assigns each object in the given data to the closet seed which provides the initial clusters.

ll/eb dala mining

The last decade has witnessed the Web revolution which has ushered a new information
retrieval age. The revblution has had a profound impact on the way we search and find information al
home and at work. Searching the web has become an everyday experience for millions of people
from all over the worli (some estimates suggest over 500 million users). From its beginniag in the
early 1990s, the web had grown to more than four billion pages in 2004, and perhaps would grow to
more than eight billion pages by the end of2006.

Search engines

The search engine databases of Web pages are built and updated automatically by Web
crawlers. When one searches the Web using one of the search engines, one is not searching the entire
Web. Instead one is only searching the database that has been compiled by search engine.

D a warehousing arrd OIAP


Daia warehousing is a prooess by which an enterprise collects dala'from the whole enterprise
to build a single version of the truth. This informaiion is useful for decision makers and may alsc be
us€d for daa mining. A data warehouse can be of real help in data mining since data cleaning and
other problems of collecting data would have already been overcome. OLAP tools are decision
sup'port tools that are often built on the top of a data warehouse or another database (called a
mulddimensional database).

I.6 DATA MI]VNG CASE STUDIES


There are number ofcase studies from a variety ofdata mining applications.

Aviation - Wipro's Frequent F'lyer Program

Wipro has reported a study of ftequent flyer data from an Indian airline. Before carrying out
data mining, the data was selected and prepared. It was decided to use only the three most common
sectors flown by each customer and the duee most corrrmon sectors when points are reduced by each
customer. It u'as discovered that much of the data supplied by lhe airline was incomplete or
inaccuraie. Also it was found that the customer data captured by the company could have been more
complete. For exartrple, the airline did not know customer's martial status or their income or their
rcasons for taking a journey.

Astronomy

Astronomers pioduce huge amounts ofdata every night on the fluctuating intensity ofaround
20 millions stars which are clhssified by their spectra and their surface temperature. Some 90%c of
s;ars are called main sequence stars including some stars that are very large. very hot. and blue in
color. The main sequence stars are fueUed by nuclear. fusion and are very stable, lasting billions of
lears. Ihe smaller main sequence stars include the Sun. There are a number ofclasses including stars
called yellow dwarf. red dwa{ and whitc dwarf.

Bankingand Finance

nar*ing and finance is a rapidly changing competitive industry. The industry is using data
mining for a variety of tasks including building customer profiles to better understand the customers.
to. identi& fraud, to evaluate risks in personal and home loans, and to better forecast stock prices,
interest rates, exchange rates and commodity prices. In the field ofcredit evaluation, data mining can
assist in estabtishing an automated decision supporl system which would allow credit card or loan
providing companied to quickly and accurately assess risk and approve or reject an application.

Climate

A study has been reported on atmospheric and oceanic parameters that cause drought in tl,e
srate of Nebraska in USA. Many variables were considered including the follouing.
I . Standardized precipitation index (SPI)
2. Palmer drought severity index (PDSI)
3. Soutlrcnr oscillation index (SOl r

4. Multivariate ENSO (lll lJino Southem Oscillation) index (N{Ei)


:.-
-
o

- 5. Pacific / Notth American index (PNA)


6. North Atlantic oscillation index (lllAO)
7. Pacific decadal oscillation index (PDO)
was concluded that SOI, MEI and PDO rather than SPI and PDSI havc
relatively stronger relationships with drought episodes over selected stations in Nebraska.

Crime Prevention

A number of case studies have been published about the use of data mining techniques in
analyzin! crime dirta. .In one particular study, the data mining techniques were used to link serious
sexual crimes to other crimes that might have been committed by the same offenders. The data used
related to more than 2000 offences (the number of offendqrs was much smaller since moqt offenders
bommitted multiple crimes) involving a variety of sexual crimes.

Direct Mail Service

In this case study, a direct mail company held a list of a large number of potential customers.
The response rate of the company had been only l%o, which the compahy wanted to improve. To
. carry out data mining, the company had to first prepare data, which included sampling the data to
select a subqet of cqstomers including those that responded to direct mail and those that did not.

Much rgsearch is being carried out in applyrng data mining to a variety of applications I
healthcare. It has been found, for example, that in drug testing, data mining may assist in isolating
those patients where the drug is most effective or where the drug is having unintended side effects.
Data mining has been used in determining

1.7 FUTI.JRE OF DATA MINING


'The use of data mining in business is growing as:data mining techniques move from research

algorithms to business applications and as storage prices continue to decline and enterprise data
continues to grow, data mining is still not being used widely. Thus, there is considerable potential for
- data mining to continue to grow. l

, Since most time 'spent in data mining is actually spent in data extraction, data cleaning and
data manipulation, it is expected that technologies like data warehousing will grow in importance. It
has been found that as much as 40q.o of all collected data contains errors. To deal with such large
error rates, there is likely to be more emphasis in the future on building data warehouses using data
cleaning and. extraction. Data mining efficiency would improve if these tasks could be carried out
10

Data mining techniques depend upon a lot of careful analysis of the business and a good
' understanding of the techniques and software available. Often a model needs to be built, tested and
validated before it can be used. This needs considerable expertise and time. The team engaged on
building a data mining application should have the business expertise as well as the data mining
expertise.

Data mining techniques that have become important in the future include techniques that
better determine "interestingness" of a discovered pattem and are able to compare current data with
an earlier set of data to determine ifthere is a change of paftem in data. Other techniques that are to
receive more attention in the future are text and web-content mining, bioinformatics, and multimedia
data mining.

The issues related to information privacy and data mining will continue to attract serious
concem in the community in the future. In particular, privacy concems related to use of data mining
techniques by govemments, in particular the US Govemment, in fighting terrorism are likely to
grow.

1.8 GUIDELINES FOR SUCCESSFUL DATA MINING

Every data mining project is different but the pro.iects do have some common features. Following are
some basic requirements for successful data mining project.

o The data must be available


r The data must be relevant, adequate, and clean
c There must be a well-defined problem
r The problem should not be solvable by means of ordinary query or OLAP tools
o The result must be actionable

Once the basic prerequisites have been met, the following guidelines may be appropriate for a data
mining project.

1. Data mining projects Should be carried out by small teams with a strong intemal

integration and a loose management style


2. Before starting a major data mining project, it is recommended that a small pilot
project be carried out. This may involve a steep leaming curve for the project team.

This is of vital importance.


3. A clear problem owner should be identified who is responsible for the project.

Preferably such a person should not be a technical analyst or a consultant but someone
with direct business responsibility, for example someone in a sales or marketing
environment. This will benefit Yhe extemal integration.
lr

4. The positive retum on investment should be realizes within 6 to l2 months.


5. Since the roll-out of the resulls in data mining application involves larger groups of
people and is technically less complex, it should be a separate and more strictly
managed prqject.

6.'Ihe whole project should have the support of the top management ofthe company.

1.9 DATA MINING SOFTWARE


There is considerable data mining software available on the Market. Most major computing
companies, for example IBM, Oracle and Microsoft, are providing data mining packages.
It should be understood that different software packages provide different types oftools. A list of
software packages is now provided.
. Angoss Software - Angoss has data mining software called KnowledgeSTUDIO. It is a
complete data mining package that includes facilities for classification, cluster analysis and
prediction. KnowledgeSTUDIO claims to provide a visual, easy-to-use interface. Angoss alsrr
. has another package called KnowledgeSEEKER that is designed to support decision tree
classification.
o CART and MARS - This software from Salford Systems includes CART Decision Treer;,

MARS predictive modeling, automated regression, TreeNet classification and regression.


data access, preparation, cleaning and reporting modules, RandomForests predictive
modeling, clustering and anomaly detection.
. Clementine - This software from SPSS is a well-known and comprehensive package that
provides association rules, classification, cluster analysis, factor analysis, forecasting,
prediction and sequence diseovery. Clementine provides a GUI approach to data mining, thus
providing icons on the desktop for various data mining steps including preparing data,
visualization and data mining techniques.
. Data Miner software Kit - It is a collection of data mining tools.
o DBMiner Technologies - DBMiner provides technique for association rules, classification
and cluster analysis. It interfaces with SQL Server and is able 1o use some ofthe facilities of
SQL Server.
o Enterprise Miner - SAS Institute has a comprehensive integrated data mining package.
Enterprise Miner provides a user-friendly icon-based GUI fronrend using their process
model called SEMMA (Sample, Explore, Modify, Model, Access).
o GhostMiner - It is a complete data mining suite, including data preprocessing, feature
selection, k-nearest neighbours, neural nets, decision tree, SVM, PCA, clustering, and
visualization.
t2

r Intelligent Miner -
This is a comprehensive data mining package from IBM. tntelligent
Miner uses DB2 but can access data from other databases. Its functionality
includes
association rules, classification, cluster analysis, prediction,
sequential pattems, and tirne
It also includes lntelligent Miner for Text for text mining, including
series.
mining of email
and web pages. Inteligent Miner provides support for processes
from data preparation to
mining and presentation.
r JDA Intellect - JDA Software Group has a comprehensive package called JDA Intellect
that
provides facilities for association mres, classification,
cruster anarysis, and prediction.
o -
Mantas Mantas Software is a smal company that was a spin-off
from SRA Intemational.
The Mantas suite is designed to focus on detecting and analyzing
suspicious behavior in
financial markets and to assist in complying with global regulations.
o MCubiX from Diagnos -it is a comprete and affordabre data mining toor box, incruding
decision tree, neural networks, associations rules and visualization-
e MineSet- originally developed by SGI. MineSet specializes in visualization
and provides a
variety of visualization tools including the scattei visualizer, the statistics visualizer
and the
map visualizer.

data cleaning and provides a graphical rool lor data


preprocessing. It provides mining on. rerational data bases and supporls development,
documentation, reuse and exchange of completing KDD process.
r :Oracle-
oracle 10g has data minin€ facilities embedded in it. Users of oracle have access
to
techniques for association nrles, classification and prediction. Oracle data. minor
is a
graphicat user interface for Oracle Data Mining.
o Weka 3- a collection of machine learning algorithms for solving data mining problems.
It is
.' . writtcn in .Iava that runs
on any plarforms.
Sqfrwarc Eviluation and Selection

1. Producl and vendor inJormotion


(a) Is the vendor reliahle?
(b) What is the vendor's background? Industry expertise? How long has the vendor
been
. in business?
(c) Does the sctftware run on the platform that is to be used for the software? Are
t:rere
satisfied crrstomers already using the software?
(d) Was the software originally designed for this platform?
(e) Does the software use client-server architecture if that is the environment
in which the
software is to be used?
l3

(0 How easy is it to input information


from a relational database system?
2. Total cost of ownership
a) How much will the product cost?
b) Does it include the cost ofmaintenance?
c) What kind of contract of sale is offered? Can the software be used by more
than one uscr, ifrequired?
3. Performance
a) How does the product perform compared to its compelitors?

b) Ifthe performance information is provided, is the information reliable?


c) How does the soRware deal with large datasets? Is the performance linear or quadratic or
exponential?
1. Functionality and modularily
a) What techniques and algorithms does the sof,ware providc? Do they include
the ones needecl and likely to be needed in the near future?
b) Do the techniques accept a variety ofdata, for example, numeril, categorical, etc?
c) Is the software modular? Can the software be customized? Is customization going

5, Troining.and support
a) What: documentation is provided?
rb) How curreat
is the documentation?
c) Does the vondor provide training and help in insta.llation?
d) Is computer-based training for the software available?
e) Arc theie any articles that have been written about the product by third parties?

f) Is technical support available in case problem arise?


g) Is this support available in case problems arise?
h) How far away is the nearest branch of the vendor?
i) Is the vendor large enough to provide adequate supporl?
j) Does the vendor have knowledgeable support staffl
6. Reporting facilities ond visualization
a) Are the results ofdata mining presented in a variety olways?
b) Are good graphical and visualization tools availabte to communicate results?
c) Are tools available to provide a summary of results?
2
t4

d' .an reporting be customized to meet the user's need?

7. Usabilig,
a) Is the user interface intuitive, given the machine platform?
b) Is the software easy to learn? Is the documentation available simple,
clear and concise?
c) Is the software flexible? Can it be easily adapted to a variety of problems?

CONCLUSION

In this chapter, we explained what data mining is and also presented a definition of data
n:iling. The reasons for current interest in data mining are discussed and a number of areas of
growth in data are described. The data mining process is discussed and the techniques to be covered
in this book are introduced. A number of application areas of data mining are presented and several
case studies lrom different applications dreas are briefly described. We also presented a list of some
data rnining soliware available and a list of issues that should be considered when purchasing data
rnining softwar.c.
REVIEW QUESTIOI.''S

1 . Briefly explain data mining and define it.


2. Discuss some of the reasons for groMh in enterprise data.
3. What kind of tasks is data mining suitable for? Discuss.
4. List three main reasons that would motivate an enterprise to take an interest in data
mining.
5. Describe the steps that are required in a typical data mining process.
6. List three important data mining techniques and present an application example for each
of them.
7. Are you able to see some disadvantages or dangers of using data mining in business or
govelnment? Explain.
8. Briefly describe a data mining case study in each ofthe following areas:
a) Astronomy
b ) Marketing
c) Telecommunications
9. Explain some factors that one must take into account when selecting data mining
soflware.
10. Explain the salient differences between the major data mining techniques.
l5

CHAPTER 2
ASSOCIATION RULES MINING
Learning Objectives
l. Explain what association rules mining is and present a naiVe algorithm
2 Explain the basic terminology and the Apriori Algorithm
3. Provide examples of Apriori Algorithm
4. Discuss the efficiency of the Apriori Algorithm and find ways 1a improve it
5. Discuss a number of more efficient algorithms
6. Summadze the major issues in association rules and get acquainted with a bibliography
on this technique

2.I INTRODUCTION
A huge amount data is stored electronically in most enterprises. In particular, in all retail
outlets the amount of data stored has grown enormously due to bar coding of all goods sold. As an
extrerne example presented earlier, Wall-Mart, with more than 4000 stores, collects about 20 million
point-of-sale transaction data each day.
Analyzing a large database of supermarket transactions with the aim of finding associ?i;,,n
rule is called association rules mining or market basket analysis. It involves searching for interesting
customer habits by looking at associations. Association rules mining has many applications oth,;r
than market basket analysis, including applications in marketing, customer segmentation, medicine,
electronic commerce, classification, clustering, web mining, bioinformatics and finance.

2.2 BASICS

Let us first describe the association rule task, and also define some of the terminology by using an
example of a small shop. We assume that the shop sells:
Bread Cheese Coffee
Juice Milk Tea
Biscuits Newspaper Sugar
We assume that the shopkeeper keeps records of what each. customer purchases. Such records of t,:n
customers are given in Table 2.1. Each row in the table gives the set of items that one customer
bought.
l6

Table 2.1 Transactions for a simplc examplc


Transaction Id Items
10 Bread, Cheese, Newspaper
20 Bread, Cheese, Juice
30 Bread, Milk
40 Cheese, Juice, Milk, Coffee
50

60

70 Bread, Cheese
Bread, Cheese, Juice, Coffee
90 Bread, Milk
100

The shopkeeper wants to find which products (catl them items) are sold together frequently.
If for example, ugar i{rd tea are tireitwo items that are sold together frequenrly then the shopkeeper
s.

might consider having a sale on one of them in the hope that it will not only increase the sale that

Association rule are written as x)


.Y meaning that whenever x appears'y also.tends.to
appear. X and Y may be single items or sets of items (in which the same item does not appear in both
sets). X is referred to as the arle cedenr of lhe rule and Y as the consequent.
X ) Y is a probabilistic relationship found empirically. It indicates only that X and Y have
been found together frequently in the given data and does not show a causal relationship implying
that buying ofX by a customer causes him/her to buy Y.
As noted above, we assume that we have a set of transactions, each transaction being a list of
items. Suppose itemg (or itemsets) X and Y appear together in only 10% of the transactions but
whenever X appears there in as 80% of chbnce that Y also appears. The l0% presence ofX and y
together is called the support (ot prevalince) of the rule and the 8070 chance is called the confidence
(or predictability) of the rule.
Let us deftne support and confidence more formally. The total number of, transaCtions is N.
Support ofX is the number of times it. appears in the database divided by N and support for X and Y
together is the number of items they appear together divided by N. Therefore using p(X) to mqan
probability of X in the database, we have:
Suppon(X) : ( Number of rimes X appears)/N = P(X)
Support(XY) : ( Nunrber of tinres X and Y appear rogether.),t{: p(Xt-ty)
l7

Confidence lor X ) Y is defined as the ration of the suppo( lor X and Y together to the support for
X. Therefore ifX appears much more frequently than X and Y appear together, the confidence will
be low. It does not depend on how frequently Y appears.

Confidence of (X ) Y) - Support(XY) / Supporr(X) = P(X n B /P(X) = P({DQ


P(YiX) is the probability of Y once X has taken place. also called the conditional probability ofY.

2.3 THE TASK AND A NAIVE ALGORITHM

Given a large set of transactions, we seek a procedure to discover all association rules wiich
have at least pok support with at leasl q94 confidence such that all rules satisfying these constraints
are found and, ofcourse. found efficiently.
Example 2.1 - A Naiie Algorithm
Let us consider n naiVe brute force algorithm to do the task. Consider the following example
(Table 2.2) which is even simpler than what we consldered earlier in Table 2.1. We now have only
the four transactions given in Table 2.2, eachhansaction showing the purchases of one customer. We
are interested in finding association rules with a minimum "suppoft" of 5Ao/o and minimum
"confidence" of 7570.
Table 2,2 Transactions for Example 2.1

Transaction ID Items

Bread, Cheese
Bread, Cheese, Juice
Bread, Milk
Cheese, Juice, Milk

If lle'can list all'the combinations of the,items that we have in stock and find which of these
combinations are frequent, then we can find the association rules that have the "confidence" lrom
these frequent combinations.
The four items and all the combinations of these four items and their frequencies of
occurrence in the transaction "database" in Table 2.2 are given in Table 2.3.
l8

Table 2.3 The list of all itemsets and their frequencies


Itemsets
Bread 3
Cheese 3
Juice 2
Milk 2
(Bread. Cheese) )
(Bread, Juice) I
(Bread, Milk) I
(Cheese, Juice) 2
(Cheese, Milk) I
(Juice, Milk) I
(Bread, Cheese, Juice) I
(Bread, Cheese, Milk) 0
(Bread, Juice,Milk) 0
(Cheese, Juicb, Milk) I
(Bread, Cheese, Juice, Milk) 0

Given the requiied minimum support of 50%, we find the itemsets that occur in at least two
transactions. Such itemsets are called frequent. The list of frequencies shows rhat all.fbLr itenrs
Bread, Cheese, Juice and Milk are frequent. The frequency goes down as we look at 2-it,:.risetj, :l-
itemsets and 4-itemsets.
The frequent itemsets are given inTable 2.4
Table 2.4 The set of all fi'equent itcmsets
Itemsets Frequency

Bread J
Cheese J
Juice 2
Milk 2
Bread, Clrcese 2
Chgese, Juice 2

. We can now pr.oceed to determine if the two 2-itemsets (Bread. Cheese) ai,. , t.Dese, ir,ice
h:ad to association rules with required confidence of 75%. E'.'ery 2-itemset (A. B) cen lcac ro rwo
rules A ) B and B I A if both satisry the required confiden..:. .\s defined earli,,. tonfidence olA
) B is given by the support for r a,rd B together divideJ oy the supp,';r"t for A
We 0rcrefore have four possible rules and their confidence as foilor,r.s.
t9

Bread ) Cheese with confidence of 213 = 67Yo


) Bread with confidence of 213 : 67Yo
Cheese
Cheese ) Juice with confi dence ol2/3 : 67Yo
Juice ) Cheese with conJidence of 100%
Therefore only the last rule Juice ) Cheese has confidence above the minimum 75% required and
qualifies. Rules that have more than the user-specified minimum confidence are called conrtdent.
Improved Naive Algorithm
Rather than counting all the possible item combinations we can look at each transaction and count
only the combinations that actually occur(that is, we dot count itemsets witl zero frequency). For
example, Table 2.5 lists all the actual combinations occurring within the transactions given in Table
))
Table 2.5 All possible combinations with nonzero frequencies
Transaction ID ltems Combinations
100 . Bread, Cheese Bread, Cheese
200 Bread, Cheese, Juice Bread, Cheese, Bread, Juice, Cheese, Juice,
Bread, Cheese, Juice
300 Bread, Milk Bread, Milk
400 Cheese, Juice, Milk Cheese, Juice, Cheese, Milk, Juice, Milk,
Cheese, Juice, Milk
," i"ht.
2.6.

Table 2.6 Frequencies of all itemsets with nonzero frequencies


Itemsets Frequency"
Bread 3
Cheese 3
Juice 2
Milk 2
(Bread, Cheese) 2
(Bread, Juice) 1

(ilLead, Milk) I
(Cheese, Juice) 2
(Juice, Milk) 1

@read, Cheese. Juice) I


(Chaese, Juice, Milk) 1

$can now proceed as before. This would work better since the list of itenr combinations is
reducgl (from 15 to 11) and this reduction is likely to be much larger for bigger ptoblenis.
Regdrdless of the extent of the reduction, this list will also become very large for, say 1000 itenrs
20

2.4 THE APRIORI ALGORITHM


The basic algorithm for finding the association rules was first proposed in 1993. In 1994. an
improved algorithm was proposed. Our discussion is based on the lqa4 algorithm called rhc Apriori
algorithm. This algorithm may be considered to consisi of two parts. In the first part. iliose itemseis
that exceed the minimum support rsquirement are found. As noted earlier, such ite::isets are callecl
frequent itemsets. In the second part. the association rules that meet the minimum confiden0e
requirement are found from the frequent itemsets. The second paft is relatively straightforward, so
much ofthe focus ofthe research in this field has been to improve the first part.
First Part - Frequent Itemsets
The first part of the algoritlrm itself may be divided into two steps (Steps 2 and 3 below).
The first step essentiall), finds itemsets that are Iikely tb be frequent or candidates for frequent
itemsets. The second step finds a subset of these canclidate itemscts that are actually frequenl.
The algorithm works given below are a given set of transactions (ir is assumed that we
require minimum suppon of p%,1:
srep 1.' Scan all transactions and find all lrequenr items that have suppon above p,%i. Let
these frequent items be 17.
Step 2: Btrild potential sets of k items from L1-1 by using pairs of itemsets in L1-1 such that
each pair has the llrst k-2 items in common. Now the k-2 common items and the one remaining item
from each,of the two itemsets are combined to form a k-itemset. The set of such potentially frequent
k itemsets is the candidate set C1. (For k=2, build the potenJial frequent pairs by using the frequent
item set Ll so that every item in Ll appears with every other item in L| The set so generated is the
candidate set C2): This step is called Apriori-gen.

' Stip J.' Scan all transactions and find all k-itemsets in C1 that are frequent. The frequenl set
so obtained is 11. lFor k=2. ez is the set ofcandidare pairs. The lrequent pairs are 12 ).
Terminate when no further frequent itemsels are found. otherwise continue with Slep 2.
The main notation for association rule mining that is used in the Apriori algorithm is the following:
o A k-itemset is a set of k irems.
. The set C* is a set of candidate k-itemsets that are potentially frequent.
o The set Zr is a subset of Cr and is the set of k-itemsets that are frequent.
It is now worthwhile to discuss the algorithmic aspects of the Apriori algorithm. Some of the issues
that need to be considered are:
l. Computing 21.' We scan the disk-resident database only once to obtain L1. An item
. vector of length z with count for each item stored in the main memory may be used.
21

Once the scan ofthe database is finished and the count for each item found, the items

that meet the support criterion can be identitied and I-r determined.

2, Apriori-gen function: This is step 2 of the Apriori algorithm. It takes an argument


Lr-r and returns a set of all candidate k-iti-rrrsets. In computing Cr from Lz, we
organize Lz so that the itemsets are stored in lheir lexicographic order. Observe that
if an itemset in C3 is (a, b, c) then L2 must have items (a, b) and (a,c) since all
subsels of C:r must be frequent. Therefore to find C3 rve only need to look at pairs in
L2 that have the same first item. Once we find two such matching pairs in L2, they
are combined to form a candidate itemset in C3. Simitarly when lbrming Ci from Li-r,
we sort the itemsets in L1-1 and look for a pair of itemsets in Li-r that have the same
first i-2 items. If we find such a pair, we can combine them to produce a candidate
itemset for Ci.

3. Pruning: Once a candidate set Ct has been produced, we can prune some of the
candidare ilemsets by checking rhat all sLlbsets of every itemset in the set are
frequent. For example, if we have derived a, b, c from a. b and a, c, then we check
that b. c is also in Lz. If it is not a. b,c may be removed fiom C:. The task of such
pruning becomes harder as the number of items in the itemsets grows, but the
number of large itemsets tends to be smali.

4. Apriori subsel function' To improve the efficiency of searching, the candidate


itemsets Ck are stored in a hash tree. The leaves of the hash tree slore itemsets while
the intemal nodes provide a roadmap to reach the leaves. Each leaf node is reached
by lraversing the tree whose root is at depth l. Each intemal node of depth d points
to all the related nodes at depth (l+1 arrd the branch to be taken is determinpd by
applying a hash function on the dth item, All nodes are initially created as leafnodes
and when the number of itemsets in leafnodes exceeds a specified threshold, the leaf
node is converted to an intemal node.

5. Transaclions storage: We assume the data is too large to be stored in the main
meniory. Should it be stored as a set of transactions, each transaction being a

sequence of item numbers? Altematively. should each transaction be stored as a


Boolean vector of length r (n being the number ol items in the store) with I s
showing for the items purchased?

6. Computing L2 (and more generally I/"'Assuming that Cz is available in the main


memory, each candidate pair needs to be tested to find if the pair is frequent. Given
22

that C2 is likely to be large, this testing must be done efficiently. In one scan, each
transaction can be checked for the candidate pairs.

Seco16 Pu", - ,inding the Rules


To find the association rules from the frequent itemsets, we take a large frequent itemset, say
p, and find each nonempty subset a. The rule a ) (p-a) is possible if it satisfies the confidence.
Confidence ofthis rule is given by support (p) / support(a).
It should be noted that when considering rules like a ) (p-a), it is possible to make the rule
generation process more efficient as follou's. We only want rules that have the minimum confidence
required. Since confidcnce is given by suppoft(p)/support(a), it is cleat that if for some a, the rule
a)(p-a) does not have minimum confidence then all rules like b) (p-b), where b is a subset of a,
will also not have the confidence since support(b) cannot be smaller than support(a).
' Another way to improve rulc generation is to consider rules like (p-a))Ifthis rule has the
a.
minimum confidence then all rules (rr-b)) b will also have minimum confidence if b is a subset of a
since (p-b) has more itcrns than (p-a, given that b is smaller than a and so cannot have support higher
than that of (p-a). As an example, if A) BCD haS the minimum confidence then all rules like AB)
CD, AC) BD and ABC ) D will also have the minimum confidence. Once again this can be used
in improving the et, ' rrlcy of rule gt n', rtion.
Implementation Issue - Transaction Storagc
Representation of the transactions.
To illustrate the different options, let the number of items be six. Let there be {A, B, C, D, E,
F). Let there be only eight transactions with transactions IDs (10, 20,30,40,50,60,70,80). This set
ofeight transactions with six items can be represented in at least three different ways as follows.
The first representation (Table 2.7) is the most obvious horizontal one. Each row in the table
provides the transaction ID and the items that were purchased.
Table 2.7 A simple representation oftransactions as an item list

Transaction ID Items

l0 A,B,D
20 D,E,F
30 A,F
40 B,C,D
50 E,F
60 D,E,F
70 C.D,F
80 A,C.D.F
23

In the second horizontal representatioil (Table 2.8), rather than listing the items that were purchased
!\€ may list all the items and indicate purchases by putting a I against the item that occurs in a
transaction and 0 against the rest. Each row is still a transaction, but the items puchased :ile
represented by a binary string.

Table 2.8 Representing transactions as a binary item list

TID A B C D E

10 I I 0 I 0 0
20 0 0 0 I I I
30 I 0 0 0 0 1

40 0 1 1 1 0 0
50 0 0 0 0 I 1

60 0
'0 0 1 1 I
70 0 0 .1 I 0 1

80 1 0 0 1 0 1

In the third reprcsentation (Table 2.9), call it the vertical representation, the transaction list is tumed
around. Rather than u^ing each row to represent a transaction ofthe items purchiised, each row now
represents an item and indicates transactions in which the item appears. The columns now represent
the transactions. This representation is also called a TID-list since for each item it provides a list of
TTDs

Table 2.9 Repre..".ting transactions as binary columns

Ttr)
Item
l0 20 j0 40 50 6l 70 80

A I 0 I 0 0 0 0 1

B I 0 0 I 0 0 0 0
C 0 0 0 1 0 0 1 1

D I I 0 1 0 1 1 I
E 0 1 0 0 I 1 0 0
F 0 1 I 0 1 I 1 1

Horv the data is represented can have an impact on the efficiency of an algorithm. -A ve:' ' ,'

:efresentation can facilitate counting of items by couiiting o[ items by counting the num:rer of ]' rn
24

each row and for example, the number of 2-itemses can be counted by finding rhe interaction ofthe
two TID lists, but the represenntion is not storage efficient if there is a very large number of a
trallsaclron involved.
Exarnple 2.2 - A simple Apriori Example
Let us first consider an example of only five transactions and six items. The example is similar to
Example 2.2 h Table 2.2 but added two more items and another transaction. We still want to find
association rules with 50% suppoft and 757o conll<lence. The transactions are given in Table 2.10.
Table 2,10 Transactions for Example 2.2

Transaction ID Itemsels

Bread, Cheese, Eggs, Juice

] times, Juice 4 times, Milk 3 times,, and Eggs.and Yogurt only once. We require 507o
appear in at least three transactions. Therefore L1 is

Table 2.11 Frequent items Lr foi Exarnple 2.2

Item Frequency

(Bread, Cheese) 2
: (Bread, Juice) 3
(Bread, Milk) 2
(Cheese, Juice) J
(Cheese, Milk) I
(Juice, Milk) 2

We therefore have only two frequent item pairs which are {Bread, Juice} and {Cheese, Juice}. This
is L2. From these two frequent 2-itemsets, we do not obtain a candidate 3-itemset since we do not
have two 2-itemsets that the same first item.
The two frequent 2-itemsets above lead to the following possible rules:
Bread ) Juice Cheese ).Iuice
.luice ) Cheese Juice ) Bread
25

by
The confidence ofthese rules is obtained by dividing the support for both items in the rule
the support for the item.on the left-hand side ofthe rule. The confidence offour rules therefore
ue3/q
r5%
= 75%,3/q=75%., 313 = 100%, and %=7 5Yo respectively. Since all of then have a minimum
'confidence, they all qualifr.

2.5 IMPROVING THE ET'FICIENCY OX'THE APRIORI ALGORJTHM

The Apriori algorithm is resource intensive for large sets of transactions that have a large set of
frequent items. The major reasons for this may be summarized as follows:
1. the number of candidate items'sets grows quickly and can result in huge candidate sets. For
example, the.size of the candidate sets, in particular Cz, is crucial to the performance of the
Apriori algorithm. The larger the candidate set, the higher the processing cost for scanning
the tranlaction database to find the frequeut item sets. Give'n that the early sets of candidate
.itemsets are very large, the initial iteration dominates the cost.
. . 2:.,the Apriori algorithm requiies many scans of the database. If n is the length of the longest

I . 3. many trivial rules (eg. Buying milk with Tic Tacs) are derived and it can often be dltfi.ou]t to
extract the most inGresting rules from all the rules derived. 'For,example, ong may wish to
remove all the rules involving very frequent sold ilems.

4. some rules can be in explicable and very fine grained, for example. toothbrush vas the most
frequently sold item on Thursday momings .. -.
. . ,l

5 . redundant rules are generated. For example, if A ---+ B is a rule then an!'rule :

AC ; B is redundant. A number of approaches have been suggested to avoid

the Apriori algorithm assumes spafseness since the number of items in each transaction is
small compared with the total number of items. The algorithm works better with sparsity.
, Some applications produce dense data which may also have many frequently occurring items.

A number of .techmques ror rmprovlng rne perfonirance of the Apriori algorithm have been
suggested. They can bp classified into 4 categories.
o Reduce the number of candidate itemsets. For example, use pruning to reduce the nuniber
ofcandidate 3- itemsets and, ifnecessary, larger itemsets'

r Reduce the number of transactions. This may involve scanning the transaction data after
' lave atleast two
Lr has been computer and deleting all the transactions that do not l-''
frequent items. More transaction reduction may be done if the frequent 2-itemset Lz is
small'
26

. Reduce the number of comparisons. There may be no need to compare every candidate
against every transaction ifwe use an appropriate data structure.

o Generate candidate sets efficiently. For example, it may be possible to compute Ck and
. from it compute Cr+r raiher than wait for Lr to be available. One could search for both k-
itemsets and (k+1)- itemsets in one pass.

We now discuss a number of algorithms that use one or more of the above approaches to
improve the Apriori Algorithm. The last method, the Frequent Pattem Growth, does not
generate candidate itemsets and is not based on the Apriori algorithm.
1. Apriori-TlD
2. Direct Hashing and Pruning (DHP)
3. Dynamic Itemset Counting (DIC)
4. Frequent Paftem Growlh

2.6 APRIORI-TID

The Apriori-TlD algorithm is outline below:


1. The entire transaction database is scanned to obtain T1 in terms of itemsets (i.e. each
entry ofT; contains all items in the transaction along with the corresponding TID)
2. Frequent l-itemset L1 is calculated with the help of Tr
3. Cz is obtained by applying the Apriori-gen function
4. The support for the candidates in C2 is then calculated by using Tr
5. Entries in Tz are then calculated.
6. L2 is then generated from C2 the usual means and then Cl can be generated from L2.
7. T3 is then generated with the help ofTu and C:. This process is repeated until the set of
candidate k-itemsets is an empty set.
Example 2.3 - Apriori-TlD
We consider the transactions in Example 2.2 again. As a first step, T1 is generated by scanning the
database. It is assumed throughout the algorithm that the itemsets in each transaction are stored in
lexicographical order. Tr is essentially the same as the whole database, the only difference being that
each ofthe itemsets in a transaction is represented as a set ofone item.

Step I
First scan the entire database and obtain T1 by treating each item as a l-itemset. This is given in
Table2.l2.
27

Table 2,12 The transaction database Tr


Transaction ID Items
100 Bread cheese Eggs Juice
200 Bread cheese Juice
300 Bread Milk Yogurt
400 Bread Juice Milk
500 Cheese Juice Milk

Steps 2 and 3
The next step is to generate L1. This is generated with the help of!1 Cz calculated as previously in
the Apriori algorithm. See Table 2.13.
Table 2.13 The sets L1 and C2

L1 Cz

Itemset
Itemset Support
{B, C}
{Bread} 4
{B, J}
{Cheese} 3
{B,M}
{Juice} 4
{C, J}
{Milk} 3
{C, M}
U,M}

-
In Table 2.13, we have used single letters B(Bread), C(Cheese), J(Juice) and M(Milk) for Cz

Step 4
The suppo( for itemsets in Cz is now calculated with the help of T1, instead of scan-ning the actual
database as in the Apriori algorithm and the result is shown in Table 2.14.

Table 2.14 Frequency of itemsets in C2


Itemset Frequency
{B, C} 2

{B, J} J

{B,M} 2

{C, J} J

{C, M} I
{.r, M} 2
28

Step 5
We now find T2 by using. C and T1 as shown in Table 2.1 5.
Table 2.15 Transaction database T,
TID Set-of-Itemsets
100 {{B, c}, {B, J}, {c, r}}
200 {{B, c}, {B, r}, {c, r}}
300 {{B, M}}
400 {{8, J}, {B, M}, {r, M}}
500 {{c, r}, {c, M}, {r, M}}
- {B' J}and {c, J} are the frequent pairs and they make up Li. c3 may now be generated but
we find that C3 is empty. If it was not empty we would have used it to find Tj with the help of the
transaction set Tz That would result in a,smaller T2. This.is the end of this simple example.

The generation of dssociation rules from the derived frequent set can be done in the usual
way. The. main advantage of the Apriori-TlD algorithm is that the size of T1 is usually smaller than
smaller, than the entry in rhe corresponding transaction for largcr k values. Since the support for each
candidate k-itemset is counted with the hetp of the corresponding T1. the algorithm is often faster
than the basic Apriori algorithm

It should be noted that both Apriori and Apriori-TlD use the same candidate generation
algorithmr and therefore they count the same itemsets. Experiments have shown that the Apriori
algorithm,guns more efficiently duiing the earlier phases of the algorithm because for small values of
k' each entry in Tr may be larger than the conesponding entry in the transaction database.

2.7 DIRECT HASHING AND PRUNING (DHP)

This algorithm' proposes overcoming some of the weakness of the Apriori algorithm by
reducing the number of candidate k-i8msets. in particular the 2-itemsets. since that is the key to
improving performance. Also; as noted earlier, as k increases. not only is there a smalier number of
frequent k-itemsets but there are fewer transactions containing these itemsets. Thus it should not be
necessary to scan the whole transaction database as k becomes larger than 2.
The direct hashing and pruning (DHP) algorithm claims to be efficient in the generation of
hequent itemsets and effective in trimming the transaction database by discarding items from the
tr+rsactions or removing whole transabtions thal do not need to be scanned. The algorithm uses a
hash'besed technique to reduce the number ofcandidate itemsels generatecl rn the first pass (that is; a
significantly smaller C: is constructed). It is claimed thal t!r,.' qiir-i:g. c:' i.:,..,t rs in C2-geherated
rrs;r;g l:iHF cir.p [:.* crciers cf ",r:*grii,;ie sl:tail+r: so thr,tt ti.tc sur,r ,'.,;liir<,] ; r , r,ttjfiDine L, i,
-oa" ,
10

The algorithm may be divided into the following three parts. The first part finds all the
freqr,rent f -itemsets and ali the candidate 2-itemsets. The second part is the more general part
including hashing and the third part is without the hashing. Both the second and third parts include
pruning, Part 2 is used for early iterations and Part 3 for later iterations.
Part l-Essentially the algorithm goes through each transaction counting all the l -itemsets. At
the same time all the possible 2-itemsets in tle current transaction are hashed to a hash table. The
algorithm uses the hash table in the next piss to reduce the number of candidate itemsets. Each
bucket in the hash table has a count, which is increased by one each time an itemset is hashed to that
bucket. Collisions can ocq,r when different itemsets are hashed to tlre same bucket. A bit vector is
associated with the hash table to provide a flag for each bucket. Ifthe bucket count is equal or above
the minimum support count, the corresponding flag in the bit vector is set to 1, otherwise it is set to
0.
Part 2-This part has two phases. In the first phase, C1 is generated. .In the Apriori algorithm
C1 is generated by Ll-r x L1-1 but the DHP algorithm uses the hash table to reduce the number of

candidate itemsets in C*. An item is included in C1 onJy if the corresponding bit in the hash table bit
vector has been set, that is the number of items hashed to the location is greater than the support.
Although having the corresponding bit vector bit set does not guarartee that the itemset is frequent
due to collisions, the hash table filtering does reduce Cr. and is stored in a hash tree, which is used to
cormt the suppo( for each itemset in the second phase ofthis part.
In the second phase, the hash table for the next step is generated. Both in the support comting
and when the hash table is generated, pruning of the database is carried out. Only itemsets that are
important to future steps are kept in the database. A k-itemset is not considered useful in a frequent
k+l itemset unless it appears at least k times in a transaction. The pruning not only trims each
transaction by removing the unwanted itemsets but also removes transactions that have'no itemsets
that could be frequent.
Part 3-The third part of the algorithm continues until there are no more candidate itemsets.
lnstead of using a hash table to find the frequent itemsets, the transaction database is now scanned to
find the support count for each itemset. The dataset is likely to be now significantly smaller because
of the pruning. When the support count is established the algorithm determines the fiequent itemsets
as before by checking against the minimum support. The algorithm then gerierates candidate itemsets
as the Apriori algorithm does.
Example 2.4 * DHP Algorithm
I We now use an example to illustrate the DHP algorithm. The transaction database is the
same as we used in.Example 2.2. We want to find association rules that satisff 50% support and 75%
confidence. Table 2.31 presents the transaction database and Table 2.16 presents the possible 2-
itemsets for each transaction.
3
30

Table 0.16 Transaction database for Example 2.4


Transaction ID Items
100 Bread, cheese, Eggs, Juice
200 Bread, cheese, Juice
30c Bread, Milk, Yogurt
400 Bread, Juice, Milk
500 Cheese, Juice, Milk

We will use letters B(Bread), C(Cheese), E(Egg), J(Juice), M(Milk) and Y(Yogurt) in Tables
2.17 to 2.19
Table 2.17 Possible 2-itemsets

100 (B, c) (8, E) (8, J) (c, E) (c, J)(E, r)


200 (B, c) (B, J) (c, J)
300 (8, M) (B, Y) (M, Y)
400 (B, J) (B, M) (J, M)
500 (c, J) (c, M) (J, M)

The possible 2-itemsets in Table 2.17 are now hashed to a hash table. The last column shown
in Table.2.33 is not required in the hash table but we have included it for the purpose of explaining
the technique.
Assume a hash table of size 8 and using a very simple hash function described below leads to
the hash Table 2.18.
Table 2.18 Hash table for 2-itemsets

Bit vector Bucket number Count Pairs Cz

1 0 3 (C, J) (B, Y) (M, Y) (C, J)


0 1 1 (C,M)
0 2 1 (E,J)
030
0 4 2 (B,C)
1 s 3 (B, E) (i, M) (J,M)
1 6 3 (B,J) (B,J)
1 7 3 (c, E) (B, M) (B, M)

The simple hash function is obtained as follows:


r For each pair, a numeric value is obtained by first representing B by 1, C by 2, E by 3, J by 4,
M by 5, and Y by 6 and then representing each pair by a two-digit number, for example, (B,
E) by l3 and 1C. M) b1,25.
3,

. The two digits are then coded as a modulo 8 number (dividing by 8 and using the remainder).
This is the bucket address.
For a support of 50%o, the frequent items are B, C, J, and M. This is Lr which leads to C2 of (B,
C), (B, I), (8, M), (C, J), (C, M) and (J, M). These candidate pairs are then hashed to the hash table
and tle pairs that hash to locations where the bit vector bit is not set, are removed. Table 2.19
shows that (B, C) and (C, M) can be removed from C2. We are therefore left with the four candidate
itern pain or the reduced Cz given in the last column ofthe hash table in Table,2.19. We nowlook at
the transaction database and modifi it to include only these candidate pairs (Table 2.19).

Table 2.19 Transaction database with candidate 2-itemsets

100 (B, J) (c, J)


200 (8, J) (c, J)
300 (B,M)
400 (B, i) (8, M)
500 (c,J)(J,M)

It is now necgssary to count support for each pair and while doing it we further trim the
database by removing items and deleting transactions that will not appear in frequent 3-itemsets. The
frequent pairs are (B, J) and (C, J). The carrdidate 3-itemsets must have two pairs with the first item
being the same. Only transaction 400 qualifies since it has candidate pairs (B, J) and (B, M), Others
can therefore be deleted and the transaction database now looks like Table 2.20.
Table 2.20 Reduced transaction database

(B,J,M)

In this simple exalnple we can now conclude that (B, J, M) is the only potential frequent 3-
itemset but it canngt qualifr since transaction 400 does not have the pair (J, M) and the pairs (J, M)
and @, M) arp not frequent pairs. That concludes this example.
2.8 DYNAMIC ITEMSET COUNTING (DIC)
The Apriori algorithm must do as many scans of the transaction database as the number of
items in the last candidate itemset that was checked for its support. The Dynamic Itemset Counting

@IC) algorithm reduces the number of scans required by'not just doing one scan for the frequent l-
itemset and another for the frequent 2-itemset but cohbining the counting for a number of itemsets
as soon as it appears that it might be necessary to count it.
The basic algorithm is as follows:
. l. Divide the transaction database into a number of, say q, partitions.
32

2. Start counting the 1-itemsets in the first partition ofthe transaction database.
3. At the beginning of the second partition, continue countirig the f -itemsets but also start counting
the 2-itemsets using the frequent 1-itemsets from the first partition.
4. At the beginning of the third partition, continue dounting the f -itemsets and the 2-iternsets but also
start counting the 3-itemsetg using results from the first two partitions.
5. Continue like this until the whole database has been scanned once. We now have the fina1 set of
frequent l-itemsets.
6. Go back to the beginning ofthe transaction database and continue counting the 2-itemsets and the
3-itemsets.
7. At the end of the first partition in the second scan of the database, we have scanned the whole
database for 2-itemsets and thus have the final set of frequent 2-itemsets.
.8. Continue the process in a similar way until no frequent k-itemsets are found.
The DIC algorithm works well when the data is relatively homogeneous throughout the file
since it starts the 2-itemsetcount before having a final l -itemset count. Ifthe data distribution is not
homogenbous, the algorithm may not identiSr an itemset to be large until most of the database has
been scanned. In such cases it may be possible to randomize the transaction data although this is not
always possible. Essentially, DIC attempts to finish the itemset counting in two scans of the database
while Apriori would often take three or more scans.
2.9 MINING FREQUENT PATTERNS WITHOUT CANDIDATE GENERATION
(FP-GROWTTT)
The algorithm uses an approach that is different from that used by methods based on the
Apriori algorithm. The major difference between frequent pattem-$owth (FP-growth) and the other
algorithms is that FP-grollth does not generate the candidates, it only tests. ln contrast, the Apriori
algorithm generates the candidate itemsets and then tests.
The motivation for the FP-tree method is as follows:
. Only the frequent items are needed to find the association mles, so it is best to find the
frequent items and ignore the others.
o If the frequent items can be stored in a compact strucfure, then-the original transaction
database does not need to be used repeatedly.
. If multiple transactions share a set of frequent items, it may be possible to merge the shared
sets with the number of occrmences registered as count.
To be able to do this, the algorithm involves generating a frequent pattem tree (FP-tree).
Generating FP-trees
The algorithm works as follows:
1. Scan the uansaction database once, as in the Apriori algorithm, to find all the frequent items
and their support.
33

2. Sort the fequent items in descending order of their support.


Initailly, start creating the FP-tree with a root 'hull".
4. Get the firs tfansaction from the hansaction database. Remove all non-frequent items and list
the remaining items according to'ihe order in the sorted frequent items.
5. Use the transaction to construct tto-.frst branch of the tfuee with each node corresponding to
a frequent item and showing that itdft's frequency, which is I for the first transaction.

6. Get the next transaction from the transaction database. Remove all non-frequent items and list
the remaining items according to the order in the sorted frequent items.
7. Insert the transaction in the tree using any common prefix that may appear. Increase the item
cottnts.
8. Continue with step 6 until all transactions in the database are processed.
Let us see one example.
The minimum support required is 50% and confidence is 75%.
Table 2.21 Transaction database for Example 2.5
Transaction ID llems
100 Bread, Eggs, Juice
200 Bread, Cheese, Juice
300 Bread, Milk, Yogurt
400 Bread, Juice,Milk
500 Cheese, Juice, Milk

The frequent items sorted by their frequency are shown in Table 2.22.
Table 2,22 Frequent items for database inTable 2.21
Frequency
Bread 4
Juice 4
Cheese J
Milk J

Now we remove the items that are not frequent from the transactions and order the items according
to their ftequency as above Table.

Table 2.23 Database after removing the non-frequent items and reordering
Transaction ID Items
101 Bread, Juice, Cheese
200 Bread, Juice, Cheese
300 Bread, Milk
400 Bread, Juice,Milk
500 Juice, Cheese, Milk
s4

M:l
M:l

f igure 2.3 FP-tree for Example 2.5


Mining the FP-tree for frequent items
To find the frequent itemsets we should note that for any frequent item a, all
the frequent
itemsets containing a can be obtained by following the a's node-links. starting from
a,s head in the
FP-tree header.
The mining of the FP-tree structure is done using an algorithm called the frequent
pattern
growth (FP-growth). This algorithm starts with the least frequent
item, that is the last item in the
header table. Then it finds all the paths from the roor to this item
and adjusts the count according to
this item's support count.
We first look at using the FP+ree in Figure 2.3 built in the example earlier
to find the
frequent itemsets. We start with the item M and find the following pattems:

BM(r)
BJM(r)
JCM(1)
No frequent itemset is discovered from these since no itemset appears three times.
Next we look at C
and find the following:

BJC(2)
JC(l)
These two patterns give us a frequent itemset JC(3). Looking at J, the next
frequent item in the table,
we obtain:
BJ(3)
J(t)
Again we obtain a frequent itemset, BJ(3). There is no need to follow links from item B as there are
no other frequent itemsets.

The process above may be represented by the "conditional" trees for M, C and J in Figures 2.4, 2.5
and 2.6 respectively.

B:.4
J:1

C:1
J:3

M:1

Figure 2.4 Conditional tree for M

Figure 2.5 Conditional tree for C


36

J:1

Figure 2.6 Conditional tree lor J


Advantages of the FP-tree approach
One advantage of the FP-tree algorithm is that it avoids scanaing the database more than twice to
find the support counts. Another advantage is that it completely eliminates the costly candidate
generation, which can be expensive in particular for the Apriori algorithm for the candidate set C2.
The FP-growth algorithm is better than the Apriori Algorithm when the transaction database is huge
and the minimum support count is low. A low minimum support count means that a list of items will
satist/ the.support count and hence the size of the candidate sets for Apriori will be large. FP-growth
uses a more efficieqt structure to mine pattems when the database grows.

2.IO PERFORMANCE EVALUATION OF ALGORITHMS


. Performance evaluation has been carried out on a number of implementation of different
association mining algorithms. The study the compared the methods including Apriori, CHARM and
FP-growth using the real world data as well as artificial data, it was concluded that:
l. The FP-growth method was usually better than the best implpmentation of the Apriori algorithm.
2' CIIARM was usually better than Apriori. In some cases, CHARM was better than the FP-growth
method. .
3. Apriori was generally better than other algorithms if the support required was high since high
supports leads to a smaller number of liequent items which suits the Apriori algorithm.
4. At very low support, the number of frequent items became large and none of the algorithms were
able to handle large frequenl search gracefully
There were two evaluations held in 2003 and November 2004. These evaluations have
provided many new and surprising insights into asSociation rule mining. In the 2003 performance
evaluation of programs, it was found that two algorithms were the best. These were:
l An efficient implementation of the FP{ree algorithm
2. An algorithm that combined a number of algorithms using multiple heuristics.
5t

The performance evaluation also included algorithms lor closed itemset mining as well as for
maximal itemset mining. The performance evaluation in 2004 found an implernentation of an

algorithm that involves a tree traversal as the most efficient algorithm for finding frequent, frequent
closed and maximal frequent itemsets.

2.T1 SOFTWARE TOR ASSOCIATION RULE MINING

Packages like Clementine and IBM Intelligent Miner include comprehensive association de mining
softrvare. We present some softvrare designed for association rules.

Apriori, FP-growth, Eclat and DIC implementation by Bart Goethals. The algorithms
generale all frequent itemsets for a given minimal support threshold and for a given minimal
confidence threshold (Free). For detailed particulars visit:
http://www.adrem.ua.ac.be/-g,octhals/soltwarc/indcx.html

a ARMiner is a client-server data mining application specialized in finding association rules.


ARMiner has been written in Java and il is distributed under the GNU General Public
License. ARMiner was developed at UMass/Boston as a Software Engineering project in
Spring 2000. For a detailed study visit:
htto ://www.cs.umb.edu/-laur/ARMiner

ARtool has also been developed at UMass/Boston. It offers a collection of algorithms and
tools for the mining of association rules in binary databases. It is distributed under the GNU
General Public License. For more information visit:
http :/iwww. cs.umb.edu/-laur/ARtool

DMII (Data lr4ining II) association rule software from NUS Singapore. For more information
visit:
http://www.cornp.nus.edu.ss/-dm2l

FIMI, Frequent Itemset Mining Implemations t€pository is the result of the workshops on
Frequent Itemset Mining Implementations. FIMI'03 and FIMI'O4 which took place at IEEE
ICDM'O3, and IEEE ICDM'04 respectively. For more information visit:
http://fi mi.cs.helsinki. fi I
'38

CONSLUSION
This chapter introduced the association rules mining problem and presented the classical
Apriori algorithm. Association rule mining is an interesting problem with many applications. The
algorithm used is conceptually simple and the resulting rules are clear and understandable. The
algorithms work on data of variable length.

The process of mining association rules can be made more efficient than the Apriori
algorithm. Some of the proposed algorithms make changes to the existing Apriori algorithm like
Apriori-TlD and DHP, while others present completely new solutions like FP-growth. In
performance evaluation of many algorithms it is becoming clear that tree-based algorithms perform
the best.

It should be noted that lowering the support threshold results in many more frequent itemsets
and often increases the number of candidates itemsets and maximum length of frequent itemsets
resulting in cost increases. Also, in the case of denser datasets (i.e. those with many items in each
transaction) the maximum length of frequent itemsets is often higher and therefore the cost offinding
the association rules is higher.
REVIEW QUESTIONS
. L Define support and confrdence fbr an association rule.
2. Define lift. What is the relation between support, confidence and lift of an association rule X
)Y?
3. Prove that all nonempty subsets ofa frequent itemset must also be frequent
4. The efficiency of the Apriori method for association rules may be improved by using an of
the following techniques:
o pruning
o Transactionreduction
o partitioning
o Sampling
Explain two of these approaches using the grocery data.
' 5. Explain how the hashing method DHP works. Estimate how much work will be needed to
compute association rules compared to Apriori. Make suitable assumptions.
6. Which step of the Apriori algorithm is the most expensive? Explain the reasons for your
answer.
39

CHAPTER 3

CLASSIF'ICATION
Learning Objectives
1. Explain the concept of classification
2. Describe the Decision Tree method
3. Describe the Naive Bayes method
4. Discuss accuracy of classification methods and how accuracy may be improved

3.1 INTRODUCTION
Classification is a classical problem extensively studied by statisticians and machine leaming
researchers. The word classification is difficult to define precisely. According to one definition
or ordering of objects (or things) into classes. If the classes are
classiJication is the separation
created without looking at the data (non-empirically), the classification is called apriori
classification. If however the classes are created empirically (by looking at the data), the
classification is called Posteriori classification. In most literature on classification it is assumed that
the classes have been deemed apriori and classification then consists of training the system so that
when a'rrew object is presented to the trained system it is able to assign the object to one of the
existing Classes. This approach is also called supervised learning.

Data mining has geilerated renewed interest in classification. Since the datasets in data
mining are often large, ngw classification techniques have been developed to deal with millions of
objects having perhaps dozens or even hundreds of attributes.

3.2 DECISION TREE


A decision tree is a popular classification method that results in a flow-chart like tree structure
where each node denotes a test on an atffibute value and each branch represents an outcome of tle
test. The tree leaves represent the classes.

Let us imagine that we wish to classifo Australian animals. We have some training data in
Table 3.1 which has already been classified. We want to build a model based on this data.
40

^Table 3.1 Training data for a classjfication problem


Name Eggs Pouch Flies Feathers Class

Cockatoo Yes No Yes Yes Bird


Dugong Nb No No No Mammal
Echidna ies Yes No No Marsupial
Emu. Yes No No Yes Bird
Kangaroo No Yes No Marsupial
]vo
Koala No Yes No No Marsupial
Kookabuna Yes No Yes Yes Bird
Owl io No Yes Yes Bird
Penguin Yes No 1.\ o Yes Bird
Platypus Yes No No No Mammal
Possum No Yes No No Marsupial
Wombat No Yes No No Marsupial

As an example of a decision tree, we show possible result in Figure 3.1 if classifuing the data
in Table 3. I .

Figure 3.1 A decision tree for the data in Table 3.1


Decision Tree is a model that is both predictive and descriptive. A decision Tree is a tree that
displays relationships found in the training data. The tree consists of zero or more intemal nodes and
one or more leaf nodes with each intemal node being a decision node having two or more child
nodes. Using the training data, the decision tree method generates a tree that consists of nodes that
are rules, very similar to those used in "20 questions" to determine tie class of am object after the
haining is completed. Each node ofthe tree represents a choice between a number of alternatives and
each leaf node represents a classification or a decision. The training process that generates the tree is
called, Induction.
4t

Normally, the complexity of a decision tree increases as the number of attributes increases,
although in some situations it has been found that only a qnall number of attributes can determine
the class to which an object belongs and the rest ofthe attributes have little or no impact.

33 BUILDIIYq A DECISION TREE - THE TREE INDUCTION ALGORITHM


. The decision tree algorithm is a relatively simple top-down greedy algorithm. The aim of the
algorithm is to build a tree that has leaves that are as homogeneous as possible. The major step ofthe
algorithm is to continue to divide leaves that are not homogeneous into leaves that are as
homogeneous as possible until no further division is possible. The decision tree algorithm is given
below:
I . Let the set of training data be S . If some of the attributes are continuously-valued, they should
be discretized. For example, age values may be'binned into the following categories (under
18), (lS-40), (41-65) and (over 65) and transformed into A, B, C and D or more descriptive
labels may be chosent Once that is done, put all of S in a single tree node.
2. lfall instances in S are in the same class. then stop.
3. Split the next node by selection of an attribute A from amongst the independent attributes that
best di?ides or splits the objects in the node into subsets and create a decision tree node.

4. Split the node according to the values of A.


5. Stop if either of the following conditions is met, otherwise continue with step 3.
(a) If this partition divides the data ioto subsets that belong to a single class and no other
node needs splitting.
(b) If there are no remaining attributes on which the sample may be further divjded.

In the decision tree algorithm, decisions are made locally and the algorithm at no stage tries to find a
globally optimum tree.

3.4 SPLITALGORITHM BASED ON INFORMATION THEORY

One of the techniques for selecting an attribute to split a node is based on the concept of
information theory er entropy. The concept is simple, although often difficult to understand for
"quite
many. It is based on Claude Shannon's idea that if you have uncertainty then you have informalion

and ifthere is no uncertainty there is no information. For example, ifa coin has a head on both sides,
then the result oftossing it does not product any information but ifa coin is normal with a head and a
tail then the resutt of the toss provides information.

Essentir[y, information is defined as -prlqg p where pr is the probability of some event.


Since the probability p1 is alryays less than l,
log p, is always negative.and -p1 log p; is always
positive. For those who. cannot recollect their high school mathematics, we note that log of I is .
42

always zero whatever the base, the log ofany number greater than I is always positive and the log of
any number smaller than 1 is always negative. Also,

log2(2) =l
1og2(2') = n
loe2(1/2) = -l
1o921112"7 = '11

Information ofany event that is likely to have several possible outcomes is given by
I=!; (-pi logpl)
Consider an event that can have one of two possible values. Let the possibilities of the two
values be pr and p2. Obviously ifpr is I and p2 is zero, then there is no information in the outcome
and I:0. If p1:0.5, then the information is
I : -0.5 log(0.5) - 0.s log(o.s)
This comes out to 1.0 (using log base 2) is the maximum information that you can have for
an event with two possible outcomes. This is also called entropy and is in effect a measure of the
minimum number of bits required to encode the information.
If we consider the case of a die (singular of dice) with six possible outcomes with equal
probability, then the information is given by:
I= 6(-116) log(1/6)) = 2.s8s
Therefore three bits are required to represent the outcome of rolling a die. Of course, if the die
was loaded so that there was a 507o or a75Vo chance of getting a 6, then the information conteilt of.
rolling the die would be lower as given below. Note that we assume that the probability of getting
any of I to 5 is equal (that is, equal to l0% for the 50%o case and 5% for the 7 5%o case).

50Yo: I = s(-0.1) log(0.1)) - 0.s los(0.5) = 2.16


75%o: I = 5(0.05) log(0.05)) - 0.75 log(0.7s): 1.39
Therefore we will need tluee bits to represent the outcome ofthrowing a die that has 50% probability
of ttrrowing a six birt only two bits when the probability is 75%.
'
Information Gain
Information gain is a measure ofhow good an attribute is for predicting the class of each of
the. training data. We will select the attribute with the highest information gain as the next split
attribute.
Perhaps the term information gain is somewhat confiising because what we really mean is
that information gain is a measure ofreduction in uncertainty once the value ofan attrlibute is known.
If the uncertainty is reduced by a large amount, knowing the value of the attribute has provided a lot
of information and thus we have a large information gain.
43

Assume there are two classes, P and N, and let the set of training data S (with a total number
of objects s) contain p elements of class P and n elements of class N. The amount of information is
defined as
I = -(n/s) log(n/s) - (p/s) log (p/s)
Obviously if p=n, I is equal to 1 and if p=s then I=0. Therefore if there was an attribute for which
almost all the objects had the same value (for example, gender when most people are male), using
the attribute would lead to no information gain (that is, not reduce uncertainty) because for gender =
female therewill be almost no objects while gender: male would have almost all the objects even
before we knew the gender. If on the other hand, an attribute divided the training sample such that
gender = female resulted in objects that all belong to Class A and gender : male all belong to
Class B, then uncertainty has been reduced to zero and we have a large information gain.
We define information gain for sample S using attribute A as follows:

Gain(S,A)=I-I, (tils)Ii
iEvalues(A)

I is the information before the split and ) iBuuyu"r,4, (tils) Ii is the sum of information after the split
where Ii is the information of node I and ti is the number of objects in node i. Thus the total
information after the split is L iEvalue(A) (tils) Ii, which is the sum of the information of each node.
Once we have computed the information gain for every remaining attribute, the attribute with
the highest information gain is selected.
Example 3.1 - Using the Information Measure
We consider an artificial example of building a decision tree classification model to classify bank
loan applications by assigning applications to one of three risk classes (Table 3.2).
Table 3.2 Training data for Example 3.1

Owns Homb? Married Gender Employed Credit Rating Risk Class

Yes Yes Female Yes A A


Yes Yes Female Yes B C
Yes Yes Male No B B
Yes Yes Female Yes B C
Yes Yes Female Yes B A
Yes Yes Male No B B
Yes Yes Female Yes AA
Yes Yes Female Yes AC
Yes Yes Female . Yes AC
44

There are l0 (s =10) samples and three classes. The frequencies ofthese classes are:
A:3
B:3
c:4
Information in the data due to uncertainty of outcome regarding the risk class each person belongs to

I: -(3/10) log(3/10) - (3/10) log(3/10) - (4/10) log(4/10) = 1.s7


Let us now consider using each attribute in tum as a candidate to split the sample.
1.Attribute "Owns Home'
Value = Yes. There are five applicants who own their home. They are in classes A=1,8=2, C=2.
Value = No. There are five applicants who do not own their home. They are in classes A=2, B:1,
C=2.
Given the above values, it does not appear as if this attribute will reduce the uncertainty by
much. Let us compute the information gain by using this attribute. We divide persons into those who
own their home and those who do not. Computing information for each of these two subtrees,
(Yes) = I(y) = -(1/s) log(l/s)- (2ts) tos(2/5) - (2ts) tog(zts) = t.s2
:
I(No) = I(n) -(2/s) tog(2/ s) - (1 t s) tog(U s) - (2/ s) Iog(2/ s) = 1.52
Total information ofthe two subtrees : 0.5I(y) + 0.5I(n) : 1.52

2. Attribute "Married"
There are five applicants who are married and five who are not,
Value = Yes has A = 0, B = 1, C = 4, total 5
Value = No has A : 3, B :2, C :0, total 5
Looking at the values above, it appears that this attribute will reduce the uncertainty by more than the
last attribute. Computing the information gain by using this attribute, we have
I(y) = -(1/5) tos(t t s) - @15) tog(4t s) = 0.722
I(n) = -(3/s) tog(3ts) - (2/s) tog(2t5):0.911
Information of the subtrees : 0.51(y) + 0.5I(n):0.846
3. Attribute "Gender"
There are three applicants who are male and seven are female.
Value= Male has A = 0, B = 3, C = 0, total 3

Value: Female has A:3, B = 0, C:4,total7


The values above show that the uncertainty is reduced even more by using this attribute since for
Value: Male.we have only one class. Let us compute the information gain by using this attribute.
I(Male) : 0

I(Female) = -(3/7) log(317) - (4n) los{{17) = 0.985


' Total information of the subtrees = 0.3l(Male) + 0.7l(Female) : 0.69
45

4. Attribute "Employed"
There are eight applicants who are employed and two that are not.
Value = Yes has A = 3 B = 1, C = 4, total 8
A=0, B =2,C= 0,total 2
Value = No has
The values above show that this attribute will reduce uncertainty but most attribute values are Yes
while the No value leads to only one class. Computing the information gain by using this attribute,
we have
I(y) = -(3/8) log(3/8) - (1/8) log(1/8) - (a/8) log(4/8) = 1.41
I(n) : o
Total information ofthe subtrees : 0.8I(y) + 0.2I(n) = l.ll
5. Attribute "Credit Rating"
There are five applicants who have credit rating A and five that have B.
Value : A has A = 2, B = I, C : 2,total 5

Value=BhasA = 1,8=2,C:2,total 5
Looking at the values above, we can see that this is like the first attribute that does not reduce
uncertainty by much. The information gain for this attribute is the same as for the first attribute.
(A): -(2/s) tog(z/s)-(1/s) log(l/5) - \2/5)log\2/51: t.s2
(B) = -( 1 /s) loe( I /s) - (2 t s) to g(2/ s) - (2/ s) log(2 t 5) : 1 .s2
Total information of the subtrees = 0.5I(A) + 0.5(B) = 1.52
The values for information gain can now be computed. See Tabte 3.3.
Table 3.3 Information gain for the five attributes

Potentiat split attibute Infomation belore spli, Information after sptit Information gain
0.05
Manied 1.57 0.85 0.72
Gender 1.s7 0.69 0.88
Employed t.s7 1.12 0.45
Credit Ratirig 1.57 1.52 0.05

Hence the largest information gain is provided by the attribute "Gender" and that is the attribute that
is used for the split.
Now we can reduce the data by removing the attribute Gender and removing the class B since
all Class B have Gender : Ma.le. See Table 3.4.

4
46

Table 3,4 Data after removing auribute "Gender" and Class B

Owns Home? Married Employed Credit Rating Risk Clqss

NoNoYesAA
Yes Yes Yes B C
No Yes Yes B C
NoNoYesBA
Yes No Yes A A
No Yes Yes A C
Yes Yes Yes A C

The information in this data of two classes due to uncertainty of outcome regarding the class each
person belongs to is given by
I = -(3/7) tog(3/D - g/7) tog(417) : 1.33
Let us now consider each attribute in tum as a candidate to split the sample.
Attribute "Owns Home"
1.

Value: Yes. There are three applicants who own their home. They are in classes A=l and C=2.
Value = No. There are four applicants who do not own their home, who are in classes A=2,
ar,dC=2.
Given the above values, it does not appear as if this attribute will reduce the uncertainty by
much. Computing information for each of these two subtrees,
I(Yes) = I(y) = -(l/3) tog(1/3)
- (2/3) tog(2/3) = 0.92
I(No) : I(n) = -(2t$ tosQtq - Q/4) togQ/q : 1.00
Total information of the two subtrees : (317)I(y) + (417)I(n) = 0.96
2. Attribute "Married"
There are four applicants who are married and three that are not.
Value: Yes has A = 0, C = 4, total 4
Value: No has A = 3, C = 0, totai 3
Looking at the values above, it appears that this attribute will reduce the uncertainty by more than the
. last attribute since for each value the persons belong to only one class and therefore information is
zero. Computing the information gain by using this attribute, we have
t(y)= - gtg log(4/4) = 0.00
I(n): - (3/3) log(3/3):0.00
Information of the subtrees = 0.00
47

There is no need to consider other attributes now, since no other attribute can be better. The
split attribute therefore is "Married" and we now obtain the following decision tree in Figure 3.2
which concludes this very simple example. It should be noted that a number of attributes that were
available in the data were not required in classifying it.

Figure 3.2 Decision tree for Example 3.1


3.5 SPLIT ALGORITHM BASED ON THE GINI INDEX
Another commonly used split approach is called the Gini index which is used in the widely used
packages CART and IBM Intelfigent Miner.
Figure 3.3 shows the Lorenz curve which is the basis of the Gini Index. The index is the ratio
of the area between the Lorenz curve and the 45-degree line to the area under 45-degree line. The
smaller the ratio, the less is the area between the two curves and the more evenly distributed is the
wealth. When wealth is evently distributed, asking any person about his/her wealth provides no-
information at all since every person has the same wealth while in a situation where wealth'ls very
unevenly distributed finding out how much wealth a person has provides information because of tle
uncertainty of wealth distribuiion.
48

100%

(c
(!)
Perfect distribution
o
(.,
cd

.C)

This area =
Gini Index
5
Q
Lorenz curve

100%

Cumulative share of people starting from lower income

Figure 3.3 Lorenz curve.


Example 3.2 -
Using the GiniJndex
We use the same example (Table 3.2) as we have usdd before to illustrate the Gini Index.

-Married
Owns Home? Gender Employed Credit Rating Rkk Class

Yes Yes Male Yes A B


Yes Yes Female Yes A A
Yes Yes Female Yes B C
Yes Yes Male No B B
Yes Yes Female Yes B C
Yes Yes Female Yes B A
Ycs Yes Male No B B
Yes Yes Female Yes A A
Yes Yes Female Yes A C
Yes Yes Female Yes A C

There are l0 (s =10) samples and three classes. The frequencies ofthese classes are:
A=3
B:3
C:4
49

The Gini index for the distribution ofapplication ofapplicants in the three classes is
G : I - (3/10)2 - (3/10)'? -(4/10)2 = 0.OO

Let us now consider using each ofthe attributes to split the sample.
1.Attribute "Owns Home"
Value = Yes. There are five applicants who own their home. They are in classes A=|,B=2,C1.
Value = No. There are five applicants who do not own their home. They are in classes A=2, B=1,
c:2.
Using this attribute will divide objects into those who own their home and those who do not.
Computing the Gini index for each of these two subtrees,
c(y) : 1 -(t/r2 - Qtr2 - Qtr2 = 0.64
c(n): c(y) = 0.64
Total value of Gini Index = G = 0.5c(y) + 0.5G(n) = 0.64
2. Attribute "Married"
There are five applicants who are married and five that are not.
Value : : 0, B : l, c :4, total 5
Yes has A
value = No has A:3,8=2, C:0, total 5
Looking at the values above, it appears that this attribute will reduce the uncertainty by more than the
last attribute. Computing the information gain by using this attribute, we have
c(y) = | -(ur2- @/s)2 = 0.t2
G(n) = 1 -(3/il2 - Q/r2 = o.4S
Total value of Gini Index = G = 0.5c(y) + 0.5G(n) = 0.40
3. Attribute "Gender"
There are three applicants who are male and seven are female.
Value = Male has A = 0, B = 3, C:0, total 3

Value = Female has A:3, B :0, C=4,total7


G(Male) = I -l =0
G(Female) = 1 - (3/T2- @n)2 = 0.51 I
Total value of Gini Index = G = O.3G(Male) + 0.7G(Female) = 0.358
4. Attribute (Employed"
There are eight applicants who are employed and two that are not.
Value: Yes has A = 3 B: l, C = 4, total 8
Value = No has A=0,8=2,C =0,total
G(y) = I - (3/8)'z-(l/8)2- (+tS12= p.5nO
G(n): o
Total value of Gini Index = G : 0.8G(y) + 0.2G(n) = 0.475
50

5. Attribute "Credit Ratingt


There are five applicants who have credit rating
A and five that have B.
Value = A has A = 2, B = l, C : 2, total 5
Value = B has A = l, B = 2, C = 2, total 5
c(A): I _2Q/r2 _0fi)2:0.64
G(B) = c(A)
Total value of Gini Index = G = 0,5G(A) + 0.5c(B) = 0.64
Table 3 4 summarizes the values of the Gini Index obtained
for the fo owing five attributes:
Owns Home Employed
Manied Credit Rating
Gender
Table 3.4 Gini Index foi the five attributes

Attibute Gini Index before split Gini Index after split

Owns Home 0.66 0.64 0.02


Married 0.66 0.40 0.26
Gender 0.66 0.358 0.302
Employed 0.66 0.47 5 0.1 85
Credit Rating 0.66 0.64 0.02

The attribute with the largest reduction in the Gini Index is


selected as the split attribute. so the
split attribute is Gender.

3.6 OVERFITTING AIID PRUNING


The decision hee building algorithm given earlier continues
until either all leaf nodes are single
class nodes or no more attributes are available for splitting
a node that has objects of more than one
class. When the objects being classified have a large number
of attributes and a tree of maximum
possible depth is built, the tree quality may not be
high since the hee is built to deal correctly with
the haining set- In fact, in order to do so, it may become quite
complex, with rong and very uneven
paths' Some branches of the tree may reflect anomalies
due to noise or outtiers in the training
samples' Such decision trees are a result of overfining
the training data and may result in poor
accuracy for unseen samples.

According to the occam's razor principre (due to the


medievar philosopher w liam of
occam) it is best to posit that the world is inherently simple
and to choose the simplest model from
similar models since the simplest model is more likely to
be a better model. we can therefore ,.shave
off'nodes and branches ofa decision tree, essentially replacing
a whole subtree by a leaf node, if it
5l

can be established that the expected error rate in the subtree is greater than that in the single leaf.
This makes the classifier simpler. A simpler model has less chance of introducing inconsistencies,
ambiguities and redundancies.

Pruning is a technique to make an overfitted decision tree simpler and more geneml.

There are a number of techliques for pruning a decision tree by removing some splits and
subhees created by them. One approach involves removing branches from a "fully grown" tree to
obtain a sequence of progressively pruned trees. The accuracy of these trees is then computed and a
pruned tree that is accurate enough and simple enough is selected. It is advisable to use a set of data
different from the training data to decide which is the "best pruned tree".

Another approach is called pre-pruning in which tree construction is halted early. Essentially
a node is not spiit if this would result in the goodness measure ofthe tree falling below a threshold. It
is, however, quite difficult to choose an appropriate threshold.

3.7 DECISION TREE RULES


The decision tree method is a popular and relatively simple supervised classification method
that involves each node ofthe tree specifying a test of some attribute and each branch from the node
corresponding to one ofthe values of the attribute. Each path from the root to leafofthe decision tree
therefore consists of attribute tests, finally reaching a leaf that describes the class. The popularity of
decision trees is partly due to the ease of understanding the rules that the nodes specify. One could
even use the rules specified by decision tree to reftieve data from a relational database satisfuing the
rules using SQL.

There are a number of advantages in converting a decision tree to rules. Decision rules make
it easier to make pruning decisions since it is eas 'r to see the context of each rule. Also, converting
to rules removes the distinction between attribute tests that occur near the roof of the tree and they
are easier for people to understand.

IF-THEN rules may be derived based on the various paths from the root to the leaf nodes.
Although the simple approach will lead to as many rules as the leaf nodes, rules can often be
combined to produce a smaller set ofrules. For example:

If Gender : "Male" then Class = B


If Gender = "Female" and Manied then Class = C, else Class = A
Once all the rules have been generatsd, it may be possible to simplify the rules. Rules with only one
antecedeni (e.g. if Gender="Male" then Class:B) cannot be further simplified, so we only consider
those with two or more antecedents. It may be possible to eliminate unnecessary rule antecedents that
have no effect on the conclusion reached by the rule. Some rules may be unnecessary and these may
be removed. In some cases a number of rules that lead to the same class may be combined.
52

3.8 NAIVE BAYES METHOD


The NaiVe Bayes method is based on tlre work of Thomas Bayes. Bayes was a British minister
and his theory was published only after his death. It is a mystery what Bayes wanted to do with such
calculations.

Bayesian classification is quite diflerent f,rom the .decision tree approach. In Bayesian
classification w9 have a hypothesis that the given data belongs to a particular ciass. We then
calculate the probability for the hypothesis to be true. This is among the most practical approaches
for certain types of problems. The'approach requires only one scan of the whole data. Also, if at
some slage there are additional training data then each training example can incrementally
increase/decrease the probability that a hypothesis is.correct.

Before we define the Bayes lheorem, we will define some notation. The expression P(A)
iefers to the probability that event A will occur. P(AIB) stands for the probability that event A will
happen, given that.eveht B has already happened. In other words P(AIB) is the conditional
probabilrty ofA based on the condition that B.has already happened. For example, A and B may be
probabilities of passing another course B respectively. P(AIB) then is the probability of passing A
when we know that B has been passed.

Now here is the Bayes theorem:


P(A\B) = P(BIA)P(A)/P(B)
Once might wonder where did this theorem come from. A.ctually it is rather easy to derive

sirice we know the following:


P(AIB): P(A & ByP(B)
and
P(BIA): P(A & ByP(A)
Diving the first equation by the second gives us the Bayes' tleorem.
Continuing with A and B.being courses, we can compute the conditional probabilities if we
knew what the probability of passing both courses was, that is P(A & B), and what the probabilities
of passing A and B separately were. If an event has already happened then we divide the joint
probability P(A & B) with the probability of what has just happened and obtain the coaditional
probability.

lf we consider X to be an objrct to be classified then Bayes' theorem (3.1) may be read as


giving the probability of it belonging to one of the classes C1, C2, C3 etc by calculating P(CilX). Once
these probabilities have been computed for all the classes, we simply assign X to the class that has
the highest conditional probability.

Let us now consider how probabilities P(CilX) may be calculated. We have


53

P(Cilx) :
[P(xlci)P(ci)yP(x)
. P(CilX) is the probability of the object X belonging to class Ci.
. P(XICD is the probability of obtaining attribute values X if we know that it belongs to
class Ci.
r P(Ci) is the prsb,ability of any object belonging to class Ci without any other information.
. P(X) is the probability of obtaining attribute values X whatever class the object belongs

Given the attribute values X, ryhat probabilities in the fomlula can we compute? The probabilities we
need to compute are P(XlCi), P(Ci) andPS). Actually the denominator P(X) is independent of Ci
and is not required to be known since we are interested only in comparing probabilities P(CilX).
Therefore we only need 1o compute P(XrCi) and P(Ci) for each class. Comparing P(Ci) is rather easy
since we count the numbef of instances ofeach class in the training data and divide each by he total
number of instances. This may not be the most accurate eslimation of P(Ci) bu1 we have very little
information, the training sample, and we have no other information to obtain a better estimate. This
estimate will be reasonable if the training sample was large and was randomly chosen.

To compute P(XlCi) we use a naiVe approach (that is why it is called the Naive Bayes model)
- by assirming that atl attributes of X are independent which is often not true.

Using the independence of aftribules assumption and based on the training dat4 we compute
an estimate of the probability of obtaining the data X that we have by estimaling the probability of
each ofthe athibute values by counting the frequency ofthose values for class Ci.

We then determine the class allocation of X by computing tPO(lCi)P(Ci)] for each of the
classes and allocating X to the class with the largest value.
, The beauty of the Bayesian approach is that the probability of the dependent attribute can be
estimated by computing estimates of the probabilities of the independent attributes.

We should also note that it is possible to use this approach even if values of all the
independent athibutes are not known since we can still estimate the probabilities of the auribute
values that we know. This is a significant advantage of the Bayesian approach.
54

Example 33 - Naive Bayes Method


Once again we go back to the example in Table 3.2 that we have used before.

Owns Home? Married Gender Employed Credit Rating Risk Class


Yes Yes lr4ale Yes AB
Yes Yes Female Yes AA
Yes Yes Female Yes B C
Yes Yes Male No BB
Yes Yes F'emale Yes BC
Yes Yes Female Yes BA
Yes Yes Male No BB
Yes Yes Female Yes AA
Yes Yes Female Yes AC
Yes Yes Female Yes AC
There are 10 (s =10) samples and three classes. The frequencies ofthese classes are:
Credit risk Class A : 3

Credit risk Class B = 3

Credit risk Class C : 4


The prior probabilities are obtained by dividing these frequencies by the total number in the training
data (hat is, 10)
P(A):0.3, P(B) = 0.3, and P(C) = 0.4
If the data that is presented to us is {yes, no, female, yes, A} for the five attributes, we can
compute the posterior probability for each class as noted earlier. Eor exainple:
P(XlCi) = P({yes, no, female, yes, A}lCD: P(Owns Home: yeslCi) x P(Married = nolCi)
X P(Gender = femalelCi) x P(Employed = yeslCi) x P(Credit Rating = AlCi)
Using expressions like that given above, we are able to compute the three posterior
probabilities for the three classes, namely that the person. with attribute values X has credit risk class
A or class B or class C. We compute P(XICDP(CD for each of the three classes given P(A) = 0.3,
P(B) : 0.3 and P(C) : 0.4 and these values are the basis for comparing the three classes.
To compute P(XICD = P({yes, no, female, yes, A}lCi) for each of the classes, we need the
following probabilities for each:
P(Owns Home : yeslCi)
P(Manied = nolCi)
. P(Gender = femalelCi)
P(Employed = yeslCi)
P(Credit Rating = AlCi)
55

These probabilities are given in Table 3.5.We order the data by risk class to make it convenient.
Given the estimates ofthe probabiliti6s in Table 3.5, we can compute the posterior probabilities as
P(XIA):2/9
P(XIB) = 0
Ptxlc) = 0
Table 3.5 Probability of events in the Naive Bayes method

Owns Home Married Gender Employed Credit Rating Class


No No Female Yes A A
No No Female Yes B A
Yes 'No Female Yes A A
t/3 I I 1 213 Probability of having{yes, no, female,
yes, A) attribule values given the risk
Class A
Yes Yes Male Yes A B
Yes No Male No B B
No No Male N; B B
2t3 2/3 0 il3 U3 Probability of having{ycs, no, femalq
y€s, A) attribute values given thc risk
Class B

Yes Yes Female Yes B C


No Yes Female Yes B C
No Yes Female Yes A C
Yes Yes Female Yes A C
0.s 0 I 1.0 0.5 Probability of havlng(yes, no, female,
yes, A) attribute values given the risk
Class C

Therefore tle values of P(XlCi)P(Ci) are zero for Classes B and C and 0.3 x 219 = 0.0666. Therefore
X is assigned to Class A. It is unforhrnate that in this example two of the probabilities cnme out to be
zero. This is most unlikely in practice.

Bayes' theorem assumes that all attributes are independent and that the training sample is a
good sample to estimate probabilities. These assumptions are not always true in practice, as attributes
are often correlaied but in spite of this the NaiVe Bayes method performs reasonably well. Other
techniques have been designed to overcome this limitation. One approach is to use Bayesian
networks that. combing Bayesian reasoning with causal relationships between attributes.

A9 ESTIMATING PREDICTIVE ACCURACY OF CLASSIHCATION ME,THODS


The accuracy of a claSsification method is the ability of the method to correctly determine the
class of a randomly selected data instance. It may be expressed as the probability of correctly
56

classiling unseen data. Estimating the accuracy of a supervised classification method can be
difficult if only the haining data is available and all of that data has been used in building the model.
In such situations, overoptimistic predictions are often made regarding the accuracy of the model.

Methods for estimatins the accuracv of a classification method:

l. Holdout Method

The holdout method (sometimes called the test sample method) requires a training set and a test
set. The sets are mutually exclusive. It may be that only dataset is available which has been divided
into two subsets (perhaps 213 aurd, ll3), the training subset and the tesl or holdout subset. Once the
classification method produces the model using the training set, the test set can be used to estimate
the accuracy. Interesting. questions arise in this estimation since a larger training set would produce a
better classifier, while a larger test set would produce a better estimate of the accuracy. A balance
must be achieved. Since none of thc test data is used in training, the estimate is not biased, but a
good estimate is obtained only if the test data is used in training set are large enough and
representative of the whole population.

2. Random Sub-sampling Method

Random sub-sampling is very much like the holdout method except that it does not rely on a
single text set. Essentially, the holdout estimation is repeated several times and the accuracy estimate
is obtained by computing the mean of the several trails. Random sub-simpling is likely to produce
better error estimates than those by the holdout method.

3. k-fold Cross-validation Method


In k-fold cross-validation, the available data is randomly divided into k disjoint subsets of
approximately equal size. One of the subsets is then used as the test set and the remaining k-l sets
are used for building the classifier. The test set is then used Jo estimate the accuracy. This is done
rcpeatedly k times so that each subset is used as a test subset once. The accuracy estimate is then the
mean of the estimates for each of the classifiers. Cross-validation has been tested extensively and has
been tested extensively and has been found. to generally work well when sufficient data is available.
A value of l0 for k has been found to be ailequate and accurate.

4. Leave-one-outMethorl

Leave-one-out is a simpler version of k-fold cross-validation. In this method, one of the training
samples is taken out and the model is generated using the remaining training data. Once the model is
built, the one remaining sample is used for testing and the result is coded as I or 0 depending if it
was classified correctly or not. The average ofsuch results provides an estimate gfthe accuracy. The
leave-one-out method is useful when the daiaset is small. For large training datasets, leaVe-one-out
JI

can become expensive since many iterations are required. Leave-one-out is unbiased but has high
variance and is therefore not particula.rly reliable.

5. Bootstrap Method

ln this method, given a dataset of size n, a bootstrap sample is randomly selected uniformly with
replacement (that is, a sample may be selected more than once) by sampling n times and used to
build a model. It can be shown that only 63.2Yo of these samples are unique. The error in building the
model is estimated by using the remainin g 36.8% of objects that are not in the bootstrap sample. The
final error is than compute d as 0.632 and 0.368 are based on the assumption that if there were n
samples available initially ftom which n samples with replacement were randomly selected for
training data then the expected percentage of unique samples in the training data would be 0.632 or
63.2Vo and the number of remaining unique objects used in the test data would be 0.368 or 36.8Vo of
the initial sample. This is repeated and the average of error estimates is obtained. The bootstrap
method is unbiased and, in contrast to leave-one-out, has low variance but much iteration is needed
for good error estimates if the sample is small, say 25 or less.

3.10 IMPROVING ACCURACY OF CLASSIFICATION METHODS

Bootstrapping, bagging and boosting are techniques for improving the accuracy of classification
results. They have been shown to be very successful for certain models, for example, decision hees.
All three involve combining several classification results from the same training data that has been
perturbed in some way. The aim of building several decision trees by using training data that has
been perturbed is to find out how these decision. trees differ from those that have been obtained
earlier.

The bootstrapping method can be shown that all the samples selected are somewhat different
from each other since, on the average, only 63.2% of the objects in them are unique. The bootstrap
samples are then used for building decision trees which are then combined to form a single decision
tree.

Bagging (the name is derived form Bootstrap and aggregating) combines classification
results from multiple models or results of using the same method on several different sets of training
data. Bagging may also be used to improve the stability and accuracy of a complex classification
model with limited training data by using sub-samples obtained by resampling, with replacement, for
generating models'
Bagging essentially invorves
involves simnle
simple voting (*rith
votinp (with no weights), so the final morcer is the one
that is predicted by the majority of the trees. The different decision trees obtained during bagging
should not be so different if the training data is good and large enough. If the trees obtained are very
different, it only indicates instabiliUy, perhaps due to the training data being random. In such a
situation, bagging will often providc a better iesult than using the training data only once to build a
decision tree.

of literature available on bootstrapping, bagging, and boosting. This brief


There is a lot
introduction only provides a glimpse into these techniques but some of the points made in the
literature regarding the benefits ofthese methods are:

r These techniques can provide a level of accuracy that usually cannot be obtained by a large
single-rree model.
o Creating a single decision tree from a collection of trees in bagging and boosting is not
difficult.
o These methods can often help in avoiding the problem of overfitting since a number of trees
based on random samples are used.

o Boosting appears to be on the average better than bagging although it is not always so. On
some problems. bagging does better than boosting.

3.11 OTIIER EVALUATION CRITERIA FOR CLASSIF'ICATION METHODS


The criteria for evaluation ofclassification methods are as follows:
l. Speed

2. Robustness
3. Scalability
4. Interpretability
5. Goodness of the model
6. Flexibility
7. Time complexity
Speed
Speed involves notjust the time or computation cost of constructing a model (e.g. a decision

tree), it aiso includes the time required to leam to use the model. Obviously, a user wishes to
minimize both times although it has to be understood that any significant data mining project will
take time to plan and prepare the data. If the problem to be solved is large, a careful study of the
methods available may need to be carried out so that an efficient classification method may be
chosen.

Robustness
Data errors are cofilmon, in particular when data is being collected from a number of sources
and errors may remain even after data cleaning. It is therefore desirable that a method be able to
produce good results in spite of some errors and missing values in datasets.
59

Scalability
Many data mining methods were originally designed for small datasets. Many have been
moclified to deal with large problems. Given that large datasets are becoming commoq it is desirable
that a method continues to work efficiently for large disk-resident databases as well.

Interpretability
A data mining professional is to ensure that the results of data mining are explaiaed to the
decision makers. It is therefore desirable that the end-user be able to understand and gain insight
from the results produced by the classification method.

Goodness of the Model

For a model to be effective, it needs to fit the problem that is being solved. For example, in a
decision tree classification, it is desirable to find a decision tree ofthe "right" size and compactness
' with high accuracy.

3.I2 CLASSIFICATION SOFTWARE


A more comprehensive classification software list is available at kdnuggets site:
(htto:i/www.kdnueeets.com/soft ware/classifi cation.html).
c C4.S,version 8 ofthe "classic" decision-'tree tool, developed by J. R. Quinlan (free, restricted
distribution) is available at: http://www,rulequest.com/Personali
o CART 5.0 and TreeNet from Salford Systems are the well-known decision tree software
packages. TreeNet provides boosting. CART is the decision tree software. The packages
incorporate facilities for data pre-processing and predictive modeling including bagging and
arcing. For more details visit: http://www.salford-systems.comi
o DTREG, f.orn u .o-purry ',,rith the same name, generates classification trees when the classes
are categorical, and regression decision trees when the classes are numerical intervals, and
finds the optimal tree size. In both cases, the attribute values may be discrete or numerical.
Software modules TreeBoost and Decision Tree Forest generate an ensemble of decision
trees. In TreeBoost, each tree is generated based on input from a previous tree while Decision
Tree Forest generates an ensemble independently of each other. The ensemble is then
combined. For more details visit: http:/iwww.dtreg.com/
o SMILES provides new splitting criteria, non-greedy search, new partitions, extraction of
several and different solutions. It offers a quite effective handling of (misclassification and

test) costs. SMILES also uses boosting and cost-sensitive leaming. For more details visit:
http:/iwww.dsic.upv.es/-fl inlsmiles/
o NBC: a Simple Naive Bayes Classifier. Written in awk. For more details visit:
http://scant.orp/nbc/nbc.html
60

CONCLUSION

In this chapter we discussed supervised classification, which is an extensively studied problem in


statistics and machine leaming. Classification is probably the most widely used data mining
technique.
We described the decision tree approach to classification using information measure and Gini
index for splitting attributes. In the decision tree approach. decisions are made locally by considering
one attribute at a. time thus considering the most important attribute first. We discussed decision tree
pruning and testing. Another classification method, the Narve Bayes method, has been described.

REVTEW QUESTIONS
1. What kind of data is rhe decision tree method most suitable for?
2. Briefly outline the major steps.of the algorithm to construct a decision tree.

, 3. Assume that we have 10 training samples. There are four classes A, B, C and D. Compute the
information in the samples for the five training datasets given below (each row is a dataset
and each dataset has 10 objects) when the number of samples in each class are:

ClassABCD
Dataset 1 1 1 1 '7

Dataset 2 2 2 2 4
Dataset 3 3 3 3 1

Dataset 4 1 2 3 4
Dataset 5 0 0 1 9

Use the log values from the following table:

p loe(p) -p loeb)

0.1 -3.32 0.332


0.2 -2.32 0.464
0.3 -t.74 0.s21
0.4 -1.32 0.529
0.5 -1.00 0.500
0.6 -0.74 0.442
0.7 -0.51 0.360
0.8 -0.32 0.258
0.9 -0.15 0.137
6t

tree? What problems can overfitting lead to?


5. List five criteria for evaluating the classification methods. Discuss them briefly.
6. Describe three methods of estimating accuracy of a classification method.
7. Explain the terms bootstrapping, bagging, and boosting for improving the accuracy of a
classification method.
8. What is the difference between bootstrapping, bagging and boosting?

5
62

CHAPTER 4

CLUSTER ANALYSIS
Learning Objectives
1. Explain what cluster analysis is

2. Describe some desirable features of a cluster analysis method


3. Describe the types ofcluster analysis techniques available
4. Describe the K-means method, a partitioning technique
5. Describe two hierarchical techniques - the Agglomerative method and the Divisive method.
4.1 WHAT IS CLUSTER ANALYSIS?
We like to organize observations or objects or things (e.g. plants, animals, chemicals) into
rneaningful groups so that we are able to make comments about the groups rather than individual
obiects. Such gloupings are often rather convenient since we can talk about a small number of
groups rather than a large number of objects although certain details are necessarily lost because
objects in each group are not identical. A classical example of a grouping is the chemical periodic
table where chemical elements are grouped into rows and columns such that elements adjacent to
each other within a group have similar physical properties. For example, the elements in the periodic
table are grouped as:
1. Alkali metals
2. Actinide series
3. Alkaline earth metals
4. Other metals
5. Transition metals
6, Nonmetals
7. Lanthanide series
8. Noble gases
Elements in each of these groups are similar but dissimilar to elements in other groups.
Tire aim of cluster analysis is exploratory, to find if data naturally falis into meaningful groups
with small within-group variations and large between-group variation. Often we may not have a
hypothesis that we are trying to test. The aim is to find any interesting grouping of the data. It is
possible to define cluster analysis as an optimization problem in which a given function consisting of
within cluster (intra-cluster) similarity and between clusters (inter-cluster) dissimilarly needs to be
optimized. This function can be difficult to define and the optimization of any such function is a
challenging task, clustering methods therefore only try to find an approximate or local optimum
solution,
63

4.2 DESIRED FEATURES OF CLUSTER ANALYSIS


Given that there qre a large number of cluster analysis methods on offer, we make a list of desired
feahres that an ideal cluster analysis method should have. The list is given below:
l. (For large datasets) Scalability: Data mining problems can be large and therefore it is
desirable that a cluster analysis method be able to deal with small as well as large problems
gracefully. Ideally, the performance should be linear with the size of the data- The method
should also scale well to datasets in which the number of attributes is large.
2, (For large dataset$ Only one scan of lhe datqset: For large problems, the data must be
stored on the disk arid the cost of I/O from the disk can then become significant in solving
the problem. It is therefore desirable that a cluster analysis method not require more than
one scan ofthe disk-resident data.
3, (For large datase9) Abitity to stop an4 resume: When the dataset is very large, cluster
analysis may require considerable processor time to complete the task. In such cases, it is
desirable that the task be able to be stopped and then resumed when convenient.
4. Minimat input parameters.'The cluster analysis method should not expect too much
guidance from the user. A data mining analyst may be working with a dataset about which.
his/her knowledge is limited. It is therefore desirable that the user not be expected to have
domain knowledge of the data and not be expected to possess insight into clusters that
might exist in the data.
5. Robustness: Most data obtained from a variety of sources has errors. It is therefore
desirable that a cluster analysis method be able to deal with noise, outliers and missing
values gracefully.
6. AbW to discover different cluster s[apes.' Clusters come in different shapes and not all
clusters are spherical. It ii therefore desirable that a cluster analysis method be able to
discover cluster shapes other than spherical. Some applications require that various shapes
be considered.

7. Different data types: Many problems have a mixture of data types, for example, numerical,
categorical and even textual. It is therefore desirable that a cluster analysis method be able
. .to deal with not only numerical data but also Boolean and categorical data.

8. Resuh independent ol data inpul order: Althongh this is a simple requirement, not all
methods satisry it. It is therefore desirable tlat a cluster analysis method not be sensitive to
' data input order, whatever the order, the result of cluster analysis of the same data should
be the same.

4.3 TYPES OF DATA


Datasets come in a number of different forms. The data may be quantitative, binary, nominal or
or<iinal.
64

Quantitative (or numerical) data is quite common, for example, weight, marks, height,
price, salary, and count. There are a number of methods for computing similarity between
quantitative data.
) Binary data is also quite common, for example, gender, and marital status. Computing
similarity or distance between categorical variables is not as simple as for quantitative data
but a number of methods have been proposed. A simple method involves counting how many
attribute values of the two objects are different amongst n attributes and using this as an
indication of distance.
3. Qualitative nominal data is similar to binary data which may take more than two values but
has. no natural order, for example, religion, food or colours. For nominal data too, an
approach similar to that suggested for computing distance for binary data may be used.
4. Qualitative ordinal (or ranked) data is similar to nominal data except that the data has an
order associated with it, for example, grades A, B, C, D, sizes S, M, L, and XL. The problem-
of measuring distance between ordinal variables is different than for nominal variables since
the order of the values is important. One method of computing distance involves transferring
the values to numeric values according to their rank. For example grades, A, B, C, D could be
transformed to 4.0, 3.0, 2.0 and 1.0 and then one of the methods in the next section may be
used,
,I.4 COMPUTING DISTANCE
l,stance is well understood concept that has a number of simple properties.
1. Distance is always positive,
2. Distance from point x to itself is always zero.
3. Distance from point x to point y cannot be greater than the sum of the distance ftom x to
some other point z and distance from z to y.
4, Distance from x to y is always the same as from y to x.
Let the distance between two points x and y (both vectors) be D(x,y). We now define a number of
distance measures.
Eaclidean dislance
Euclidean distance or the Lz norm of the difference vector is most commonly used to
compute distances and has an intuitive appeal but the largest valued attribute may dominate the
distance. It is therefore essential that the attributes are properly scaled.
D(x,y) = (X (*.y)')''
It is possible to use this distance measure without the square root if one wanted to place
greater weight on differences that are large.
A Euclidean distance measure is more appropriate when the data is not standardized, but as

roted above the distance measure can be greatly affected by the scale olthe data.
65

Marthattan distance
Another commonly used distance metric is the Manhattan distance or the Lr norm of the
difference vector. In most cases, the results obtained by the Manhattan distance are similar to those
obtained by using the Euclidean distance. Once again the largest-valued attribute can dominate the
distance, although not as much as in the Euclidean distance.

D(x'Y) = E lx;- Y; I

Chebychev distonce
This distance metric is based on the maximum attribute difference. It is also called the L- nonn of
the difference vector.

D(x,y) = Max lx1- y; I

Calegorical tlota distonce


This riistance measure may be used if many attributes have categorical values with only a snrali
number olvalui:s (e.g. binary values). Let N be the total number ofcategorical attributes.

D(x,y) = ( number of x;- y; )/N


4.5 TYPES OF CLUSTER ANALYSIS METHODS
The cluster analysis methods ma1,be divided into the following categories:
Partitional methods
Partitional methods obtain a single level partition ol objects. These methods usually are based on
greedy heuristics that are used iteratively to obtain a locel optimum solution. Given n objects, these
methods make kln clusters of data and use an iterative relocation method. It is assumed that each
cluster has at least one object belongs to only one cluster. Objects may be relocated between clusters
as the clusters are refined. Often these methods require that the number of ilusters be specified
apriori alrj iii,o number usutilly does not change during the processing.
Hierarchical methods
Hierarchical methods obtain a nested partition of:the objects resulting in a tree of clusters. These
methods either start with one cluster and then split into smaller and smaller clusters ( calied diviri" e
or top down) or start with each object in an irrdividual cluster and then try to merge similar clusrri.'
into larger and larger clusters ( called agglomerative or bottom up). In this approach, in contrast to
partitioning, tentative clusterc rnay be merged or split based on some criteria.
IrensiA-based methods
In tirlr, class of methods, typically for each data point in a cluster, at lease a minimum nunber of
points must exist within a given radius. Density'-based methods can deal with arbitrary shape clusters
since the major requirement of such methods is that each clusier be a dense region of points
surrounded by regions of low density. .
66

Grid-based methods
in this class ofmethods. the object space ralher than the data is dir ided into a grid. Grid partitioning
is based on characteristics ofthe data and such methods can deal with non-numeric data more easilv.
Grid-based methods are not affected by data oldering.
lladel-based methods
A model is assumed, perhaps based on a probability distribution. Essentially, the algorithm tries to
build clusters with a high level of similarity within them and a low of similarity between them.
Similarity measurement is based on the mean values and the algorithm tries to minimize the squared-
error function.
A simple taxonomy ofcluster analysis methods is presented in Figure 4.1

Figure 4.1 Taxonomy ofcluster analysis methods


4.6 PARTITIONAL METHODS

Partitional methods are popular since they tend to be computationally efficient ahd are more
easily adapted for very large datasets. The hierarchical methods tend to be computationally more
'expensive,

The aim of partitional methods is to reduce the varionce within each cluster as much as
possible and have large variance between the clusters. Since the partitional methods do not normally
explicitly control the inter-cluster variance, heuristics (e.g. choosing seeds as far apart as possible)
may be used for ensuring large inter-cluster variance. One may therefore consider the aim to be
67

minimizing a ratio like a,/b where a is some measure of within cluster variance and b is some measure
of between cluster variation.

The K-Means illethod

K-Means is the simplest and most popular classical clustering method that is easy to implement. The
classical method can only be used ifthe data about all the objects is located in the main memory. The
method is called K-Means since each of the K clusters is represented by the mean of the objects.
(called the centroid) within it. It is also called the centroid method since at each step the centroid
point of each cluster is assumed to be known and each of the remaining points are allocated to.the
cluster whose centroid is cl.osest to it. Once this allocation is completed, the centroids of the clusters
are recomputed using simple means and the process of allocating points to each cluster is repeated
until there is no change in the clusters (or some other stopping criterion, e.g. no significant reduction
in the squared error, is met). The method may also be liked as a search problern where the aim is
essentially to find the optirnum clusters given the number ofclusters rnd seeds specified by the user.
Obviously, we cannot use a brute-force or exhaustive search method to find the optimum, so rve
consider solutions that may not be optimal but may be computed efficiently.

The K-means method uses the Euclidean distance measure, which appears to work well with
compact clusters. If instead of the Euclidean distance, the Manhattan distance is used the method is
called the K-median method. The K-median method can be less sensitive to outliers.

The K-means method may be described as follows:


1. Select the number ofclusters. Let this number be k.
2, Pick k seeds as controids ofthe k clusters. The seeds may be picked randomly unless
the user has some insiglrt into the data.
3. Compute the Euclidean distance ofeach object in the dataset from each ofthe
centroids.
4. Allocate each object to the cluster it is nearest to based on the distances computed in
the previous step.
5. Compute the centroids of the clusters by computing the means of the attribute values of
the objects in each cluster.
6. Check if the stopping criterion has been me1 (e.g. the cluster membership is'unchanged.y.
Ifyes, go to Step 7. Ifno! go to Step 3.
7. [Optional] One may decide to stop at this stage or to split a cluster or combine two
clusters heuristically until a stopping criterion is met.
The method is scalabie and efficient (the time complexity is of O(n) and is guaranteed to find a local
minimum-
68

Scaling and weighting


For clustering to be effective, all attributes should be converted to a similar scale unless
we want to give more weight to some attributes that are lelatively large in scale. There are a numbor
of ways to fiansform the attributes. One possibility is to transform them all to a normalized score or
to a range (0,1). Such transformations are called scaling. Some other approaches to scaling are given
below:
.1, Divide each attribute by the mean value of that attribute, This reduces the mean of each
attribute to l. It does not control the variation; some values rnay still be large, others small.
2. Divide each attribute by the difference between the largost value and increases the mean of
attributes that have a small range of values but does not reduce each attribute's mean to the
same value. The scaling reduces the difference between the largest value and the smallest
value to I and therefore does control the variation.
3. Convert the attribute values to "standardized scores" by subtracting the mean of the attribute
from each attribute value and dividing it by the standard deviation. Now the mean of each
attribute will be zero and standard devration one. This not r)nly scales the magnitude of each
attribute to a similar range but also scales the standar{ deviation.

Starling values for lhe K-means melhod

Often the user has little basis r-or specilying the number of clusters and starting seeris. This problem
may be overcome by using an iterative approach. For exar].lple, one ml: irst three clusters and
choose three starting seeds randomly. Once the final clusters have been obtained, the process may be
repeated with a different set of seeds. Attempts should be made to select seeds that are as far away
from each other as possiblc. Also, during the iterative process if two clusters are found to be close
together, it may be desirable to merge them. Also, a large cluster may be split to in two if the
variance within the clusler is above some tlrreshold value.

Another apptr.rtth involves tinding t, : centroid of the whole dataset and then perturbing this
centroid value to find seeds. Yet anothei app;oach recommends using a hietarchicat method like the
agglomerative method on the data first, since thai method does not require staring values, and then
using the results ofthat rnethod as the basis for specifying thc number ofcltrsters and starting seeds.

Summary of the K-means method

K-means is an iterative-ilnprovement greedy method. A number of iterations are nomrally needed for
convergence and therefore the dataset is processed a number of times. If the data is ve:y la:ge and
cannot be accommodated in the main memory the process may become inefficient.

Although the K-means method is most widely known ani used. there.rie a number of issues
related to the meihod thut should be understood:
69

1. The K-means method needs to comFute Elclidean distances and means of the attribute
values of objects within a cluster. The classical algorithm therefore is only suitable ior
continuous data. K-means variations that deal with categcrical data those are available t'ut
not widely used.
2. The K-means method implicitly assumes spherical prolrrtriliry distributions.
3, The results of the K-means method depend .,trongly <-ri'r the ilitial guesses of the seeds.

4, The K-means method can be sensitive to ouiliers. llan outlier is picksd as a starting seed, it
may end up in a cluster of its own. Also, if an outlier nioi'es from one cluster to another
during iterations. it can have a major impacr on the rlu:;te, s because the means of the two
clusters are likel,r to change significantly.
5. Although:;ome locai ,-;'-ri;mufir solrrrlons discovered i :! : K-means nrethod are satisfactory,
often the local optimum is not as good as the global optimum.
6, The K-means method does not consider the size ofthe clusters. Some clusters may be large
and some very small.
7. The K-means method does not deal with overlapping clusters.

Expectation Maximization l\{ethod

The K-means method does nol explicitly assume any proi;ai,,it;, ci:tribution for the
attribute values. It only assumes that the dataset consists of groups of ob.iects ;hat are similar and the
groups can be discovered because the user has provided cluster seeds.

ln contrast to the K-means method, the Expectation Maximizatioir (EM) method is based on
tho assumption that the objects in the dataset have attrlbutcs wir:se vaiues are distributed according
to some (unknown) linear combination (or mixture) of sinrple probahility distributions. While the K-
means method involves assigning objects to clur;ters to minimize within-group variation, the EM
method assigns objects to different clusters with certain probabilities in an attempt to maximize
expectation (or likelihood) of assignment.

The simplest situ.rtion is when there are only two distributions. For eveiy individual we may
assume that it comes from distribution I with probability p and therefor: from distribution 2 \"'ith
p:obability 1-p. Such mixture models appear to be widely used because they provide more
parameters and therefore more flexibility in modeling.

'rhe EIr,[ method consists of a i1,',.-q1gp itetative algorithm. The first step, called the
Esiimati. n step or the E-step, involves estimaliug the probability distributions of the clusters given
ihe data. Tl,e second step. called the Maximization step or the M-step. inl,sives finding the ntodcl
,i:..ameters that maximize the likelihood of the solution.
70

The EM method assumes that all attributes are indepertdent random variables. In a simple
case ofjust two clusters with objects having only a single attribute, we may assume that the attribute

values vary according to a normal distribution. The EM method requires that we now estimate the
following parameters:

1. The mean and standard deviation ofthe normal distribution for cluster i
2. The mean and standard deviation ofthe normal distribution for cluster 2
3' The probability p of a sample belonging to cluster I and therefore probability l-p of
belonging to cluster 2
The EM method then works as follows:
1. Guess the initial values of the five parameters (the two means, the two standard deviations
and the probability p) given above.
2. Use the two normal distributions (given the two guesses of means and two guesses of
standard deviations) and compute the probability ofeach obj ect belonging to each of the two
clusters.
3. Compute the likelihood of data coming from these tv,,o clusters by multiplying the sum of
probabilities of each obj ect.
4. Re-estimate the five parameters and go to Step 2 until a stopping criterion has been met.

4.7 HIERARCHICALMETIIODS
. l{icrarchicai ntotltods prt'duce a nested series of clusters as opposed to the partitional methods
'whi' h prtiduce only a flat set of clusters. Essentially the hierarchical methods atlempt to capture
the
slrrr{rtuie of the data by constructing a tree of clusters. This approach allows clusters to be found at
li iler.;;rt levels t-rl' granuiarity.
'i here are two types of hierarchical approaches possible. In one approach, called the
ogglomerative approach for merging groups (or bottom-up approach), each object at the start is a
ciuster b,v itself and the nearby clusters are repeatedly merged resulting in larger and larger clusters
until some stopping criterion (often a given number of clusters) is met or all the objects are merged
into a single large cluster which is the highest level of the hierarchy. In the second approach, cailed
lhe divisive approach (or the top-down approach), all the objects are put in a single cluster to start.
The method then repeatedly performs splitting of clusters resulting in smaller and smaller clusters
until a stopping criterion is reached or each cluster has only one object in it.

Distancr Li nveen Chrsfers

The hierarcliical clustering nrethods require distances between clusters to be computed. These
.ji:tance merics i:re o{ien caiied linkoge nteirics.
71

Computing distances between large clusters can be expensive. Suppose one cluster has 50
objects and another has 100, then computing most ofthe distance metrics listed below would require
computing distances between each object in the first cluster with every object in the second.
Therefore 5000 distances would rreed to be computed just to compute a distance between two
clusters. This can be expensive ifeach object has many attributes.

We will discuss the following methods for computing distances between clusters:
1. Single-lirk algorilhm
2. Complete-linkalgorithm
3. Centroid algorithm
4. Average-linkalgorithrn
5. Ward's minimum-variance algorithm

Single-link

The singleJink (or the nearest neighbour) algorithm is perhaps the simplest algorithni for
computing distance between two clusters. The algorithm determines the distance between two
clusters as the minimum of the distances. between all pairs of points (a,x) where a is from the firs
cluster and x is from the second. The algorithm therefore requires that all pairwise distances be
computed and the smallest distance (or the shortest kink) found. The algorithm can form chains and
can form elongated clusters.
Figure 4.2 shows two clusters A and B the single-link distance between them.
B
B
D

A
A

A
A

Figure 4.2 Single-link distance between two clusters.


72

Complele-link

The complete-link algorithm is also called the farthest neighbour algorithm. In this
algorithm. the distance between two clusters is defined as the maximum of the pairwise distances
(a,x). Therefore if there are ra elements in one cluster and z in the other, all zn pairwise distances
therefore must be computed and the largest chosen.

Compute link is strongly biased towards compact clusters. Figure 4.3 shows two clusters A
and B and the completeJink distance between them. Complete-link can be distorted by moderate
outliers in one or both ofthe clusters.

B B
B
B

A
A
A
A

Figure 4.3 CompleteJink distance between two clusters.


Both singleJink and completeJink measures have their difiiculties. In the singleJink
algorithm, each cluster may have an outlier and the two outliers may be nearby and so the distance
between the two clusters would be computed to be small. SingleJink can form a chain of objects as
clusters are combined: since there is no constraint on the distance between objects that are far away
from each other.

Centroid

In the cenhoid algorithm the distance between two clusters is determined as the distance
between the centroids of the clusters as shown below. The centroid algdrithm computes the distance
between two clusters as the distance between the average point of each of the two clusters. Usually
the squared Euclidean distance between the centroids is used. This approach is easy and generally
works well and i$ more tolerant of somewhat longer clusters than the complete-link algorithm.
Figure.4.4 shows twoplusters A and B and the centroid distance between them.
73

,,,
B
D

A
A

Figure 4.4 The Centroid distance between two clusters.


AverageJink
The averageJink algorithm on the other hand computes the distance between two clusters as
the average of all pairwise distances between an object from one cluster and another from the other
cluster. Therefore if there are m elements in one cluster and n in the other, there are mn distances to
be computed, added and divided by mn. This approach also generally works well. It tends to join
clusters with small variances .although it is more tolerant of somewhat longer clusters than the
completeJink algorithm. Figure 4.5 shows two clusters A and B and the averageJink distance
between tlem.

Figure 4.5 Thd Average-link distance between two clusters.


t4

lVard's min imum-variance methotl

Ward's minimum-variance distance measure on the other hand is different. The method
generally works well and results in creating small tight clusters. Ward's distance is the difference
between the total within the cluster sum of squares for the two clusters separately and within the
cluster sum ofsquares resulting from merging the two clusters.
An example for ward's distance may be derived. It may be expressed as follows:
Dy(A,B) = N1N3D6(A,B)/(NA + Nu)
Where D1ry(A,B) is the Ward's minimum-variance distance between clusters A and B with NA and
Ns objects in them respectively. D6(A,B) is the centroid distance between the two clusters computed
as squared Euclidean distance between the centroids. It has been observed that the Ward's method
tends to join clusters with a small number of objects and is biased towards producing clusters with
roughly the same number ofobjects. The distance measure can be sensitive to outliers.
Agglomerative Method
Some applications naturally have a hierarchical structure. For example, the world's fauna
and flora have a hierarchical structure. The agglomerative clustering method tries to discover such
structure given a dataset.
The basic idea of the agglomerative method is to start out with n clusters for n data points,
that is, each cluster consisting ofa single data point. Using a measure of distance, at each step ofthe
method, the method merges two nearest clusters, thus reducing the number of clusters and building
obtained or all the data points are in one cluster. The agglomerative method leads to hierarchical
clusters in which at each step we build larger and larger clusters that include increasingly dissimilar
objects.
The agglomerative method is basically a bottom-up approach which furvolves the following steps.
. i. Allocate each point to a cluster of its om. Thus we stafi with n clusters for n objects.
2. Create a distance matrix by computing distances bet\&een all pairs of clusters either using, for
example, the single-link metric or the completeJink metric. Some other metric may also be
used. Sort these distances in ascending order.
3. Find the two clusters that have the smallest distance between them.
4. Remove the pair of objects and merge them.
5. Ifthere is only one cluster left then stop.
6. Compute all distances from the new cluster and update the distance matrix after the merger
and go to Step 3.

Divisive Hierarchical Method


The divisive method is the opposite of the agglomerative method in that the method starts
with the whole dataset as one cluster and then proceeds to recursively divide the cluster into two sub-
clusters and continues until each cluster has only one object or some other stopping criterion has
been reached. There are two types of divisive methods:
l. Monothetic: It splits a cluster using only one attribute at a time. An attribute that has
the most variation could be selected.
2. Polylhetic: It splits a cluster using all of the attributes together. Two clusters far apart
could be built based on distance between objects.
A typical polythetic divisive method works like the following:
1. Decide on a method of measuring the distance between two objects. Also decide a
threshold distance.
2. Create a distance matrix by computing distances between all pairs of objects within
the cluster. Sort thesq distances in ascending order.
3. Find the tow objects that have the largest distance between them. They are the most
dissimilar objects.
.
4. If the distance between the two objects is smaller than the per-specified threshold and
there is no other cluster that needs to be divideil then stop, otherwise continue.
5. Use the pair ofobjects as seeds ofa K-means method to create two new clusters.
6. If there is only one object in each cluster then stop otherwise continue with Step 2.
In the above method, we need to resolve the following two issues:
o Which cluster to split next?
. How to split a cluster?
lYhich cluster to split next?
There are a number ofpossibilities when selecting the next cluster to split:
1. Split the clusters in some sequential order.
2. Split the cluster that has the largest number of objects.
3. Split the cluster that has the largest variation within it.
How to splil a cluster?
A distance matrix is created and the two most dissimilar objects are selected as seeds of two
new clusters. The K-means method is then used to split the cluster.
Advontages of the hierarchical approach
1. The hierarchical approach can provide more insight into the data by showing a hierarchy of
clusters than a flat cluster structure created by a partitioning method like the K-means
method.
2. Hierarchical methods are comceptually simpler and can be implemented easily.
3. In some applications only proximity data is available and then the hierarchical approach may
be better.

-4., Hierarchical methods can provide clusters at different levels oferanularitv.


76

Disadvantaga of the hierarchical oppoach


1. The hierarchical methods do not include a mechanism by which objects that have been
incorrectly put in a cluster may be reassigned to another cluster.
2. The time complexity of hierarchical methods can be shown to be O(n3).
3. The distance matrix rcquires O(n3) space and becomes very large for a large number of
objects.
4. Different distance metrics and scaling of data can significantty change the results.

4.8 DENSITY-BASED METHODS


The density-based methods are based on the assumption that clusters are high density collections
of data of arbitrary shape that are sepamted by a large space of low density data (which is assumed to
be noise).

DBSCAN (density based spatial clustering of applications with noise) is one example of a
Censity-based method for clustering. The method was designed for spatial databases but can be used
in other applications. It requires two input parameters: the size of the neighbourhood (R) and the
minimum points in the neighbourhood (N). Essentially these two parameters determine the density
within the clusters the user is willing to accept since they specify how many points must be in a
rego.n. The number of points not only determines the density of acceptable clusters but it also
determines which objects will be labeled outliers or noise. Objects are declared to be outliers ifthere
are few other objects in their neighbourhood. The size parameter R determines the size of the clusters
found. IfR is big enough, there would be one big cluster and no outliers. IfR is small, there will be
small dense clusters and there might be many outliers
'
We now define a number of concepts that are required in the DBSCAN method:
l. Neighbourhood: The neighbourhood of an object y is defined as all the objects that are
within the radius R from y.
2. Core object:, An object y is called a core object if there are N objects within its
neighbourhood.
3. Proximily: Two objects are defined to be in proximity to each other if they belong to the
. same cluster. Object xr is in proximity to object x2 if trvo conditions are satisfied:
(a) The objects are close enough to each other, i.e. within a distance ofR.
(b) xz is a core object as defined above.

4.'-Conneclivity.' Two objects x1 and xn are connected ifthere is a path or chain of objects Xr,
x2, .........,,.,., xn from x1 to xn such that each xi+r is in proximity to object x;.
We now outline the basic algorithm for density based clustering:
. 1. Select values of R and N.
2. Arbitrarily select an object p.
77

3. Rehieve all objects that are connected to p, given R and N.


4. If p is a core object, a cluster is formed.
5. Ifp is border object, no objects are in its proximity. Choose another object. Go to Step 3.
6. Continue the process until all ofthe objects h4ve been processed
4.9 DEALING WITH LARGE DATABASES
I

, Mo.t
"lrrt"ring
methods implicitly assume that all data is accessible in the main memory, Often
, the size of the database is not considered but a method requiring multiple scans of data that'is disk-
resident could be quite inefiicient fot large problems.
One possible approach to deal with large datasets that could be used with any gpe of
clustering method is to draw as large a sample from the large dataset as could b! accommodated in
the main memory. The sample is then clustered. Each remaining object is then assigned to the nearest
cluster obtained from the sample. This process could be repeated several times and the clusters that
leadtothesmallestwithinclusiersvariariceqouldbechosen.
K-Means Method for Large Databases
. This method first picks the number of clusters and thefu seerl centroids and then attempts to classiff
each object to belong to one ofthe following three groups:
(a) Those that are certain to belong to a cluster. These objects together are.called the discard
set. Some information about these objects is computed and saved. This includes the number
of objects n, a veclor sum of all attribute values of the n objects (a vector S) and a vector
sum-of, squares of all attiibute values of the n objects (a vector Q). These values are
sufficient to recompute the centroid of the new cluster and its variance.
(b) Those that are sufficiently close to each other to 6e replaced by their summary. The objects
ap however sufficiently far away ftom each cluster's clntroid that they cannot yet be put in
the discard set ofobjects. Tfese objects together are called the compression set.
(c) The remaining objects are too difficult to assign to either of the two groups above. These
r bbjects are called the retained set and are stored as individual objects. They cannol be
replaced by a summary.
Hierarchical Method for Large Databases - Concept of Fractionation
Dealing with large datasets is difficult -using hierarchical methods since the methods require an
. lfx distance matoix to be computed for N objects. IfN is:large, say 100,000, the mahix has 1010

elements making i1 impractical to use a clasSical hierarchical method.

: . A modification of classical hierarchical methods that deals with large datasets was proposed

in 1992. It is based on the idea of splitting the data into manageable subsets called "fractions" and
then applying a hierarchical method to each fraction. The concept is called fractionation. The basic
algorithm used in the method is as follows, assuming that M is the largest number of objects that the
78

hierarchical method may be applied to. The size M may be determined perhaps based on the size of
the main memory.
Now the algorithm:
I . Split the large dataset into fractions of size M.
2. The hierarchical clustering technique being used is applied to each fraction. Let the number
ofclusters so obtained from all the fractions be C.
3. For each of the C clusters, compute the mean of the attribute values of the objects in it. Let
this mean vector be m;, i: 1......,C. These cluster means are called meta-observations. The
meta-observations now become the data values that represent the fractions.
4. If the C meta-observations are too large (greater than M), go to step 1, otherwise apply the
same hierarchical clustering technique to the meta-observations obtained in step 3.

5. Allocate each object of the original dataset to the cluster with the nearest mear obtained in
step 4.

4.10 QUAIITY OF CLUSTER ANALYSIS METHODS


The quality ofthe clustering methods or results of a cluster analysis is a challenging task. The quality
of a method involves a number of criteria:
1. Efficiency of the method.
2. Ability of the.method to deal with noisy and missing data.
3. Ability of the method to deal with large problems.
4. Ability of the method to deal with a variety of attribute types and magnitudes.

4.I1 CLUSTER ANALYSIS SOFTWARE

A more comprehensive list ofcluster analysis software is available at


http ://u.ww.kdnusqets.com/software/cl usterin g.html.

o ClustanGraphicsT from Clustan offers a variety of clustering methods including K-


means, density-based and hierarchical cluster analysis. The software provides facilities to
display results of clustering including dendograms and scatterplots. For more details visit:
http://www.clustan.com/index.html
o CViz Cluster Visualization from IBM is a visualization tool designed for analyzing high-
dimensional data in large, complex data sets. It displays the most important factors relating
clusters of records. For more details visit: http:/h,,,rvw.alphaworks.ibm.com/tech/cviz
. Autoclass, an unsupervised Bayesian classification system, based on SNOB mentioned
below, is available from NASA. (Free). For more details visit:
http ://ic.arc. nasa. go v/i c/proj ects/ba-yes- gro up/autoclass/
. Cluster 3.0, open source software originalty developed by Michael Eisen at Stanford. It uses
the K-means method, which includes multiple trials to find the best clustering solution. This
19

is crucial for the K-means method to be reliable. For more details visit: hltp://bonsai. ims.u-
tokyo. ac jo/-mdehoon/software/cluster/software.htm
r CLUTO provides a set of clustering methods including partitional, agglomerative, and graph-
partitioning based on a variety of similarity/distance metrics. For moie details visit:
http://www-users.cs. umn.edu,/-karvois/cluto/ (Free)

CONCLUSION

Cluster analysis is a collection of methods that assists the user in putting different objects
from a collection ofobjects into different groups. In some ways one could say that cluster analysis is
best used as an exploratory data analysis exercise when the user has no hypothesis to test. Clustdr
analysis, therefore, can be used to uncover hidden structure which may assist further exploration.

We have discussed a number of clustering methods. In the K-means method it was required
that the user specifies the number of clusters and starting seeds for each cluster. This may be difficult
to do without some insight into the data tlpt the user may not have. One possible approach is that
might combine a partitioning method like K-means with a hierarchical method like the
agglomerative method. The agglomerative method can then be used to understand better the data and
help in estimating the number of glusters and the starting seeds. The strength of cluster analysis is
that it works well with numerig data. Techniques that work well are available with categorical and
textual data as well. Cluster analysis is easy to use.

We have noted thit the performance of cluster analysis methods can be dependent on the
choice of the distance metric. It can be difficult to devise a suitable distance metric for data that
contains a mixture ofvariable types. It can also be difficult to determine a proper weighting scheme
for disparate variable type. Furthermore, since cluster analysis is exploratory, the results of cluster
analysis sometimes can be difficult to interpret. On the other hand, quite unexpected results may be
obtained. For example, at NASA two subgroups of stars were distinguished, where previously no
difference was suspected.

REVIEW QUESTIONS

l. List four desirable features ofa cluster analysis method. Which ofthem are important for large
databases? Discuss.

2. Discuss the different types of data which one might encounter in practice. What a data tlpe is
clustering moSt suitable for?
3.Given two objects represented by the attribute values (1, 6,2, 5,3) and (3, 5,2, 6, 6)
a) Compute the Euclidean distance between the two objects
b) Compute the Manhattan distance between the two objects.
4. Suppose that a data mining task is to cluster the following eight points (with (x, y)
80

A.4(6,9), As(7,s), A6(5,7),

.j..

: Ar(4,6), A2(2, s), I


81

Learning Objectives
1. Explain what web mining is all about
2. Define the refevant web terminology

information from the web


4. Analyze the structure of the w-eb using the HITS algorithm designed for w'eb structure'
mlmng
5. Understand user behavior in interacting with the web or with a web site in order to
improve the quality of service
l

Definition:
Wcb mining is rhe application of data mining rechniquc.s to f nd interesling and potentiolly
useful knowledge from lleb data. It is normally expected that either the hyperlink structure of the
llleb or the l|'eb log data or both have been used in the mining process.

Web mining can be divvied into several categories:

1. lleb content mining: it deals with discovering useful information or knowledge from Web page
conients. In contrast to Web usage mining and Web structure mining, we are contented with mining
focuses on the Web page contenl rather than the links

2. l!/eb structure mining: It deals with the discovering and modeling the link struclure of the Web.
Work has been carried ou1 to model the Web based on the topology ol the hyperliniis. This can help
in discovering similarity between sites or in discovering imponant si(es lor a particular topic or
discipline or in discovering Web communities.

3, ll/eb usage mining: It deals with understanding user behavior in interacting with the Web or with
the Web site. One of the aims is to obtain information that may assist Web site recognition or assist
site adaptation to better suit the user. The mined data often includes data logs of users interactions
with the Web. The logs include the Web server logs, proxy serv'er logs, and browser logs, The logs
include information about the refening pages! user identification, time a user spends at a site and the
sequence of pages visited.

The three categories above are not independent since Web structure mining is closely related
to Web con(ent mining and both are related to Web usage mining.
l.Hyperlink: The text documents do not have hyperlinks, while the links are very important
components of Web documents. In hard copy documents in a library, the documents are usually
structured (e.g. books) and they have been catalogued by cataloguing experts. No linkage betwegn
these documents is identified except that two documents may have been catalogued in the same
classification and therefore deal with similar topics.

2. Types of Information: As noted above, Web pages differ widely in structure, quality and their
usefulness. Web pages can consist of text, frames, multimedia objects, animation and other'types of
information quite different from text documents which mainly consist of trixt but may have some
other objects like tables, cliagrams, figures and some images.

3. Dynamics: The text documents do not change unless a new edition ofa book appears while Web
pages change frequently because the information on the Web including linkage information is
updated all the time (although some Web pages- are out of date and never seem to change!) and new
pages appear every second. Finding a previous version of a page is almost impossible on the Web
and links pointing to a page may work tobay but not tomorrow.

4. Quality: The text documents arb usually of high quality since they usually go through some
quality control process because they are very expensive to produce. In contrast, much of the
information on the Web is'of low quality. Compared to the size of the Web, it may be that less than
l0olo of Web pages are really useful and of high quality.

5. Huge size: Although some of the hbraries are ,ery large, the Web in comparison is much larger,
perhaps its size is appropriating 100 terabytes. That is equivalent to about 200 million books.

6. Document use: Compared to the use of conventional documents, the use of Web documents is
very different. The Web users tend to poie short queries, browse perhaps the first page of the results
and dien move on.

5.2 WEB TERMINOLOGY AIID CHARACTERISTICS

The l{orld Wide lTeb @anY) is the set of all the nodes which are interconnected by
hypertext links.
A /iz& expresses one or more relationships between two or more resources. Links may also be
establishes within a document by using anchors.

. A Web page is a collection of information, consisting of one or more Web resources,


intended to be rendered simultaneously, and identified by a single URL. A Web site is a collection of
interlinked Web pages, including a homepage, residing at dre same network location.
A tJniftrm Resource Locdtor (URL) is an identifier for an abs&act or physical resource, for
example a server and a file path or index. URLs are location dependent and each URL contains four
83

distinct parts, namely the protocol types (usually http), the name of the Web server; the directory
path and the file name. If a file.name is not specified, index.html is assumed.
A Web server serves Web pages using http to client machines so that a browser can display

them.
A c/ren, is the role adopted by an application when it is retrieving a Web resource.

A proxy is an intermediary which acts as both a server and a client for the purpose of
retrieving resources on behalf of other clients. Clients using a proxy know that the proxy is present
and that it is an intermediary.
A domain name server is a distributed database of the name to address mappings. When a

DNS server looks up a computer name, it either finds it in its list, or asks another DNS server which
knows more names.
A cookie is the data sent by a Web server to a Web client, to be stored locally by the client
and sent back to the server'on subsequent requests.
Obtaining information from the Web using a search engine is called information "pull" while
information sent to users is called information "push". For example users may register with a site and
then information is sent ("pushed") to such users without their requesting it.
Graph Terminolory
A directed graph as a set ofnodes (pages) denoted by V and edges (links) denoted by E. Thus
a graph is (V,E) where all edges are directed, just like a link that points liom one page to another,
and may be considered an ordered pair ofnodes, the nodes that thy link.
An undirected graph also is represented by nodes and edges (V, E) but the edges have no
direction specihed. Therefore an undirected graph is not like the pages and links on the Web unless
we assume the possibility of traversal in both directions. The back button on the browsers does
provide the possibility of back traveisal once a link has been traversed in one direction, but in
general both way traversal of links is not possible on the Web.
A graph may be searched-either by a breadth-first search or by a depth-first search. The
breadth-first search is based on first searching all the nodes that can be reached from the node where
the search is starting and once these nodes have been searched, searching the nodes at the next level
that can be reached from those nodes and so on. The depth first search is based more on first
searching any unvisited descendants of a given node, than visiting the node and then any brother
nodes. Essentially, the search algorithm involves going down before going across to a brother node.

The diameter of the graph is defined as the maximum of the minimum distances between all
possible ordered node pairs (u,v), that is , it is the maximum number ol links that one would need to
follow starting from any page u to reach any page v assuming that the best path has been followed.
84

The Strongly Connected Core (SCC)- This part of the Web was found to consist of about
307o of the Web, which is still very large given more than four billion pages on the Web in
2004. This core may be considered the heart ofthe Web and its rnain property is that pages in
the core cah reach each other following directed edges. (i.e. hyperlinks)

a The IN Group - This part of the Web was found to consist of about 20%o of the Web. The
main property of the tN group is that pages in the group can reach the SCC but cannot be
reached from it.

o Tlie OUT Group - This pa,t of the Web was found to consist of about 20%o of the Web. The
main property of the OUT group is that pages in the group can be readhed from the SCC but
cannot reach the SCC.

Tendrils - This part of the Web was found to consist of about 20o/o of the Web. The main
property of pages in this group is that the pages cannot be reached by the SCC and cannot
reach the SCC. It does not imply that these pages have no linkages to pages outside the group
since they could well have linkages lrom the IN Group and to the OUT Group.

a The Discorurected Group - This part of the Web was found to be less frian 10Yo of the Web
and is essentially disconnected from the rest of the Web world. These pages could include,
for example, personal pages at many sites that link to no other page and have no links to
them.

Sizeof,lhe Web
The deep Web includes information. stored in searchable databases often inaccessible to
search engines. This information can often only be accessed by using interface of each website.
Some of these information may be available only to subscribers. The shallow Web (indexable web)
is the information on the Web that the search engine can access without accessing the Web data
bases..

In many cases use of Web makes good sense, for example, it is better to put even short
announcements in an enterprise on the Web rather than send them by email since emails sit in many
mail 'boxes wasting disk storage while putting information on the Web can be more effective as well
as help in maintaining a record of communications.
If such uses grow, which appears likely, then a very large number of such Web pages with a
short life span and low connectivity to other pages are likely ro be generated each day. The large
.nrrmbers of Web sites that disappear everyday do create enormous problems on the Web. Links from

welf known sites do not always work. Not all results ofa search engines are guaranteed to work. The
URLS cited in scholarly publications also cannot be relied upon to be still available. A study of
{l-)

papers presented at the WWW conferences found that links cited in them had a'decay rate that grew
-
with the age of the papers. Abandoned sites therefore ar" a nuis*ce.' .

To overcome these problems, it may become necessary to categorize Web pages. The following
categorization is one possibility:
1. a Web page that is guaranteed not to change ever

2. a Web page that will not delete any content, may add content/links
disappear
3. a Web page that may,change contenV links but.ttie page will nol disappear i

4. a Web page without any guarantee


Web Metrics
There have been a number of studies that have tried to measure the Web, for example, its size
and its skucture. There are a number ofother properties about the Web that are useful to measure.

possible to define how well connected a node is by using the concept of the centrality of a node.
Centrality may be out-centrality is based on the distancbs measured from other nodes that are
connected to the node using the in-links. Based on these metrics, it is possible to define the concept
of compactness that varies from 0 to 1, for a completely disconnected Web graph and I for a fully
connected Web graph.

5.3 LOCALITY AND HIERARCITY IN THE WEB

A Web site of any enterprise usually has the homepage as the root of the tree as in any
hierarchical structure. For example, if one looks at a typical university Web site the homepage will
provide some basic information about the institution and then provide links, for example, to:
Prospective students
Staft'
Research

Information for current students


Information for current staff
The Prospective student's node will have a number oflinks, for example, to:
Courses offered
Admission requirements
Information for intemational students
Information for graduate students
Scholarships available
Semesler dates
Similar structure would be expected for other nodes at this level ofthe tree.
86

Web sites are fetching information from a database to ensure that the information is accurate
and timely. A recent study found that almost 40% of all URLs fetched information from database.

It is possible to classify Web pages into several types:


1. Homepage or lhe head page: These pages represent an entry for the Web site ofan enterprise or a
section within the enterprise oi an individual's Web page.

2. Index page: These pages assist the user to navigate through of the enterpdse Web site. A
homepage in some cases may also act as an index page

3. Reference page.' These pages provide some basic information that is used by a number of other
pages. For example, each page in a Web site may have a link to a page that provides the enterprise's
privacy policy.

4. Contenl page: These pages only provide content and have little role in assisting a user's
navigation. Often these pages ar€ larger in size, have few out-links, and are well down the tree in a
Web site. They are the leaf nodes of a tree.

A number ofprinciples have been developed to help design the structure and content ofa Web site.
For example, three basic principles are:

1. Relevant linkage principle: It is assured that links from a page point to other relevant resources.
This is similar to the assumption that is made for citations in scholarly publications, where it is
assumed-that a publication cites only relevant publications. Links are often assumed to reflect the
judgment of the page creator. By providing a link to another page it is assumed that the creator is
making a recornmendation for the other relevant pag

2. Topical unlty principle: It is assumed that Web pages that are co-cited (i.e. linked from the same
pages) are related. Many Web mining algorithms make use of this relevance assumption as a.
mquure of mutual relevance between Web pages.

3. L*ical alltnity Brinciple: It is assumed that the text and the links within a page are relevant to
each other. Once. again, it is assumed that the text on a page has been chosen carefirlly by the creator
to be related to a theme.

5.4 WEB CONTENT MII\TING

The area of Web mining deals with discovering useful information from the Web. Normally
when we.need to'search for content on the Web, we use one of the search engines like Google or a
subject directory like Yahoo! Some search engines find pages based on location and frequency of
keywords on the page although some now use the concept of page rank.
on
o/

The algorithm proposed is called Dual lterative Pattem Relation Extraction (DIPRE). It works as
follows:
1, Sample: Start with a Sample S provided by the user.
2. Occurrences.. Find occurrences of tuples starting with those in S. Once tuples are found the
context of every occurrence is saved. Let these be O,
O --S
3. Patterns: Generate pattems based on the set of occurrences O. This requires generating pattems
with similar contexts.
P --- O
1. Match Patterns The Web is norv searched for the pattems.
5. Stop ifenough matches are found, else go to step 2.
Web document clustering
Web document clustering is another approach to find relevant documents on a topic or about query
keywords. The popular search engines often return a huge, unmanageable list of documents which
contain the keywords that the user specified. Finding the most useful documents from a large list is
usually tedious, often impossible. The user could apply clustering to a set of documents retumed by a
search engine in response to a query with the aim of finding semantically meaningful clusterq rather
than a list of ranked documents.
In cluster analysis techniques, in particular we discussed the K-means meihod and
agglomerative method. These methods can be used for Web document cluster analysis as well but
these methods assume that each document has a fixed set of attributes that appear in all documents.
Similarity between documents can then be computed based on these values. One could possibly have
a set of words and their frequencies in each document and then use those values for clustbring them.
Suffix Tree Clustering (STC) is an approach that takes a different path and is designed
specifically for Web document cluster analysis, and it uses a phrase-based clustering approach rather
than use single word frequency.
In STC, the key requirements of a Web document clustering algorithm include the following:
1. Relevance: This is the most obvious requirement. We want clusters that are relevant to the user
query and that cluster similar documents together.

2, Browsable summaries.. The cluster must be easy to understand. The user should be quickly able to
browse the description ofa cluster and work out whether the cluster is relevant to the query.

3. Snippet tokrance; The clustering method should not require whole documenls and should be able
to.produce relevant clusters based only on the information that the search engine retums.

1. Performance.' The clustering method should be able to process fie results of the search engine
quickly and provide the resulting clusters to the user.
88

Finding Similar Web pages


There is a prolife..ration of similar of identical documents on the Web. It has been found that almost
30% of all Web pages are very similar to other pages and abour 22o/o are vinually identical to other
pages. There are many reasons for identical pages. For example:

l. A local copy may have been made to enable faster access lo the material.
2. FAQs on the imporlant topics are duplicared since such pages may be used lrequently locally.
3. Online documentation of popuiar software like Unix or LATEX.may be duplicated foi local use.
4. There a(e mirror sites that copy highly accessed sites io reduce: traifrc G g, to reduce intemational
traffic lrom India or Australia).
In some cases. documents are nol exactly identical because different lormatting might be used at
different site. There may be some cusromization or use of temptates at different sites. A large
document may be split into smaller documents or a composite document may be joined together to
build a single document.
Copying a single Web page is ofien called, replicatio,?, on the other hand copying an entire Web site
is called mirroring.
Discussion is focused on content-based similarity which is based on comparing the textual content of
the Web pages. The Web pages also have non-text content but we will not consider it.
We define Nvo concepts:
1. Resemblance.' Resemblance of two documents is defined to be a numbei between 0 and I with 1

indicating that the two docbments are virtually identical and any value close to I indicating that the
documenls are very similar.
2. Containment: Containmenl ofone document in another is also defined as a number between 0 and
'l
with I indicating that the first document is completely contained in the second.
There are number ofways by which similarity of documents can be assessed. One brute force
approach is to compare two documents using software like the tool diff available in the Unix
operating system which essenlially compares the lwo documents as files. Other string comparison
algorithms can be used to find how many to be deleted, changed or added to
characters need
transform one document into the other, but these approaches are very expensive if one wishes to
compare millions of documents.
There are other issues that must be considered in document matching. Firstly, if we are
looking to compare millions of documents then the storage requirement of the method should not be
large. Secondly, documents may be in HTML, PDF, Postscript, FrameMaker, TeX, PageMaker or
MS Word. They need to be converted to text for comparison. The conversion can introduce some
errors. Finally, the method should be robust, that is, it should not be possible to circumvent the
matching process with modest changes to a document.
Fingerprinting
An approach for comparing a large number of documents is based on the idea of
furgerprinting documents.
A document may be divided into all possible substrings of lenglh L. These substrings are
called shingles. Based on the shingles one can define resemblance R(X,Y) and containment C(X,Y)
between two documents X and Y as follows. We assume S(X) and S(Y) to be a set of shingles for
docuntents X and Y respectively.

R(x,Y):{S(x) n S(Y)} / {S(x) U S(Y)} and


c(x,Y) :{s(x) n s(Y)} / {s(x)}

Following algorithm be used to find similar documents:


' 1. Collect all the documents that one wishes to compare.
2. Choose a suitable shingle width and compute the shingles for each document.
: 3. Co-p*" the shingles for each pair ofdocuments.
4. Identify those documents that are similar.
Full fingerprinting.' . The web is very large and this algorithm requires enorrnous storage for the
shingles and very long processing time to finish pair wise comparison for say even 100 million
documents. This approach is calledfult Jingerprinting.
Example 5,1
Consider a simple example in which we wish to find if the two "documents" with the following

Table 5.1 Shingles oflength 3


Shingles in Document l
the Web is
Web continues to Web is growing : ,.

rs growng at
to grow at .. ::

at a fast
at a fast ,
Comparing the two sets ol shingles rve find that only two of them are identical. Thus, for this simple
example, the documents are not very similar.

Shingles in Number of letters Shingles in Number of letters


Doeument I Document l
the Web continues l7 the Web is 10
Web continues to t6 Web is growing t4
continues to grow l7 is growing at l3
to grow at 10 growing at a 12
grow at a I at a fast 9
at a fast 9 a fast rate l1
a fast rate l1

Table 5.2 Number of letters in shingles


For comparison, we select the three shortest shingles. For the first document, these are "to
grow at", "glow at a" and "at a fast". For the second document the shortest shingles are "the Web is",
"at a fast" and "a fast rate". There is only one match out of the three shingles, providing a
resemblance ratio 0.33.
False positives with the original document would be obtained for documents like "the
Australian economy is growing at a fast rate". False negatives would be obtained for string like "the
Web is growing at quite a rapid rate".
It has been found that small length shingles cause many false positives while large shingles
result in more false negatives.
Following issues are ignored, when comparing documents using fingerprinting.
l. How long should the shingle be for good performance?
2. Should shingle length be in number of words or number of characters?
3. How much storage would we need to store shingles?
4. Should upper-case and lower-c6se letters be treated differently?
5. Should spaces between words be removed?
6. Should punctuation marks be ignored?
7. Should end of line marker be ignored?
8. Should end ofparagraph be ignored?
9. Should stop words like "to", "a" and "the" be removed?
10. Should stemming be carried out to transform "growing" into "grow"?
This is the one approach to find similar pages.
91

5.5 WEB USAGE MINING


1. Number of hits - the number of times each page in the Web site has been viewed
2. Number of visitors - the number of users who came to the site
3. Visitor referring Web site - the Web site URL of the site the user came from
4. Visitor referring Web site - the Web site URL of the site where the user went when he
/she left the Web site

5. Entry point - which Web site page the user entered from
6. Visitor time and duration - the time and day ofvisit and how long the visitor
browsed the site
7. Path analysis - a list ofthe path ofpages that the user took
8. Visitor IP address - this helps in finding which part of the world the user came from

9. Browser t)?e
I 0. Platform
I 1. Cookies

Even this simple information about a Web site and pages within it can assist an enterprise to achieve
the following:
1. Shorten the paths to high visit pages '

2. Eliminate or combine low visit pages


3. Redesign some pages including the homepage to help user navigation
4. Redesign some pages so that the search engines can find them
5. Help evaluate effectiveness ofan advertising campaign
Web usage mining may also involve collecting much more..information that has been listed. For
example, it may be desirable to collect information on:
1, Palh traverserl,. What paths do the customers traverse? What are the most commonly traversed
paths through a Web site? These patterns need to be interpreted, analyzed" visualized and acted upon'

2, Conversion rates: What are the lookto-click, click+o-basket-to-buy rates for each product? Are
there significant differences in these rates for different produris?

3. Impact oJ Advertising: Which banners are pulling in th$ most traffic? What is their conversion
rate?

4. Impact of promotions: Which promotions generate the most sales? Is there a particular level in the
site where promotions are most effective?

5. lYeb site desigz.. Which links do the customers click most frequently? What links do they buy
from most frequently? Are there some features ofthese links that can be identified?
92

6, Customer segmentation: What are the features of customers who "abandon their trolley" rvithout
buying? Where do the most profitable customers come from?

7. Enterprise search: Whichcustomers use enterprise search? Are they more likely to purchase?
What do they search for? How frequently does the search engine retum a failed. result? How
frequently does the search engine return too many results?

Log data analysis has been investigated using.the following techniques:

. Using association rules.


r Using composite association rules
e Using cluster analysis

5.6 WEB STRUCTURE MINING


The aim of web strugture mining is to discover the link structure or the model that is assumed
to rmderlie the Web. Thb model may be based on the topology of the hyperlinks. This can help in
discovering similarity between sites or in discovering authority sites lor a panicular topic or disciple
or in discovering overview or survey sites that point to many authority sites (such sites are called
hubs). Lirl/r structure is only one kind of information that may be used in analyzing the stnrcture of
the Web.

The HITS (Hyperlink-Induced Topic Search) bigorithm has 2 major steps:


1. Sampling step - It collects a set of relevant Web pages given a topic.
2. Iterative slep - It finds hubs and authorities using the information collected during sampling.

The HITS method uses the following algorithm.


Step I -Sampting Step
The first step involves finding:3 subset of nodes' or a subgraph S, w-hich is richi in relevant
authoritative pages. To obtain such a subgraph, the algorithm startsrwith a root sef of, say, 200 pages
selected fiom the r-esult of searching for the query in a traditional search engine. Let th! root set be
R. Starting from the root set R, wish to obtain.a set S that has the following properties: : r-

2. S is rich in relevant pages given the query


3. S contains most (or mrrny) ofthe strongest authorities.
HITS Algorithrn exlands the root set R into abase set S by using the following algorithm:
l; LrtS:R
2. For each page in S, do steps 3 to 5
3. Let T be the set of all pages S points to
ofall pages that point to S
4. Let F be the set
5. Let S = S +T+some ofall ofF (some ifF is large)
93

6. Delete all links with the same dornain name


7. This S is retumed
Step 2 - Finding Hubs and Authorities
The algorithm for finding hubs and authorities works as follows:
1. tet a pagep have a non-negalive authority weight x, and a non-negative hub weight yr.
Pages with relatively large weights xrwill be classified to be the authorities (similarly

for the hubs with large weights 1,)


2. The weights are normalized so their squared sum for each type of weight is 1 since
only the relative rveights are important.
3. For a pagep, the value of;r,, is updated to be the sumofy, over all pages q that link to p.
4. For a page p, the value of1,, is updated to be the sum ofx, over all pages 4 that p link to.
5. Continue with step 2 unless a termination condition has been reached.
6. On termination, the ouQut of the algorithm is a set of pages with the largest x, weighs that can be
assumed to be authorities and those with the largest y, weights lhat can be assured to be rhe hubs.
Kleinberg provides example of how the HITS atgorithm works and it is shown to perform well.

Theorem: The sequences of weights x, and y, converge.


Proof: Let G:(V,E). The graph can be represented by an adjacency matrix A where each element (i,
j) is 1 ifthere is an edge between the two vertices, and 0 otherwise.
The weights are modified according to simple operations x:Ary and y=Ax. Therefore, x:Ar
Ax. Similarly, y=AA ty. The iterations therefore convcrge to the principal eigenvectors of AAr.
Problems with the HITS Algorithm
There has been much research done in evaluating the HITS algorithm and it has been shown that
while the algorithm works w'ell for most queries, it does not work well for some others. There are a
number ol reasons for this:
1, Hubs and authorities.'A clear-cut distinction between hubs and authorities may not be
appropriate since many sites are hubs as well as authorities.

2. Topic drift: Certain documents of tightly connected documents, perhaps due to mutually
reinforcing relationships between hosts, can dominate the HITS computation. These documents in
some instances may not be the most relevant to the query that was posed. It has been reported that in
one case when the search item was'laguar" the HITS algorithm converged to a football team called
Jaguars. Other examples of topic drift have been found on topics like "gun control", "abortion", and
"movies".
j. Automatically generated linksi Some ofthe links are computer generated and represent no human
judgement but HITS still gives them equal impoflance.

7
94

4, .\'on'relevont docutttetis: Some queries can retum non-relevant documents in the highly ranked
clltcries and this can lead to erroneous results from the HITS still gives them equal importance.

5. Efficiency: The real-tirne perlbrmance of the algorithm is not good given the steps that involve
finding sites that are pointed to bl, pages in the root pages.

A number ofproposals have been made lbr modifying HITS. These include:
o More careful selection of the base set will reduce the possibility of topic drift. One possible
approach might be to modifl, lhe HITS algorithm so that the hub authority weights are
modified only based on the best hubs and the best authorities.
. One may argue that the in-link information is more important than the out-link information.
A hub can become important by pointing to a lot ofauthorities.
Web Communities
A Wcb community is generated by a group of individuals that share a common interest. It manifests
on the Web as a collection of Web pages with a common interest as the theme. These could, for
example, be communities about a sub-discipline, a religious group, about sport or a sport team, a
hobby, an event. a country. a state, or whatever. The communities include in them newsgroups,
portals and the large ones may include directolies in sites like Yahoo!
The HITS algorithm finds authorilies and hubs lor a specified broad topic. The idea of cyber
communities of to find all Web

5.7WEB MINING SOFTWARE


Many general purpose data mining software packages include Web mining software. For eg.
Clementine from SPSS includes Web mining modules. The following list includes variety of Web
mining software.
o l23logAnalyzer from a company with the same n.ure is low-cost Web mining software that
provides an overview of Web site's performinie. statistics about Web server activity
including the number of visitors, the number of unique IPs, Web pages viewed, files
downloaded, directories that were accessed, number of hits. broken down by day of the week
and hour ofthe day and images that were accessed.
o Analog, claims to be ultra-fast, scalable, highly configurable, works on any operating :;,1'.tei::
and is free.
o Azure Web Log Analyzer from Azure Desktop claims to find the usual Weh infocrlaricrr
including the most popular pages, the number of visitors and where they are from, what
browser and computer they used.
95

Click Tracks from a company by the same name is Web mining software offering number of
modules including Analyze:a Optimizer and Pro that use log files to provide Web site
analysis. Allows desktop data mining.
Datanautics G2 and Insight 5 from Datanautics. Web mining software for data collection,
processing, analysis and reporting.
LiveStats.NET and LiveStats.BlZ from DeepMetrix provide website analysis, data
visualization and statistics on distinct visitors, repeat visits, popular entry and exit pages,
time spent on pages, geographic report which breakdown visits by country and continent,
click paths , keywords by search engine and more.
NetTracker Web analytics from Sane Solutions claims to analyze log files (from Web
servers, proxy servers and firewalls), data gathered by JavaScript page tags, or a hybrid of
both.
Nihuo Web Log Analyzer from LogAnalyser provides reports on how many visitors came to
the website, where they came from, which pages they viewed, how long they spent on the
site.
o WebAnalyst from Megaputer is based on PolyAnalyst text mining software.
r Weblog Expert 3.5 from a company with the same name produces reports that include the
following\ information: activities statistics,, accessed files, paths tlrow the site, information
about referring pages, search engines, browsers, operating syst€ms and more.
o WebTrends 7 from NetIQ is a collection of modules that provide a variety of Web data
including navigation analysis, custombx segmentation and more.
o WUM; Web utilization Miner is an open source project. WUMprep is a collection of Perl
scripts for data preprocessing tasks suggests sessonizing, root deduction and maping of
URLs on to concepts.WuM is integrated Java-based Web mining software for log file
preparation, basic reporting, discovery ofsequential pattems and visualization.

CONCLUSION

The World Wide Web has become an extremely valuable resource for a large number of
people all around the world. During the last decade, the Web revolution has had a profound impact
on the way we search and find information at home and at work. Although information resources like
libraries have been available tothe public for a long time, the Web provides instantaneous access to a
huge variety of information. From its beginniug in the early 1990s the Web has grown to perhaps
more than eight billion Web pages which are accessed all over the world every day. Millions of Web
pages are added every day and millions of others are modified or deleted.

The Web is an open medium with no controls on who puts up what kind of material. The
opennes: has meant but the Web has grown exponentially. which is its strength as well as its
96

weakness. The strength is that one can find information onjust about any topic. The weakness is the
problem of abundance of information.

REVIEWQUESTIONS
1. Define the three types of Web mining. What are their major differences?
2. Define the following terms:
a) Browser
b) Uniform resource locator
c) Domain name server
d) Cookie
3. Describe three major differences between the conventional textual documents and Web
documents.
4. What is Lotks's Inverse-Square law regarding scholarly publications? What relation does it
have to the power laws of distribution of in-links and out-links from Web pages?
5. Describe the "bow-tie" structure of the Web. What percentage of pages from the Strongly
ConnectedCore? What is the main property of the Core?
6. What is the difference between the deep and shallow Web? Why do we need the deep Web?
7. How can clustering be used in Web usage mining?
8. What is the basis of Kleinber'g HITS algorithm?
9. Provide a step-by-step description of Kleinberg's HITS algorithm for finding authorities and
hubs for topic "data mining".
10. Discuss the advantages and disadvantages ofthe HITS algorithm.
I l. What is a Web Communiry? How do you discover them?
12. Use the HITS algorithm to find hubs and authorities from the following five web pages:
Page A (out-links to B, C, D)
Page B (out-links to A, C, D)
Page C (out-links to D)
Page D (ourlinks to C, E)
Page E (ourlinks to B, C, D)
13. What are the major differences between classical information retrieval and Web search?

You might also like