Textbook
Textbook
CENTRE
CORE COURSE IX
DATA MINING AND DATA WAREHOUSING
Objective I In this course students shall learn the mqthematical & algorithmic details of
various data association techniques to discover patlerns in undirlying data (namely mining
data). They also learn how to consolidate huge volume of data in one place efJiciently.
Unit - I'
Introduction to data mining - Association Rule Mining.
Unit - II
Classification - Cluster analysis.
Unit - III
Web Data Mining - Search engines.
Unit - IV
Data warehousing - Algorithms & operations to create data warehouse - Designing data
warehouse - Applications ofdata warehouse.
Unit - V
Online analytical processing - Information Privacy.
Text Book :
l. GK. Gupta, Introduction to Data mining with case studies, Prentice Hall India. 2006
(ISBN 8l - 203-3053-6)[Unit-l : (Chapters 1,2); Unit - 2 : Chapters 3,4); Unit-3 (Chap-
ters 5,6);:Unit-4 (Chapters 7), Unit - 5 (Chapters 8,9)1.
REFERENCE BOOKS
l. KL.P. Soman & Shyam Diwakar and V. Ajay, tnsight to Data Mining Theory and
Practice, Prentice Hall ofIndia,2006. (ISBN - 8l-203-2897-3)
2. Jiawei Han end Micheline Kamber, Data Mining Concepts and Techniques, Elsevier,
Second Edition, 2007 (ISBN: 8l-312-0.535-5)
I
CHAPTER 1
TO DATA MINING
Learning Objectives
l Explain what data mining is and where it may be useful
2. List the steps that a data mining process often involves
3. Discuss why data mining has become inrportant
4. Introduce briefly some of the data mining techniques
5. Develop a good understanding of the data mining software available on the mail'.t
information. The techniques can find novel patterns thal may assist en eni-:1ri:'- r,'
Data mining is a collection of techniques for el/icient automated discovery of previously unkrtr.ttt t i
valiS, novel, useful and unLderstandable pdttetns in large databases. The pdtterns must be aclionahic
so that they may be used in a decision of dn enterprise making process.
Data mining is a complex process and may require a variety of steps before some useful results are
olrtained. Often tlata pre-processing including data cleaning rnay be needed. Itt some cases, sampling
ofdata and testing ofvarious hypotheses may be required before data mining can start.
Data mining has found many applications in the lasr feiv vears lor a ntttnber of reasons.
1. Growth of OLAP data: The first database systems were implemented in the t 960's and
1970's. Many enterprises therefole have more than 30 years of experience in using database
systems and they have accumulated large amounts oldata during that time.
2. Grorvth of data due to cards: The growing use of credit cards and loyaity cards is an
important area of data growth. In the USA, there has been a tremendous grou'th in the use of
loyalty cards. Even in Australia, the use of cards like Fl,"-Rriys has grown considerabiy.
Table 1.1 shows the total number of VISA and Mastercard credil cards in the top ten c...iu
holding countries.
2
10 Canada 51 31 1.65
3. Growth in data due to the web: E-commerce developments have resulted in information
about visitors to Web sites being captures, once again resulting in mountains of data for some
companies.
4. Growth in data due to other sources: There are many other sources of data.
Some of them are:
r TelephoneTransactions
. Frequent flyer transactions
r Medical transactions
o Immigration and customs transactions
o Banking transactions
o Motor vehicle transactions
. Utilities (e.g electricity and gas) hansactions
. Shoppingtransactions
18000
o 16000
{. r+ooo
f, rzooo
I loooo
.s 8000
$ aooo
oooo
$ 2000
o
0
'1996 1997 1998 1999 2000 2001
Year
7. Competitive environment
Owning to increased globalization of trade, the business environment in most countries has
become very competitive. For example, in many countries the telecommunications industry used
to be a state monopoly but it has mostly been privatized now, leading to intense competition rn
this industry. Businesses have to work harder to find new customers and to retain old ones.
8. Availability of software
A number of companies have developed useful data mining software in the last few years.
Companies that were already operating in the statistics software market and were familiar with
statistical algorithms, some of which are now used in data mining, have developed some of the
software.
1. Requirement arralysis: the enterprise decision makers need to formulate goals that the data mining
process is expected to achieve. The business problem must be clearly defined. One cannot use ,latir
mining without a good idea of what kind of outcomes the enterprise is looking for, since the
technique is to be used and the data is required.
2' Duttt Selection and collection: this step includes finding the best source the databases for the data
that is required. If the enterprise has implemented a data warehouse, then most of the data could be
available there. If
the data is not available in the warehouse or the enterprise does not have a
,varelrouse, the source Online Transaclion processing (OLTP) systems need to be identified and the
lequired information extracted and stored in some temporary system.
l, Cleaning preporing data: Tl'ris rnay not be an onerous task ifa data warehouse containing the
and
required data exists, since most ol this must have aheady been done when data was loaded in the
warehorne. Otherwise this task can be very resource intensive and sometimes more than 50% of
effort in a data mining project is spent on this step. Essentially, a data store that integrates data from
a number ofdatabases may need to be created. When integrating data, one often encounters problems
like identi$zing data. dealing with missing data. data conflicts and ambiguity. An ETL (extraction,
.transformation and loading) tool may be used to overcome these problems.
4, Data Mining exploration snd validation: Once appropriate data has been collected and cleaned. it
is possible to start data mining exploration. Assuming that the user has access to one or more data
mining.tools, a data mining model may be constructed based on the needs of the enterprise. _It may be
possible to.t$..iu sample of data and apply a number of relevant techniques. For each technique the
results should be evaluated and their significance interpreted. This is to be an iterative process which
shoulcl lead to selection of one or more techniques th4t are suitable for further exploration, testing
and validati,on.
5. Implemenling, evalauiing and monitoring.' Once a model has been selecied and validated, the
model can be implementetl for use by the decision makers. This may involve software development
lor generating reports, or f,or results visualization and explanation, for managers. It may be that more
than one technique is available for the given data mining task. It is then important to evaluate the
results qnd choose the best technique. Evaluation may involve checking the accuracy and
elfcctiveiress of the techniciue.
. 'Ihere is a need for regular monitoring of the performance of the techniques that have been
irnplemented. It is essential that use of the tools by the managers be monitored and the results
evaluated regularly. Every enterprise evolves with time and so too the data mining system.
6. Results visualization: Explaining the results of data mining to the decision makers is an import,xlt
step of the data mining process. Most commercial data mining tools include data visualization
;nodules. These tools are vital in communicating the data mining results to the managers; although a
l,roblem dealing with a number of dimensions must be visualized using a two-dimensional comput(:r
slreen or printout. CleYer data visualization tools are being developed to display results that deal
r.rith more than two dimensions. Thc' r isualization tools available should be tried and used if found
' ' .: live for tirc givei prcblem.
Figure 1.2 CRISP data mining process model
I.4 DATA MINING APPLICATIONS
Data mining is being used for a wide variety of applications. We group the applications into :l.s
following six groups. These are related groups, not disiointed groups.
1. Prediction antl description' Data minilrg used to.answer questions like "would this customer buy
a product?'' or "is this customer likely to leave?" Data mining techniques may also be used for sale
forecasting and analysis. Usually the techniques involve selecting sorne or all the attdbutes of the
objects available in a database to predict other variables of interest.
2. Retationship marketing: Data mining can help in analyzing cr.rstomer profiles, discovering sales
higgers, and in identifuing critical. issues that determine client loyalty and help in improving
customer retention. This also includes analyzing customer profiles and improving direct marketing
plans. lt may be possible to use cluster analysis to identifi customers suitable for cross-selling other
products.
3. Customer proliling: It is the process of using the relevant and available information to describe
tire characteristics ofa group of customers and to identiry their discriminators from other cust')mers
or ordinary consumers and drivers for their purchasing decisions. Profiling can help an ente,pl'ise
idt-'ntify its most valuable customers so thar the enterprise may differentiate their needs and valu:s.
6
4. Oattiers iilentiJication and tletecting fraud: There are many uses of data mining in identifiing
outliers, fraud or unusual cases. 'l'hese might be as simple as identifting unusual expense claims by
staff, identifying anomalies in expentliture between similar units of an enterprise, perhaps during
auditing, or identifying fraud, for exarnple, involving credit or phone cards.
5, Customer segmentation: It is a way to assess and view individuals in the market based on their
status and needs. Data mining can be used for customer segmentation, for promoting the cross-
selling ol services. and in increasing customer retention. Data mining may also be used for branch
segmentation and for evaluating the performance of various banking channels, such as phone or
online banking. Furthermore data mining may be used to understand and predict customer behavior
and profitabiliry-, to develop new products and services, and to effectively market new offerings.
6. Web site design and promotion: Web mining may be used to discover how users navigate a Web
site and the results can help in improving the site design and making it more visible on ihe Web.
Data mining may also be used in cross-selling by suggesting to a Web cusromer items that he /she
may be interested in, through correlating properties aboul the customers, or the items the person had
ordered, with a database of items that other customers might have ordered previously.
Association rules mining has many applications other than market basket analysis, including
applications in marketing, customer segmentation. medicine, electronic co lmerce, classification,
clustering, Web mining, bioinformatics. and finance. A simple algorithm called the Apriori algorithm
is used to find associations.
Supemised classiJication: Supervised classification is appropriate to use if the data is known to have
a small number of classes, the classes are known and some training data with their classes known is
available. The model built based on the training data may then be used to assign a new object to a
predefined class.
Supervised classilication can be used in predicting the class to which an object or individual
is likely to belong. This i.i useful, for example, in predicting whether an individual is like\. to
respond to a direct mail solicitation, in identiffing a good'candidate for a surgical procedwe, or in
identifuing a good risk for granting a loan or insurance. One of rhe most widely used supervised
classification techniques is decision tree. The decision tree teghnique is wide.ly used because it
generates easily understandable rules for classiffing_data.
Cluster anatysis
One of the most widely used cluster analysis methods is called the K-means algorithm, which
requires that the user specified not only the number of clusters but also their starting seeds. The
algorithm assigns each object in the given data to the closet seed which provides the initial clusters.
The last decade has witnessed the Web revolution which has ushered a new information
retrieval age. The revblution has had a profound impact on the way we search and find information al
home and at work. Searching the web has become an everyday experience for millions of people
from all over the worli (some estimates suggest over 500 million users). From its beginniag in the
early 1990s, the web had grown to more than four billion pages in 2004, and perhaps would grow to
more than eight billion pages by the end of2006.
Search engines
The search engine databases of Web pages are built and updated automatically by Web
crawlers. When one searches the Web using one of the search engines, one is not searching the entire
Web. Instead one is only searching the database that has been compiled by search engine.
Wipro has reported a study of ftequent flyer data from an Indian airline. Before carrying out
data mining, the data was selected and prepared. It was decided to use only the three most common
sectors flown by each customer and the duee most corrrmon sectors when points are reduced by each
customer. It u'as discovered that much of the data supplied by lhe airline was incomplete or
inaccuraie. Also it was found that the customer data captured by the company could have been more
complete. For exartrple, the airline did not know customer's martial status or their income or their
rcasons for taking a journey.
Astronomy
Astronomers pioduce huge amounts ofdata every night on the fluctuating intensity ofaround
20 millions stars which are clhssified by their spectra and their surface temperature. Some 90%c of
s;ars are called main sequence stars including some stars that are very large. very hot. and blue in
color. The main sequence stars are fueUed by nuclear. fusion and are very stable, lasting billions of
lears. Ihe smaller main sequence stars include the Sun. There are a number ofclasses including stars
called yellow dwarf. red dwa{ and whitc dwarf.
Bankingand Finance
nar*ing and finance is a rapidly changing competitive industry. The industry is using data
mining for a variety of tasks including building customer profiles to better understand the customers.
to. identi& fraud, to evaluate risks in personal and home loans, and to better forecast stock prices,
interest rates, exchange rates and commodity prices. In the field ofcredit evaluation, data mining can
assist in estabtishing an automated decision supporl system which would allow credit card or loan
providing companied to quickly and accurately assess risk and approve or reject an application.
Climate
A study has been reported on atmospheric and oceanic parameters that cause drought in tl,e
srate of Nebraska in USA. Many variables were considered including the follouing.
I . Standardized precipitation index (SPI)
2. Palmer drought severity index (PDSI)
3. Soutlrcnr oscillation index (SOl r
Crime Prevention
A number of case studies have been published about the use of data mining techniques in
analyzin! crime dirta. .In one particular study, the data mining techniques were used to link serious
sexual crimes to other crimes that might have been committed by the same offenders. The data used
related to more than 2000 offences (the number of offendqrs was much smaller since moqt offenders
bommitted multiple crimes) involving a variety of sexual crimes.
In this case study, a direct mail company held a list of a large number of potential customers.
The response rate of the company had been only l%o, which the compahy wanted to improve. To
. carry out data mining, the company had to first prepare data, which included sampling the data to
select a subqet of cqstomers including those that responded to direct mail and those that did not.
Much rgsearch is being carried out in applyrng data mining to a variety of applications I
healthcare. It has been found, for example, that in drug testing, data mining may assist in isolating
those patients where the drug is most effective or where the drug is having unintended side effects.
Data mining has been used in determining
algorithms to business applications and as storage prices continue to decline and enterprise data
continues to grow, data mining is still not being used widely. Thus, there is considerable potential for
- data mining to continue to grow. l
, Since most time 'spent in data mining is actually spent in data extraction, data cleaning and
data manipulation, it is expected that technologies like data warehousing will grow in importance. It
has been found that as much as 40q.o of all collected data contains errors. To deal with such large
error rates, there is likely to be more emphasis in the future on building data warehouses using data
cleaning and. extraction. Data mining efficiency would improve if these tasks could be carried out
10
Data mining techniques depend upon a lot of careful analysis of the business and a good
' understanding of the techniques and software available. Often a model needs to be built, tested and
validated before it can be used. This needs considerable expertise and time. The team engaged on
building a data mining application should have the business expertise as well as the data mining
expertise.
Data mining techniques that have become important in the future include techniques that
better determine "interestingness" of a discovered pattem and are able to compare current data with
an earlier set of data to determine ifthere is a change of paftem in data. Other techniques that are to
receive more attention in the future are text and web-content mining, bioinformatics, and multimedia
data mining.
The issues related to information privacy and data mining will continue to attract serious
concem in the community in the future. In particular, privacy concems related to use of data mining
techniques by govemments, in particular the US Govemment, in fighting terrorism are likely to
grow.
Every data mining project is different but the pro.iects do have some common features. Following are
some basic requirements for successful data mining project.
Once the basic prerequisites have been met, the following guidelines may be appropriate for a data
mining project.
1. Data mining projects Should be carried out by small teams with a strong intemal
Preferably such a person should not be a technical analyst or a consultant but someone
with direct business responsibility, for example someone in a sales or marketing
environment. This will benefit Yhe extemal integration.
lr
6.'Ihe whole project should have the support of the top management ofthe company.
r Intelligent Miner -
This is a comprehensive data mining package from IBM. tntelligent
Miner uses DB2 but can access data from other databases. Its functionality
includes
association rules, classification, cluster analysis, prediction,
sequential pattems, and tirne
It also includes lntelligent Miner for Text for text mining, including
series.
mining of email
and web pages. Inteligent Miner provides support for processes
from data preparation to
mining and presentation.
r JDA Intellect - JDA Software Group has a comprehensive package called JDA Intellect
that
provides facilities for association mres, classification,
cruster anarysis, and prediction.
o -
Mantas Mantas Software is a smal company that was a spin-off
from SRA Intemational.
The Mantas suite is designed to focus on detecting and analyzing
suspicious behavior in
financial markets and to assist in complying with global regulations.
o MCubiX from Diagnos -it is a comprete and affordabre data mining toor box, incruding
decision tree, neural networks, associations rules and visualization-
e MineSet- originally developed by SGI. MineSet specializes in visualization
and provides a
variety of visualization tools including the scattei visualizer, the statistics visualizer
and the
map visualizer.
5, Troining.and support
a) What: documentation is provided?
rb) How curreat
is the documentation?
c) Does the vondor provide training and help in insta.llation?
d) Is computer-based training for the software available?
e) Arc theie any articles that have been written about the product by third parties?
7. Usabilig,
a) Is the user interface intuitive, given the machine platform?
b) Is the software easy to learn? Is the documentation available simple,
clear and concise?
c) Is the software flexible? Can it be easily adapted to a variety of problems?
CONCLUSION
In this chapter, we explained what data mining is and also presented a definition of data
n:iling. The reasons for current interest in data mining are discussed and a number of areas of
growth in data are described. The data mining process is discussed and the techniques to be covered
in this book are introduced. A number of application areas of data mining are presented and several
case studies lrom different applications dreas are briefly described. We also presented a list of some
data rnining soliware available and a list of issues that should be considered when purchasing data
rnining softwar.c.
REVIEW QUESTIOI.''S
CHAPTER 2
ASSOCIATION RULES MINING
Learning Objectives
l. Explain what association rules mining is and present a naiVe algorithm
2 Explain the basic terminology and the Apriori Algorithm
3. Provide examples of Apriori Algorithm
4. Discuss the efficiency of the Apriori Algorithm and find ways 1a improve it
5. Discuss a number of more efficient algorithms
6. Summadze the major issues in association rules and get acquainted with a bibliography
on this technique
2.I INTRODUCTION
A huge amount data is stored electronically in most enterprises. In particular, in all retail
outlets the amount of data stored has grown enormously due to bar coding of all goods sold. As an
extrerne example presented earlier, Wall-Mart, with more than 4000 stores, collects about 20 million
point-of-sale transaction data each day.
Analyzing a large database of supermarket transactions with the aim of finding associ?i;,,n
rule is called association rules mining or market basket analysis. It involves searching for interesting
customer habits by looking at associations. Association rules mining has many applications oth,;r
than market basket analysis, including applications in marketing, customer segmentation, medicine,
electronic commerce, classification, clustering, web mining, bioinformatics and finance.
2.2 BASICS
Let us first describe the association rule task, and also define some of the terminology by using an
example of a small shop. We assume that the shop sells:
Bread Cheese Coffee
Juice Milk Tea
Biscuits Newspaper Sugar
We assume that the shopkeeper keeps records of what each. customer purchases. Such records of t,:n
customers are given in Table 2.1. Each row in the table gives the set of items that one customer
bought.
l6
60
70 Bread, Cheese
Bread, Cheese, Juice, Coffee
90 Bread, Milk
100
The shopkeeper wants to find which products (catl them items) are sold together frequently.
If for example, ugar i{rd tea are tireitwo items that are sold together frequenrly then the shopkeeper
s.
might consider having a sale on one of them in the hope that it will not only increase the sale that
Confidence lor X ) Y is defined as the ration of the suppo( lor X and Y together to the support for
X. Therefore ifX appears much more frequently than X and Y appear together, the confidence will
be low. It does not depend on how frequently Y appears.
Given a large set of transactions, we seek a procedure to discover all association rules wiich
have at least pok support with at leasl q94 confidence such that all rules satisfying these constraints
are found and, ofcourse. found efficiently.
Example 2.1 - A Naiie Algorithm
Let us consider n naiVe brute force algorithm to do the task. Consider the following example
(Table 2.2) which is even simpler than what we consldered earlier in Table 2.1. We now have only
the four transactions given in Table 2.2, eachhansaction showing the purchases of one customer. We
are interested in finding association rules with a minimum "suppoft" of 5Ao/o and minimum
"confidence" of 7570.
Table 2,2 Transactions for Example 2.1
Transaction ID Items
Bread, Cheese
Bread, Cheese, Juice
Bread, Milk
Cheese, Juice, Milk
If lle'can list all'the combinations of the,items that we have in stock and find which of these
combinations are frequent, then we can find the association rules that have the "confidence" lrom
these frequent combinations.
The four items and all the combinations of these four items and their frequencies of
occurrence in the transaction "database" in Table 2.2 are given in Table 2.3.
l8
Given the requiied minimum support of 50%, we find the itemsets that occur in at least two
transactions. Such itemsets are called frequent. The list of frequencies shows rhat all.fbLr itenrs
Bread, Cheese, Juice and Milk are frequent. The frequency goes down as we look at 2-it,:.risetj, :l-
itemsets and 4-itemsets.
The frequent itemsets are given inTable 2.4
Table 2.4 The set of all fi'equent itcmsets
Itemsets Frequency
Bread J
Cheese J
Juice 2
Milk 2
Bread, Clrcese 2
Chgese, Juice 2
. We can now pr.oceed to determine if the two 2-itemsets (Bread. Cheese) ai,. , t.Dese, ir,ice
h:ad to association rules with required confidence of 75%. E'.'ery 2-itemset (A. B) cen lcac ro rwo
rules A ) B and B I A if both satisry the required confiden..:. .\s defined earli,,. tonfidence olA
) B is given by the support for r a,rd B together divideJ oy the supp,';r"t for A
We 0rcrefore have four possible rules and their confidence as foilor,r.s.
t9
(ilLead, Milk) I
(Cheese, Juice) 2
(Juice, Milk) 1
$can now proceed as before. This would work better since the list of itenr combinations is
reducgl (from 15 to 11) and this reduction is likely to be much larger for bigger ptoblenis.
Regdrdless of the extent of the reduction, this list will also become very large for, say 1000 itenrs
20
' Stip J.' Scan all transactions and find all k-itemsets in C1 that are frequent. The frequenl set
so obtained is 11. lFor k=2. ez is the set ofcandidare pairs. The lrequent pairs are 12 ).
Terminate when no further frequent itemsels are found. otherwise continue with Slep 2.
The main notation for association rule mining that is used in the Apriori algorithm is the following:
o A k-itemset is a set of k irems.
. The set C* is a set of candidate k-itemsets that are potentially frequent.
o The set Zr is a subset of Cr and is the set of k-itemsets that are frequent.
It is now worthwhile to discuss the algorithmic aspects of the Apriori algorithm. Some of the issues
that need to be considered are:
l. Computing 21.' We scan the disk-resident database only once to obtain L1. An item
. vector of length z with count for each item stored in the main memory may be used.
21
Once the scan ofthe database is finished and the count for each item found, the items
that meet the support criterion can be identitied and I-r determined.
3. Pruning: Once a candidate set Ct has been produced, we can prune some of the
candidare ilemsets by checking rhat all sLlbsets of every itemset in the set are
frequent. For example, if we have derived a, b, c from a. b and a, c, then we check
that b. c is also in Lz. If it is not a. b,c may be removed fiom C:. The task of such
pruning becomes harder as the number of items in the itemsets grows, but the
number of large itemsets tends to be smali.
5. Transaclions storage: We assume the data is too large to be stored in the main
meniory. Should it be stored as a set of transactions, each transaction being a
that C2 is likely to be large, this testing must be done efficiently. In one scan, each
transaction can be checked for the candidate pairs.
Transaction ID Items
l0 A,B,D
20 D,E,F
30 A,F
40 B,C,D
50 E,F
60 D,E,F
70 C.D,F
80 A,C.D.F
23
In the second horizontal representatioil (Table 2.8), rather than listing the items that were purchased
!\€ may list all the items and indicate purchases by putting a I against the item that occurs in a
transaction and 0 against the rest. Each row is still a transaction, but the items puchased :ile
represented by a binary string.
TID A B C D E
10 I I 0 I 0 0
20 0 0 0 I I I
30 I 0 0 0 0 1
40 0 1 1 1 0 0
50 0 0 0 0 I 1
60 0
'0 0 1 1 I
70 0 0 .1 I 0 1
80 1 0 0 1 0 1
In the third reprcsentation (Table 2.9), call it the vertical representation, the transaction list is tumed
around. Rather than u^ing each row to represent a transaction ofthe items purchiised, each row now
represents an item and indicates transactions in which the item appears. The columns now represent
the transactions. This representation is also called a TID-list since for each item it provides a list of
TTDs
Ttr)
Item
l0 20 j0 40 50 6l 70 80
A I 0 I 0 0 0 0 1
B I 0 0 I 0 0 0 0
C 0 0 0 1 0 0 1 1
D I I 0 1 0 1 1 I
E 0 1 0 0 I 1 0 0
F 0 1 I 0 1 I 1 1
Horv the data is represented can have an impact on the efficiency of an algorithm. -A ve:' ' ,'
:efresentation can facilitate counting of items by couiiting o[ items by counting the num:rer of ]' rn
24
each row and for example, the number of 2-itemses can be counted by finding rhe interaction ofthe
two TID lists, but the represenntion is not storage efficient if there is a very large number of a
trallsaclron involved.
Exarnple 2.2 - A simple Apriori Example
Let us first consider an example of only five transactions and six items. The example is similar to
Example 2.2 h Table 2.2 but added two more items and another transaction. We still want to find
association rules with 50% suppoft and 757o conll<lence. The transactions are given in Table 2.10.
Table 2,10 Transactions for Example 2.2
Transaction ID Itemsels
] times, Juice 4 times, Milk 3 times,, and Eggs.and Yogurt only once. We require 507o
appear in at least three transactions. Therefore L1 is
Item Frequency
(Bread, Cheese) 2
: (Bread, Juice) 3
(Bread, Milk) 2
(Cheese, Juice) J
(Cheese, Milk) I
(Juice, Milk) 2
We therefore have only two frequent item pairs which are {Bread, Juice} and {Cheese, Juice}. This
is L2. From these two frequent 2-itemsets, we do not obtain a candidate 3-itemset since we do not
have two 2-itemsets that the same first item.
The two frequent 2-itemsets above lead to the following possible rules:
Bread ) Juice Cheese ).Iuice
.luice ) Cheese Juice ) Bread
25
by
The confidence ofthese rules is obtained by dividing the support for both items in the rule
the support for the item.on the left-hand side ofthe rule. The confidence offour rules therefore
ue3/q
r5%
= 75%,3/q=75%., 313 = 100%, and %=7 5Yo respectively. Since all of then have a minimum
'confidence, they all qualifr.
The Apriori algorithm is resource intensive for large sets of transactions that have a large set of
frequent items. The major reasons for this may be summarized as follows:
1. the number of candidate items'sets grows quickly and can result in huge candidate sets. For
example, the.size of the candidate sets, in particular Cz, is crucial to the performance of the
Apriori algorithm. The larger the candidate set, the higher the processing cost for scanning
the tranlaction database to find the frequeut item sets. Give'n that the early sets of candidate
.itemsets are very large, the initial iteration dominates the cost.
. . 2:.,the Apriori algorithm requiies many scans of the database. If n is the length of the longest
I . 3. many trivial rules (eg. Buying milk with Tic Tacs) are derived and it can often be dltfi.ou]t to
extract the most inGresting rules from all the rules derived. 'For,example, ong may wish to
remove all the rules involving very frequent sold ilems.
4. some rules can be in explicable and very fine grained, for example. toothbrush vas the most
frequently sold item on Thursday momings .. -.
. . ,l
5 . redundant rules are generated. For example, if A ---+ B is a rule then an!'rule :
the Apriori algorithm assumes spafseness since the number of items in each transaction is
small compared with the total number of items. The algorithm works better with sparsity.
, Some applications produce dense data which may also have many frequently occurring items.
A number of .techmques ror rmprovlng rne perfonirance of the Apriori algorithm have been
suggested. They can bp classified into 4 categories.
o Reduce the number of candidate itemsets. For example, use pruning to reduce the nuniber
ofcandidate 3- itemsets and, ifnecessary, larger itemsets'
r Reduce the number of transactions. This may involve scanning the transaction data after
' lave atleast two
Lr has been computer and deleting all the transactions that do not l-''
frequent items. More transaction reduction may be done if the frequent 2-itemset Lz is
small'
26
. Reduce the number of comparisons. There may be no need to compare every candidate
against every transaction ifwe use an appropriate data structure.
o Generate candidate sets efficiently. For example, it may be possible to compute Ck and
. from it compute Cr+r raiher than wait for Lr to be available. One could search for both k-
itemsets and (k+1)- itemsets in one pass.
We now discuss a number of algorithms that use one or more of the above approaches to
improve the Apriori Algorithm. The last method, the Frequent Pattem Growth, does not
generate candidate itemsets and is not based on the Apriori algorithm.
1. Apriori-TlD
2. Direct Hashing and Pruning (DHP)
3. Dynamic Itemset Counting (DIC)
4. Frequent Paftem Growlh
2.6 APRIORI-TID
Step I
First scan the entire database and obtain T1 by treating each item as a l-itemset. This is given in
Table2.l2.
27
Steps 2 and 3
The next step is to generate L1. This is generated with the help of!1 Cz calculated as previously in
the Apriori algorithm. See Table 2.13.
Table 2.13 The sets L1 and C2
L1 Cz
Itemset
Itemset Support
{B, C}
{Bread} 4
{B, J}
{Cheese} 3
{B,M}
{Juice} 4
{C, J}
{Milk} 3
{C, M}
U,M}
-
In Table 2.13, we have used single letters B(Bread), C(Cheese), J(Juice) and M(Milk) for Cz
Step 4
The suppo( for itemsets in Cz is now calculated with the help of T1, instead of scan-ning the actual
database as in the Apriori algorithm and the result is shown in Table 2.14.
{B, J} J
{B,M} 2
{C, J} J
{C, M} I
{.r, M} 2
28
Step 5
We now find T2 by using. C and T1 as shown in Table 2.1 5.
Table 2.15 Transaction database T,
TID Set-of-Itemsets
100 {{B, c}, {B, J}, {c, r}}
200 {{B, c}, {B, r}, {c, r}}
300 {{B, M}}
400 {{8, J}, {B, M}, {r, M}}
500 {{c, r}, {c, M}, {r, M}}
- {B' J}and {c, J} are the frequent pairs and they make up Li. c3 may now be generated but
we find that C3 is empty. If it was not empty we would have used it to find Tj with the help of the
transaction set Tz That would result in a,smaller T2. This.is the end of this simple example.
The generation of dssociation rules from the derived frequent set can be done in the usual
way. The. main advantage of the Apriori-TlD algorithm is that the size of T1 is usually smaller than
smaller, than the entry in rhe corresponding transaction for largcr k values. Since the support for each
candidate k-itemset is counted with the hetp of the corresponding T1. the algorithm is often faster
than the basic Apriori algorithm
It should be noted that both Apriori and Apriori-TlD use the same candidate generation
algorithmr and therefore they count the same itemsets. Experiments have shown that the Apriori
algorithm,guns more efficiently duiing the earlier phases of the algorithm because for small values of
k' each entry in Tr may be larger than the conesponding entry in the transaction database.
This algorithm' proposes overcoming some of the weakness of the Apriori algorithm by
reducing the number of candidate k-i8msets. in particular the 2-itemsets. since that is the key to
improving performance. Also; as noted earlier, as k increases. not only is there a smalier number of
frequent k-itemsets but there are fewer transactions containing these itemsets. Thus it should not be
necessary to scan the whole transaction database as k becomes larger than 2.
The direct hashing and pruning (DHP) algorithm claims to be efficient in the generation of
hequent itemsets and effective in trimming the transaction database by discarding items from the
tr+rsactions or removing whole transabtions thal do not need to be scanned. The algorithm uses a
hash'besed technique to reduce the number ofcandidate itemsels generatecl rn the first pass (that is; a
significantly smaller C: is constructed). It is claimed thal t!r,.' qiir-i:g. c:' i.:,..,t rs in C2-geherated
rrs;r;g l:iHF cir.p [:.* crciers cf ",r:*grii,;ie sl:tail+r: so thr,tt ti.tc sur,r ,'.,;liir<,] ; r , r,ttjfiDine L, i,
-oa" ,
10
The algorithm may be divided into the following three parts. The first part finds all the
freqr,rent f -itemsets and ali the candidate 2-itemsets. The second part is the more general part
including hashing and the third part is without the hashing. Both the second and third parts include
pruning, Part 2 is used for early iterations and Part 3 for later iterations.
Part l-Essentially the algorithm goes through each transaction counting all the l -itemsets. At
the same time all the possible 2-itemsets in tle current transaction are hashed to a hash table. The
algorithm uses the hash table in the next piss to reduce the number of candidate itemsets. Each
bucket in the hash table has a count, which is increased by one each time an itemset is hashed to that
bucket. Collisions can ocq,r when different itemsets are hashed to tlre same bucket. A bit vector is
associated with the hash table to provide a flag for each bucket. Ifthe bucket count is equal or above
the minimum support count, the corresponding flag in the bit vector is set to 1, otherwise it is set to
0.
Part 2-This part has two phases. In the first phase, C1 is generated. .In the Apriori algorithm
C1 is generated by Ll-r x L1-1 but the DHP algorithm uses the hash table to reduce the number of
candidate itemsets in C*. An item is included in C1 onJy if the corresponding bit in the hash table bit
vector has been set, that is the number of items hashed to the location is greater than the support.
Although having the corresponding bit vector bit set does not guarartee that the itemset is frequent
due to collisions, the hash table filtering does reduce Cr. and is stored in a hash tree, which is used to
cormt the suppo( for each itemset in the second phase ofthis part.
In the second phase, the hash table for the next step is generated. Both in the support comting
and when the hash table is generated, pruning of the database is carried out. Only itemsets that are
important to future steps are kept in the database. A k-itemset is not considered useful in a frequent
k+l itemset unless it appears at least k times in a transaction. The pruning not only trims each
transaction by removing the unwanted itemsets but also removes transactions that have'no itemsets
that could be frequent.
Part 3-The third part of the algorithm continues until there are no more candidate itemsets.
lnstead of using a hash table to find the frequent itemsets, the transaction database is now scanned to
find the support count for each itemset. The dataset is likely to be now significantly smaller because
of the pruning. When the support count is established the algorithm determines the fiequent itemsets
as before by checking against the minimum support. The algorithm then gerierates candidate itemsets
as the Apriori algorithm does.
Example 2.4 * DHP Algorithm
I We now use an example to illustrate the DHP algorithm. The transaction database is the
same as we used in.Example 2.2. We want to find association rules that satisff 50% support and 75%
confidence. Table 2.31 presents the transaction database and Table 2.16 presents the possible 2-
itemsets for each transaction.
3
30
We will use letters B(Bread), C(Cheese), E(Egg), J(Juice), M(Milk) and Y(Yogurt) in Tables
2.17 to 2.19
Table 2.17 Possible 2-itemsets
The possible 2-itemsets in Table 2.17 are now hashed to a hash table. The last column shown
in Table.2.33 is not required in the hash table but we have included it for the purpose of explaining
the technique.
Assume a hash table of size 8 and using a very simple hash function described below leads to
the hash Table 2.18.
Table 2.18 Hash table for 2-itemsets
. The two digits are then coded as a modulo 8 number (dividing by 8 and using the remainder).
This is the bucket address.
For a support of 50%o, the frequent items are B, C, J, and M. This is Lr which leads to C2 of (B,
C), (B, I), (8, M), (C, J), (C, M) and (J, M). These candidate pairs are then hashed to the hash table
and tle pairs that hash to locations where the bit vector bit is not set, are removed. Table 2.19
shows that (B, C) and (C, M) can be removed from C2. We are therefore left with the four candidate
itern pain or the reduced Cz given in the last column ofthe hash table in Table,2.19. We nowlook at
the transaction database and modifi it to include only these candidate pairs (Table 2.19).
It is now necgssary to count support for each pair and while doing it we further trim the
database by removing items and deleting transactions that will not appear in frequent 3-itemsets. The
frequent pairs are (B, J) and (C, J). The carrdidate 3-itemsets must have two pairs with the first item
being the same. Only transaction 400 qualifies since it has candidate pairs (B, J) and (B, M), Others
can therefore be deleted and the transaction database now looks like Table 2.20.
Table 2.20 Reduced transaction database
(B,J,M)
In this simple exalnple we can now conclude that (B, J, M) is the only potential frequent 3-
itemset but it canngt qualifr since transaction 400 does not have the pair (J, M) and the pairs (J, M)
and @, M) arp not frequent pairs. That concludes this example.
2.8 DYNAMIC ITEMSET COUNTING (DIC)
The Apriori algorithm must do as many scans of the transaction database as the number of
items in the last candidate itemset that was checked for its support. The Dynamic Itemset Counting
@IC) algorithm reduces the number of scans required by'not just doing one scan for the frequent l-
itemset and another for the frequent 2-itemset but cohbining the counting for a number of itemsets
as soon as it appears that it might be necessary to count it.
The basic algorithm is as follows:
. l. Divide the transaction database into a number of, say q, partitions.
32
2. Start counting the 1-itemsets in the first partition ofthe transaction database.
3. At the beginning of the second partition, continue countirig the f -itemsets but also start counting
the 2-itemsets using the frequent 1-itemsets from the first partition.
4. At the beginning of the third partition, continue dounting the f -itemsets and the 2-iternsets but also
start counting the 3-itemsetg using results from the first two partitions.
5. Continue like this until the whole database has been scanned once. We now have the fina1 set of
frequent l-itemsets.
6. Go back to the beginning ofthe transaction database and continue counting the 2-itemsets and the
3-itemsets.
7. At the end of the first partition in the second scan of the database, we have scanned the whole
database for 2-itemsets and thus have the final set of frequent 2-itemsets.
.8. Continue the process in a similar way until no frequent k-itemsets are found.
The DIC algorithm works well when the data is relatively homogeneous throughout the file
since it starts the 2-itemsetcount before having a final l -itemset count. Ifthe data distribution is not
homogenbous, the algorithm may not identiSr an itemset to be large until most of the database has
been scanned. In such cases it may be possible to randomize the transaction data although this is not
always possible. Essentially, DIC attempts to finish the itemset counting in two scans of the database
while Apriori would often take three or more scans.
2.9 MINING FREQUENT PATTERNS WITHOUT CANDIDATE GENERATION
(FP-GROWTTT)
The algorithm uses an approach that is different from that used by methods based on the
Apriori algorithm. The major difference between frequent pattem-$owth (FP-growth) and the other
algorithms is that FP-grollth does not generate the candidates, it only tests. ln contrast, the Apriori
algorithm generates the candidate itemsets and then tests.
The motivation for the FP-tree method is as follows:
. Only the frequent items are needed to find the association mles, so it is best to find the
frequent items and ignore the others.
o If the frequent items can be stored in a compact strucfure, then-the original transaction
database does not need to be used repeatedly.
. If multiple transactions share a set of frequent items, it may be possible to merge the shared
sets with the number of occrmences registered as count.
To be able to do this, the algorithm involves generating a frequent pattem tree (FP-tree).
Generating FP-trees
The algorithm works as follows:
1. Scan the uansaction database once, as in the Apriori algorithm, to find all the frequent items
and their support.
33
6. Get the next transaction from the transaction database. Remove all non-frequent items and list
the remaining items according to the order in the sorted frequent items.
7. Insert the transaction in the tree using any common prefix that may appear. Increase the item
cottnts.
8. Continue with step 6 until all transactions in the database are processed.
Let us see one example.
The minimum support required is 50% and confidence is 75%.
Table 2.21 Transaction database for Example 2.5
Transaction ID llems
100 Bread, Eggs, Juice
200 Bread, Cheese, Juice
300 Bread, Milk, Yogurt
400 Bread, Juice,Milk
500 Cheese, Juice, Milk
The frequent items sorted by their frequency are shown in Table 2.22.
Table 2,22 Frequent items for database inTable 2.21
Frequency
Bread 4
Juice 4
Cheese J
Milk J
Now we remove the items that are not frequent from the transactions and order the items according
to their ftequency as above Table.
Table 2.23 Database after removing the non-frequent items and reordering
Transaction ID Items
101 Bread, Juice, Cheese
200 Bread, Juice, Cheese
300 Bread, Milk
400 Bread, Juice,Milk
500 Juice, Cheese, Milk
s4
M:l
M:l
BM(r)
BJM(r)
JCM(1)
No frequent itemset is discovered from these since no itemset appears three times.
Next we look at C
and find the following:
BJC(2)
JC(l)
These two patterns give us a frequent itemset JC(3). Looking at J, the next
frequent item in the table,
we obtain:
BJ(3)
J(t)
Again we obtain a frequent itemset, BJ(3). There is no need to follow links from item B as there are
no other frequent itemsets.
The process above may be represented by the "conditional" trees for M, C and J in Figures 2.4, 2.5
and 2.6 respectively.
B:.4
J:1
C:1
J:3
M:1
J:1
The performance evaluation also included algorithms lor closed itemset mining as well as for
maximal itemset mining. The performance evaluation in 2004 found an implernentation of an
algorithm that involves a tree traversal as the most efficient algorithm for finding frequent, frequent
closed and maximal frequent itemsets.
Packages like Clementine and IBM Intelligent Miner include comprehensive association de mining
softrvare. We present some softvrare designed for association rules.
Apriori, FP-growth, Eclat and DIC implementation by Bart Goethals. The algorithms
generale all frequent itemsets for a given minimal support threshold and for a given minimal
confidence threshold (Free). For detailed particulars visit:
http://www.adrem.ua.ac.be/-g,octhals/soltwarc/indcx.html
ARtool has also been developed at UMass/Boston. It offers a collection of algorithms and
tools for the mining of association rules in binary databases. It is distributed under the GNU
General Public License. For more information visit:
http :/iwww. cs.umb.edu/-laur/ARtool
DMII (Data lr4ining II) association rule software from NUS Singapore. For more information
visit:
http://www.cornp.nus.edu.ss/-dm2l
FIMI, Frequent Itemset Mining Implemations t€pository is the result of the workshops on
Frequent Itemset Mining Implementations. FIMI'03 and FIMI'O4 which took place at IEEE
ICDM'O3, and IEEE ICDM'04 respectively. For more information visit:
http://fi mi.cs.helsinki. fi I
'38
CONSLUSION
This chapter introduced the association rules mining problem and presented the classical
Apriori algorithm. Association rule mining is an interesting problem with many applications. The
algorithm used is conceptually simple and the resulting rules are clear and understandable. The
algorithms work on data of variable length.
The process of mining association rules can be made more efficient than the Apriori
algorithm. Some of the proposed algorithms make changes to the existing Apriori algorithm like
Apriori-TlD and DHP, while others present completely new solutions like FP-growth. In
performance evaluation of many algorithms it is becoming clear that tree-based algorithms perform
the best.
It should be noted that lowering the support threshold results in many more frequent itemsets
and often increases the number of candidates itemsets and maximum length of frequent itemsets
resulting in cost increases. Also, in the case of denser datasets (i.e. those with many items in each
transaction) the maximum length of frequent itemsets is often higher and therefore the cost offinding
the association rules is higher.
REVIEW QUESTIONS
. L Define support and confrdence fbr an association rule.
2. Define lift. What is the relation between support, confidence and lift of an association rule X
)Y?
3. Prove that all nonempty subsets ofa frequent itemset must also be frequent
4. The efficiency of the Apriori method for association rules may be improved by using an of
the following techniques:
o pruning
o Transactionreduction
o partitioning
o Sampling
Explain two of these approaches using the grocery data.
' 5. Explain how the hashing method DHP works. Estimate how much work will be needed to
compute association rules compared to Apriori. Make suitable assumptions.
6. Which step of the Apriori algorithm is the most expensive? Explain the reasons for your
answer.
39
CHAPTER 3
CLASSIF'ICATION
Learning Objectives
1. Explain the concept of classification
2. Describe the Decision Tree method
3. Describe the Naive Bayes method
4. Discuss accuracy of classification methods and how accuracy may be improved
3.1 INTRODUCTION
Classification is a classical problem extensively studied by statisticians and machine leaming
researchers. The word classification is difficult to define precisely. According to one definition
or ordering of objects (or things) into classes. If the classes are
classiJication is the separation
created without looking at the data (non-empirically), the classification is called apriori
classification. If however the classes are created empirically (by looking at the data), the
classification is called Posteriori classification. In most literature on classification it is assumed that
the classes have been deemed apriori and classification then consists of training the system so that
when a'rrew object is presented to the trained system it is able to assign the object to one of the
existing Classes. This approach is also called supervised learning.
Data mining has geilerated renewed interest in classification. Since the datasets in data
mining are often large, ngw classification techniques have been developed to deal with millions of
objects having perhaps dozens or even hundreds of attributes.
Let us imagine that we wish to classifo Australian animals. We have some training data in
Table 3.1 which has already been classified. We want to build a model based on this data.
40
As an example of a decision tree, we show possible result in Figure 3.1 if classifuing the data
in Table 3. I .
Normally, the complexity of a decision tree increases as the number of attributes increases,
although in some situations it has been found that only a qnall number of attributes can determine
the class to which an object belongs and the rest ofthe attributes have little or no impact.
In the decision tree algorithm, decisions are made locally and the algorithm at no stage tries to find a
globally optimum tree.
One of the techniques for selecting an attribute to split a node is based on the concept of
information theory er entropy. The concept is simple, although often difficult to understand for
"quite
many. It is based on Claude Shannon's idea that if you have uncertainty then you have informalion
and ifthere is no uncertainty there is no information. For example, ifa coin has a head on both sides,
then the result oftossing it does not product any information but ifa coin is normal with a head and a
tail then the resutt of the toss provides information.
always zero whatever the base, the log ofany number greater than I is always positive and the log of
any number smaller than 1 is always negative. Also,
log2(2) =l
1og2(2') = n
loe2(1/2) = -l
1o921112"7 = '11
Information ofany event that is likely to have several possible outcomes is given by
I=!; (-pi logpl)
Consider an event that can have one of two possible values. Let the possibilities of the two
values be pr and p2. Obviously ifpr is I and p2 is zero, then there is no information in the outcome
and I:0. If p1:0.5, then the information is
I : -0.5 log(0.5) - 0.s log(o.s)
This comes out to 1.0 (using log base 2) is the maximum information that you can have for
an event with two possible outcomes. This is also called entropy and is in effect a measure of the
minimum number of bits required to encode the information.
If we consider the case of a die (singular of dice) with six possible outcomes with equal
probability, then the information is given by:
I= 6(-116) log(1/6)) = 2.s8s
Therefore three bits are required to represent the outcome of rolling a die. Of course, if the die
was loaded so that there was a 507o or a75Vo chance of getting a 6, then the information conteilt of.
rolling the die would be lower as given below. Note that we assume that the probability of getting
any of I to 5 is equal (that is, equal to l0% for the 50%o case and 5% for the 7 5%o case).
Assume there are two classes, P and N, and let the set of training data S (with a total number
of objects s) contain p elements of class P and n elements of class N. The amount of information is
defined as
I = -(n/s) log(n/s) - (p/s) log (p/s)
Obviously if p=n, I is equal to 1 and if p=s then I=0. Therefore if there was an attribute for which
almost all the objects had the same value (for example, gender when most people are male), using
the attribute would lead to no information gain (that is, not reduce uncertainty) because for gender =
female therewill be almost no objects while gender: male would have almost all the objects even
before we knew the gender. If on the other hand, an attribute divided the training sample such that
gender = female resulted in objects that all belong to Class A and gender : male all belong to
Class B, then uncertainty has been reduced to zero and we have a large information gain.
We define information gain for sample S using attribute A as follows:
Gain(S,A)=I-I, (tils)Ii
iEvalues(A)
I is the information before the split and ) iBuuyu"r,4, (tils) Ii is the sum of information after the split
where Ii is the information of node I and ti is the number of objects in node i. Thus the total
information after the split is L iEvalue(A) (tils) Ii, which is the sum of the information of each node.
Once we have computed the information gain for every remaining attribute, the attribute with
the highest information gain is selected.
Example 3.1 - Using the Information Measure
We consider an artificial example of building a decision tree classification model to classify bank
loan applications by assigning applications to one of three risk classes (Table 3.2).
Table 3.2 Training data for Example 3.1
There are l0 (s =10) samples and three classes. The frequencies ofthese classes are:
A:3
B:3
c:4
Information in the data due to uncertainty of outcome regarding the risk class each person belongs to
2. Attribute "Married"
There are five applicants who are married and five who are not,
Value = Yes has A = 0, B = 1, C = 4, total 5
Value = No has A : 3, B :2, C :0, total 5
Looking at the values above, it appears that this attribute will reduce the uncertainty by more than the
last attribute. Computing the information gain by using this attribute, we have
I(y) = -(1/5) tos(t t s) - @15) tog(4t s) = 0.722
I(n) = -(3/s) tog(3ts) - (2/s) tog(2t5):0.911
Information of the subtrees : 0.51(y) + 0.5I(n):0.846
3. Attribute "Gender"
There are three applicants who are male and seven are female.
Value= Male has A = 0, B = 3, C = 0, total 3
4. Attribute "Employed"
There are eight applicants who are employed and two that are not.
Value = Yes has A = 3 B = 1, C = 4, total 8
A=0, B =2,C= 0,total 2
Value = No has
The values above show that this attribute will reduce uncertainty but most attribute values are Yes
while the No value leads to only one class. Computing the information gain by using this attribute,
we have
I(y) = -(3/8) log(3/8) - (1/8) log(1/8) - (a/8) log(4/8) = 1.41
I(n) : o
Total information ofthe subtrees : 0.8I(y) + 0.2I(n) = l.ll
5. Attribute "Credit Rating"
There are five applicants who have credit rating A and five that have B.
Value : A has A = 2, B = I, C : 2,total 5
Value=BhasA = 1,8=2,C:2,total 5
Looking at the values above, we can see that this is like the first attribute that does not reduce
uncertainty by much. The information gain for this attribute is the same as for the first attribute.
(A): -(2/s) tog(z/s)-(1/s) log(l/5) - \2/5)log\2/51: t.s2
(B) = -( 1 /s) loe( I /s) - (2 t s) to g(2/ s) - (2/ s) log(2 t 5) : 1 .s2
Total information of the subtrees = 0.5I(A) + 0.5(B) = 1.52
The values for information gain can now be computed. See Tabte 3.3.
Table 3.3 Information gain for the five attributes
Potentiat split attibute Infomation belore spli, Information after sptit Information gain
0.05
Manied 1.57 0.85 0.72
Gender 1.s7 0.69 0.88
Employed t.s7 1.12 0.45
Credit Ratirig 1.57 1.52 0.05
Hence the largest information gain is provided by the attribute "Gender" and that is the attribute that
is used for the split.
Now we can reduce the data by removing the attribute Gender and removing the class B since
all Class B have Gender : Ma.le. See Table 3.4.
4
46
NoNoYesAA
Yes Yes Yes B C
No Yes Yes B C
NoNoYesBA
Yes No Yes A A
No Yes Yes A C
Yes Yes Yes A C
The information in this data of two classes due to uncertainty of outcome regarding the class each
person belongs to is given by
I = -(3/7) tog(3/D - g/7) tog(417) : 1.33
Let us now consider each attribute in tum as a candidate to split the sample.
Attribute "Owns Home"
1.
Value: Yes. There are three applicants who own their home. They are in classes A=l and C=2.
Value = No. There are four applicants who do not own their home, who are in classes A=2,
ar,dC=2.
Given the above values, it does not appear as if this attribute will reduce the uncertainty by
much. Computing information for each of these two subtrees,
I(Yes) = I(y) = -(l/3) tog(1/3)
- (2/3) tog(2/3) = 0.92
I(No) : I(n) = -(2t$ tosQtq - Q/4) togQ/q : 1.00
Total information of the two subtrees : (317)I(y) + (417)I(n) = 0.96
2. Attribute "Married"
There are four applicants who are married and three that are not.
Value: Yes has A = 0, C = 4, total 4
Value: No has A = 3, C = 0, totai 3
Looking at the values above, it appears that this attribute will reduce the uncertainty by more than the
. last attribute since for each value the persons belong to only one class and therefore information is
zero. Computing the information gain by using this attribute, we have
t(y)= - gtg log(4/4) = 0.00
I(n): - (3/3) log(3/3):0.00
Information of the subtrees = 0.00
47
There is no need to consider other attributes now, since no other attribute can be better. The
split attribute therefore is "Married" and we now obtain the following decision tree in Figure 3.2
which concludes this very simple example. It should be noted that a number of attributes that were
available in the data were not required in classifying it.
100%
(c
(!)
Perfect distribution
o
(.,
cd
.C)
This area =
Gini Index
5
Q
Lorenz curve
100%
-Married
Owns Home? Gender Employed Credit Rating Rkk Class
There are l0 (s =10) samples and three classes. The frequencies ofthese classes are:
A=3
B:3
C:4
49
The Gini index for the distribution ofapplication ofapplicants in the three classes is
G : I - (3/10)2 - (3/10)'? -(4/10)2 = 0.OO
Let us now consider using each ofthe attributes to split the sample.
1.Attribute "Owns Home"
Value = Yes. There are five applicants who own their home. They are in classes A=|,B=2,C1.
Value = No. There are five applicants who do not own their home. They are in classes A=2, B=1,
c:2.
Using this attribute will divide objects into those who own their home and those who do not.
Computing the Gini index for each of these two subtrees,
c(y) : 1 -(t/r2 - Qtr2 - Qtr2 = 0.64
c(n): c(y) = 0.64
Total value of Gini Index = G = 0.5c(y) + 0.5G(n) = 0.64
2. Attribute "Married"
There are five applicants who are married and five that are not.
Value : : 0, B : l, c :4, total 5
Yes has A
value = No has A:3,8=2, C:0, total 5
Looking at the values above, it appears that this attribute will reduce the uncertainty by more than the
last attribute. Computing the information gain by using this attribute, we have
c(y) = | -(ur2- @/s)2 = 0.t2
G(n) = 1 -(3/il2 - Q/r2 = o.4S
Total value of Gini Index = G = 0.5c(y) + 0.5G(n) = 0.40
3. Attribute "Gender"
There are three applicants who are male and seven are female.
Value = Male has A = 0, B = 3, C:0, total 3
can be established that the expected error rate in the subtree is greater than that in the single leaf.
This makes the classifier simpler. A simpler model has less chance of introducing inconsistencies,
ambiguities and redundancies.
Pruning is a technique to make an overfitted decision tree simpler and more geneml.
There are a number of techliques for pruning a decision tree by removing some splits and
subhees created by them. One approach involves removing branches from a "fully grown" tree to
obtain a sequence of progressively pruned trees. The accuracy of these trees is then computed and a
pruned tree that is accurate enough and simple enough is selected. It is advisable to use a set of data
different from the training data to decide which is the "best pruned tree".
Another approach is called pre-pruning in which tree construction is halted early. Essentially
a node is not spiit if this would result in the goodness measure ofthe tree falling below a threshold. It
is, however, quite difficult to choose an appropriate threshold.
There are a number of advantages in converting a decision tree to rules. Decision rules make
it easier to make pruning decisions since it is eas 'r to see the context of each rule. Also, converting
to rules removes the distinction between attribute tests that occur near the roof of the tree and they
are easier for people to understand.
IF-THEN rules may be derived based on the various paths from the root to the leaf nodes.
Although the simple approach will lead to as many rules as the leaf nodes, rules can often be
combined to produce a smaller set ofrules. For example:
Bayesian classification is quite diflerent f,rom the .decision tree approach. In Bayesian
classification w9 have a hypothesis that the given data belongs to a particular ciass. We then
calculate the probability for the hypothesis to be true. This is among the most practical approaches
for certain types of problems. The'approach requires only one scan of the whole data. Also, if at
some slage there are additional training data then each training example can incrementally
increase/decrease the probability that a hypothesis is.correct.
Before we define the Bayes lheorem, we will define some notation. The expression P(A)
iefers to the probability that event A will occur. P(AIB) stands for the probability that event A will
happen, given that.eveht B has already happened. In other words P(AIB) is the conditional
probabilrty ofA based on the condition that B.has already happened. For example, A and B may be
probabilities of passing another course B respectively. P(AIB) then is the probability of passing A
when we know that B has been passed.
P(Cilx) :
[P(xlci)P(ci)yP(x)
. P(CilX) is the probability of the object X belonging to class Ci.
. P(XICD is the probability of obtaining attribute values X if we know that it belongs to
class Ci.
r P(Ci) is the prsb,ability of any object belonging to class Ci without any other information.
. P(X) is the probability of obtaining attribute values X whatever class the object belongs
Given the attribute values X, ryhat probabilities in the fomlula can we compute? The probabilities we
need to compute are P(XlCi), P(Ci) andPS). Actually the denominator P(X) is independent of Ci
and is not required to be known since we are interested only in comparing probabilities P(CilX).
Therefore we only need 1o compute P(XrCi) and P(Ci) for each class. Comparing P(Ci) is rather easy
since we count the numbef of instances ofeach class in the training data and divide each by he total
number of instances. This may not be the most accurate eslimation of P(Ci) bu1 we have very little
information, the training sample, and we have no other information to obtain a better estimate. This
estimate will be reasonable if the training sample was large and was randomly chosen.
To compute P(XlCi) we use a naiVe approach (that is why it is called the Naive Bayes model)
- by assirming that atl attributes of X are independent which is often not true.
Using the independence of aftribules assumption and based on the training dat4 we compute
an estimate of the probability of obtaining the data X that we have by estimaling the probability of
each ofthe athibute values by counting the frequency ofthose values for class Ci.
We then determine the class allocation of X by computing tPO(lCi)P(Ci)] for each of the
classes and allocating X to the class with the largest value.
, The beauty of the Bayesian approach is that the probability of the dependent attribute can be
estimated by computing estimates of the probabilities of the independent attributes.
We should also note that it is possible to use this approach even if values of all the
independent athibutes are not known since we can still estimate the probabilities of the auribute
values that we know. This is a significant advantage of the Bayesian approach.
54
These probabilities are given in Table 3.5.We order the data by risk class to make it convenient.
Given the estimates ofthe probabiliti6s in Table 3.5, we can compute the posterior probabilities as
P(XIA):2/9
P(XIB) = 0
Ptxlc) = 0
Table 3.5 Probability of events in the Naive Bayes method
Therefore tle values of P(XlCi)P(Ci) are zero for Classes B and C and 0.3 x 219 = 0.0666. Therefore
X is assigned to Class A. It is unforhrnate that in this example two of the probabilities cnme out to be
zero. This is most unlikely in practice.
Bayes' theorem assumes that all attributes are independent and that the training sample is a
good sample to estimate probabilities. These assumptions are not always true in practice, as attributes
are often correlaied but in spite of this the NaiVe Bayes method performs reasonably well. Other
techniques have been designed to overcome this limitation. One approach is to use Bayesian
networks that. combing Bayesian reasoning with causal relationships between attributes.
classiling unseen data. Estimating the accuracy of a supervised classification method can be
difficult if only the haining data is available and all of that data has been used in building the model.
In such situations, overoptimistic predictions are often made regarding the accuracy of the model.
l. Holdout Method
The holdout method (sometimes called the test sample method) requires a training set and a test
set. The sets are mutually exclusive. It may be that only dataset is available which has been divided
into two subsets (perhaps 213 aurd, ll3), the training subset and the tesl or holdout subset. Once the
classification method produces the model using the training set, the test set can be used to estimate
the accuracy. Interesting. questions arise in this estimation since a larger training set would produce a
better classifier, while a larger test set would produce a better estimate of the accuracy. A balance
must be achieved. Since none of thc test data is used in training, the estimate is not biased, but a
good estimate is obtained only if the test data is used in training set are large enough and
representative of the whole population.
Random sub-sampling is very much like the holdout method except that it does not rely on a
single text set. Essentially, the holdout estimation is repeated several times and the accuracy estimate
is obtained by computing the mean of the several trails. Random sub-simpling is likely to produce
better error estimates than those by the holdout method.
4. Leave-one-outMethorl
Leave-one-out is a simpler version of k-fold cross-validation. In this method, one of the training
samples is taken out and the model is generated using the remaining training data. Once the model is
built, the one remaining sample is used for testing and the result is coded as I or 0 depending if it
was classified correctly or not. The average ofsuch results provides an estimate gfthe accuracy. The
leave-one-out method is useful when the daiaset is small. For large training datasets, leaVe-one-out
JI
can become expensive since many iterations are required. Leave-one-out is unbiased but has high
variance and is therefore not particula.rly reliable.
5. Bootstrap Method
ln this method, given a dataset of size n, a bootstrap sample is randomly selected uniformly with
replacement (that is, a sample may be selected more than once) by sampling n times and used to
build a model. It can be shown that only 63.2Yo of these samples are unique. The error in building the
model is estimated by using the remainin g 36.8% of objects that are not in the bootstrap sample. The
final error is than compute d as 0.632 and 0.368 are based on the assumption that if there were n
samples available initially ftom which n samples with replacement were randomly selected for
training data then the expected percentage of unique samples in the training data would be 0.632 or
63.2Vo and the number of remaining unique objects used in the test data would be 0.368 or 36.8Vo of
the initial sample. This is repeated and the average of error estimates is obtained. The bootstrap
method is unbiased and, in contrast to leave-one-out, has low variance but much iteration is needed
for good error estimates if the sample is small, say 25 or less.
Bootstrapping, bagging and boosting are techniques for improving the accuracy of classification
results. They have been shown to be very successful for certain models, for example, decision hees.
All three involve combining several classification results from the same training data that has been
perturbed in some way. The aim of building several decision trees by using training data that has
been perturbed is to find out how these decision. trees differ from those that have been obtained
earlier.
The bootstrapping method can be shown that all the samples selected are somewhat different
from each other since, on the average, only 63.2% of the objects in them are unique. The bootstrap
samples are then used for building decision trees which are then combined to form a single decision
tree.
Bagging (the name is derived form Bootstrap and aggregating) combines classification
results from multiple models or results of using the same method on several different sets of training
data. Bagging may also be used to improve the stability and accuracy of a complex classification
model with limited training data by using sub-samples obtained by resampling, with replacement, for
generating models'
Bagging essentially invorves
involves simnle
simple voting (*rith
votinp (with no weights), so the final morcer is the one
that is predicted by the majority of the trees. The different decision trees obtained during bagging
should not be so different if the training data is good and large enough. If the trees obtained are very
different, it only indicates instabiliUy, perhaps due to the training data being random. In such a
situation, bagging will often providc a better iesult than using the training data only once to build a
decision tree.
r These techniques can provide a level of accuracy that usually cannot be obtained by a large
single-rree model.
o Creating a single decision tree from a collection of trees in bagging and boosting is not
difficult.
o These methods can often help in avoiding the problem of overfitting since a number of trees
based on random samples are used.
o Boosting appears to be on the average better than bagging although it is not always so. On
some problems. bagging does better than boosting.
2. Robustness
3. Scalability
4. Interpretability
5. Goodness of the model
6. Flexibility
7. Time complexity
Speed
Speed involves notjust the time or computation cost of constructing a model (e.g. a decision
tree), it aiso includes the time required to leam to use the model. Obviously, a user wishes to
minimize both times although it has to be understood that any significant data mining project will
take time to plan and prepare the data. If the problem to be solved is large, a careful study of the
methods available may need to be carried out so that an efficient classification method may be
chosen.
Robustness
Data errors are cofilmon, in particular when data is being collected from a number of sources
and errors may remain even after data cleaning. It is therefore desirable that a method be able to
produce good results in spite of some errors and missing values in datasets.
59
Scalability
Many data mining methods were originally designed for small datasets. Many have been
moclified to deal with large problems. Given that large datasets are becoming commoq it is desirable
that a method continues to work efficiently for large disk-resident databases as well.
Interpretability
A data mining professional is to ensure that the results of data mining are explaiaed to the
decision makers. It is therefore desirable that the end-user be able to understand and gain insight
from the results produced by the classification method.
For a model to be effective, it needs to fit the problem that is being solved. For example, in a
decision tree classification, it is desirable to find a decision tree ofthe "right" size and compactness
' with high accuracy.
test) costs. SMILES also uses boosting and cost-sensitive leaming. For more details visit:
http:/iwww.dsic.upv.es/-fl inlsmiles/
o NBC: a Simple Naive Bayes Classifier. Written in awk. For more details visit:
http://scant.orp/nbc/nbc.html
60
CONCLUSION
REVTEW QUESTIONS
1. What kind of data is rhe decision tree method most suitable for?
2. Briefly outline the major steps.of the algorithm to construct a decision tree.
, 3. Assume that we have 10 training samples. There are four classes A, B, C and D. Compute the
information in the samples for the five training datasets given below (each row is a dataset
and each dataset has 10 objects) when the number of samples in each class are:
ClassABCD
Dataset 1 1 1 1 '7
Dataset 2 2 2 2 4
Dataset 3 3 3 3 1
Dataset 4 1 2 3 4
Dataset 5 0 0 1 9
p loe(p) -p loeb)
5
62
CHAPTER 4
CLUSTER ANALYSIS
Learning Objectives
1. Explain what cluster analysis is
7. Different data types: Many problems have a mixture of data types, for example, numerical,
categorical and even textual. It is therefore desirable that a cluster analysis method be able
. .to deal with not only numerical data but also Boolean and categorical data.
8. Resuh independent ol data inpul order: Althongh this is a simple requirement, not all
methods satisry it. It is therefore desirable tlat a cluster analysis method not be sensitive to
' data input order, whatever the order, the result of cluster analysis of the same data should
be the same.
Quantitative (or numerical) data is quite common, for example, weight, marks, height,
price, salary, and count. There are a number of methods for computing similarity between
quantitative data.
) Binary data is also quite common, for example, gender, and marital status. Computing
similarity or distance between categorical variables is not as simple as for quantitative data
but a number of methods have been proposed. A simple method involves counting how many
attribute values of the two objects are different amongst n attributes and using this as an
indication of distance.
3. Qualitative nominal data is similar to binary data which may take more than two values but
has. no natural order, for example, religion, food or colours. For nominal data too, an
approach similar to that suggested for computing distance for binary data may be used.
4. Qualitative ordinal (or ranked) data is similar to nominal data except that the data has an
order associated with it, for example, grades A, B, C, D, sizes S, M, L, and XL. The problem-
of measuring distance between ordinal variables is different than for nominal variables since
the order of the values is important. One method of computing distance involves transferring
the values to numeric values according to their rank. For example grades, A, B, C, D could be
transformed to 4.0, 3.0, 2.0 and 1.0 and then one of the methods in the next section may be
used,
,I.4 COMPUTING DISTANCE
l,stance is well understood concept that has a number of simple properties.
1. Distance is always positive,
2. Distance from point x to itself is always zero.
3. Distance from point x to point y cannot be greater than the sum of the distance ftom x to
some other point z and distance from z to y.
4, Distance from x to y is always the same as from y to x.
Let the distance between two points x and y (both vectors) be D(x,y). We now define a number of
distance measures.
Eaclidean dislance
Euclidean distance or the Lz norm of the difference vector is most commonly used to
compute distances and has an intuitive appeal but the largest valued attribute may dominate the
distance. It is therefore essential that the attributes are properly scaled.
D(x,y) = (X (*.y)')''
It is possible to use this distance measure without the square root if one wanted to place
greater weight on differences that are large.
A Euclidean distance measure is more appropriate when the data is not standardized, but as
roted above the distance measure can be greatly affected by the scale olthe data.
65
Marthattan distance
Another commonly used distance metric is the Manhattan distance or the Lr norm of the
difference vector. In most cases, the results obtained by the Manhattan distance are similar to those
obtained by using the Euclidean distance. Once again the largest-valued attribute can dominate the
distance, although not as much as in the Euclidean distance.
D(x'Y) = E lx;- Y; I
Chebychev distonce
This distance metric is based on the maximum attribute difference. It is also called the L- nonn of
the difference vector.
Grid-based methods
in this class ofmethods. the object space ralher than the data is dir ided into a grid. Grid partitioning
is based on characteristics ofthe data and such methods can deal with non-numeric data more easilv.
Grid-based methods are not affected by data oldering.
lladel-based methods
A model is assumed, perhaps based on a probability distribution. Essentially, the algorithm tries to
build clusters with a high level of similarity within them and a low of similarity between them.
Similarity measurement is based on the mean values and the algorithm tries to minimize the squared-
error function.
A simple taxonomy ofcluster analysis methods is presented in Figure 4.1
Partitional methods are popular since they tend to be computationally efficient ahd are more
easily adapted for very large datasets. The hierarchical methods tend to be computationally more
'expensive,
The aim of partitional methods is to reduce the varionce within each cluster as much as
possible and have large variance between the clusters. Since the partitional methods do not normally
explicitly control the inter-cluster variance, heuristics (e.g. choosing seeds as far apart as possible)
may be used for ensuring large inter-cluster variance. One may therefore consider the aim to be
67
minimizing a ratio like a,/b where a is some measure of within cluster variance and b is some measure
of between cluster variation.
K-Means is the simplest and most popular classical clustering method that is easy to implement. The
classical method can only be used ifthe data about all the objects is located in the main memory. The
method is called K-Means since each of the K clusters is represented by the mean of the objects.
(called the centroid) within it. It is also called the centroid method since at each step the centroid
point of each cluster is assumed to be known and each of the remaining points are allocated to.the
cluster whose centroid is cl.osest to it. Once this allocation is completed, the centroids of the clusters
are recomputed using simple means and the process of allocating points to each cluster is repeated
until there is no change in the clusters (or some other stopping criterion, e.g. no significant reduction
in the squared error, is met). The method may also be liked as a search problern where the aim is
essentially to find the optirnum clusters given the number ofclusters rnd seeds specified by the user.
Obviously, we cannot use a brute-force or exhaustive search method to find the optimum, so rve
consider solutions that may not be optimal but may be computed efficiently.
The K-means method uses the Euclidean distance measure, which appears to work well with
compact clusters. If instead of the Euclidean distance, the Manhattan distance is used the method is
called the K-median method. The K-median method can be less sensitive to outliers.
Often the user has little basis r-or specilying the number of clusters and starting seeris. This problem
may be overcome by using an iterative approach. For exar].lple, one ml: irst three clusters and
choose three starting seeds randomly. Once the final clusters have been obtained, the process may be
repeated with a different set of seeds. Attempts should be made to select seeds that are as far away
from each other as possiblc. Also, during the iterative process if two clusters are found to be close
together, it may be desirable to merge them. Also, a large cluster may be split to in two if the
variance within the clusler is above some tlrreshold value.
Another apptr.rtth involves tinding t, : centroid of the whole dataset and then perturbing this
centroid value to find seeds. Yet anothei app;oach recommends using a hietarchicat method like the
agglomerative method on the data first, since thai method does not require staring values, and then
using the results ofthat rnethod as the basis for specifying thc number ofcltrsters and starting seeds.
K-means is an iterative-ilnprovement greedy method. A number of iterations are nomrally needed for
convergence and therefore the dataset is processed a number of times. If the data is ve:y la:ge and
cannot be accommodated in the main memory the process may become inefficient.
Although the K-means method is most widely known ani used. there.rie a number of issues
related to the meihod thut should be understood:
69
1. The K-means method needs to comFute Elclidean distances and means of the attribute
values of objects within a cluster. The classical algorithm therefore is only suitable ior
continuous data. K-means variations that deal with categcrical data those are available t'ut
not widely used.
2. The K-means method implicitly assumes spherical prolrrtriliry distributions.
3, The results of the K-means method depend .,trongly <-ri'r the ilitial guesses of the seeds.
4, The K-means method can be sensitive to ouiliers. llan outlier is picksd as a starting seed, it
may end up in a cluster of its own. Also, if an outlier nioi'es from one cluster to another
during iterations. it can have a major impacr on the rlu:;te, s because the means of the two
clusters are likel,r to change significantly.
5. Although:;ome locai ,-;'-ri;mufir solrrrlons discovered i :! : K-means nrethod are satisfactory,
often the local optimum is not as good as the global optimum.
6, The K-means method does not consider the size ofthe clusters. Some clusters may be large
and some very small.
7. The K-means method does not deal with overlapping clusters.
The K-means method does nol explicitly assume any proi;ai,,it;, ci:tribution for the
attribute values. It only assumes that the dataset consists of groups of ob.iects ;hat are similar and the
groups can be discovered because the user has provided cluster seeds.
ln contrast to the K-means method, the Expectation Maximizatioir (EM) method is based on
tho assumption that the objects in the dataset have attrlbutcs wir:se vaiues are distributed according
to some (unknown) linear combination (or mixture) of sinrple probahility distributions. While the K-
means method involves assigning objects to clur;ters to minimize within-group variation, the EM
method assigns objects to different clusters with certain probabilities in an attempt to maximize
expectation (or likelihood) of assignment.
The simplest situ.rtion is when there are only two distributions. For eveiy individual we may
assume that it comes from distribution I with probability p and therefor: from distribution 2 \"'ith
p:obability 1-p. Such mixture models appear to be widely used because they provide more
parameters and therefore more flexibility in modeling.
'rhe EIr,[ method consists of a i1,',.-q1gp itetative algorithm. The first step, called the
Esiimati. n step or the E-step, involves estimaliug the probability distributions of the clusters given
ihe data. Tl,e second step. called the Maximization step or the M-step. inl,sives finding the ntodcl
,i:..ameters that maximize the likelihood of the solution.
70
The EM method assumes that all attributes are indepertdent random variables. In a simple
case ofjust two clusters with objects having only a single attribute, we may assume that the attribute
values vary according to a normal distribution. The EM method requires that we now estimate the
following parameters:
1. The mean and standard deviation ofthe normal distribution for cluster i
2. The mean and standard deviation ofthe normal distribution for cluster 2
3' The probability p of a sample belonging to cluster I and therefore probability l-p of
belonging to cluster 2
The EM method then works as follows:
1. Guess the initial values of the five parameters (the two means, the two standard deviations
and the probability p) given above.
2. Use the two normal distributions (given the two guesses of means and two guesses of
standard deviations) and compute the probability ofeach obj ect belonging to each of the two
clusters.
3. Compute the likelihood of data coming from these tv,,o clusters by multiplying the sum of
probabilities of each obj ect.
4. Re-estimate the five parameters and go to Step 2 until a stopping criterion has been met.
4.7 HIERARCHICALMETIIODS
. l{icrarchicai ntotltods prt'duce a nested series of clusters as opposed to the partitional methods
'whi' h prtiduce only a flat set of clusters. Essentially the hierarchical methods atlempt to capture
the
slrrr{rtuie of the data by constructing a tree of clusters. This approach allows clusters to be found at
li iler.;;rt levels t-rl' granuiarity.
'i here are two types of hierarchical approaches possible. In one approach, called the
ogglomerative approach for merging groups (or bottom-up approach), each object at the start is a
ciuster b,v itself and the nearby clusters are repeatedly merged resulting in larger and larger clusters
until some stopping criterion (often a given number of clusters) is met or all the objects are merged
into a single large cluster which is the highest level of the hierarchy. In the second approach, cailed
lhe divisive approach (or the top-down approach), all the objects are put in a single cluster to start.
The method then repeatedly performs splitting of clusters resulting in smaller and smaller clusters
until a stopping criterion is reached or each cluster has only one object in it.
The hierarcliical clustering nrethods require distances between clusters to be computed. These
.ji:tance merics i:re o{ien caiied linkoge nteirics.
71
Computing distances between large clusters can be expensive. Suppose one cluster has 50
objects and another has 100, then computing most ofthe distance metrics listed below would require
computing distances between each object in the first cluster with every object in the second.
Therefore 5000 distances would rreed to be computed just to compute a distance between two
clusters. This can be expensive ifeach object has many attributes.
We will discuss the following methods for computing distances between clusters:
1. Single-lirk algorilhm
2. Complete-linkalgorithm
3. Centroid algorithm
4. Average-linkalgorithrn
5. Ward's minimum-variance algorithm
Single-link
The singleJink (or the nearest neighbour) algorithm is perhaps the simplest algorithni for
computing distance between two clusters. The algorithm determines the distance between two
clusters as the minimum of the distances. between all pairs of points (a,x) where a is from the firs
cluster and x is from the second. The algorithm therefore requires that all pairwise distances be
computed and the smallest distance (or the shortest kink) found. The algorithm can form chains and
can form elongated clusters.
Figure 4.2 shows two clusters A and B the single-link distance between them.
B
B
D
A
A
A
A
Complele-link
The complete-link algorithm is also called the farthest neighbour algorithm. In this
algorithm. the distance between two clusters is defined as the maximum of the pairwise distances
(a,x). Therefore if there are ra elements in one cluster and z in the other, all zn pairwise distances
therefore must be computed and the largest chosen.
Compute link is strongly biased towards compact clusters. Figure 4.3 shows two clusters A
and B and the completeJink distance between them. Complete-link can be distorted by moderate
outliers in one or both ofthe clusters.
B B
B
B
A
A
A
A
Centroid
In the cenhoid algorithm the distance between two clusters is determined as the distance
between the centroids of the clusters as shown below. The centroid algdrithm computes the distance
between two clusters as the distance between the average point of each of the two clusters. Usually
the squared Euclidean distance between the centroids is used. This approach is easy and generally
works well and i$ more tolerant of somewhat longer clusters than the complete-link algorithm.
Figure.4.4 shows twoplusters A and B and the centroid distance between them.
73
,,,
B
D
A
A
Ward's minimum-variance distance measure on the other hand is different. The method
generally works well and results in creating small tight clusters. Ward's distance is the difference
between the total within the cluster sum of squares for the two clusters separately and within the
cluster sum ofsquares resulting from merging the two clusters.
An example for ward's distance may be derived. It may be expressed as follows:
Dy(A,B) = N1N3D6(A,B)/(NA + Nu)
Where D1ry(A,B) is the Ward's minimum-variance distance between clusters A and B with NA and
Ns objects in them respectively. D6(A,B) is the centroid distance between the two clusters computed
as squared Euclidean distance between the centroids. It has been observed that the Ward's method
tends to join clusters with a small number of objects and is biased towards producing clusters with
roughly the same number ofobjects. The distance measure can be sensitive to outliers.
Agglomerative Method
Some applications naturally have a hierarchical structure. For example, the world's fauna
and flora have a hierarchical structure. The agglomerative clustering method tries to discover such
structure given a dataset.
The basic idea of the agglomerative method is to start out with n clusters for n data points,
that is, each cluster consisting ofa single data point. Using a measure of distance, at each step ofthe
method, the method merges two nearest clusters, thus reducing the number of clusters and building
obtained or all the data points are in one cluster. The agglomerative method leads to hierarchical
clusters in which at each step we build larger and larger clusters that include increasingly dissimilar
objects.
The agglomerative method is basically a bottom-up approach which furvolves the following steps.
. i. Allocate each point to a cluster of its om. Thus we stafi with n clusters for n objects.
2. Create a distance matrix by computing distances bet\&een all pairs of clusters either using, for
example, the single-link metric or the completeJink metric. Some other metric may also be
used. Sort these distances in ascending order.
3. Find the two clusters that have the smallest distance between them.
4. Remove the pair of objects and merge them.
5. Ifthere is only one cluster left then stop.
6. Compute all distances from the new cluster and update the distance matrix after the merger
and go to Step 3.
DBSCAN (density based spatial clustering of applications with noise) is one example of a
Censity-based method for clustering. The method was designed for spatial databases but can be used
in other applications. It requires two input parameters: the size of the neighbourhood (R) and the
minimum points in the neighbourhood (N). Essentially these two parameters determine the density
within the clusters the user is willing to accept since they specify how many points must be in a
rego.n. The number of points not only determines the density of acceptable clusters but it also
determines which objects will be labeled outliers or noise. Objects are declared to be outliers ifthere
are few other objects in their neighbourhood. The size parameter R determines the size of the clusters
found. IfR is big enough, there would be one big cluster and no outliers. IfR is small, there will be
small dense clusters and there might be many outliers
'
We now define a number of concepts that are required in the DBSCAN method:
l. Neighbourhood: The neighbourhood of an object y is defined as all the objects that are
within the radius R from y.
2. Core object:, An object y is called a core object if there are N objects within its
neighbourhood.
3. Proximily: Two objects are defined to be in proximity to each other if they belong to the
. same cluster. Object xr is in proximity to object x2 if trvo conditions are satisfied:
(a) The objects are close enough to each other, i.e. within a distance ofR.
(b) xz is a core object as defined above.
4.'-Conneclivity.' Two objects x1 and xn are connected ifthere is a path or chain of objects Xr,
x2, .........,,.,., xn from x1 to xn such that each xi+r is in proximity to object x;.
We now outline the basic algorithm for density based clustering:
. 1. Select values of R and N.
2. Arbitrarily select an object p.
77
, Mo.t
"lrrt"ring
methods implicitly assume that all data is accessible in the main memory, Often
, the size of the database is not considered but a method requiring multiple scans of data that'is disk-
resident could be quite inefiicient fot large problems.
One possible approach to deal with large datasets that could be used with any gpe of
clustering method is to draw as large a sample from the large dataset as could b! accommodated in
the main memory. The sample is then clustered. Each remaining object is then assigned to the nearest
cluster obtained from the sample. This process could be repeated several times and the clusters that
leadtothesmallestwithinclusiersvariariceqouldbechosen.
K-Means Method for Large Databases
. This method first picks the number of clusters and thefu seerl centroids and then attempts to classiff
each object to belong to one ofthe following three groups:
(a) Those that are certain to belong to a cluster. These objects together are.called the discard
set. Some information about these objects is computed and saved. This includes the number
of objects n, a veclor sum of all attribute values of the n objects (a vector S) and a vector
sum-of, squares of all attiibute values of the n objects (a vector Q). These values are
sufficient to recompute the centroid of the new cluster and its variance.
(b) Those that are sufficiently close to each other to 6e replaced by their summary. The objects
ap however sufficiently far away ftom each cluster's clntroid that they cannot yet be put in
the discard set ofobjects. Tfese objects together are called the compression set.
(c) The remaining objects are too difficult to assign to either of the two groups above. These
r bbjects are called the retained set and are stored as individual objects. They cannol be
replaced by a summary.
Hierarchical Method for Large Databases - Concept of Fractionation
Dealing with large datasets is difficult -using hierarchical methods since the methods require an
. lfx distance matoix to be computed for N objects. IfN is:large, say 100,000, the mahix has 1010
: . A modification of classical hierarchical methods that deals with large datasets was proposed
in 1992. It is based on the idea of splitting the data into manageable subsets called "fractions" and
then applying a hierarchical method to each fraction. The concept is called fractionation. The basic
algorithm used in the method is as follows, assuming that M is the largest number of objects that the
78
hierarchical method may be applied to. The size M may be determined perhaps based on the size of
the main memory.
Now the algorithm:
I . Split the large dataset into fractions of size M.
2. The hierarchical clustering technique being used is applied to each fraction. Let the number
ofclusters so obtained from all the fractions be C.
3. For each of the C clusters, compute the mean of the attribute values of the objects in it. Let
this mean vector be m;, i: 1......,C. These cluster means are called meta-observations. The
meta-observations now become the data values that represent the fractions.
4. If the C meta-observations are too large (greater than M), go to step 1, otherwise apply the
same hierarchical clustering technique to the meta-observations obtained in step 3.
5. Allocate each object of the original dataset to the cluster with the nearest mear obtained in
step 4.
is crucial for the K-means method to be reliable. For more details visit: hltp://bonsai. ims.u-
tokyo. ac jo/-mdehoon/software/cluster/software.htm
r CLUTO provides a set of clustering methods including partitional, agglomerative, and graph-
partitioning based on a variety of similarity/distance metrics. For moie details visit:
http://www-users.cs. umn.edu,/-karvois/cluto/ (Free)
CONCLUSION
Cluster analysis is a collection of methods that assists the user in putting different objects
from a collection ofobjects into different groups. In some ways one could say that cluster analysis is
best used as an exploratory data analysis exercise when the user has no hypothesis to test. Clustdr
analysis, therefore, can be used to uncover hidden structure which may assist further exploration.
We have discussed a number of clustering methods. In the K-means method it was required
that the user specifies the number of clusters and starting seeds for each cluster. This may be difficult
to do without some insight into the data tlpt the user may not have. One possible approach is that
might combine a partitioning method like K-means with a hierarchical method like the
agglomerative method. The agglomerative method can then be used to understand better the data and
help in estimating the number of glusters and the starting seeds. The strength of cluster analysis is
that it works well with numerig data. Techniques that work well are available with categorical and
textual data as well. Cluster analysis is easy to use.
We have noted thit the performance of cluster analysis methods can be dependent on the
choice of the distance metric. It can be difficult to devise a suitable distance metric for data that
contains a mixture ofvariable types. It can also be difficult to determine a proper weighting scheme
for disparate variable type. Furthermore, since cluster analysis is exploratory, the results of cluster
analysis sometimes can be difficult to interpret. On the other hand, quite unexpected results may be
obtained. For example, at NASA two subgroups of stars were distinguished, where previously no
difference was suspected.
REVIEW QUESTIONS
l. List four desirable features ofa cluster analysis method. Which ofthem are important for large
databases? Discuss.
2. Discuss the different types of data which one might encounter in practice. What a data tlpe is
clustering moSt suitable for?
3.Given two objects represented by the attribute values (1, 6,2, 5,3) and (3, 5,2, 6, 6)
a) Compute the Euclidean distance between the two objects
b) Compute the Manhattan distance between the two objects.
4. Suppose that a data mining task is to cluster the following eight points (with (x, y)
80
.j..
Learning Objectives
1. Explain what web mining is all about
2. Define the refevant web terminology
Definition:
Wcb mining is rhe application of data mining rechniquc.s to f nd interesling and potentiolly
useful knowledge from lleb data. It is normally expected that either the hyperlink structure of the
llleb or the l|'eb log data or both have been used in the mining process.
1. lleb content mining: it deals with discovering useful information or knowledge from Web page
conients. In contrast to Web usage mining and Web structure mining, we are contented with mining
focuses on the Web page contenl rather than the links
2. l!/eb structure mining: It deals with the discovering and modeling the link struclure of the Web.
Work has been carried ou1 to model the Web based on the topology ol the hyperliniis. This can help
in discovering similarity between sites or in discovering imponant si(es lor a particular topic or
discipline or in discovering Web communities.
3, ll/eb usage mining: It deals with understanding user behavior in interacting with the Web or with
the Web site. One of the aims is to obtain information that may assist Web site recognition or assist
site adaptation to better suit the user. The mined data often includes data logs of users interactions
with the Web. The logs include the Web server logs, proxy serv'er logs, and browser logs, The logs
include information about the refening pages! user identification, time a user spends at a site and the
sequence of pages visited.
The three categories above are not independent since Web structure mining is closely related
to Web con(ent mining and both are related to Web usage mining.
l.Hyperlink: The text documents do not have hyperlinks, while the links are very important
components of Web documents. In hard copy documents in a library, the documents are usually
structured (e.g. books) and they have been catalogued by cataloguing experts. No linkage betwegn
these documents is identified except that two documents may have been catalogued in the same
classification and therefore deal with similar topics.
2. Types of Information: As noted above, Web pages differ widely in structure, quality and their
usefulness. Web pages can consist of text, frames, multimedia objects, animation and other'types of
information quite different from text documents which mainly consist of trixt but may have some
other objects like tables, cliagrams, figures and some images.
3. Dynamics: The text documents do not change unless a new edition ofa book appears while Web
pages change frequently because the information on the Web including linkage information is
updated all the time (although some Web pages- are out of date and never seem to change!) and new
pages appear every second. Finding a previous version of a page is almost impossible on the Web
and links pointing to a page may work tobay but not tomorrow.
4. Quality: The text documents arb usually of high quality since they usually go through some
quality control process because they are very expensive to produce. In contrast, much of the
information on the Web is'of low quality. Compared to the size of the Web, it may be that less than
l0olo of Web pages are really useful and of high quality.
5. Huge size: Although some of the hbraries are ,ery large, the Web in comparison is much larger,
perhaps its size is appropriating 100 terabytes. That is equivalent to about 200 million books.
6. Document use: Compared to the use of conventional documents, the use of Web documents is
very different. The Web users tend to poie short queries, browse perhaps the first page of the results
and dien move on.
The l{orld Wide lTeb @anY) is the set of all the nodes which are interconnected by
hypertext links.
A /iz& expresses one or more relationships between two or more resources. Links may also be
establishes within a document by using anchors.
distinct parts, namely the protocol types (usually http), the name of the Web server; the directory
path and the file name. If a file.name is not specified, index.html is assumed.
A Web server serves Web pages using http to client machines so that a browser can display
them.
A c/ren, is the role adopted by an application when it is retrieving a Web resource.
A proxy is an intermediary which acts as both a server and a client for the purpose of
retrieving resources on behalf of other clients. Clients using a proxy know that the proxy is present
and that it is an intermediary.
A domain name server is a distributed database of the name to address mappings. When a
DNS server looks up a computer name, it either finds it in its list, or asks another DNS server which
knows more names.
A cookie is the data sent by a Web server to a Web client, to be stored locally by the client
and sent back to the server'on subsequent requests.
Obtaining information from the Web using a search engine is called information "pull" while
information sent to users is called information "push". For example users may register with a site and
then information is sent ("pushed") to such users without their requesting it.
Graph Terminolory
A directed graph as a set ofnodes (pages) denoted by V and edges (links) denoted by E. Thus
a graph is (V,E) where all edges are directed, just like a link that points liom one page to another,
and may be considered an ordered pair ofnodes, the nodes that thy link.
An undirected graph also is represented by nodes and edges (V, E) but the edges have no
direction specihed. Therefore an undirected graph is not like the pages and links on the Web unless
we assume the possibility of traversal in both directions. The back button on the browsers does
provide the possibility of back traveisal once a link has been traversed in one direction, but in
general both way traversal of links is not possible on the Web.
A graph may be searched-either by a breadth-first search or by a depth-first search. The
breadth-first search is based on first searching all the nodes that can be reached from the node where
the search is starting and once these nodes have been searched, searching the nodes at the next level
that can be reached from those nodes and so on. The depth first search is based more on first
searching any unvisited descendants of a given node, than visiting the node and then any brother
nodes. Essentially, the search algorithm involves going down before going across to a brother node.
The diameter of the graph is defined as the maximum of the minimum distances between all
possible ordered node pairs (u,v), that is , it is the maximum number ol links that one would need to
follow starting from any page u to reach any page v assuming that the best path has been followed.
84
The Strongly Connected Core (SCC)- This part of the Web was found to consist of about
307o of the Web, which is still very large given more than four billion pages on the Web in
2004. This core may be considered the heart ofthe Web and its rnain property is that pages in
the core cah reach each other following directed edges. (i.e. hyperlinks)
a The IN Group - This part of the Web was found to consist of about 20%o of the Web. The
main property of the tN group is that pages in the group can reach the SCC but cannot be
reached from it.
o Tlie OUT Group - This pa,t of the Web was found to consist of about 20%o of the Web. The
main property of the OUT group is that pages in the group can be readhed from the SCC but
cannot reach the SCC.
Tendrils - This part of the Web was found to consist of about 20o/o of the Web. The main
property of pages in this group is that the pages cannot be reached by the SCC and cannot
reach the SCC. It does not imply that these pages have no linkages to pages outside the group
since they could well have linkages lrom the IN Group and to the OUT Group.
a The Discorurected Group - This part of the Web was found to be less frian 10Yo of the Web
and is essentially disconnected from the rest of the Web world. These pages could include,
for example, personal pages at many sites that link to no other page and have no links to
them.
Sizeof,lhe Web
The deep Web includes information. stored in searchable databases often inaccessible to
search engines. This information can often only be accessed by using interface of each website.
Some of these information may be available only to subscribers. The shallow Web (indexable web)
is the information on the Web that the search engine can access without accessing the Web data
bases..
In many cases use of Web makes good sense, for example, it is better to put even short
announcements in an enterprise on the Web rather than send them by email since emails sit in many
mail 'boxes wasting disk storage while putting information on the Web can be more effective as well
as help in maintaining a record of communications.
If such uses grow, which appears likely, then a very large number of such Web pages with a
short life span and low connectivity to other pages are likely ro be generated each day. The large
.nrrmbers of Web sites that disappear everyday do create enormous problems on the Web. Links from
welf known sites do not always work. Not all results ofa search engines are guaranteed to work. The
URLS cited in scholarly publications also cannot be relied upon to be still available. A study of
{l-)
papers presented at the WWW conferences found that links cited in them had a'decay rate that grew
-
with the age of the papers. Abandoned sites therefore ar" a nuis*ce.' .
To overcome these problems, it may become necessary to categorize Web pages. The following
categorization is one possibility:
1. a Web page that is guaranteed not to change ever
2. a Web page that will not delete any content, may add content/links
disappear
3. a Web page that may,change contenV links but.ttie page will nol disappear i
possible to define how well connected a node is by using the concept of the centrality of a node.
Centrality may be out-centrality is based on the distancbs measured from other nodes that are
connected to the node using the in-links. Based on these metrics, it is possible to define the concept
of compactness that varies from 0 to 1, for a completely disconnected Web graph and I for a fully
connected Web graph.
A Web site of any enterprise usually has the homepage as the root of the tree as in any
hierarchical structure. For example, if one looks at a typical university Web site the homepage will
provide some basic information about the institution and then provide links, for example, to:
Prospective students
Staft'
Research
Web sites are fetching information from a database to ensure that the information is accurate
and timely. A recent study found that almost 40% of all URLs fetched information from database.
2. Index page: These pages assist the user to navigate through of the enterpdse Web site. A
homepage in some cases may also act as an index page
3. Reference page.' These pages provide some basic information that is used by a number of other
pages. For example, each page in a Web site may have a link to a page that provides the enterprise's
privacy policy.
4. Contenl page: These pages only provide content and have little role in assisting a user's
navigation. Often these pages ar€ larger in size, have few out-links, and are well down the tree in a
Web site. They are the leaf nodes of a tree.
A number ofprinciples have been developed to help design the structure and content ofa Web site.
For example, three basic principles are:
1. Relevant linkage principle: It is assured that links from a page point to other relevant resources.
This is similar to the assumption that is made for citations in scholarly publications, where it is
assumed-that a publication cites only relevant publications. Links are often assumed to reflect the
judgment of the page creator. By providing a link to another page it is assumed that the creator is
making a recornmendation for the other relevant pag
2. Topical unlty principle: It is assumed that Web pages that are co-cited (i.e. linked from the same
pages) are related. Many Web mining algorithms make use of this relevance assumption as a.
mquure of mutual relevance between Web pages.
3. L*ical alltnity Brinciple: It is assumed that the text and the links within a page are relevant to
each other. Once. again, it is assumed that the text on a page has been chosen carefirlly by the creator
to be related to a theme.
The area of Web mining deals with discovering useful information from the Web. Normally
when we.need to'search for content on the Web, we use one of the search engines like Google or a
subject directory like Yahoo! Some search engines find pages based on location and frequency of
keywords on the page although some now use the concept of page rank.
on
o/
The algorithm proposed is called Dual lterative Pattem Relation Extraction (DIPRE). It works as
follows:
1, Sample: Start with a Sample S provided by the user.
2. Occurrences.. Find occurrences of tuples starting with those in S. Once tuples are found the
context of every occurrence is saved. Let these be O,
O --S
3. Patterns: Generate pattems based on the set of occurrences O. This requires generating pattems
with similar contexts.
P --- O
1. Match Patterns The Web is norv searched for the pattems.
5. Stop ifenough matches are found, else go to step 2.
Web document clustering
Web document clustering is another approach to find relevant documents on a topic or about query
keywords. The popular search engines often return a huge, unmanageable list of documents which
contain the keywords that the user specified. Finding the most useful documents from a large list is
usually tedious, often impossible. The user could apply clustering to a set of documents retumed by a
search engine in response to a query with the aim of finding semantically meaningful clusterq rather
than a list of ranked documents.
In cluster analysis techniques, in particular we discussed the K-means meihod and
agglomerative method. These methods can be used for Web document cluster analysis as well but
these methods assume that each document has a fixed set of attributes that appear in all documents.
Similarity between documents can then be computed based on these values. One could possibly have
a set of words and their frequencies in each document and then use those values for clustbring them.
Suffix Tree Clustering (STC) is an approach that takes a different path and is designed
specifically for Web document cluster analysis, and it uses a phrase-based clustering approach rather
than use single word frequency.
In STC, the key requirements of a Web document clustering algorithm include the following:
1. Relevance: This is the most obvious requirement. We want clusters that are relevant to the user
query and that cluster similar documents together.
2, Browsable summaries.. The cluster must be easy to understand. The user should be quickly able to
browse the description ofa cluster and work out whether the cluster is relevant to the query.
3. Snippet tokrance; The clustering method should not require whole documenls and should be able
to.produce relevant clusters based only on the information that the search engine retums.
1. Performance.' The clustering method should be able to process fie results of the search engine
quickly and provide the resulting clusters to the user.
88
l. A local copy may have been made to enable faster access lo the material.
2. FAQs on the imporlant topics are duplicared since such pages may be used lrequently locally.
3. Online documentation of popuiar software like Unix or LATEX.may be duplicated foi local use.
4. There a(e mirror sites that copy highly accessed sites io reduce: traifrc G g, to reduce intemational
traffic lrom India or Australia).
In some cases. documents are nol exactly identical because different lormatting might be used at
different site. There may be some cusromization or use of temptates at different sites. A large
document may be split into smaller documents or a composite document may be joined together to
build a single document.
Copying a single Web page is ofien called, replicatio,?, on the other hand copying an entire Web site
is called mirroring.
Discussion is focused on content-based similarity which is based on comparing the textual content of
the Web pages. The Web pages also have non-text content but we will not consider it.
We define Nvo concepts:
1. Resemblance.' Resemblance of two documents is defined to be a numbei between 0 and I with 1
indicating that the two docbments are virtually identical and any value close to I indicating that the
documenls are very similar.
2. Containment: Containmenl ofone document in another is also defined as a number between 0 and
'l
with I indicating that the first document is completely contained in the second.
There are number ofways by which similarity of documents can be assessed. One brute force
approach is to compare two documents using software like the tool diff available in the Unix
operating system which essenlially compares the lwo documents as files. Other string comparison
algorithms can be used to find how many to be deleted, changed or added to
characters need
transform one document into the other, but these approaches are very expensive if one wishes to
compare millions of documents.
There are other issues that must be considered in document matching. Firstly, if we are
looking to compare millions of documents then the storage requirement of the method should not be
large. Secondly, documents may be in HTML, PDF, Postscript, FrameMaker, TeX, PageMaker or
MS Word. They need to be converted to text for comparison. The conversion can introduce some
errors. Finally, the method should be robust, that is, it should not be possible to circumvent the
matching process with modest changes to a document.
Fingerprinting
An approach for comparing a large number of documents is based on the idea of
furgerprinting documents.
A document may be divided into all possible substrings of lenglh L. These substrings are
called shingles. Based on the shingles one can define resemblance R(X,Y) and containment C(X,Y)
between two documents X and Y as follows. We assume S(X) and S(Y) to be a set of shingles for
docuntents X and Y respectively.
rs growng at
to grow at .. ::
at a fast
at a fast ,
Comparing the two sets ol shingles rve find that only two of them are identical. Thus, for this simple
example, the documents are not very similar.
5. Entry point - which Web site page the user entered from
6. Visitor time and duration - the time and day ofvisit and how long the visitor
browsed the site
7. Path analysis - a list ofthe path ofpages that the user took
8. Visitor IP address - this helps in finding which part of the world the user came from
9. Browser t)?e
I 0. Platform
I 1. Cookies
Even this simple information about a Web site and pages within it can assist an enterprise to achieve
the following:
1. Shorten the paths to high visit pages '
2, Conversion rates: What are the lookto-click, click+o-basket-to-buy rates for each product? Are
there significant differences in these rates for different produris?
3. Impact oJ Advertising: Which banners are pulling in th$ most traffic? What is their conversion
rate?
4. Impact of promotions: Which promotions generate the most sales? Is there a particular level in the
site where promotions are most effective?
5. lYeb site desigz.. Which links do the customers click most frequently? What links do they buy
from most frequently? Are there some features ofthese links that can be identified?
92
6, Customer segmentation: What are the features of customers who "abandon their trolley" rvithout
buying? Where do the most profitable customers come from?
7. Enterprise search: Whichcustomers use enterprise search? Are they more likely to purchase?
What do they search for? How frequently does the search engine retum a failed. result? How
frequently does the search engine return too many results?
2. Topic drift: Certain documents of tightly connected documents, perhaps due to mutually
reinforcing relationships between hosts, can dominate the HITS computation. These documents in
some instances may not be the most relevant to the query that was posed. It has been reported that in
one case when the search item was'laguar" the HITS algorithm converged to a football team called
Jaguars. Other examples of topic drift have been found on topics like "gun control", "abortion", and
"movies".
j. Automatically generated linksi Some ofthe links are computer generated and represent no human
judgement but HITS still gives them equal impoflance.
7
94
4, .\'on'relevont docutttetis: Some queries can retum non-relevant documents in the highly ranked
clltcries and this can lead to erroneous results from the HITS still gives them equal importance.
5. Efficiency: The real-tirne perlbrmance of the algorithm is not good given the steps that involve
finding sites that are pointed to bl, pages in the root pages.
A number ofproposals have been made lbr modifying HITS. These include:
o More careful selection of the base set will reduce the possibility of topic drift. One possible
approach might be to modifl, lhe HITS algorithm so that the hub authority weights are
modified only based on the best hubs and the best authorities.
. One may argue that the in-link information is more important than the out-link information.
A hub can become important by pointing to a lot ofauthorities.
Web Communities
A Wcb community is generated by a group of individuals that share a common interest. It manifests
on the Web as a collection of Web pages with a common interest as the theme. These could, for
example, be communities about a sub-discipline, a religious group, about sport or a sport team, a
hobby, an event. a country. a state, or whatever. The communities include in them newsgroups,
portals and the large ones may include directolies in sites like Yahoo!
The HITS algorithm finds authorilies and hubs lor a specified broad topic. The idea of cyber
communities of to find all Web
Click Tracks from a company by the same name is Web mining software offering number of
modules including Analyze:a Optimizer and Pro that use log files to provide Web site
analysis. Allows desktop data mining.
Datanautics G2 and Insight 5 from Datanautics. Web mining software for data collection,
processing, analysis and reporting.
LiveStats.NET and LiveStats.BlZ from DeepMetrix provide website analysis, data
visualization and statistics on distinct visitors, repeat visits, popular entry and exit pages,
time spent on pages, geographic report which breakdown visits by country and continent,
click paths , keywords by search engine and more.
NetTracker Web analytics from Sane Solutions claims to analyze log files (from Web
servers, proxy servers and firewalls), data gathered by JavaScript page tags, or a hybrid of
both.
Nihuo Web Log Analyzer from LogAnalyser provides reports on how many visitors came to
the website, where they came from, which pages they viewed, how long they spent on the
site.
o WebAnalyst from Megaputer is based on PolyAnalyst text mining software.
r Weblog Expert 3.5 from a company with the same name produces reports that include the
following\ information: activities statistics,, accessed files, paths tlrow the site, information
about referring pages, search engines, browsers, operating syst€ms and more.
o WebTrends 7 from NetIQ is a collection of modules that provide a variety of Web data
including navigation analysis, custombx segmentation and more.
o WUM; Web utilization Miner is an open source project. WUMprep is a collection of Perl
scripts for data preprocessing tasks suggests sessonizing, root deduction and maping of
URLs on to concepts.WuM is integrated Java-based Web mining software for log file
preparation, basic reporting, discovery ofsequential pattems and visualization.
CONCLUSION
The World Wide Web has become an extremely valuable resource for a large number of
people all around the world. During the last decade, the Web revolution has had a profound impact
on the way we search and find information at home and at work. Although information resources like
libraries have been available tothe public for a long time, the Web provides instantaneous access to a
huge variety of information. From its beginniug in the early 1990s the Web has grown to perhaps
more than eight billion Web pages which are accessed all over the world every day. Millions of Web
pages are added every day and millions of others are modified or deleted.
The Web is an open medium with no controls on who puts up what kind of material. The
opennes: has meant but the Web has grown exponentially. which is its strength as well as its
96
weakness. The strength is that one can find information onjust about any topic. The weakness is the
problem of abundance of information.
REVIEWQUESTIONS
1. Define the three types of Web mining. What are their major differences?
2. Define the following terms:
a) Browser
b) Uniform resource locator
c) Domain name server
d) Cookie
3. Describe three major differences between the conventional textual documents and Web
documents.
4. What is Lotks's Inverse-Square law regarding scholarly publications? What relation does it
have to the power laws of distribution of in-links and out-links from Web pages?
5. Describe the "bow-tie" structure of the Web. What percentage of pages from the Strongly
ConnectedCore? What is the main property of the Core?
6. What is the difference between the deep and shallow Web? Why do we need the deep Web?
7. How can clustering be used in Web usage mining?
8. What is the basis of Kleinber'g HITS algorithm?
9. Provide a step-by-step description of Kleinberg's HITS algorithm for finding authorities and
hubs for topic "data mining".
10. Discuss the advantages and disadvantages ofthe HITS algorithm.
I l. What is a Web Communiry? How do you discover them?
12. Use the HITS algorithm to find hubs and authorities from the following five web pages:
Page A (out-links to B, C, D)
Page B (out-links to A, C, D)
Page C (out-links to D)
Page D (ourlinks to C, E)
Page E (ourlinks to B, C, D)
13. What are the major differences between classical information retrieval and Web search?