7 Classication and Prediction 3

Contents
7 Classi cation and Prediction

7.1 What is classi cation? What is prediction? . . . . . . . . . . . . . . . . . . 7.2 Issues regarding classi cation and prediction . . . . . . . . . . . . . . . . . . 7.3 Classi cation by decision tree induction . . . . . . . . . . . . . . . . . . . . 7.3.1 Decision tree induction . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.2 Tree pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.3 Extracting classi cation rules from decision trees . . . . . . . . . . . 7.3.4 Enhancements to basic decision tree induction . . . . . . . . . . . . 7.3.5 Scalability and decision tree induction . . . . . . . . . . . . . . . . . 7.3.6 Integrating data warehousing techniques and decision tree induction 7.4 Bayesian classi cation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.1 Bayes theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.2 Naive Bayesian classi cation . . . . . . . . . . . . . . . . . . . . . . 7.4.3 Bayesian belief networks . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.4 Training Bayesian belief networks . . . . . . . . . . . . . . . . . . . . 7.5 Classi cation by backpropagation . . . . . . . . . . . . . . . . . . . . . . . . 7.5.1 A multilayer feed-forward neural network . . . . . . . . . . . . . . . 7.5.2 De ning a network topology . . . . . . . . . . . . . . . . . . . . . . . 7.5.3 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.4 Backpropagation and interpretability . . . . . . . . . . . . . . . . . . 7.6 Association-based classi cation . . . . . . . . . . . . . . . . . . . . . . . . . 7.7 Other classi cation methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.7.1 k-nearest neighbor classi ers . . . . . . . . . . . . . . . . . . . . . . 7.7.2 Case-based reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.7.3 Genetic algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.7.4 Rough set theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.7.5 Fuzzy set approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.8 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.8.1 Linear and multiple regression . . . . . . . . . . . . . . . . . . . . . 7.8.2 Nonlinear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.8.3 Other regression models . . . . . . . . . . . . . . . . . . . . . . . . . 7.9 Classi er accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.9.1 Estimating classi er accuracy . . . . . . . . . . . . . . . . . . . . . . 7.9.2 Increasing classi er accuracy . . . . . . . . . . . . . . . . . . . . . . 7.9.3 Is accuracy enough to judge a classi er? . . . . . . . . . . . . . . . . 7.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 5 6 7 9 10 11 12 13 15 15 16 17 19 19 20 21 21 24 25 27 27 28 28 28 29 30 30 32 32 33 33 34 34 35
CONTENTS
c J. Han and M. Kamber, 1998, DRAFT!! DO NOT COPY!! DO NOT DISTRIBUTE!!
September 15, 1999
Chapter 7
Classi cation and Prediction

Databases are rich with hidden information that can be used for making intelligent business decisions. Classication and prediction are two forms of data analysis which can be used to extract models describing important data classes or to predict future data trends. Whereas classi cation predicts categorical labels or discrete values, prediction models continuous-valued functions. For example, a classi cation model may be built to categorize bank loan applications as either safe or risky, while a prediction model may be built to predict the expenditures of potential customers on computer equipment given their income and occupation. Many classi cation and prediction methods have been proposed by researchers in machine learning, expert systems, statistics, and neurobiology. Most algorithms are memory resident, typically assuming a small data size. Recent database mining research has built on such work, developing scalable classi cation and prediction techniques capable of handling large, disk resident data. These techniques often consider parallel and distributed processing. In this chapter, you will learn basic techniques for data classi cation such as decision tree induction, Bayesian classi cation and Bayesian belief networks, and neural networks. The integration of data warehousing technology with classi cation is also discussed, as well as association-based classi cation. Other approaches to classi cation, such as k-nearest neighbor classi ers, case-based reasoning, genetic algorithms, rough sets, and fuzzy logic techniques are introduced. Methods for prediction, including linear, nonlinear, and generalized linear regression models are brie y discussed. Where applicable, you will learn of modi cations, extensions and optimizations to these techniques for their application to data classi cation and prediction for large databases.
7.1 What is classi cation? What is prediction?

Data classi cation is a two step process Figure 7.1. In the rst step, a model is built describing a predetermined set of data classes or concepts. The model is constructed by analyzing database tuples described by attributes. Each tuple is assumed to belong to a prede ned class, as determined by one of the attributes, called the class label attribute. In the context of classi cation, data tuples are also referred to as samples, examples, or objects. The data tuples analyzed to build the model collectively form the training data set. The individual tuples making up the training set are referred to as training samples and are randomly selected from the sample population. Since the class label of each training sample is provided, this step is also known as supervised learning i.e., the learning of the model is `supervised' in that it is told to which class each training sample belongs. It contrasts with unsupervised learning or clustering, in which the class labels of the training samples are not known, and the
number or set of classes to be learned may not be known in advance. Clustering is the topic of Chapter 8. Typically, the learned model is represented in the form of classi cation rules, decision trees, or mathematical formulae. For example, given a database of customer credit information, classi cation rules can be learned to identify customers as having either excellent or fair credit ratings Figure 7.1a. The rules can be used to categorize future data samples, as well as provide a better understanding of the database contents. In the second step Figure 7.1b, the model is used for classi cation. First, the predictive accuracy of the model or classi er is estimated. Section 7.9 of this chapter describes several methods for estimating classi er accuracy. The holdout method is a simple technique which uses a test set of class-labeled samples. These samples are 3
4
a)
CHAPTER 7. CLASSIFICATION AND PREDICTION

Classification Algorithm
Training Data Classification Rules
name Sandy Jones Bill Lee Courtney Fox Susan Lake Claire Phips Andre Beau ...
b)
age < 30 < 30 30 - 40 > 40 > 40 30 - 40 ...
income low low high med med high
credit rating fair excellent excellent fair fair excellent ...
IF age 30-40 AND income=high THEN credit_rating=excellent
Classification Rules
Test Data
New Data
name Frank Jones Sylvia Crest Anne Yee ...
age > 40 < 30 30 - 40 ...
income high low high ...
credit rating fair fair excellent ...
(John Henri, 30-40, high) Credit rating?
excellent
Figure 7.1: The data classi cation process: a Learning: Training data are analyzed by a classi cation algorithm. Here, the class label attribute is credit rating, and the learned model or classi er is represented in the form of classi cation rules. b Classi cation: Test data are used to estimate the accuracy of the classi cation rules. If the accuracy is considered acceptable, the rules can be applied to the classi cation of new data tuples. randomly selected and are independent of the training samples. The accuracy of a model on a given test set is the percentage of test set samples that are correctly classi ed by the model. For each test sample, the known class label is compared with the learned model's class prediction for that sample. Note that if the accuracy of the model were estimated based on the training data set, this estimate could be optimistic since the learned model tends to over t the data that is, it may have incorporated some particular anomalies of the training data which are not present in the overall sample population. Therefore, a test set is used. If the accuracy of the model is considered acceptable, the model can be used to classify future data tuples or objects for which the class label is not known. Such data are also referred to in the machine learning literature as unknown" or previously unseen" data. For example, the classi cation rules learned in Figure 7.1a from the analysis of data from existing customers can be used to predict the credit rating of new or future i.e., previously unseen customers. How is prediction di erent from classi cation?" Prediction can be viewed as the construction and use of a model to assess the class of an unlabeled object, or to assess the value or value ranges of an attribute that a given object is likely to have. In this view, classi cation and regression are the two major types of prediction problems where classi cation is used to predict discrete or nominal values, while regression is used to predict continuous or
7.2. ISSUES REGARDING CLASSIFICATION AND PREDICTION
ordered values. In our view, however, we refer to the use of predication to predict class labels as classi cation and the use of predication to predict continuous values e.g., using regression techniques as prediction. This view is commonly accepted in data mining. Classi cation and prediction have numerous applications including credit approval, medical diagnosis, performance prediction, and selective marketing.
Example 7.1 Suppose that we have a database of customers on the AllElectronics mailing list. The mailing list
is used to send out promotional literature describing new products and upcoming price discounts. The database describes attributes of the customers, such as their name, age, income, occupation, and credit rating. The customers can be classi ed as to whether or not they have purchased a computer at AllElectronics. Suppose that new customers are added to the database and that you would like to notify these customers of an uncoming computer sale. To send out promotional literature to every new customer in the database can be quite costly. A more cost e cient method would be to only target those new customers who are likely to purchase a new computer. A classi cation model can be constructed and used for this purpose. Suppose instead that you would like to predict the number of major purchases that a customer will make at AllElectronics during a scal year. Since the predicted value here is ordered, a prediction model can be constructed for this purpose. 2
7.2 Issues regarding classi cation and prediction

Preparing the data for classi cation and prediction. The following preprocessing steps may be applied to the
data in order to help improve the accuracy, e ciency, and scalability of the classi cation or prediction process.
Data cleaning. This refers to the preprocessing of data in order to remove or reduce noise by applying
smoothing techniques, for example, and the treatment of missing values e.g., by replacing a missing value with the most commonly occurring value for that attribute, or with the most probable value based on statistics. Although most classi cation algorithms have some mechanisms for handling noisy or missing data, this step can help reduce confusion during learning. Relevance analysis. Many of the attributes in the data may be irrelevant to the classi cation or prediction task. For example, data recording the day of the week on which a bank loan application was led is unlikely to be relevant to the success of the application. Furthermore, other attributes may be redundant. Hence, relevance analysis may be performed on the data with the aim of removing any irrelevant or redundant attributes from the learning process. In machine learning, this step is known as feature selection. Including such attributes may otherwise slow down, and possibly mislead, the learning step. Ideally, the time spent on relevance analysis, when added to the time spent on learning from the resulting reduced" feature subset, should be less than the time that would have been spent on learning from the original set of features. Hence, such analysis can help improve classi cation e ciency and scalability. Data transformation. The data can be generalized to higher-level concepts. Concept hierarchies may be used for this purpose. This is particularly useful for continuous-valued attributes. For example, numeric values for the attribute income may be generalized to discrete ranges such as low, medium, and high. Similarly, nominal-valued attributes, like street, can be generalized to higher-level concepts, like city. Since generalization compresses the original training data, fewer input output operations may be involved during learning. The data may also be normalized, particularly when neural networks or methods involving distance measurements are used in the learning step. Normalization involves scaling all values for a given attribute so that they fall within a small speci ed range, such as -1.0 to 1.0, or 0 to 1.0. In methods which use distance measurements, for example, this would prevent attributes with initially large ranges like, say income from outweighing attributes with initially smaller ranges such as binary attributes. Data cleaning, relevance analysis, and data transformation are described in greater detail in Chapter 3 of this book. Comparing classi cation methods. Classi cation and prediction methods can be compared and evaluated according to the following criteria:

age?
<30
30-40
>40
student? no yes
yes excellent
credit_rating? fair
no
yes
yes
no
Figure 7.2: A decision tree for the concept buys computer, indicating whether or not a customer at AllElectronics is likely to purchase a computer. Each internal non-leaf node represents a test on an attribute. Each leaf node represents a class either buys computer = yes or buys computer = no. 1. Predictive accuracy. This refers to the ability of the model to correctly predict the class label of new or previously unseen data. 2. Speed. This refers to the computation costs involved in generating and using the model. 3. Robustness. This is the ability of the model to make correct predictions given noisy data or data with missing values. 4. Scalability. This refers to the ability of the learned model to perform e ciently on large amounts of data. 5. Interpretability. This refers is the level of understanding and insight that is provided by the learned model. These issues are discussed throughout the chapter. The database research community's contributions to classi cation and prediction for data mining have strongly emphasized the scalability aspect, particularly with respect to decision tree induction.
7.3 Classi cation by decision tree induction

A decision tree is a ow-chart-like tree structure, where each internal node denotes a test on an attribute, each branch represents an outcome of the test, and leaf nodes represent classes or class distributions. The topmost node in a tree is the root node. A typical decision tree is shown in Figure 7.2. It represents the concept buys computer, that is, it predicts whether or not a customer at AllElectronics is likely to purchase a computer. Internal nodes are
What is a decision tree?"
denoted by rectangles, and leaf nodes are denoted by ovals. In order to classify an unknown sample, the attribute values of the sample are tested against the decision tree. A path is traced from the root to a leaf node which holds the class prediction for that sample. Decision trees can easily be converted to classi cation rules. In Section 7.3.1, we describe a basic algorithm for learning decision trees. When decision trees are built, many of the branches may re ect noise or outliers in the training data. Tree pruning attempts to identify and remove such branches, with the goal of improving classi cation accuracy on unseen data. Tree pruning is described in Section 7.3.2. The extraction of classi cation rules from decision trees is discussed in Section 7.3.3. Enhancements of the basic decision tree algorithm are given in Section 7.3.4. Scalability issues for the induction of decision trees from large databases are discussed in Section 7.3.5. Section 7.3.6 describes the integration of decision tree induction with data warehousing facilities, such as data cubes, allowing the mining of decision trees at multiple levels of granularity. Decision trees have been used in many application areas ranging from medicine to game theory and business. Decision trees are the basis of several commercial rule induction systems.
7.3. CLASSIFICATION BY DECISION TREE INDUCTION
Algorithm 7.3.1 Generate decision tree Generate a decision tree from the given training data. Input: The training samples, samples, represented by discrete-valued attributes; the set of candidate attributes, attribute-list. Output: A decision tree. Method:
1 create a node N ; 2 if samples are all of the same class, C then 3 return N as a leaf node labeled with the class C ; 4 if attribute-list is empty then 5 return N as a leaf node labeled with the most common class in samples; majority voting 6 select test-attribute, the attribute among attribute-list with the highest information gain; 7 label node N with test-attribute; 8 for each known value ai of test-attribute partition the samples 9 grow a branch from node N for the condition test-attribute=ai; 10 let si be the set of samples in samples for which test-attribute=ai; a partition 11 if si is empty then 12 attach a leaf labeled with the most common class in samples; 13 else attach the node returned by Generate decision treesi , attribute-list - test-attribute;
Figure 7.3: Basic algorithm for inducing a decision tree from training samples.
7.3.1 Decision tree induction

The basic algorithm for decision tree induction is a greedy algorithm which constructs decision trees in a top-down recursive divide-and-conquer manner. The algorithm, summarized in Figure 7.3, is a version of ID3, a well-known decision tree induction algorithm. Extensions to the algorithm are discussed in Sections 7.3.2 to 7.3.6. The basic strategy is as follows: The tree starts as a single node representing the training samples step 1. If the samples are all of the same class, then the node becomes a leaf and is labeled with that class steps 2 and 3. Otherwise, the algorithm uses an entropy-based measure known as information gain as a heuristic for selecting the attribute that will best separate the samples into individual classes step 6. This attribute becomes the test" or decision" attribute at the node step 7. In this version of the algorithm, all attributes are categorical, i.e., discrete-valued. Continuous-valued attributes must be discretized. A branch is created for each known value of the test attribute, and the samples are partitioned accordingly steps 8-10. The algorithm uses the same process recursively to form a decision tree for the samples at each partition. Once an attribute has occurred at a node, it need not be considered in any of the node's descendents step 13. The recursive partitioning stops only when any one of the following conditions is true: 1. All samples for a given node belong to the same class step 2 and 3, or 2. There are no remaining attributes on which the samples may be further partitioned step 4. In this case, majority voting is employed step 5. This involves converting the given node into a leaf and labeling it with the class in majority among samples. Alternatively, the class distribution of the node samples may be stored; or 3. There are no samples for the branch test-attribute=ai step 11. In this case, a leaf is created with the majority class in samples step 12.
Attribute selection measure. The information gain measure is used to select the test attribute at each node
in the tree. Such a measure is referred to as an attribute selection measure or a measure of the goodness of split. The attribute with the highest information gain or greatest entropy reduction is chosen as the test attribute for the current node. This attribute minimizes the information needed to classify the samples in the resulting partitions and re ects the least randomness or impurity" in these partitions. Such an information-theoretic approach minimizes the expected number of tests needed to classify an object and guarantees that a simple but not necessarily the simplest tree is found. Let S be a set consisting of s data samples. Suppose the class label attribute has m distinct values de ning m distinct classes, Ci for i = 1; : : :; m. Let si be the number of samples of S in class Ci. The expected information needed to classify a given sample is given by: Is1 ; s2; : : :; sm =
m X p log p , i=1 i
2
7.1
where pi is the probability than an arbitrary sample belongs to class Ci and is estimated by si s. Note that a log function to the base 2 is used since the information is encoded in bits. Let attribute A have v distinct values, fa1 ; a2; ; av g. Attribute A can be used to partition S into v subsets, fS1 ; S2 ; ; Sv g, where Sj contains those samples in S that have value aj of A. If A were selected as the test attribute i.e., best attribute for splitting, then these subsets would correspond to the branches grown from the node containing the set S. Let sij be the number of samples of class Ci in a subset Sj . The entropy, or expected information based on the partitioning into subsets by A is given by: EA =
v X s j + + smj Is
1
The term v=1 s j ++smj acts as the weight of the j th subset and is the number of samples in the subset i.e., j s having value aj of A divided by the total number of samples in S. The smaller the entropy value is, the greater the purity of the subset partitions. The encoding information that would be gained by branching on A is
1
j =1
j ; : : :; smj :
7.2
GainA = Is1 ; s2 ; : : :; sm , EA:
7.3
In other words, GainA is the expected reduction in entropy caused by knowing the value of attribute A. The algorithm computes the information gain of each attribute. The attribute with the highest information gain is chosen as the test attribute for the given set S. A node is created and labeled with the attribute, branches are created for each value of the attribute, and the samples are partitioned accordingly.
Example 7.2 Induction of a decision tree. Table 7.1 presents a training set of data tuples taken from the AllElectronics customer database. The data are adapted from Quinlan 1986b . The class label attribute, buys computer, has two distinct values namely fyes, nog, therefore, there are two distinct classes m = 2. Let C correspond to
the class yes and class C2 correspond to no. There are 9 samples of class yes and 5 samples of class no. To compute the information gain of each attribute, we rst use Equation 7.1 to compute the expected information needed to classify a given sample. This is: 9 9 5 5 Is1 ; s2 = I9; 5 = , 14 log2 14 , 14 log2 14 = 0:940 Next, we need to compute the entropy of each attribute. Let's start with the attribute age. We need to look at the distribution of yes and no samples for each value of age. We compute the expected information for each of these distributions. for age = 30": for age = 30-40": for age = 40": s11 = 2 s12 = 4 s13 = 3 s21 = 3 s22 = 0 s23 = 2 Is11 ; s21 = 0.971 Is12 ; s22 = 0 Is13 ; s23 = 0.971
1

rid 1 2 3 4 5 6 7 8 9 10 11 12 13 14 age 30 30 30-40 40 40 40 30-40 30 30 40 30 30-40 30-40 40 income high high high medium low low low medium low medium medium medium high medium student no no no no yes yes yes no yes yes yes no yes no credit rating fair excellent fair fair fair excellent excellent fair fair fair excellent excellent fair excellent Class: buys computer no no yes yes yes no yes no yes yes yes yes yes no
Table 7.1: Training data tuples from the AllElectronics customer database. Using Equation 7.2, the expected information needed to classify a given sample if the samples are partitioned according to age, is: 4 5 5 Eage = 14 Is11 ; s21 + 14 Is12 ; s22 + 14 Is13 ; s23 = 0:694: Hence, the gain in information from such a partitioning would be: Gainage = Is1 ; s2 , Eage = 0:246 Similarly, we can compute Gainincome = 0.029, Gainstudent = 0.151, and Gaincredit rating = 0.048. Since age has the highest information gain among the attributes, it is selected as the test attribute. A node is created and labeled with age, and branches are grown for each of the attribute's values. The samples are then partitioned accordingly, as shown in Figure 7.4. Notice that the samples falling into the partition for age = 30-40 all belong to the same class. Since they all belong to class yes, a leaf should therefore be created at the end of this branch and labeled with yes. The nal decision tree returned by the algorithm is shown in Figure 7.2. 2 In summary, decision tree induction algorithms have been used for classi cation in a wide range of application domains. Such systems do not use domain knowledge. The learning and classi cation steps of decision tree induction are generally fast. Classi cation accuracy is typically high for data where the mapping of classes consists of long and thin regions in concept space.
7.3.2 Tree pruning

When a decision tree is built, many of the branches will re ect anomalies in the training data due to noise or outliers. Tree pruning methods address this problem of over tting the data. Such methods typically use statistical measures to remove the least reliable branches, generally resulting in faster classi cation and an improvement in the ability of the tree to correctly classify independent test data. How does tree pruning work?" There are two common approaches to tree pruning. In the prepruning approach, a tree is pruned" by halting its construction early e.g., by deciding not to further split or partition the subset of training samples at a given node. Upon halting, the node becomes a leaf. The leaf may hold the most frequent class among the subset samples, or the probability distribution of those samples. When constructing a tree, measures such as statistical signi cance, 2 , information gain, etc., can be used to assess the goodness of a split. If partitioning the samples at a node would result in a split that falls below a prespeci ed threshold, then further partitioning of the given subset is halted. There are di culties, however,
10

age?
<30
30-40
>40
income high high medium low medium
student credit_rating no no no yes yes fair excellent fair fair excellent
Class no no no yes yes
income high low medium high
student credit_rating no yes no yes fair excellent excellent fair
Class yes yes yes yes
income medium low low medium medium
student credit_rating no yes yes yes no fair fair excellent fair excellent
Class yes yes no yes no
Figure 7.4: The attribute age has the highest information gain and therefore becomes a test attribute at the root node of the decision tree. Branches are grown for each value of age. The samples are shown partitioned according to each branch. in choosing an appropriate threshold. High thresholds could result in oversimpli ed trees, while low thresholds could result in very little simpli cation. The postpruning approach removes branches from a fully grown" tree. A tree node is pruned by removing its branches. The cost complexity pruning algorithm is an example of the postpruning approach. The pruned node becomes a leaf and is labeled by the most frequent class among its former branches. For each non-leaf node in the tree, the algorithm calculates the expected error rate that would occur if the subtree at that node were pruned. Next, the expected error rate occurring if the node were not pruned is calculated using the error rates for each branch, combined by weighting according to the proportion of observations along each branch. If pruning the node leads to a greater expected error rate, then the subtree is kept. Otherwise, it is pruned. After generating a set of progressively pruned trees, an independent test set is used to estimate the accuracy of each tree. The decision tree that minimizes the expected error rate is preferred. Rather than pruning trees based on expected error rates, we can prune trees based on the number of bits required to encode them. The best pruned tree" is the one that minimizes the number of encoding bits. This method adopts the Minimum Description Length MDL principle which follows the notion that the simplest solution is preferred. Unlike cost complexity pruning, it does not require an independent set of samples. Alternatively, prepruning and postpruning may be interleaved for a combined approach. Postpruning requires more computation than prepruning, yet generally leads to a more reliable tree.
7.3.3 Extracting classi cation rules from decision trees

The knowledge represented in decision trees can be extracted and represented in the form of classi cation IFTHEN rules. One rule is created for each path from the root to a leaf node. Each attribute-value pair along a given path forms a conjunction in the rule antecedent IF" part. The leaf node holds the class prediction, forming the rule consequent THEN" part. The IF-THEN rules may be easier for humans to understand, particularly if the given tree is very large.
Can I get classi cation rules out of my decision tree? If so, how?"
11
Example 7.3 Generating classi cation rules from a decision tree. The decision tree of Figure 7.2 can be
IF age = 30" AND student = no THEN buys computer = no IF age = 30" AND student = yes THEN buys computer = yes IF age = 30-40" THEN buys computer = yes IF age = 40" AND credit rating = excellent THEN buys computer = yes IF age = 40" AND credit rating = fair THEN buys computer = no
converted to classi cation IF-THEN rules by tracing the path from the root node to each leaf node in the tree. The rules extracted from Figure 7.2 are:
2
C4.5, a later version of the ID3 algorithm, uses the training samples to estimate the accuracy of each rule. Since this would result in an optimistic estimate of rule accuracy, C4.5 employs a pessimistic estimate to compensate for the bias. Alternatively, a set of test samples independent from the training set can be used to estimate rule accuracy. A rule can be pruned" by removing any condition in its antecedent that does not improve the estimated accuracy of the rule. For each class, rules within a class may then be ranked according to their estimated accuracy. Since it is possible that a given test sample will not satisfy any rule antecedent, a default rule assigning the majority class is typically added to the resulting rule set.
7.3.4 Enhancements to basic decision tree induction

Many enhancements to the basic decision tree induction algorithm of Section 7.3.1 have been proposed. In this section, we discuss several major enhancements, many of which are incorporated into C4.5, a successor algorithm to ID3. The basic decision tree induction algorithm of Section 7.3.1 requires all attributes to be categorical or discretized. The algorithm can be modi ed to allow for continuous-valued attributes. A test on a continuous-valued attribute A results in two branches, corresponding to the conditions A V and A V for some numeric value, V , of A. Given v values of A, then v , 1 possible splits are considered in determining V . Typically, the midpoints between each pair of adjacent values are considered. If the values are sorted in advance, then this requires only one pass through the values. The basic algorithm for decision tree induction creates one branch for each value of a test attribute, and then distributes the samples accordingly. This partitioning can result in numerous small subsets. As the subsets become smaller and smaller, the partitioning process may end up using sample sizes that are statistically insu cient. The detection of useful patterns in the subsets may become impossible due to insu ciency of the data. One alternative is to allow for the grouping of categorical attribute values. A tree node may test whether the value of an attribute belongs to a given set of values, such as Ai 2 fa1; a2; : : :; ang. Another alternative is to create binary decision trees, where each branch holds a boolean test on an attribute. Binary trees result in less fragmentation of the data. Some empirical studies have found that binary decision trees tend to be more accurate that traditional decision trees. The information gain measure is biased in that it tends to prefer attributes with many values. Many alternatives have been proposed, such as gain ratio, which considers the probability of each attribute value. Various other selection measures exist, including the gini index, the 2 contingency table statistic, and the G-statistic. Many methods have been proposed for handling missing attribute values. A missing or unknown value for an attribute A may be replaced by the most common value for A, for example. Alternatively, the apparent information gain of attribute A can be reduced by the proportion of samples with unknown values of A. In this way, fractions" of a sample having a missing value can be partitioned into more than one branch at a test node. Other methods may look for the most probable value of A, or make use of known relationships between A and other attributes. Incremental versions of decision tree induction have been proposed. When given new training data, these restructure the decision tree acquired from learning on previous training data, rather than relearning a new tree from scratch". Additional enhancements to basic decision tree induction which address scalability, and the integration of data warehousing techniques, are discussed in Sections 7.3.5 and 7.3.6, respectively.
What are some enhancements to basic decision tree induction?"
12
7.3.5 Scalability and decision tree induction

The e ciency of existing decision tree algorithms, such as ID3 and C4.5, has been well established for relatively small data sets. E ciency and scalability become issues of concern when these algorithms are applied to the mining of very large, real-world databases. Most decision tree algorithms have the restriction that the training samples should reside in main memory. In data mining applications, very large training sets of millions of samples are common. Hence, this restriction limits the scalability of such algorithms, where the decision tree construction can become ine cient due to swapping of the training samples in and out of main and cache memories. Early strategies for inducing decision trees from large databases include discretizing continuous attributes, and sampling data at each node. These, however, still assume that the training set can t in memory. An alternative method rst partitions the data into subsets which individually can t into memory, and then builds a decision tree from each subset. The nal output classi er combines each classi er obtained from the subsets. Although this method allows for the classi cation of large data sets, its classi cation accuracy is not as high as the single classi er that would have been built using all of the data at once. rid 1 2 3 4 credit rating excellent excellent fair excellent age 38 26 35 49 buys computer yes yes no no
How scalable is decision tree induction?"
Table 7.2: Sample data for the class buys computer.
credit_rating excellent excellent fair excellent ...
rid 1 2 3 4 ...
age 26 35 38 49 ...
rid 2 3 1 4 ...
rid 1 2 3 4 ...
buys_computer yes yes no no ...
node 5 2 3 6 ... 3 1
0 2
5 Disk Resident -- Attribute List Memory Resident -- Class List
Figure 7.5: Attribute list and class list data structures used in SLIQ for the sample data of Table 7.2. More recent decision tree algorithms which address the scalability issue have been proposed. Algorithms for the induction of decision trees from very large training sets include SLIQ and SPRINT, both of which can handle categorical and continuous-valued attributes. Both algorithms propose pre-sorting techniques on disk-resident data sets that are too large to t in memory. Both de ne the use of new data structures to facilitate the tree construction. SLIQ employs disk resident attribute lists and a single memory resident class list. The attribute lists and class lists generated by SLIQ for the sample data of Table 7.2 are shown in Figure 7.5. Each attribute has an associated attribute list, indexed by rid a record identi er. Each tuple is represented by a linkage of one entry from each attribute list to an entry in the class list holding the class label of the given tuple, which in turn is linked to its corresponding leaf node in the decision tree. The class list remains in memory since it is often accessed and modi ed in the building and pruning phases. The size of the class list grows proportionally with the number of tuples in the training set. When a class list cannot t into memory, the performance of SLIQ decreases. SPRINT uses a di erent attribute list data structure which holds the class and rid information, as shown in Figure 7.6. When a node is split, the attribute lists are partitioned and distributed among the resulting child nodes

credit_rating excellent excellent fair excellent ... buys_computer yes yes no no ... rid 1 2 3 4 ... age 26 35 38 49 ... buys_computer y n y n ... rid 2 3 1 4 ...
13
Figure 7.6: Attribute list data structure used in SPRINT for the sample data of Table 7.2. accordingly. When a list is partitioned, the order of the records in the list is maintained. Hence, partitioning lists does not require resorting. SPRINT was designed to be easily parallelized, further contributing to its scalability. While both SLIQ and SPRINT handle disk-resident data sets that are too large to t into memory, the scalability of SLIQ is limited by the use of its memory-resident data structure. SPRINT removes all memory restrictions, yet requires the use of a hash tree proportional in size to the training set. This may become expensive as the training set size grows. RainForest is a framework for the scalable induction of decision trees. The method adapts to the amount of main memory available, and apply to any decision tree induction algorithm. It maintains an AVC-set Attribute-Value, Class label indicating the class distribution for each attribute. RainForest reports a speed-up over SPRINT.
7.3.6 Integrating data warehousing techniques and decision tree induction

Decision tree induction can be integrated with data warehousing techniques for data mining. In this section we discuss the method of attribute-oriented induction to generalize the given data, and the use of multidimensional data cubes to store the generalized data at multiple levels of granularity. We then discuss how these approaches can be integrated with decision tree induction in order to facilitate interactive multilevel mining. The use of a data mining query language to specify classi cation tasks is also discussed. In general, the techniques described here are applicable to other forms of learning as well. Attribute-oriented induction AOI uses concept hierarchies to generalize the training data by replacing lower level data with higher level concepts Chapter 5. For example, numerical values for the attribute income may be generalized to the ranges 30K", 30K-40K", 40K", or the categories low, medium, or high. This allows the user to view the data at more meaningful levels. In addition, the generalized data are more compact than the original training set, which may result in fewer input output operations. Hence, AOI also addresses the scalability issue by compressing the training data. The generalized training data can be stored in a multidimensional data cube, such as the structure typically used in data warehousing Chapter 2. The data cube is a multidimensional data structure, where each dimension represents an attribute or a set of attributes in the data schema, and each cell stores the value of some aggregate measure such as count. Figure 7.7 shows a data cube for customer information data, with the dimensions income, age, and occupation. The original numeric values of income and age have been generalized to ranges. Similarly, original values for occupation, such as accountant and banker, or nurse and X-ray technician, have been generalized to nance and medical, respectively. The advantage of the multidimensional structure is that it allows fast indexing to cells or slices of the cube. For instance, one may easily and quickly access the total count of customers in occupations relating to nance who have an income greater than $40K, or the number of customers who work in the area of medicine and are less than 40 years old. Data warehousing systems provide a number of operations that allow mining on the data cube at multiple levels of granularity. To review, the roll-up operation performs aggregation on the cube, either by climbing up a concept hierarchy e.g., replacing the value banker for occupation by the more general, nance, or by removing a dimension in the cube. Drill-down performs the reverse of roll-up, by either stepping down a concept hierarchy or adding a dimension e.g., time. A slice performs a selection on one dimension of the cube. For example, we may obtain a data slice for the generalized value accountant of occupation, showing the corresponding income and age data. A dice performs a selection on two or more dimensions. The pivot or rotate operation rotates the data axes in view
14
Occupation

Finance Medical Government
> 30K Income 30K-40K > 40K < 30 30-40 Age > 40
Figure 7.7: A multidimensional data cube. in order to provide an alternative presentation of the data. For example, pivot may be used to transform a 3-D cube into a series of 2-D planes. The above approaches can be integrated with decision tree induction to provide interactive multilevel mining of decision trees. The data cube and knowledge stored in the concept hierarchies can be used to induce decision trees at di erent levels of abstraction. Furthermore, once a decision tree has been derived, the concept hierarchies can be used to generalize or specialize individual nodes in the tree, allowing attribute roll-up or drill-down, and reclassi cation of the data for the newly speci ed abstraction level. This interactive feature will allow users to focus their attention on areas of the tree or data which they nd interesting. When integrating AOI with decision tree induction, generalization to a very low speci c concept level can result in quite large and bushy trees. Generalization to a very high concept level can result in decision trees of little use, where interesting and important subconcepts are lost due to overgeneralization. Instead, generalization should be to some intermediate concept level, set by a domain expert or controlled by a user-speci ed threshold. Hence, the use of AOI may result in classi cation trees that are more understandable, smaller, and therefore easier to interpret than trees obtained from methods operating on ungeneralized larger sets of low-level data such as SLIQ or SPRINT. A criticism of typical decision tree generation is that, because of the recursive partitioning, some resulting data subsets may become so small that partitioning them further would have no statistically signi cant basis. The maximum size of such insigni cant" data subsets can be statistically determined. To deal with this problem, an exception threshold may be introduced. If the portion of samples in a given subset is less than the threshold, further partitioning of the subset is halted. Instead, a leaf node is created which stores the subset and class distribution of the subset samples. Owing to the large amount and wide diversity of data in large databases, it may not be reasonable to assume that each leaf node will contain samples belonging to a common class. This problem may be addressed by employing a precision or classi cation threshold. Further partitioning of the data subset at a given node is terminated if the percentage of samples belonging to any given class at that node exceeds this threshold. A data mining query language may be used to specify and facilitate the enhanced decision tree induction method. Suppose that the data mining task is to predict the credit risk of customers aged 30-40, based on their income and occupation. This may be speci ed as the following data mining query:
mine classi cation analyze credit risk in relevance to income, occupation from Customer db where age = 30 and age 40 display as rules
7.4. BAYESIAN CLASSIFICATION
15
The above query, expressed in DMQL1 , executes a relational query on Customer db to retrieve the task-relevant data. Tuples not satisfying the where clause are ignored, and only the data concerning the attributes speci ed in the in relevance to clause, and the class label attribute credit risk are collected. AOI is then performed on this data. Since the query has not speci ed which concept hierarchies to employ, default hierarchies are used. A graphical user interface may be designed to facilitate user speci cation of data mining tasks via such a data mining query language. In this way, the user can help guide the automated data mining process. Hence, many ideas from data warehousing can be integrated with classi cation algorithms, such as decision tree induction, in order to facilitate data mining. Attribute-oriented induction employs concept hierarchies to generalize data to multiple abstraction levels, and can be integrated with classi cation methods in order to perform multilevel mining. Data can be stored in multidimensional data cubes to allow quick accessing to aggregate data values. Finally, a data mining query language can be used to assist users in interactive data mining.
7.4 Bayesian classi cation

Bayesian classi ers are statistical classi ers. They can predict class membership probabilities, such as the probability that a given sample belongs to a particular class. Bayesian classi cation is based on Bayes theorem, described below. Studies comparing classi cation algorithms have found a simple Bayesian classi er known as the naive Bayesian classi er to be comparable in performance with decision tree and neural network classi ers. Bayesian classi ers have also exhibited high accuracy and speed when applied to large databases. Naive Bayesian classi ers assume that the e ect of an attribute value on a given class is independent of the values of the other attributes. This assumption is called class conditional independence. It is made to simplify the computations involved, and in this sense, is considered naive". Bayesian belief networks are graphical models, which unlike naive Bayesian classi ers, allow the representation of dependencies among subsets of attributes. Bayesian belief networks can also be used for classi cation. Section 7.4.1 reviews basic probability notation and Bayes theorem. You will then learn naive Bayesian classi cation in Section 7.4.2. Bayesian belief networks are described in Section 7.4.3.
What are Bayesian classi ers"?
7.4.1 Bayes theorem

Let X be a data sample whose class label is unknown. Let H be some hypothesis, such as that the data sample X belongs to a speci ed class C. For classi cation problems, we want to determine PH jX, the probability that the hypothesis H holds given the observed data sample X. PH jX is the posterior probability, or a posteriori probability, of H conditioned on X. For example, suppose the world of data samples consists of fruits, described by their color and shape. Suppose that X is red and round, and that H is the hypothesis that X is an apple. Then PH jX re ects our con dence that X is an apple given that we have seen that X is red and round. In contrast, PH is the prior probability, or a priori probability of H. For our example, this is the probability that any given data sample is an apple, regardless of how the data sample looks. The posterior probability, P H jX is based on more information such as background knowledge than the prior probability, PH, which is independent of X. Similarly, PX jH is the posterior probability of X conditioned on H. That is, it is the probability that X is red and round given that we know that it is true that X is an apple. PX is the prior probability of X. Using our example, it is the probability that a data sample from our set of fruits is red and round. How are these probabilities estimated?" P X, PH, and PX jH may be estimated from the given data, as we shall see below. Bayes theorem is useful in that it provides a way of calculating the posterior probability, P H jX from P H, P X, and PX jH. Bayes theorem is: jHPH P H jX = PXPX 7.4 In the next section, you will learn how Bayes theorem is used in the naive Bayesian classi er.
1 The use of a data mining query language to specify data mining queries is discussed in Chapter 4, using the SQL-based DMQL language.
16
7.4.2 Naive Bayesian classi cation
The naive Bayesian classi er, or simple Bayesian classi er, works as follows: 1. Each data sample is represented by an n-dimensional feature vector, X = x1 ; x2; : : :; xn, depicting n measurements made on the sample from n attributes, respectively A1 ; A2; ::; An. 2. Suppose that there are m classes, C1 ; C2; : : :; Cm . Given an unknown data sample, X i.e., having no class label, the classi er will predict that X belongs to the class having the highest posterior probability, conditioned on X. That is, the naive Bayesian classi er assigns an unknown sample X to the class Ci if and only if : P CijX PCj jX for 1 j m; j 6= i. Thus we maximize PCijX. The class Ci for which PCijX is maximized is called the maximum posteriori hypothesis. By Bayes theorem Equation 7.4,
jCi PC PCijX = PXPX i :
7.5
3. As PX is constant for all classes, only PX jCiPCi need be maximized. If the class prior probabilities are not known, then it is commonly assumed that the classes are equally likely, i.e. PC1 = PC2 = : : : = PCm , and we would therefore maximize P X jCi . Otherwise, we maximize PX jCiPCi. Note that the class prior probabilities may be estimated by P Ci = ssi , where si is the number of training samples of class Ci, and s is the total number of training samples. 4. Given data sets with many attributes, it would be extremely computationally expensive to compute PX jCi . In order to reduce computation in evaluating PX jCi , the naive assumption of class conditional independence is made. This presumes that the values of the attributes are conditionally independent of one another, given the class label of the sample, i.e., that there are no dependence relationships among the attributes. Thus, PX jCi =
n Y Px jC : k i
k=1
7.6
The probabilities Px1jCi; P x2jCi; : : :; PxnjCi can be estimated from the training samples, where: a If Ak is categorical, then Pxk jCi = ssik , where sik is the number of training samples of class Ci having i the value xk for Ak , and si is the number of training samples belonging to Ci. b If Ak is continuous-valued, then the attribute is assumed to have a Gaussian distribution. Therefore,
, Ci Ci ; P xkjCi = gxk ; Ci ; Ci = p 1 e 7.7 2 Ci where gxk ; Ci ; Ci is the Gaussian normal density function for attribute Ak , while Ci and Ci are the mean and variance respectively given the values for attribute Ak for training samples of class Ci . 5. In order to classify an unknown sample X, PX jCi PCi is evaluated for each class Ci . Sample X is then assigned to the class Ci if and only if :
2 2 2
x,
P X jCi PCi PX jCj PCj for 1 j m; j 6= i. In other words, it is assigned to the class, Ci, for which PX jCiPCi is the maximum.
7.4. BAYESIAN CLASSIFICATION

How e ective are Bayesian classi ers?"
17
In theory, Bayesian classi ers have the minimum error rate in comparison to all other classi ers. However, in practice this is not always the case owing to inaccuracies in the assumptions made for its use, such as class conditional independence, and the lack of available probability data. However, various empirical studies of this classi er in comparison to decision tree and neural network classi ers have found it to be comparable in some domains. Bayesian classi ers are also useful in that they provide a theoretical justi cation for other classi ers which do not explicitly use Bayes theorem. For example, under certain assumptions, it can be shown that many neural network and curve tting algorithms output the maximum posteriori hypothesis, as does the naive Bayesian classi er. label of an unknown sample using naive Bayesian classi cation, given the same training data as in Example 7.2 for decision tree induction. The training data are in Table 7.1. The data samples are described by the attributes age, income, student, and credit rating. The class label attribute, buys computer, has two distinct values namely fyes, nog. Let C1 correspond to the class buys computer = yes and C2 correspond to buys computer = no. The unknown sample we wish to classify is X = age =
30", income = medium, student = yes, credit rating = fair.
Example 7.4 Predicting a class label using naive Bayesian classi cation. We wish to predict the class
We need to maximize P X jCi PCi, for i = 1, 2. PCi, the prior probability of each class, can be computed based on the training samples: P buys computer = yes = 9=14 = 0:643 P buys computer = no = 5=14 = 0:357 To compute P X jCi, for i = 1, 2, we compute the following conditional probabilities:
Page = 30" j buys computer = yes = 2=9 = 0:222 Page = 30" j buys computer = no = 3=5 = 0:600 Pincome = medium j buys computer = yes = 4=9 = 0:444 Pincome = medium j buys computer = no = 2=5 = 0:400 Pstudent = yes j buys computer = yes = 6=9 = 0:667 Pstudent = yes j buys computer = no = 1=5 = 0:200 Pcredit rating = fair j buys computer = yes = 6=9 = 0:667 Pcredit rating = fair j buys computer = no = 2=5 = 0:400
Using the above probabilities, we obtain PX jbuys computer = yes = 0:222 0:444 0:667 0:667 = 0:044 PX jbuys computer = no = 0:600 0:400 0:200 0:400 = 0:019 PX jbuys computer = yesPbuys computer = yes = 0:044 0:643 = 0:028 PX jbuys computer = noPbuys computer = no = 0:019 0:357 = 0:007 Therefore, the naive Bayesian classi er predicts buys computer = yes" for sample X.
7.4.3 Bayesian belief networks

The naive Bayesian classi er makes the assumption of class conditional independence, i.e., that given the class label of a sample, the values of the attributes are conditionally independent of one another. This assumption simpli es computation. When the assumption holds true, then the naive Bayesian classi er is the most accurate in comparison with all other classi ers. In practice, however, dependencies can exist between variables. Bayesian belief networks specify joint conditional probability distributions. They allow class conditional independencies to be de ned between subsets of variables. They provide a graphical model of causal relationships, on which learning can be performed. These networks are also known as belief networks, Bayesian networks, and probabilistic networks. For brevity, we will refer to them as belief networks.
18
a) FamilyHistory Smoker

b) FH, S LC ~LC LungCancer Emphysema 0.8 0.2 FH, ~S 0.5 0.5 ~FH, S 0.7 0.3 ~FH, ~S 0.1 0.9
PositiveXRay
Dyspnea
Figure 7.8: a A simple Bayesian belief network; b The conditional probability table for the values of the variable LungCancer LC showing each possible combination of the values of its parent nodes, Family History FH and Smoker S. A belief network is de ned by two components. The rst is a directed acyclic graph, where each node represents a random variable, and each arc represents a probabilistic dependence. If an arc is drawn from a node Y to a node Z, then Y is a parent or immediate predecessor of Z, and Z is a descendent of Y . Each variable is conditionally independent of its nondescendents in the graph, given its parents. The variables may be discrete or continuous-valued. They may correspond to actual attributes given in the data, or to hidden variables" believed to form a relationship such as medical syndromes in the case of medical data. Figure 7.8a shows a simple belief network, adapted from Russell et al. 1995a for six Boolean variables. The arcs allow a representation of causal knowledge. For example, having lung cancer is in uenced by a person's family history of lung cancer, as well as whether or not the person is a smoker. Furthermore, the arcs also show that the variable LungCancer is conditionally independent of Emphysema, given its parents, FamilyHistory and Smoker. This means that once the values of FamilyHistory and Smoker are known, then the variable Emphysema does not provide any additional information regarding LungCancer. The second component de ning a belief network consists of one conditional probability table CPT for each variable. The CPT for a variable Z speci es the conditional distribution PZ jParentsZ, where P arentsZ are the parents of Z. Figure 7.8b showns a CPT for LungCancer. The conditional probability for each value of LungCancer is given for each possible combination of values of its parents. For instance, from the upper leftmost and bottom rightmost entries, respectively, we see that P LungCancer = Y es j FamilyHistory = Y es; Smoker = Y es = 0:8, and PLungCancer = No j FamilyHistory = No; Smoker = No = 0:9. by The joint probability of any tuple z1 ; :::; zn corresponding to the variables or attributes Z1 ; :::; Zn is computed
n Y Pz jParentsZ ; i=1 i i
P z1; :::; zn =
7.8
where the values for Pzi jParentsZi correspond to the entries in the CPT for Zi . A node within the network can be selected as an output" node, representing a class label attribute. There may be more than one output node. Inference algorithms for learning can be applied on the network. The classi cation process, rather than returning a single class label, can return a probability distribution for the class label attribute, i.e., predicting the probability of each class.
7.5. CLASSIFICATION BY BACKPROPAGATION
19
7.4.4 Training Bayesian belief networks

In learning or training a belief network, a number of scenarios are possible. The network structure may be given in advance, or inferred from the data. The network variables may be observable or hidden in all or some of the training samples. The case of hidden data is also referred to as missing values or incomplete data. If the network structure is known and the variables are observable, then learning the network is straightforward. It consists of computing the CPT entries, as is similarly done when computing the probabilities involved in naive Bayesian classi cation. When the network structure is given and some of the variables are hidden, then a method of gradient descent can be used to train the belief network. The object is to learn the values for the CPT entries. Let S be a set of s training samples, X1 ; X2; ::; Xs. Let wijk be a CPT entry for the variable Yi = yij having the parents Ui = uik . For example, if wijk is the upper leftmost CPT entry of Figure 7.8b, then Yi is LungCancer; yij is its value, Yes; Ui lists the parent nodes of Yi , namely fFamilyHistory, Smokerg; and uik lists the values of the parent nodes, namely fYes, Yesg. The wijk are viewed as weights, analogous to the weights in hidden units of neural networks Section 7.5. The weights, wijk , are initialized to random probability values. The gradient descent strategy performs greedy hill-climbing. At each iteration, the weights are updated, and will eventually converge to a local optimum solution. The method aims to maximize PS jH. This is done by following the gradient of lnPS jH, which makes the problem simpler. Given the network structure and initialized wijk, the algorithm proceeds as follows. 1. Compute the gradients: For each i; j; k, compute
s @lnP S jH = X PYi = yij ; Ui = uik jXd @wijk wijk d=1
How does a Bayesian belief network learn?"
7.9
The probability in the right-hand side of Equation 7.9 is to be calculated for each training sample Xd in S. For brevity, let's refer to this probability simply as p. When the variables represented by Yi and Ui are hidden for some Xd , then the corresponding probability p can be computed from the observed variables of the sample using standard algorithms for Bayesian network inference such as those available by the commercial software package, Hugin. 2. Take a small step in the direction of the gradient: The weights are updated by wijk wijk + l @lnPS jH ; @w
ijk
7.10
S where l is the learning rate representing the step size, and @lnP ijkjH is computed from Equation 7.9. The @w learning rate is set to a small constant. 3. Renormlize the weights: Because the weights wijk are probability values, they must be between 0 and 1.0, P and j wijk must equal 1 for all i; k. These criteria are achieved by renormalizing the weights after they have been updated by Equation 7.10.
Several algorithms exist for learning the network structure from the training data given observable variables. The problem is one of discrete optimization. For solutions, please see the bibliographic notes at the end of this chapter.
7.5 Classi cation by backpropagation

Backpropagation is a neural network learning algorithm. The eld of neural networks was originally kindled by psychologists and neurobiologists who sought to develop and test computational analogues of neurons. Roughly
What is backpropagation?"
20
input layer x_1

hidden layer output layer
x_2
x_i w_ij w_kj O_j O_k
Figure 7.9: A multilayer feed-forward neural network: A training sample, X = x1; x2; ::; xi, is fed to the input layer. Weighted connections exist between each layer, where wij denotes the weight from a unit j in one layer to a unit i in the previous layer. speaking, a neural network is a set of connected input output units where each connection has a weight associated with it. During the learning phase, the network learns by adjusting the weights so as to be able to predict the correct class label of the input samples. Neural network learning is also referred to as connectionist learning due to the connections between units. Neural networks involve long training times, and are therefore more suitable for applications where this is feasible. They require a number of parameters which are typically best determined empirically, such as the network topology or structure". Neural networks have been criticized for their poor interpretability, since it is di cult for humans to interpret the symbolic meaning behind the learned weights. These features initially made neural networks less desirable for data mining. Advantages of neural networks, however, include their high tolerance to noisy data as well as their ability to classify patterns on which they have not been trained. In addition, several algorithms have recently been developed for the extraction of rules from trained neural networks. These factors contribute towards the usefulness of neural networks for classi cation in data mining. The most popular neural network algorithm is the backpropagation algorithm, proposed in the 1980's. In Section 7.5.1 you will learn about multilayer feed-forward networks, the type of neural network on which the backpropagation algorithm performs. Section 7.5.2 discusses de ning a network topology. The backpropagation algorithm is described in Section 7.5.3. Rule extraction from trained neural networks is discussed in Section 7.5.4.
7.5.1 A multilayer feed-forward neural network
The backpropagation algorithm performs learning on a multilayer feed-forward neural network. An example of such a network is shown in Figure 7.9. The inputs correspond to the attributes measured for each training sample. The inputs are fed simultaneously into a layer of units making up the input layer. The weighted outputs of these units are, in turn, fed simultaneously to a second layer of neuron-like" units, known as a hidden layer. The hidden layer's weighted outputs can be input to another hidden layer, and so on. The number of hidden layers is arbitrary, although in practice, usually only one is used. The weighted outputs of the last hidden layer are input to units making up the output layer, which emits the network's prediction for given samples. The units in the hidden layers and output layer are sometimes referred to as neurodes, due to their symbolic biological basis, or as output units. The multilayer neural network shown in Figure 7.9 has two layers of output units. Therefore, we say that it is a two-layer neural network. Similarly, a network containing two hidden layers is called a three-layer neural network, and so on. The network is feed-forward in that none of the weights cycle back to an input unit or to an output unit of a previous layer. It is fully connected in that each unit provides input to each unit in the next forward layer. Multilayer feed-forward networks of linear threshold functions, given enough hidden units, can closely approximate any function.
21
7.5.2 De ning a network topology

Before training can begin, the user must decide on the network topology by specifying the number of units in the input layer, the number of hidden layers if more than one, the number of units in each hidden layer, and the number of units in the output layer. Normalizing the input values for each attribute measured in the training samples will help speed up the learning phase. Typically, input values are normalized so as to fall between 0 and 1.0. Discrete-valued attributes may be encoded such that there is one input unit per domain value. For example, if the domain of an attribute A is fa0 ; a1; a2g, then we may assign three input units to represent A. That is, we may have, say, I0 ; I1; I2, as input units. Each unit is initialized to 0. If A = a0, then I0 is set to 1. If A = a1 , I1 is set to 1, and so on. One output unit may be used to represent two classes where the value 1 represents one class, and the value 0 represents the other. If there are more than two classes, then 1 output unit per class is used. There are no clear rules as to the best" number of hidden layer units. Network design is a trial by error process and may a ect the accuracy of the resulting trained network. The initial values of the weights may also a ect the resulting accuracy. Once a network has been trained and its accuracy is not considered acceptable, then it is common to repeat the training process with a di erent network topology or a di erent set of initial weights.
How can I design the topology of the neural network?"
7.5.3 Backpropagation
Backpropagation learns by iteratively processing a set of training samples, comparing the network's prediction for each sample with the actual known class label. For each training sample, the weights are modi ed so as to minimize the mean squared error between the network's prediction and the actual class. These modi cations are made in the backwards" direction, i.e., from the output layer, through each hidden layer down to the rst hidden layer hence the name backpropagation. Although it is not guaranteed, in general the weights will eventually converge, and the learning process stops. The algorithm is summarized in Figure 7.10. Each step is described below. Initialize the weights. The weights in the network are initialized to small random numbers e.g., ranging from -1.0 to 1.0, or -0.5 to 0.5. Each unit has a bias associated with it, as explained below. The biases are similarly initialized to small random numbers. Each training sample, X, is processed by the following steps. Propagate the inputs forward. In this step, the net input and output of each unit in the hidden and output layers are computed. First, the training sample is fed to the input layer of the network. The net input to each unit in the hidden and output layers is then computed as a linear combination of its inputs. To help illustrate this, a hidden layer or output layer unit is shown in Figure 7.11. The inputs to the unit are, in fact, the outputs of the units connected to it in the previous layer. To compute the net input to the unit, each input connected to the unit is multiplied by its corresponding weight, and this is summed. Given a unit j in a hidden or output layer, the net input, Ij , to unit j is: Ij =
How does backpropagation work?"
Xw
i
ij Oi + j
7.11
where wij is the weight of the connection from unit i in the previous layer to unit j; Oi is the output of unit i from the previous layer; and j is the bias of the unit. The bias acts as a threshold in that it serves to vary the activity of the unit. Each unit in the hidden and output layers takes its net input, and then applies an activation function to it, as illustrated in Figure 7.11. The function symbolizes the activation of the neuron represented by the unit. The logistic, or simoid function is used. Given the net input Ij to unit j, then Oj , the output of unit j, is computed as: Oj = 1 + 1 ,Ij e 7.12
This function is also referred to as a squashing function, since it maps a large input domain onto the smaller range
22
Algorithm 7.5.1 Backpropagation Neural network learning for classi cation, using the backpropagation algorithm. Input: The training samples, samples; the learning rate, l; a multilayer feed-forward network, network. Output: A neural network trained to classify the samples. Method:
1 Initialize all weights and biases in network; 2 while terminating condition is not satis ed f 3 for each training sample X in samples f 4 Propagate the inputs forward: 5 for each hidden or output layer unit j 6 Ij = compute the net input of unit j i wij Oi + j ; 7 for each hidden or output layer unit j 1 8 Oj = 1+e,Ij ; compute the output of each unit j 9 Backpropagate the errors: 10 for each unit j in the output layer 11 Errj = Oj 1 , Oj Tj , Oj ; compute the error 12 for each unit j in the hidden layers 13 Errj = Oj 1 , Oj k Errk wjk ; compute the error 14 for each weight wij in network f 15 wij = lErrj Oi ; weight increment 16 wij = wij + wij ; g weight update 17 for each bias j in network f 18 j = lErrj ; bias increment 19 j = j + j ; g bias update 20 gg
Figure 7.10: Backpropagation algorithm. of 0 to 1. The logistic function is nonlinear and di erentiable, allowing the backpropagation algorithm to model classi cation problems that are linearly inseparable. Backpropagate the error. The error is propagated backwards by updating the weights and biases to re ect the error of the network's prediction. For a unit j in the output layer, the error Errj is computed by: Errj = Oj 1 , Oj Tj , Oj 7.13
where Oj is the actual output of unit j, and Tj is the true output, based on the known class label of the given training sample. Note that Oj 1 , Oj is the derivative of the logistic function. To compute the error of a hidden layer unit j, the weighted sum of the errors of the units connected to unit j in the next layer are considered. The error of a hidden layer unit j is: Errj = Oj 1 , Oj
X Err w
k
k jk
7.14
where wjk is the weight of the connection from unit j to a unit k in the next higher layer, and Errk is the error of unit k. The weights and biases are updated to re ect the propagated errors. Weights are updated by Equations 7.15 and 7.16 below, where wij is the change in weight wij . wij = lErrj Oi 7.15

weights w_0 x_0 w_1 x_1 f output bias
23
w_n x_n
input vector X
weighted sum
activation function
Figure 7.11: A hidden or output layer unit: The inputs are multiplied by their corresponding weights in order to form a weighted sum, which is added to the bias associated with the unit. A nonlinear activation function is applied to the net input. wij = wij + wij 7.16
What is the `l' in Equation 7.15?" The variable l is the learning rate, a constant typically having a value between 0 and 1:0. Backpropagation learns using a method of gradient descent to search for a set of weights which can model the given classi cation problem so as to minimize the mean squared distance between the network's class predictions and the actual class label of the samples. The learning rate helps to avoid getting stuck at a local minimum in decision space i.e., where the weights appear to converge, but are not the optimum solution, and encourages nding the global minimum. If the learning rate is too small, then learning will occur at a very slow pace. If the learning rate is too large, then oscillation between inadequate solutions may occur. A rule of thumb is to set the learning rate to 1=t, where t is the number of iterations through the training set so far. Biases are updated by Equations 7.17 and 7.18 below, where j is the change in bias j .
j = lErrj j = j + j
7.17 7.18
Note that here we are updating the weights and biases after the presentation of each sample. This is referred to as case updating. Alternatively, the weight and bias increments could be accumulated in variables, so that the weights and biases are updated after all of the samples in the training set have been presented. This latter strategy is called epoch updating, where one iteration through the training set is an epoch. In theory, the mathematical derivation of backpropagation employs epoch updating, yet in practice, case updating is more common since it tends to yield more accurate results. Terminating condition. Training stops when either 1. all wij in the previous epoch were so small as to be below some speci ed threshold, or 2. the percentage of samples misclassi ed in the previous epoch is below some threshold, or 3. a prespeci ed number of epochs has expired. In practice, several hundreds of thousands of epochs may be required before the weights will converge.
24
x_1
1 w_15 w_24
w_14 4 w_46
x_2
2 w_25 w_34 w_56 5 w_35
x_3
Figure 7.12: An example of a multilayer feed-forward neural network.
Example 7.5 Sample calculations for learning by the backpropagation algorithm. Figure 7.12 shows a
x1 x2 x3 w14 w15 w24 w25 w34 w35 w46 w56 4 5 6 1 0 1 0.2 -0.3 0.4 0.1 -0.5 0.2 -0.3 -0.2 -0.4 0.2 0.1 Table 7.3: Initial input, weight, and bias values. This example shows the calculations for backpropagation, given the rst training sample, X. The sample is fed into the network, and the net input and output of each unit are computed. These values are shown in Table 7.4. Unit j 4 5 6 Net Input, Ij 0:2 + 0 , 0:5 , 0:4 = ,0:7 ,0:3 + 0 + 0:2 + 0:2 = 0:1 0:30:33 , 0:20:52 + 0:1 , 0:19 Output, Oj 1=1 + e0:7 = 0:33 1=1 + e,0:1 = 0:52 1=1 + e,0:19 = 0:55
multilayer feed-forward neural network. The initial weight and bias values of the network are given in Table 7.3, along with the rst training sample, X = 1; 0; 1.
Table 7.4: The net input and output calculations. The error of each unit is computed and propagated backwards. The error values are shown in Table 7.5. The weight and bias updates are shown in Table 7.6. 2 Several variations and alternatives to the backpropagation algorithm have been proposed for classi cation in neural networks. These may involve the dynamic adjustment of the network topology, and of the learning rate or other parameters, or the use of di erent error functions.
7.5.4 Backpropagation and interpretability

A major disadvantage of neural networks lies in their knowledge representation. Acquired knowledge in the form of a network of units connected by weighted links is di cult for humans to interpret. This factor has motivated research
How can I `understand' what the backpropgation network has learned?"
7.6. ASSOCIATION-BASED CLASSIFICATION

Unit j 6 5 4 Errj 0:551 , 0:551 , 0:55 = 0:495 0:521 , 0:520:495,0:3 = 0:037 0:331 , 0:330:495,0:2 = ,0:022
25
Table 7.5: Calculation of the error at each node. Weight or Bias w46 w56 w14 w15 w24 w25 w34 w35 6 5 4 New Value ,0:3 = 0:90:4950:33 = ,0:153 ,0:2 = 0:90:4950:52 = ,0:032 0:2 = 0:9,0:0221 = 0:180 ,0:3 = 0:90:0371 = ,0:267 0:4 = 0:9,0:0220 = 0:4 0:1 = 0:90:0370 = 0:1 ,0:5 = 0:9,0:0221 = ,0:520 0:2 = 0:90:0371 = 0:233 0:1 + 0:90:495 = 0:546 0:2 + 0:90:037 = 0:233 ,0:4 + 0:9,0:022 = ,0:420
Table 7.6: Calculations for weight and bias updating. in extracting the knowledge embedded in trained neural networks and in representing that knowledge symbolically. Methods include extracting rules from networks and sensitivity analysis. Various algorithms for the extraction of rules have been proposed. The methods typically impose restrictions regarding procedures used in training the given neural network, the network topology, and the discretization of input values. Fully connected networks are di cult to articulate. Hence, often, the rst step towards extracting rules from neural networks is network pruning. This consists of removing weighted links that do not result in a decrease in the classi cation accuracy of the given network. Once the trained network has been pruned, some approaches will then perform link, unit, or activation value clustering. In one method, for example, clustering is used to nd the set of common activation values for each hidden unit in a given trained two-layer neural network Figure 7.13. The combinations of these activation values for each hidden unit are analyzed. Rules are derived relating combinations of activation values with corresponding output unit values. Similarly, the sets of input values and activation values are studied to derive rules describing the relationship between the input and hidden unit layers. Finally, the two sets of rules may be combined to form IF-THEN rules. Other algorithms may derive rules of other forms, including M-of-N rules where M out of a given N conditions in the rule antecedent must be true in order for the rule consequent to be applied, decision trees with M-of-N tests, fuzzy rules, and nite automata. Sensitivity analysis is used to assess the impact that a given input variable has on a network output. The input to the variable is varied while the remaining input variables are xed at some value. Meanwhile, changes in the network output are monitored. The knowledge gained from this form of analysis can be represented in rules such as IF X decreases 5 THEN Y increases 8".
7.6 Association-based classi cation

Association rule mining is an important and highly active area of data mining research. Chapter 6 of this book described many algorithms for association rule mining. Recently, data mining techniques have been developed which apply association rule mining to the problem of classi cation. In this section, we study such association-based
Can association rule mining be used for classi cation?"
26
Identify sets of common activation values for each hidden node, H_i: for H_1: (-1,0,1) for H_2: (0,1) H_1 Derive rules relating common activation values with output nodes, O_j: IF (a_2 = 0 AND a_3 = -1) OR (a_1 = -1 AND a_2 = 1 AND a_3 = -1) OR (a_1 = -1 AND a_2 = 0 AND a_3 = 0.24) THEN O_1 = 1, O_2 = 0 ELSE O_1 = 0, O_2 = 1 Derive rules relating input nodes, I_i, to output nodes, O_j: IF (I_2 = 0 AND I_7 = 0) THEN a_2 = 0 IF (I_4 = 1 AND I_6 = 1) THEN a_3 = -1 IF (I_5 = 0) THEN a_3 = -1 ... Obtain rules relating inputs and output classes: IF (I_2 = 0 AND I_7 = 0 AND I_4 = 1 AND I_6 = 1) THEN class = 1 IF (I_2 = 0 AND I_7 = 0 AND I_5 = 0) THEN class = 1 I_1 I_2 I_3 I_4 I_5 I_6 I_7 H_2 H_3 for H_3: (-1, 0.24, 1) O_1 O_2
Figure 7.13: Rules can be extracted from training neural networks. classi cation. One method of association-based classi cation, called associative classi cation, consists of two steps. In the rst step, association rules are generated using a modi ed version of the standard association rule mining algorithm known as Apriori. The second step constructs a classi er based on the association rules discovered. Let D be the training data, and Y be the set of all classes in D. The algorithm maps categorical attributes to consecutive positive integers. Continuous attributes are discretized and mapped accordingly. Each data sample d in D then is represented by a set of attribute, integer-value pairs called items, and a class label y. Let I be the set of all items in D. A class association rule CAR is of the form condset y, where condset is a set of items condset I and y 2 Y . Such rules can be represented by ruleitems of the form condset, y . A CAR has con dence c if c of the samples in D that contain condset belong to class y. A CAR has support s if s of the samples in D contain condset and belong to class y. The support count of a condset condsupCount is the number of samples in D that contain the condset. The rule count of a ruleitem rulesupCount is the number of samples in D that contain the condset and are labeled with class y. Ruleitems that satisfy minimum support are frequent ruleitems. If a set of ruleitems has the same condset, then the rule with the highest con dence is selected as the possible rule PR to represent the set. A rule satisfying minimum con dence is called accurate. The rst step of the associative classi cation method nds the set of all PRs that are both frequent and accurate. These are the class association rules CARs. A ruleitem whose condset contains k items is a k-ruleitem. The algorithm employs an iterative approach, similar to that described for Apriori in Section 5.2.1, where ruleitems are processed rather than itemsets. The algorithm scans the database, searching for the frequent k-ruleitems, for k = 1; 2; ::, until all frequent k-ruleitems have been found. One scan is made for each value of k. The k-ruleitems are used to explore k+1-ruleitems. In the rst scan of the database, the count support of 1-ruleitems is determined, and the frequent 1-ruleitems are retained. The frequent 1-ruleitems, referred to as the set F1 , are used to generate candidate 2-ruleitems, C2 . Knowledge of frequent ruleitem properties is used to prune candidate ruleitems that cannot be frequent. This knowledge states that all non-empty subsets of a frequent ruleitem must also be frequent. The database is scanned a second time to compute the support counts of each candidate, so that the frequent 2ruleitems F2 can be determined. This process repeats, where Fk is used to generate Ck+1, until no more frequent ruleitems are found. The frequent ruleitems that satisfy minimum con dence form the set of CARs. Pruning may be applied to this rule set.
How does associative classi cation work?"
7.7. OTHER CLASSIFICATION METHODS
27
The second step of the associative classi cation method processes the generated CARs in order to construct the classi er. Since the total number of rule subsets that would be examined in order to determine the most accurate set of rules can be huge, a heuristic method is employed. A precedence ordering among rules is de ned where a rule ri has greater precedence over a rule rj i.e., ri rj if 1 the con dence of ri is greater than that of rj , or 2 the con dences are the same, but ri has greater support, or 3 the con dences and supports of ri and rj are the same, but ri is generated earlier than rj . In general, the algorithm selects a set of high precedence CARs to cover the samples in D. The algorithm requires slightly more than one pass over D in order to determine the nal classi er. The classi er maintains the selected rules from high to low precedence order. When classifying a new sample, the rst rule satisfying the sample is used to classify it. The classi er also contains a default rule, having lowest precedence, which speci es a default class for any new sample that is not satis ed by any other rule in the classi er. In general, the above associative classi cation method was empirically found to be more accurate than C4.5 on several data sets. Each of the above two steps was shown to have linear scale-up. Association rule mining based on clustering has also been applied to classi cation. The ARCS, or Association Rule Clustering System Section 6.4.3 mines association rules of the form Aquan1 ^ Aquan2 Acat, where Aquan1 and Aquan2 are tests on quantitative attribute ranges where the ranges are dynamically determined, and Acat assigns a class label for a categorical attribute from the given training data. Association rules are plotted on a 2-D grid. The algorithm scans the grid, searching for rectangular clusters of rules. In this way, adjacent ranges of the quantitative attributes occurring within a rule cluster may be combined. The clustered association rules generated by ARCS were applied to classi cation, and their accuracy was compared to C4.5. In general, ARCS is slightly more accurate when there are outliers in the data. The accuracy of ARCS is related to the degree of discretization used. In terms of scalability, ARCS requires a constant amount of memory, regardless of the database size. C4.5 has exponentially higher execution times than ARCS, requiring the entire database, multiplied by some factor, to t entirely in main memory. Hence, association rule mining is an important strategy for generating accurate and scalable classi ers.
7.7 Other classi cation methods

In this section, we give a brief description of a number of other classi cation methods. These methods include k-nearest neighbor classi cation, case-based reasoning, genetic algorithms, rough set and fuzzy set approaches. In general, these methods are less commonly used for classi cation in commercial data mining systems than the methods described earlier in this chapter. Nearest-neighbor classi cation, for example, stores all training samples, which may present di culties when learning from very large data sets. Furthermore, many applications of case-based reasoning, genetic algorithms, and rough sets for classi cation are still in the prototype phase. These methods, however, are enjoying increasing popularity, and hence we include them here.
7.7.1
-nearest neighbor classi ers
Nearest neighbor classi ers are based on learning by analogy. The training samples are described by n-dimensional numeric attributes. Each sample represents a point in an n-dimensional space. In this way, all of the training samples are stored in an n-dimensional pattern space. When given an unknown sample, a k-nearest neighbor classi er searches the pattern space for the k training samples that are closest to the unknown sample. These k training samples are the k nearest neighbors" of the unknown sample. Closeness" is de ned in terms of Euclidean distance, where the Euclidean distance between two points, X = x1 ; x2; :::; xn and Y = y1 ; y2; :::; yn is:
vn uX u dX; Y = t xi , yi :
2
i=1
7.19
The unknown sample is assigned the most common class among its k nearest neighbors. When k = 1, the unknown sample is assigned the class of the training sample that is closest to it in pattern space. Nearest neighbor classi ers are instance-based since they store all of the training samples. They can incur expensive computational costs when the number of potential neighbors i.e., stored training samples with which to compare a given unlabeled sample is great. Therefore, e cient indexing techniques are required. Unlike decision tree
28
induction and backpropagation, nearest neighbor classi ers assign equal weight to each attribute. This may cause confusion when there are many irrelevant attributes in the data. Nearest neighbor classi ers can also be used for prediction, i.e., to return a real-valued prediction for a given unknown sample. In this case, the classi er returns the average value of the real-valued labels associated with the k nearest neighbors of the unknown sample.
7.7.2 Case-based reasoning
Case-based reasoning CBR classi ers are instanced-based. Unlike nearest neighbor classi ers, which store train-
ing samples as points in Euclidean space, the samples or cases" stored by CBR are complex symbolic descriptions. Business applications of CBR include problem resolution for customer service help desks, for example, where cases describe product-related diagnostic problems. CBR has also been applied to areas such as engineering and law, where cases are either technical designs or legal rulings, respectively. When given a new case to classify, a case-based reasoner will rst check if an identical training case exists. If one is found, then the accompanying solution to that case is returned. If no identical case is found, then the case-based reasoner will search for training cases having components that are similar to those of the new case. Conceptually, these training cases may be considered as neighbors of the new case. If cases are represented as graphs, this involves searching for subgraphs which are similar to subgraphs within the new case. The case-based reasoner tries to combine the solutions of the neighboring training cases in order to propose a solution for the new case. If incompatibilities arise with the individual solutions, then backtracking to search for other solutions may be necessary. The case-based reasoner may employ background knowledge and problem-solving strategies in order to propose a feasible combined solution. Challenges in case-based reasoning include nding a good similarity metric e.g., for matching subgraphs, developing e cient techniques for indexing training cases, and methods for combining solutions.
7.7.3 Genetic algorithms
Genetic algorithms attempt to incorporate ideas of natural evolution. In general, genetic learning starts as follows. An initial population is created consisting of randomly generated rules. Each rule can be represented by a string
of bits. As a simple example, suppose that samples in a given training set are described by two Boolean attributes, A1 and A2, and that there are two classes, C1 and C2. The rule IF A1 and not A2 THEN C2 " can be encoded as the bit string 100", where the two leftmost bits represent attributes A1 and A2 , respectively, and the rightmost bit represents the class. Similarly, the rule if not A1 and not A2 then C1 " can be encoded as 001". If an attribute has k values where k 2, then k bits may be used to encode the attribute's values. Classes can be encoded in a similar fashion. Based on the notion of survival of the ttest, a new population is formed to consist of the ttest rules in the current population, as well as o spring of these rules. Typically, the tness of a rule is assessed by its classi cation accuracy on a set of training samples. O spring are created by applying genetic operators such as crossover and mutation. In crossover, substrings from pairs of rules are swapped to form new pairs of rules. In mutation, randomly selected bits in a rule's string are inverted. The process of generating new populations based on prior populations of rules continues until a population P evolves" where each rule in P satis es a prespeci ed tness threshold. Genetic algorithms are easily parallelizable and have been used for classi cation as well as other optimization problems. In data mining, they may be used to evaluate the tness of other algorithms.
7.7.4 Rough set theory

Rough set theory can be used for classi cation to discover structural relationships within imprecise or noisy data. It applies to discrete-valued attributes. Continuous-valued attributes must therefore be discretized prior to its use. Rough set theory is based on the establishment of equivalence classes within the given training data. All of the data samples forming an equivalence class are indiscernible, that is, the samples are identical with respect to the attributes describing the data. Given real-world data, it is common that some classes cannot be distinguished
7.7. OTHER CLASSIFICATION METHODS

C
29
upper approximation of C lower approximation of C
Figure 7.14: A rough set approximation of the set of samples of the class C using lower and upper approximation sets of C. The rectangular regions represent equivalence classes. in terms of the available attributes. Rough sets can be used to approximately or roughly" de ne such classes. A rough set de nition for a given class C is approximated by two sets - a lower approximation of C and an upper approximation of C. The lower approximation of C consists of all of the data samples which, based on the knowledge of the attributes, are certain to belong to C without ambiguity. The upper approximation of C consists of all of the samples which, based on the knowledge of the attributes, cannot be described as not belonging to C. The lower and upper approximations for a class C are shown in Figure 7.14, where each rectangular region represents an equivalence class. Decision rules can be generated for each class. Typically, a decision table is used to represent the rules. Rough sets can also be used for feature reduction where attributes that do not contribute towards the classi cation of the given training data can be identi ed and removed, and relevance analysis where the contribution or signi cance of each attribute is assessed with respect to the classi cation task. The problem of nding the minimal subsets reducts of attributes that can describe all of the concepts in the given data set is NP-hard. However, algorithms to reduce the computation intensity have been proposed. In one method, for example, a discernibility matrix is used which stores the di erences between attribute values for each pair of data samples. Rather than searching on the entire training set, the matrix is instead searched to detect redundant attributes.
7.7.5 Fuzzy set approaches

Rule-based systems for classi cation have the disadvantage that they involve sharp cut-o s for continuous attributes. For example, consider Rule 7.20 below for customer credit application approval. The rule essentially says that applications for customers who have had a job for two or more years, and who have a high income i.e., of more than $50K are approved. 7.20 IF years employed = 2 ^ income 50K THEN credit = approved: By Rule 7.20, a customer who has had a job for at least 2 years will receive credit if her income is, say, $51K, but not if it is $50K. Such harsh thresholding may seem unfair. Instead, fuzzy logic can be introduced into the system to allow fuzzy" thresholds or boundaries to be de ned. Rather than having a precise cuto between categories or sets, fuzzy logic uses truth values between 0:0 and 1:0 to represent the degree of membership that a certain value has in a given category. Hence, with fuzzy logic, we can capture the notion that an income of $50K is, to some degree, high, although not as high as an income of $51K. Fuzzy logic is useful for data mining systems performing classi cation. It provides the advantage of working at a high level of abstraction. In general, the use of fuzzy logic in rule-based systems involves the following: Attribute values are converted to fuzzy values. Figure 7.15 shows how values for the continuous attribute income are mapped into the discrete categories flow, medium, highg, as well as how the fuzzy membership or truth values are calculated. Fuzzy logic systems typically provide graphical tools to assist users in this step. For a given new sample, more than one fuzzy rule may apply. Each applicable rule contributes a vote for membership in the categories. Typically, the truth values for each predicted category are summed.
30
fuzzy membership 1.0 0.5 _ low somewhat low | 10K | 20K | 30K
medium
high borderline high
| 40K
| 50K
| 60K
| 70K
income
Figure 7.15: Fuzzy values for income. The sums obtained above are combined into a value that is returned by the system. This process may be done by weighting each category by its truth sum and multiplying by the mean truth value of each category. The calculations involved may be more complex, depending on the complexity of the fuzzy membership graphs. Fuzzy logic systems have been used in numerous areas for classi cation, including health care and nance.
7.8 Prediction
like to develop a model to predict the salary of college graduates with 10 years of work experience, or the potential sales of a new product given its price. Many problems can be solved by linear regression, and even more can be tackled by applying transformations to the variables so that a nonlinear problem can be converted to a linear one. For reasons of space, we cannot give a fully detailed treatment of regression. Instead, this section provides an intuitive introduction to the topic. By the end of this section, you will be familiar with the ideas of linear, multiple, and nonlinear regression, as well as generalized linear models. Several software packages exist to solve regression problems. Examples include SAS http: www.sas.com, SPSS http: www.spss.com, and S-Plus http: www.mathsoft.com.
What if we would like to predict a continuous value, rather than a categorical label?" The prediction of continuous values can be modeled by statistical techniques of regression. For example, we may
7.8.1 Linear and multiple regression

In linear regression, data are modeled using a straight line. Linear regression is the simplest form of regression. Bivariate linear regression models a random variable, Y called a response variable, as a linear function of another random variable, X called a predictor variable, i.e., Y = + X; 7.21 where the variance of Y is assumed to be constant, and and are regression coe cients specifying the Yintercept and slope of the line, respectively. These coe cients can be solved for by the method of least squares, which minimizes the error between the actual line separating the data and the estimate of the line. Given s samples or data points of the form x1 ; y1, x2 ; y2, .., xs ; ys, then the regression coe cients can be estimated using this method with Equations 7.22 and 7.23,
What is linear regression?"
Psi xi , xyi , y ; = Ps xi , x

=1
i=1
7.22
7.8. PREDICTION
= y , x; where x is the average of x1; x2; ::; xs, and y is the average of y1 ; y2 ; ::; ys. The coe cients good approximations to otherwise complicated regression equations. X
years experience salary in $1000
31 7.23 and often provide
3 8 9 13 3 6 11 21 1 16
30 57 64 72 36 43 59 90 20 83
Table 7.7: Salary data.

100
80
Salary (in $1000)
60
40
20
0 0 5 10 15 Years experience 20 25
Figure 7.16: Plot of the data in Table 7.7 for Example 7.6. Although the points do not fall on a straight line, the overall pattern suggests a linear relationship between X years experience and Y salary.
Example 7.6 Linear regression using the method of least squares. Table 7.7 shows a set of paired data
where X is the number of years of work experience of a college graduate and Y is the corresponding salary of the graduate. A plot of the data is shown in Figure 7.16, suggesting a linear relationship between the two variables, X and Y . We model the relationship that salary may be related to the number of years of work experience with the equation Y = + X. Given the above data, we compute x = 9:1 and y = 55:4. Substituting these values into Equation 7.22, we get 3,9:130,55:4+8,9:157,55:4+:::+16,9:183,55:4 = = 3:7 3,9:1 +8,9:1 +:::+16,9:1 = 55:4 , 3:79:1 = 21:7 Thus, the equation of the least squares line is estimated by Y = 21:7 + 3:7X. Using this equation, we can predict that the salary of a college graduate with, say, 10 years of experience is $58.7K. 2 Multiple regression is an extension of linear regression involving more than one predictor variable. It allows response variable Y to be modeled as a linear function of a multidimensional feature vector. An example of a multiple regression model based on two predictor attributes or variables, X1 and X2 , is shown in Equation 7.24.
2 2 2
32

7.24
Y = + 1 X1 + 2 X2 The method of least squares can also be applied here to solve for , 1 , and 2 .
7.8.2 Nonlinear regression

How can we model data that does not show a linear dependence? For example, what if a given response variable and predictor variables have a relationship that may be modeled by a polynomial function?"
Polynomial regression can be modeled by adding polynomial terms to the basic linear model. By applying transformations to the variables, we can convert the nonlinear model into a linear one that can then be solved by the method of least squares. Example 7.7 Transformation of a polynomial regression model to a linear regression model. Consider
a cubic polynomial relationship given by Equation 7.25. Y = + 1X + 2X 2 + 3X 3 To convert this equation to linear form, we de ne new variables as shown in Equation 7.26. X1 = X X2 = X 2 X3 = X 3 7.26 Equation 7.25 can then be converted to linear form by applying the above assignments, resulting in the equation Y = + 1 X1 + 2 X2 + 3 X3 , which is solvable by the method of least squares. 2 In Exercise 7, you are asked to nd the transformations required to convert a nonlinear model involving a power function into a linear regression model. Some models are intractably nonlinear such as the sum of exponential terms, for example and cannot be converted to a linear model. For such cases, it may be possible to obtain least-square estimates through extensive calculations on more complex formulae. 7.25
7.8.3 Other regression models

Linear regression is used to model continuous-valued functions. It is widely used, owing largely to its simplicity. Can it also be used to predict categorical labels?" Generalized linear models represent the theoretical foundation on which linear regression can be applied to the modeling of categorical response variables. In generalized linear models, the variance of the response variable Y is a function of the mean value of Y , unlike in linear regression, where the variance of Y is constant. Common types of generalized linear models include logistic regression and Poisson regression. Logistic regression models the probability of some event occurring as a linear function of a set of predictor variables. Count data frequently exhibit a Poisson distribution and are commonly modeled using Poisson regression. Log-linear models approximate discrete multidimensional probability distributions. They may be used to estimate the probability value associated with data cube cells. For example, suppose we are given data for the attributes city, item, year, and sales. In the log-linear method, all attributes must be categorical, hence continuousvalued attributes like sales must rst be discretized. The method can then be used to estimate the probability of each cell in the 4-D base cuboid for the given attributes, based on the 2-D cuboids for city and item, city and year, city and sales, and the 3-D cuboid for item, year, and sales. In this way, an iterative technique can be used to build higher order data cubes from lower order ones. The technique scales up well to allow for many dimensions. Aside from prediction, the log-linear model is useful for data compression since the smaller order cuboids together typically occupy less space than the base cuboid and data smoothing since cell estimates in the smaller order cuboids are less subject to sampling variations than cell estimates in the base cuboid.
7.9. CLASSIFIER ACCURACY
33
training set data
derive classifier
estimate accuracy
test set
Figure 7.17: Estimating classi er accuracy with the holdout method.
7.9 Classi er accuracy

Estimating classi er accuracy is important in that it allows one to evaluate how accurately a given classi er will correctly label future data, i.e., data on which the classi er has not been trained. For example, if data from previous sales are used to train a classi er to predict customer purchasing behavior, we would like some estimate of how accurately the classi er can predict the purchasing behavior of future customers. Accuracy estimates also help in the comparison of di erent classi ers. In Section 7.9.1, we discuss techniques for estimating classi er accuracy, such as the holdout and k-fold cross-validation methods. Section 7.9.2 describes bagging and boosting, two strategies for increasing classi er accuracy. Section 7.9.3 discusses additional issues relating to classi er selection.
7.9.1 Estimating classi er accuracy

Using training data to derive a classi er and then to estimate the accuracy of the classi er can result in misleading over-optimistic estimates due to overspecialization of the learning algorithm or model to the data. Holdout and cross-validation are two common techniques for assessing classi er accuracy, based on randomly-sampled partitions of the given data. In the holdout method, the given data are randomly partitioned into two independent sets, a training set and a test set. Typically, two thirds of the data are allocated to the training set, and the remaining one third is allocated to the test set. The training set is used to derive the classi er, whose accuracy is estimated with the test set Figure 7.17. The estimate is pessimistic since only a portion of the initial data is used to derive the classi er. Random subsampling is a variation of the holdout method in which the holdout method is repeated k times. The overall accuracy estimate is taken as the average of the accuracies obtained from each iteration. In k-fold cross validation, the initial data are randomly partitioned into k mutually exclusive subsets or folds", S1 ; S2 ; :::; Sk, each of approximately equal size. Training and testing is performed k times. In iteration i, the subset Si is reserved as the test set, and the remaining subsets are collectively used to train the classi er. That is, the classi er of the rst iteration is trained on subsets S2 ; ::; Sk, and tested on S1 ; the classi er of the section iteration is trained on subsets S1 ; S3 ; ::; Sk, and tested on S2 ; and so on. The accuracy estimate is the overall number of correct classi cations from the k iterations, divided by the total number of samples in the initial data. In strati ed cross-validation, the folds are strati ed so that the class distribution of the samples in each fold is approximately the same as that in the initial data. Other methods of estimating classi er accuracy include bootstrapping, which samples the given training instances uniformly with replacement, and leave-one-out, which is k-fold cross validation with k set to s, the number of initial samples. In general, strati ed 10-fold cross-validation is recommended for estimating classi er accuracy even if computation power allows using more folds due to its relatively low bias and variance. The use of such techniques to estimate classi er accuracy increases the overall computation time, yet is useful for selecting among several classi ers.
34
C_1
new data sample
C_2 combine data . . votes class prediction
C_T
Figure 7.18: Increasing classi er accuracy: Bagging and boosting each generate a set of classi ers, C1; C2; ::; CT . Voting strategies are used to combine the class predictions for a given unknown sample.
7.9.2 Increasing classi er accuracy

In the previous section, we studied methods of estimating classi er accuracy. In Section 7.3.2, we saw how pruning can be applied to decision tree induction to help improve the accuracy of the resulting decision trees. Are there general techniques for improving classi er accuracy? The answer is yes. Bagging or boostrap aggregation and boosting are two such techniques Figure 7.18. Each combines a series of T learned classi ers, C1; C2; ::; CT , with the aim of creating an improved composite classi er, C . How do these methods work?" Suppose that you are a patient and would like to have a diagnosis made based on your symptoms. Instead of asking one doctor, you may choose to ask several. If a certain diagnosis occurs more than the others, you may choose this as the nal or best diagnosis. Now replace each doctor by a classi er, and you have the intuition behind bagging. Suppose instead, that you assign weights to the value" or worth of each doctor's diagnosis, based on the accuracies of previous diagnoses they have made. The nal diagnosis is then a combination of the weighted diagnoses. This is the essence behind boosting. Let us have a closer look at these two techniques. Given a set S of s samples, bagging works as follows. For iteration t t = 1; 2; ::; T , a training set St is sampled with replacement from the original set of samples, S. Since sampling with replacement is used, some of the original samples of S may not be included in St , while others may occur more than once. A classi er Ct is learned for each training set, St . To classify an unknown sample, X, each classi er Ct returns its class prediction, which counts as one vote. The bagged classi er, C , counts the votes and assigns the class with the most votes to X. Bagging can be applied to the prediction of continuous values by taking the average value of each vote, rather than the majority. In boosting, weights are assigned to each training sample. A series of classi ers is learned. After a classi er Ct is learned, the weights are updated to allow the subsequent classi er, Ct+1, to pay more attention" to the misclassi cation errors made by Ct. The nal boosted classi er, C , combines the votes of each individual classi er, where the weight of each classi er's vote is a function of its accuracy. The boosting algorithm can be extended for the prediction of continuous values.
7.9.3 Is accuracy enough to judge a classi er?

In addition to accuracy, classi ers can be compared with respect to their speed, robustness e.g., accuracy on noisy data, scalability, and interpretability. Scalability can be evaluated by assessing the number of I O operations involved for a given classi cation algorithm on data sets of increasingly large size. Interpretability is subjective, although we may use objective measurements such as the complexity of the resulting classi er e.g., number of tree nodes for decision trees, or number of hidden units for neural networks, etc. in assessing it. Is it always possible to assess accuracy?" In classi cation problems, it is commonly assumed that all objects are uniquely classi able, i.e., that each training sample can belong to only one class. As we have discussed above, classi cation algorithms can then be compared according to their accuracy. However, owing to the wide diversity
7.10. SUMMARY
35
of data in large databases, it is not always reasonable to assume that all objects are uniquely classi able. Rather, it is more probable to assume that each object may belong to more than one class. How then, can the accuracy of classi ers on large databases be measured? The accuracy measure is not appropriate, since it does not take into account the possibility of samples belonging to more than one class. Rather than returning a class label, it is useful to return a probability class distribution. Accuracy measures may then use a second guess heuristic whereby a class prediction is judged as correct if it agrees with the rst or second most probable class. Although this does take into consideration, in some degree, the non-unique classi cation of objects, it is not a complete solution.
7.10 Summary
Classi cation and prediction are two forms of data analysis which can be used to extract models describing important data classes or to predict future data trends. While classi cation predicts categorical labels classes, prediction models continuous-valued functions. Preprocessing of the data in preparation for classi cation and prediction can involve data cleaning to reduce noise or handle missing values, relevance analysis to remove irrelevant or redundant attributes, and data transformation, such as generalizing the data to higher level concepts, or normalizing the data. Predictive accuracy, computational speed, robustness, scalability, and interpretability are ve criteria for the evaluation of classi cation and prediction methods. ID3 and C4.5 are greedy algorithms for the induction of decision trees. Each algorithm uses an information theoretic measure to select the attribute tested for each non-leaf node in the tree. Pruning algorithms attempt to improve accuracy by removing tree branches re ecting noise in the data. Early decision tree algorithms typically assume that the data are memory resident - a limitation to data mining on large databases. Since then, several scalable algorithms have been proposed to address this issue, such as SLIQ, SPRINT, and RainForest. Decision trees can easily be converted to classi cation IF-THEN rules. Naive Bayesian classi cation and Bayesian belief networks are based on Bayes theorem of posterior probability. Unlike naive Bayesian classi cation which assumes class conditional independence, Bayesian belief networks allow class conditional independencies to be de ned between subsets of variables. Backpropagation is a neural network algorithm for classi cation which employs a method of gradient descent. It searches for a set of weights which can model the data so as to minimize the mean squared distance between the network's class prediction and the actual class label of data samples. Rules may be extracted from trained neural networks in order to help improve the interpretability of the learned network. Association mining techniques, which search for frequently occurring patterns in large databases, can be applied to and used for classi cation. Nearest neighbor classi ers and cased-based reasoning classi ers are instance-based methods of classi cation in that they store all of the training samples in pattern space. Hence, both require e cient indexing techniques. In genetic algorithms, populations of rules evolve" via operations of crossover and mutation until all rules within a population satisfy a speci ed threshold. Rough set theory can be used to approximately de ne classes that are not distinguishable based on the available attributes. Fuzzy set approaches replace brittle" threshold cuto s for continuous-valued attributes with degree of membership functions. Linear, nonlinear, and generalized linear models of regression can be used for prediction. Many nonlinear problems can be converted to linear problems by performing transformations on the predictor variables. Data warehousing techniques, such as attribute-oriented induction and the use of multidimensional data cubes, can be integrated with classi cation methods in order to allow fast multilevel mining. Classi cation tasks may be speci ed using a data mining query language, promoting interactive data mining. Strati ed k-fold cross validation is a recommended method for estimating classi er accuracy. Bagging and boosting methods can be used to increase overall classi cation accuracy by learning and combining a series of individual classi ers.
36
Exercises
1. Table 7.8 consists of training data from an employee database. The data have been generalized. For a given row entry, count represents the number of data tuples having the values for department, status, age, and salary given in that row.
department status age
sales sales sales systems systems systems systems marketing marketing secretary secretary
senior junior junior junior senior junior senior senior junior senior junior
31-35 26-30 31-35 21-25 31-35 26-30 41-45 36-40 31-35 46-50 26-30
salary
45-50K 25-30K 30-35K 45-50K 65-70K 45-50K 65-70K 45-50K 40-45K 35-40K 25-30K
count
30 40 40 20 5 3 3 10 4 4 6
Table 7.8: Generalized relation from an employee database. Let salary be the class label attribute. a How would you modify the ID3 algorithm to take into consideration the count of each data tuple i.e., of each row entry? b Use your modi ed version of ID3 to construct a decision tree from the given data. c Given a data sample with the values systems", junior", and 20-24" for the attributes department, status, and age, respectively, what would a naive Bayesian classi cation of the salary for the sample be? d Design a multilayer feed-forward neural network for the given data. Label the nodes in the input and output layers. e Using the multilayer feed-forward neural network obtained above, show the weight values after one iteration of the backpropagation algorithm given the training instance sales, senior, 31-35, 45-50K". Indicate your initial weight values and the learning rate used. Write an algorithm for k-nearest neighbor classi cation given k, and n, the number of attributes describing each sample. What is a drawback of using a separate set of samples to evaluate pruning? Given a decision tree, you have the option of a converting the decision tree to rules and then pruning the resulting rules, or b pruning the decision tree and then converting the pruned tree to rules? What advantage does a have over b? ADD QUESTIONS ON OTHER CLASSIFICATION METHODS. Table 7.9 shows the mid-term and nal exam grades obtained for students in a database course. a Plot the data. Do X and Y seem to have a linear relationship? b Use the method of least squares to nd an equation for the prediction of a student's nal exam grade based on the student's mid-term grade in the course. c Predict the nal exam grade of a student who received an 86 on the mid-term exam. Some nonlinear regression models can be converted to linear models by applying transformations to the predictor variables. Show how the nonlinear regression equation Y = X can be converted to a linear regression equation solvable by the method of least squares.
2. 3. 4. 5. 6.
7.
7.10. SUMMARY
X Y
37
mid-term exam
72 50 81 74 94 86 59 83 65 33 88 81
84 63 77 78 90 75 49 79 77 52 74 90
nal exam
Table 7.9: Mid-term and nal exam grades. 8. It is di cult to assess classi cation accuracy when individual data objects may belong to more than one class at a time. In such cases, comment on what criteria you would use to compare di erent classi ers modeled after the same data.
Bibliographic Notes
Classi cation from a machine learning perspective is described in several books, such as Weiss and Kulikowski 136 , Michie, Spiegelhalter, and Taylor 88 , Langley 67 , and Mitchell 91 . Weiss and Kulikowski 136 compare classi cation and prediction methods from many di erent elds, in addition to describing practical techniques for the evaluation of classi er performance. Many of these books describe each of the basic methods of classi cation discussed in this chapter. Edited collections containing seminal articles on machine learning can be found in Michalksi, Carbonell, and Mitchell 85, 86 , Kodrato and Michalski 63 , Shavlik and Dietterich 123 , and Michalski and Tecuci 87 . For a presentation of machine learning with respect to data mining applications, see Michalski, Bratko, and Kubat 84 . The C4.5 algorithm is described in a book by J. R. Quinlan 108 . The book gives an excellent presentation of many of the issues regarding decision tree induction, as does a comprehensive survey on decision tree induction by Murthy 94 . Other algorithms for decision tree induction include the predecessor of C4.5, ID3 Quinlan 104 , CART Breiman et al. 11 , FACT Loh and Vanichsetakul 76 , QUEST Loh and Shih 75 , and PUBLIC Rastogi and Shim 111 . Incremental versions of ID3 include ID4 Schlimmer and Fisher 120 and ID5 Utgo 132 . In addition, INFERULE Uthurusamy, Fayyad, and Spangler 133 learns decision trees from inconclusive data. KATE Manago and Kodrato 80 learns decision trees from complex structured data. Decision tree algorithms that address the scalability issue in data mining include SLIQ Mehta, Agrawal, and Rissanen 81 , SPRINT Shafer, Agrawal, and Mehta 121 , RainForest Gehrke, Ramakrishnan, and Ganti 43 , and Kamber et al. 61 . Earlier approaches described include 16, 17, 18 . For a comparison of attribute selection measures for decision tree induction, see Buntine and Niblett 15 , and Murthy 94 . For a detailed discussion on such measures, see Kononenko and Hong 65 . There are numerous algorithms for decision tree pruning, including cost complexity pruning Breiman et al. 11 , reduced error pruning Quinlan 105 , and pessimistic pruning Quinlan 104 . PUBLIC Rastogi and Shim 111 integrates decision tree construction with tree pruning. MDL-based pruning methods can be found in Quinlan and Rivest 110 , Mehta, Agrawal, and Rissanen 82 , and Rastogi and Shim 111 . Others methods include Niblett and Bratko 96 , and Hosking, Pednault, and Sudan 55 . For an empirical comparison of pruning methods, see Mingers 89 , and Malerba, Floriana, and Semeraro 79 . For the extraction of rules from decision trees, see Quinlan 105, 108 . Rather than generating rules by extracting them from decision trees, it is also possible to induce rules directly from the training data. Rule induction algorithms
38
include CN2 Clark and Niblett 21 , AQ15 Hong, Mozetic, and Michalski 54 , ITRULE Smyth and Goodman 126 , FOIL Quinlan 107 , and Swap-1 Weiss and Indurkhya 134 . Decision trees, however, tend to be superior in terms of computation time and predictive accuracy. Rule re nement strategies which identify the most interesting rules among a given rule set can be found in Major and Mangano 78 . For descriptions of data warehousing and multidimensional data cubes, see Harinarayan, Rajaraman, and Ullman 48 , and Berson and Smith 8 , as well as Chapter 2 of this book. Attribution-oriented induction AOI is presented in Han and Fu 45 , and summarized in Chapter 5. The integration of AOI with decision tree induction is proposed in Kamber et al. 61 . The precision or classi cation threshold described in Section 7.3.6 is used in Agrawal et al. 2 and Kamber et al. 61 . Thorough presentations of Bayesian classi cation can be found in Duda and Hart 32 , a classic textbook on pattern recognition, as well as machine learning textbooks such as Weiss and Kulikowski 136 and Mitchell 91 . For an analysis of the predictive power of naive Bayesian classi ers when the class conditional independence assumption is violated, see Domingosand Pazzani 31 . Experiments with kernel density estimation for continuous-valued attributes, rather than Gaussian estimation have been reported for naive Bayesian classi ers in John 59 . Algorithms for inference on belief networks can be found in Russell and Norvig 118 and Jensen 58 . The method of gradient descent, described in Section 7.4.4 for learning Bayesian belief networks, is given in Russell et al. 117 . The example given in Figure 7.8 is adapted from Russell et al. 117 . Alternative strategies for learning belief networks with hidden variables include the EM algorithm Lauritzen 68 , and Gibbs sampling York and Madigan 139 . Solutions for learning the belief network structure from training data given observable variables are proposed in 22, 14, 50 . The backpropagation algorithm was presented in Rumelhart, Hinton, and Williams 115 . Since then, many variations have been proposed involving, for example, alternative error functions Hanson and Burr 47 , dynamic adjustment of the network topology Fahlman and Lebiere 35 , Le Cun, Denker, and Solla 70 , and dynamic adjustment of the learning rate and momentum parameters Jacobs 56 . Other variations are discussed in Chauvin and Rumelhart 19 . Books on neural networks include 116, 49, 51, 40, 19, 9, 113 . Many books on machine learning, such as 136, 91 , also contain good explanations of the backpropagation algorithm. There are several techniques for extracting rules from neural networks, such as 119, 42, 131, 40, 7, 77, 25, 69 . The method of rule extraction described in Section 7.5.4 is based on Lu, Setiono, and Liu 77 . Critiques of techniques for rule extraction from neural networks can be found in Andrews, Diederich, and Tickle 5 , and Craven and Shavlik 26 . An extensive survey of applications of neural networks in industry, business, and science is provided in Widrow, Rumelhart, and Lehr 137 . The method of associative classi cation described in Section 7.6 was proposed in Liu, Hsu, and Ma 74 . ARCS was proposed in Lent, Swami, and Widom 73 , and is also described in Chapter 6. Nearest neighbor methods are discussed in many statistical texts on classi cation, such as Duda and Hart 32 , and James 57 . Additional information can be found in Cover and Hart 24 and Fukunaga and Hummels 41 . References on case-based reasoning CBR include the texts 112, 64, 71 , as well as 1 . For a survey of business applications of CBR, see Allen 4 . Examples of other applications include 6, 129, 138 . For texts on genetic algorithms, see 44, 83, 90 . Rough sets were introduced in Pawlak 97, 99 . Concise summaries of rough set theory in data mining include 141, 20 . Rough sets have been used for feature reduction and expert system design in many applications, including 98, 72, 128 . Algorithms to reduce the computation intensity in nding reducts have been proposed in 114, 125 . General descriptions of fuzzy logic can be found in 140, 8, 20 . There are many good textbooks which cover the techniques of regression. Example include 57, 30, 60, 28, 52, 95, 3 . The book by Press et al. 101 and accompanying source code contain many statistical procedures, such as the method of least squares for both linear and multiple regression. Recent nonlinear regression models include projection pursuit and MARS Friedman 39 . Log-linear models are also known in the computer science literature as multiplicative models. For log-linear models from a computer science perspective, see Pearl 100 . Regression trees Breiman et al. 11 are often comparable in performance with other regression methods, particularly when there exist many higher order dependencies among the predictor variables. Methods for data cleaning and data transformation are discussed in Pyle 102 , Kennedy et al. 62 , Weiss and Indurkhya 134 , and Chapter 3 of this book. Issues involved in estimating classi er accuracy are described in Weiss and Kulikowski 136 . The use of strati ed 10-fold cross-validation for estimating classi er accuracy is recommended over the holdout, cross-validation, leave-one-out Stone 127 , and bootstrapping Efron and Tibshirani 33 methods, based on a theoretical and empirical study by Kohavi 66 . Bagging is proposed in Breiman 10 . The boosting technique of Freund and Schapire 38 has been applied to several di erent classi ers, including decision tree induction Quinlan 109 , and naive Bayesian classi cation Elkan 34 .
7.10. SUMMARY
39
The University of California at Irvine UCI maintains a Machine Learning Repository of data sets for the development and testing of classi cation algorithms. For information on this repository, see http: www.ics.uci.edu ~mlearn MLRepository.html. No classi cation method is superior over all others for all data types and domains. Empirical comparisons on classi cation methods include 106, 37, 135, 122, 130, 12, 23, 27, 92, 29 .
40
Bibliography
1 A. Aamodt and E. Plazas. Case-based reasoning: Foundational issues, methodological variations, and system approaches. AI Comm., 7:39 52, 1994. 2 R. Agrawal, S. Ghosh, T. Imielinski, B. Iyer, and A. Swami. An interval classi er for database mining applications. In Proc. 18th Int. Conf. Very Large Data Bases, pages 560 573, Vancouver, Canada, August 1992. 3 A. Agresti. An Introduction to Categorical Data Analysis. John Wiley & Sons, 1996. 4 B. P. Allen. Case-based reasoning: Business applications. Comm. ACM, 37:40 42, 1994. 5 R. Andrews, J. Diederich, and A. B. Tickle. A survey and critique of techniques for extracting rules from trained arti cial neural networks. Knowledge-Based Systems, 8, 1995. 6 K. D. Ashley. Modeling Legal Argument: Reasoning with Cases and Hypotheticals. Cambridge, MA: MIT Press, 1990. 7 S. Avner. Discovery of comprehensible symbolic rules in a neural network. In Intl. Symposium on Intelligence in Neural and Bilogical Systems, pages 64 67, 1995. 8 A. Berson and S. J. Smith. Data Warehousing, Data Mining, and OLAP. New York: McGraw-Hill, 1997. 9 C. M. Bishop. Neural Networks for Pattern Recognition. Oxford, UK: Oxford University Press, 1995. 10 L. Breiman. Bagging predictors. Machine Learning, 24:123 140, 1996. 11 L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classi cation and Regression Trees. Wadsworth International Group, 1984. 12 C. E. Brodley and P. E. Utgo . Multivariate versus univariate decision trees. In Technical Report 8, Department of Computer Science, Univ. of Massachusetts, 1992. 13 W. Buntine. Graphical models for discovering knowledge. In U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining, pages 59 82. AAAI MIT Press, 1996. 14 W. L. Buntine. Operations for learning with graphical models. Journal of Arti cial Intelligence Research, 2:159 225, 1994. 15 W. L. Buntine and Tim Niblett. A further comparison of splitting rules for decision-tree induction. Machine Learning, 8:75 85, 1992. 16 J. Catlett. Megainduction: Machine Learning on Very large Databases. PHD Thesis, University of Sydney, 1991. 17 P. K. Chan and S. J. Stolfo. Experiments on multistrategy learning by metalearning. In Proc. 2nd. Int. Conf. Information and Knowledge Management, pages 314 323, 1993. 18 P. K. Chan and S. J. Stolfo. Metalearning for multistrategy and parallel learning. In Proc. 2nd. Int. Workshop on Multistrategy Learning, pages 150 165, 1993. 41
42
BIBLIOGRAPHY
19 Y. Chauvin and D. Rumelhart. Backpropagation: Theory, Architectures, and Applications. Hillsdale, NJ: Lawrence Erlbaum Assoc., 1995. 20 K. Cios, W. Pedrycz, and R. Swiniarski. Data Mining Methods for Knowledge Discovery. Boston: Kluwer Academic Publishers, 1998. 21 P. Clark and T. Niblett. The CN2 induction algorithm. Machine Learning, 3:261 283, 1989. 22 G. Cooper and E. Herskovits. A bayesian method for the induction of probabilistic networks from data. Machine Learning, 9:309 347, 1992. 23 V. Corruble, D. E. Brown, and C. L. Pittard. A comparison of decision classi ers with backpropagation neural networks for multimodal classi cation problems. Patern Recognition, 26:953 961, 1993. 24 T. Cover and P. Hart. Nearest neighbor pattern classi cation. IEEE Trans. Information Theory, 13:21 27, 1967. 25 M. W. Craven and J. W. Shavlik. Extracting tree-structured representations of trained networks. In D. Touretzky and M. Mozer M. Hasselmo, editors, Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press, 1996. 26 M. W. Craven and J. W. Shavlik. Using neural networks in data mining. Future Generation Computer Systems, 13:211 229, 1997. 27 S. P. Curram and J. Mingers. Neural networks, decision tree induction and discriminant analysis: An empirical comparison. J. Operational Research Society, 45:440 450, 1994. 28 J. L. Devore. Probability and Statistics for Engineering and the Science, 4th ed. Duxbury Press, 1995. 29 T. G. Dietterich, H. Hild, and G. Bakiri. A comparison of ID3 and backpropagation for english text-to-speech mapping. Machine Learning, 18:51 80, 1995. 30 A. J. Dobson. An Introduction to Generalized Linear Models. Chapman and Hall, 1990. 31 P. Domingos and M. Pazzani. Beyond independence: Conditions for the optimality of the simple bayesian classi er. In Proc. 13th Intl. Conf. Machine Learning, pages 105 112, 1996. 32 R. Duda and P. Hart. Pattern Classi cation and Scene Analysis. Wiley: New York, 1973. 33 B. Efron and R. Tibshirani. An Introduction to the Bootstrap. New York: Chapman & Hall, 1993. 34 C. Elkan. Boosting and naive bayesian learning. In Technical Report CS97-557, Dept. of Computer Science and Engineering, Univ. Calif. at San Diego, Sept. 1997. 35 S. Fahlman and C. Lebiere. The cascade-correlation learning algorithm. In Technical Report CMU-CS-90-100, Computer Science Department, Carnegie Mellon University, Pittsburgh, PA, 1990. 36 U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy eds.. Advances in Knowledge Discovery and Data Mining. AAAI MIT Press, 1996. 37 D. H. Fisher and K. B. McKusick. An empirical comparison of ID3 and back-propagation. In Proc. 11th Intl. Joint Conf. AI, pages 788 793, San Mateo, CA: Morgan Kaufmann, 1989. 38 Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55:119 139, 1997. 39 J. Friedman. Multivariate adaptive regression. Annalsof Statistics, 19:1 141, 1991. 40 L. Fu. Neural Networks in Computer Intelligence. McGraw-Hill, 1994. 41 K. Fukunaga and D. Hummels. Bayes error estimation using parzen and k-nn procedure. In IEEE Trans. Pattern Analysis and Machine Learning, pages 634 643, 1987.
BIBLIOGRAPHY
43
42 S. I. Gallant. Neural Network Learning and Expert Systems. Cambridge, MA: MIT Press, 1993. 43 J. Gehrke, R. Ramakrishnan, and V. Ganti. Rainforest: A framework for fast decision tree construction of large datasets. In Proc. 1998 Int. Conf. Very Large Data Bases, pages 416 427, New York, NY, August 1998. 44 D. Goldberg. Genetic Algorithms in Search, Optimization, and Machine Learning. Addison-Wesley: Reading, MA, 1989. 45 J. Han and Y. Fu. Exploration of the power of attribute-oriented induction in data mining. In U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining, pages 399 421. AAAI MIT Press, 1996. 46 J. Han, Y. Fu, W. Wang, J. Chiang, W. Gong, K. Koperski, D. Li, Y. Lu, A. Rajan, N. Stefanovic, B. Xia, and O. R. Za
ane. DBMiner: A system for mining knowledge in large relational databases. In Proc. 1996 Int. Conf. Data Mining and Knowledge Discovery KDD'96, pages 250 255, Portland, Oregon, August 1996. 47 S. J. Hanson and D. J. Burr. Minkowski back-propagation: Learning in connectionist models with non-euclidean error signals. In Neural Information Processing Systems, American Institute of Physics, 1988. 48 V. Harinarayan, A. Rajaraman, and J. D. Ullman. Implementing data cubes e ciently. In Proc. 1996 ACMSIGMOD Int. Conf. Management of Data, pages 205 216, Montreal, Canada, June 1996. 49 R. Hecht-Nielsen. Neurocomputing. Reading, MA: Addison Wesley, 1990. 50 D. Heckerman, D. Geiger, and D. Chickering. Learning bayesian networks: The combination of knowledge and statistical data. Machine Learning, 20:197, 1995. 51 J. Hertz, A. Krogh, and R. G. Palmer. Introduction to the Theory of Neural Computation. Addison Wesley: Reading, MA., 1991. 52 R. V. Hogg and A. T. Craig. Introduction to Mathematical Statistics, 5th ed. Prentice Hall, 1995. 53 L. B. Holder. Intermediate decision trees. In Proc. 14th Int. Joint Conf. Arti cial Intelligence IJCAI95, pages 1056 1062, Montreal, Canada, Aug. 1995. 54 J. Hong, I. Mozetic, and R. S. Michalski. AQ15: Incremental learning of attribute-based descriptions from examples, the method and user's guide. In Report ISG 85-5, UIUCDCS-F-86-949,, Department of Computer Science, University of Illinois at Urbana-Champagin, 1986. 55 J. Hosking, E. Pednault, and M. Sudan. A statistical perspective on data mining. In Future Generation Computer Systems, pages 117 134, ???, 1997. 56 R. Jacobs. Increased rates of convergence through learning rate adaptation. Neural Networks, 1:295 307, 1988. 57 M. James. Classi cation Algorithms. John Wiley, 1985. 58 F. V. Jensen. An Introduction to Bayesian Networks. Springer Verlag, 1996. 59 G. H. John. Enhancements to the Data Mining Process. Ph.D. Thesis, Computer Science Dept., Stanford Univeristy, 1997. 60 R. A. Johnson and D. W. Wickern. Applied Multivariate Statistical Analysis, 3rd ed. Prentice Hall, 1992. 61 M. Kamber, L. Winstone, W. Gong, S. Cheng, and J. Han. Generalization and decision tree induction: E cient classi cation in data mining. In Proc. 1997 Int. Workshop Research Issues on Data Engineering RIDE'97, pages 111 120, Birmingham, England, April 1997. 62 R. L Kennedy, Y. Lee, B. Van Roy, C. D. Reed, and R. P. Lippman. Solving Data Mining Problems Through Pattern Recognition. Upper Saddle River, NJ: Prentice Hall, 1998. 63 Y. Kodrato and R. S. Michalski. Machine Learning, An Arti cial Intelligence Approach, Vol. 3. Morgan Kaufmann, 1990.
44
BIBLIOGRAPHY
64 J. L. Kolodner. Case-Based Reasoning. Morgan Kaufmann, 1993. 65 I. Kononenko and S. J. Hong. Attribute selection for modeling. In Future Generation Computer Systems, pages 181 195, ???, 1997. 66 K. Koperski and J. Han. Discovery of spatial association rules in geographic information databases. In Proc. 4th Int. Symp. Large Spatial Databases SSD'95, pages 47 66, Portland, Maine, Aug. 1995. 67 P. Langley. Elements of Machine Learning. Morgan Kaufmann, 1996. 68 S. L. Lauritzen. The EM algorithm for graphical association models with missing data. Computational Statistics and Data Analysis, 19:191 201, 1995. 69 S. Lawrence, C. L Giles, and A. C. Tsoi. Symbolic conversion, grammatical inference and rule extraction for foreign exchange rate prediction. In Y. Abu-Mostafa and A. S. Weigend P. N Refenes, editors, Neural Networks in the Captial Markets. Singapore: World Scienti c, 1997. 70 Y. Le Cun, J. S. Denker, and S. A. Solla. Optimal brain damage. In D. Touretzky, editor, Advances in neural Information Processing Systems, 2, pages San Mateo, CA: Morgan Kaufmann. Cambridge, MA: MIT Press, 1990. 71 D. B. Leake. CBR in context: The present and future. In D. B. Leake, editor, Cased-Based Reasoning: Experience, Lessons, and Future Directions, pages 3 30. Menlo Park, CA: AAAI Press, 1996. 72 A. Lenarcik and Z. Piasta. Probabilistic rough classi ers with mixture of discrete and continuous variables. In T. Y. Lin N. Cercone, editor, Rough Sets and Data Mining: Analysis for Imprecise Data, pages 373 383. Boston: Kluwer, 1997. 73 B. Lent, A. Swami, and J. Widom. Clustering association rules. In Proc. 1997 Int. Conf. Data Engineering ICDE'97, pages 220 231, Birmingham, England, April 1997. 74 B. Liu, W. Hsu, and Y. Ma. Integrating classi cation and association rule mining. In Proc. 1998 Int. Conf. Knowledge Discovery and Data Mining KDD'98, pages 80 86, New York, NY, August 1998. 75 W. Y. Loh and Y.S. Shih. Split selection methods for classi cation trees. Statistica Sinica, 7:815 840, 1997. 76 W. Y. Loh and N. Vanichsetakul. Tree-structured classi caiton via generalized discriminant analysis. Journal of the American Statistical Association, 83:715 728, 1988. 77 H. Lu, R. Setiono, and H. Liu. Neurorule: A connectionist approach to data mining. In Proc. 21st Int. Conf. Very Large Data Bases, pages 478 489, Zurich, Switzerland, Sept. 1995. 78 J. Major and J. Mangano. Selecting among rules induced from a hurricane database. Journal of Intelligent Information Systems, 4:39 52, 1995. 79 D. Malerba, E. Floriana, and G. Semeraro. A further comparison of simpli cation methods for decision tree induction. In D.Fisher H. Lenz, editor, Learning from Data: AI and Statistics. Springer-Verlag, 1995. 80 M. Manago and Y. Kodrato . Induction of decision trees from complex structured data. In G. PiatetskyShapiro and W. J. Frawley, editors, Knowledge Discovery in Databases, pages 289 306. AAAI MIT Press, 1991. 81 M. Mehta, R. Agrawal, and J. Rissanen. SLIQ: A fast scalable classi er for data mining. In Proc. 1996 Int. Conf. Extending Database Technology EDBT'96, Avignon, France, March 1996. 82 M. Mehta, J. Rissanen, and R. Agrawal. MDL-based decision tree pruning. In Proc. 1st Intl. Conf. Knowledge Discovery and Data Mining KDD95, Montreal, Canada, Aug. 1995. 83 Z. Michalewicz. Genetic Algorithms + Data Structures = Evolution Programs. Springer Verlag, 1992. 84 R. S. Michalski, I. Bratko, and M. Kubat. Machine Learning and Data Mining: Methods and Applications. John Wiley & Sons, 1998.
BIBLIOGRAPHY
45
85 R. S. Michalski, J. G. Carbonell, and T. M. Mitchell. Machine Learning, An Arti cial Intelligence Approach, Vol. 1. Morgan Kaufmann, 1983. 86 R. S. Michalski, J. G. Carbonell, and T. M. Mitchell. Machine Learning, An Arti cial Intelligence Approach, Vol. 2. Morgan Kaufmann, 1986. 87 R. S. Michalski and G. Tecuci. Machine Learning, A Multistrategy Approach, Vol. 4. Morgan Kaufmann, 1994. 88 D. Michie, D. J. Spiegelhalter, and C. C. Taylor. Machine Learning, Neural and Statistical Classi cation. Ellis Horwood, 1994. 89 J. Mingers. An empirical comparison of pruning methods for decision-tree induction. Machine Learning, 4:227 243, 1989. 90 M. Mitchell. An Introduction to Genetic Algorithms. MIT Press, 1996. 91 T. M. Mitchell. Machine Learning. McGraw Hill, 1997. 92 D. Mitchie, D. J. Spiegelhalter, and C. C. Taylor. Machine Learning, Neural and Statistical Classi cation. New York: Ellis Horwood, 1994. 93 R. Mooney, J. Shavlik, G. Towell, and A. Grove. An experimental comparison of symbolic and connectionist learning algorithms. In Proc. 11th Int. Joint Conf. on Arti cial Intelligence IJCAI'89, pages 775 787, Detroit, MI, Aug. 1989. 94 S. K. Murthy. Automatic construction of decision trees from data: A multi-disciplinary survey. Data Mining and Knowledge Discovery, 2:345 389, 1998. 95 J. Neter, M. H. Kutner, C. J. Nachtsheim, and L. Wasserman. Applied Linear Statistical Models, 4th ed. Irwin: Chicago, 1996. 96 T. Niblett and I. Bratko. Learning decision rules in noisy domains. In M. A. Bramer, editor, Expert Systems '86: Research and Development in Expert Systems III, pages 25 34. British Computer Society Specialist Group on Expert Systems, Dec. 1986. 97 Z. Pawlak. Rough sets. Intl. J. Computer and Information Sciences, 11:341 356, 1982. 98 Z. Pawlak. On learning - rough set approach. In Lecture Notes 208, pages 197 227, New York: Springer-Verlag, 1986. 99 Z. Pawlak. Rough Sets, Theoretical Aspects of Reasonign about Data. Boston: Kluwer, 1991. 100 J. Pearl. Probabilistic Reasoning in Intelligent Systems. Palo Alto, CA: Morgan Kau man, 1988. 101 W. H. Press, S. A. Teukolsky, V. T. Vetterling, and B. P. Flannery. Numerical Recipes in C, The Art of Scienti c Computing. Cambridge, MA: Cambridge University Press, 1996. 102 D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999. 103 J. R. Quinlan. The e ect of noise on concept learning. In Michalski et al., editor, Machine Learning: An Arti cial Intelligence Approach, Vol. 2, pages 149 166. Morgan Kaufmann, 1986. 104 J. R. Quinlan. Induction of decision trees. Machine Learning, 1:81 106, 1986. 105 J. R. Quinlan. Simplifying decision trees. Internation Journal of Man-Machine Studies, 27:221 234, 1987. 106 J. R. Quinlan. An empirical comparison of genetic and decision-tree classi ers. In Proc. 5th Intl. Conf. Machine Learning, pages 135 141, San Mateo, CA: Morgan Kaufmann, 1988. 107 J. R. Quinlan. Learning logic de nitions from relations. Machine Learning, 5:139 166, 1990. 108 J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.
46
BIBLIOGRAPHY
109 J. R. Quinlan. Bagging, boosting, and C4.5. In Proc. 13th Natl. Conf. on Arti cial Intelligence AAAI'96, volume 1, pages 725 730, Portland, OR, Aug. 1996. 110 J. R. Quinlan and R. L. Rivest. Inferring decision trees using the minimum description length principle. Information and Computation, 80:227 248, March 1989. 111 R. Rastogi and K. Shim. Public: A decision tree classifer that integrates building and pruning. In Proc. 1998 Int. Conf. Very Large Data Bases, pages 404 415, New York, NY, August 1998. 112 C. Riesbeck and R. Schank. Inside Case-Based Reasoning. Hillsdale, NJ: Lawrence Erlbaum, 1989. 113 B. D. Ripley. Pattern Recognition and Neural Networks. Cambridge: Cambridge University Press, 1996. 114 S. Romanski. Operations on families of sets for exhaustive search, given a monotonic function. In Proc. 3rd Intl. Conf on Data and Knowledge Bases, C. Beeri et. al. eds.,, pages 310 322, Jerusalem, Israel, 1988. 115 D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error propagation. In D. E. Rumelhart J. L. McClelland, editor, Parallel Distributed Processing. Cambridge, MA: MIT Press, 1986. 116 D. E. Rumelhart and J. L. McClelland. Parallel Distributed Processing. Cambridge, MA: MIT Press, 1986. 117 S. Russell, J. Binder, D. Koller, and K. Kanazawa. Local learning in probabilistic networks with hidden variables. In Proc. 14th Joint Int. Conf. on Arti cial Intelligence IJCAI'95, volume 2, pages 1146 1152, Montreal, Canada, Aug. 1995. 118 S. Russell and P. Norvig. Arti cial Intelligence: A Modern Approach. Prentice-Hall, 1995. 119 K. Saito and R. Nakano. Medical diagnostic expert system based on PDP model. In Proc. IEEE International Conf. on Neural Networks, volume 1, pages 225 262, San Mateo, CA, 1988. 120 J. C. Schlimmer and D. Fisher. A case study of incremental concept induction. In Proc. 5th Natl. Conf. Arti cial Intelligence, pages 496 501, Phildelphia, PA: Morgan Kaufmann, 1986. 121 J. Shafer, R. Agrawal, and M. Mehta. SPRINT: A scalable parallel classi er for data mining. In Proc. 1996 Int. Conf. Very Large Data Bases, pages 544 555, Bombay, India, Sept. 1996. 122 J. W. Shavlik, R. J. Mooney, and G. G. Towell. Symbolic and neural learning algorithms: An experimental comparison. Machine Learning, 6:111 144, 1991. 123 J.W. Shavlik and T.G. Dietterich. Readings in Machine Learning. Morgan Kaufmann, 1990. 124 A. Skowron, L. Polowski, and J. Komorowski. Learning tolerance relations by boolean descriptors: Automatic feature extraction from data tables. In Proc. 4th Intl. Workshop on Rough Sets, Fuzzy Sets, and Machine Discovery, S. Tsumoto et. al. eds., pages 11 17, University of Tokyo, 1996. 125 A. Skowron and C. Rauszer. The discernibility matrices and functions in information systems. In R. Slowinski, editor, Intelligent Decision Support, Handbook of Applications and Advances of the Rough Set Theory, pages 331 362. Boston: Kluwer, 1992. 126 P. Smyth and R.M. Goodman. Rule induction using information theory. In G. Piatetsky-Shapiro and W. J. Frawley, editors, Knowledge Discovery in Databases, pages 159 176. AAAI MIT Press, 1991. 127 M. Stone. Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society, 36:111 147, 1974. 128 R. Swiniarski. Rough sets and principal component analysis and their applications in feature extraction and seletion, data model building and classi cation. In S. Pal A. Skowron, editor, Fuzzy Sets, Rough Sets and Decision Making Processes. New York: Springer-Verlag, 1998. 129 K. Sycara, R. Guttal, J. Koning, S. Narasimhan, and D. Navinchandra. CADET: A case-based synthesis tool for engineering design. Int. Journal of Expert Systems, 4:157 188, 1992.
BIBLIOGRAPHY
47
130 S. B. Thrun et al. The monk's problems: A performance comparison of di erent learning algorithms. In Technical Report CMU-CS-91-197, Department of Computer Science, Carnegie Mellon Univ., Pittsburgh, PA, 1991. 131 G. G. Towell and J. W. Shavlik. Extracting re ned rules from knowledge-based neural networks. Machine Learning, 13:71 101, Oct. 1993. 132 P. E. Utgo . An incremental ID3. In Proc. Fifth Int. Conf. Machine Learning, pages 107 120, San Mateo, California, 1988. 133 R. Uthurusamy, U. M. Fayyad, and S. Spangler. Learning useful rules from inconclusive data. In G. PiatetskyShapiro and W. J. Frawley, editors, Knowledge Discovery in Databases, pages 141 157. AAAI MIT Press, 1991. 134 S. M. Weiss and N. Indurkhya. Predictive Data Mining. Morgan Kaufmann, 1998. 135 S. M. Weiss and I. Kapouleas. An empirical comparison of pattern recognition, neural nets, and machine learning classi cation methods. In Proc. 11th Int. Joint Conf. Arti cial Intelligence, pages 781 787, Detroit, MI, Aug. 1989. 136 S. M. Weiss and C. A. Kulikowski. Computer Systems that Learn: Classi cation and Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert Systems. Morgan Kaufman, 1991. 137 B. Widrow, D. E. Rumelhart, and M. A. Lehr. Neural networks: Applications in industry, business and science. Comm. ACM, 37:93 105, 1994. 138 K. D. Wilson. Chemreg: Using case-based reasoning to support health and safety compliance in the chemical industry. AI Magazine, 19:47 57, 1998. 139 J. York and D. Madigan. Markov chaine monte carlo methods for hierarchical bayesian expert systems. In Cheesman and Oldford, pages 445 452, 1994. 140 L. A. Zadeh. Fuzzy sets. Information and Control, 8:338 353, 1965. 141 W. Ziarko. The discovery, analysis, and representation of data dependencies in databases. In G. PiatetskyShapiro W. J. Frawley, editor, Knowledge Discovery in Databases, pages 195 209. AAAI Press, 1991.

7 Classication and Prediction 3

Uploaded by

Copyright:

Available Formats

7 Classication and Prediction 3

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

7 Classication and Prediction 3

Uploaded by

Copyright:

Available Formats

Contents

7 Classi cation and Prediction

c J. Han and M. Kamber, 1998, DRAFT!! DO NOT COPY!! DO NOT DISTRIBUTE!!

September 15, 1999

Classi cation and Prediction

7.1 What is classi cation? What is prediction?

CHAPTER 7. CLASSIFICATION AND PREDICTION

Training Data Classification Rules

age < 30 < 30 30 - 40 > 40 > 40 30 - 40 ...

income low low high med med high

credit rating fair excellent excellent fair fair excellent ...

IF age 30-40 AND income=high THEN credit_rating=excellent

name Frank Jones Sylvia Crest Anne Yee ...

age > 40 < 30 30 - 40 ...

income high low high ...

credit rating fair fair excellent ...

(John Henri, 30-40, high) Credit rating?

7.2. ISSUES REGARDING CLASSIFICATION AND PREDICTION

7.2 Issues regarding classi cation and prediction

CHAPTER 7. CLASSIFICATION AND PREDICTION

7.3 Classi cation by decision tree induction

7.3. CLASSIFICATION BY DECISION TREE INDUCTION

7.3.1 Decision tree induction

CHAPTER 7. CLASSIFICATION AND PREDICTION

GainA = Is1 ; s2 ; : : :; sm  , EA:

7.3. CLASSIFICATION BY DECISION TREE INDUCTION

7.3.2 Tree pruning

CHAPTER 7. CLASSIFICATION AND PREDICTION

income high high medium low medium

student credit_rating no no no yes yes fair excellent fair fair excellent

Class no no no yes yes

income high low medium high

student credit_rating no yes no yes fair excellent excellent fair

Class yes yes yes yes

income medium low low medium medium

Class yes yes no yes no

7.3.3 Extracting classi cation rules from decision trees

7.3. CLASSIFICATION BY DECISION TREE INDUCTION

7.3.4 Enhancements to basic decision tree induction

CHAPTER 7. CLASSIFICATION AND PREDICTION

7.3.5 Scalability and decision tree induction

Table 7.2: Sample data for the class buys computer.

credit_rating excellent excellent fair excellent ...

buys_computer yes yes no no ...

5 Disk Resident -- Attribute List Memory Resident -- Class List

7.3. CLASSIFICATION BY DECISION TREE INDUCTION

7.3.6 Integrating data warehousing techniques and decision tree induction

CHAPTER 7. CLASSIFICATION AND PREDICTION

7.4. BAYESIAN CLASSIFICATION

7.4 Bayesian classi cation

7.4.1 Bayes theorem

CHAPTER 7. CLASSIFICATION AND PREDICTION

7.4.2 Naive Bayesian classi cation

jCi PC PCijX = PXPX i :

7.4. BAYESIAN CLASSIFICATION

7.4.3 Bayesian belief networks

CHAPTER 7. CLASSIFICATION AND PREDICTION

P z1; :::; zn =

7.5. CLASSIFICATION BY BACKPROPAGATION

GainA = Is1 ; s2 ; : : :; sm , EA:

jCi PC PCijX = PXPX i :

P z1; :::; zn =

Psi xi , xyi , y ; = Ps xi , x

31 7.23 and often provide