7 Classication and Prediction 3
7 Classication and Prediction 3
7 Classication and Prediction 3
CONTENTS
Chapter 7
number or set of classes to be learned may not be known in advance. Clustering is the topic of Chapter 8. Typically, the learned model is represented in the form of classi cation rules, decision trees, or mathematical formulae. For example, given a database of customer credit information, classi cation rules can be learned to identify customers as having either excellent or fair credit ratings Figure 7.1a. The rules can be used to categorize future data samples, as well as provide a better understanding of the database contents. In the second step Figure 7.1b, the model is used for classi cation. First, the predictive accuracy of the model or classi er is estimated. Section 7.9 of this chapter describes several methods for estimating classi er accuracy. The holdout method is a simple technique which uses a test set of class-labeled samples. These samples are 3
4
a)
name Sandy Jones Bill Lee Courtney Fox Susan Lake Claire Phips Andre Beau ...
b)
Classification Rules
Test Data
New Data
excellent
Figure 7.1: The data classi cation process: a Learning: Training data are analyzed by a classi cation algorithm. Here, the class label attribute is credit rating, and the learned model or classi er is represented in the form of classi cation rules. b Classi cation: Test data are used to estimate the accuracy of the classi cation rules. If the accuracy is considered acceptable, the rules can be applied to the classi cation of new data tuples. randomly selected and are independent of the training samples. The accuracy of a model on a given test set is the percentage of test set samples that are correctly classi ed by the model. For each test sample, the known class label is compared with the learned model's class prediction for that sample. Note that if the accuracy of the model were estimated based on the training data set, this estimate could be optimistic since the learned model tends to over t the data that is, it may have incorporated some particular anomalies of the training data which are not present in the overall sample population. Therefore, a test set is used. If the accuracy of the model is considered acceptable, the model can be used to classify future data tuples or objects for which the class label is not known. Such data are also referred to in the machine learning literature as unknown" or previously unseen" data. For example, the classi cation rules learned in Figure 7.1a from the analysis of data from existing customers can be used to predict the credit rating of new or future i.e., previously unseen customers. How is prediction di erent from classi cation?" Prediction can be viewed as the construction and use of a model to assess the class of an unlabeled object, or to assess the value or value ranges of an attribute that a given object is likely to have. In this view, classi cation and regression are the two major types of prediction problems where classi cation is used to predict discrete or nominal values, while regression is used to predict continuous or
ordered values. In our view, however, we refer to the use of predication to predict class labels as classi cation and the use of predication to predict continuous values e.g., using regression techniques as prediction. This view is commonly accepted in data mining. Classi cation and prediction have numerous applications including credit approval, medical diagnosis, performance prediction, and selective marketing.
Example 7.1 Suppose that we have a database of customers on the AllElectronics mailing list. The mailing list
is used to send out promotional literature describing new products and upcoming price discounts. The database describes attributes of the customers, such as their name, age, income, occupation, and credit rating. The customers can be classi ed as to whether or not they have purchased a computer at AllElectronics. Suppose that new customers are added to the database and that you would like to notify these customers of an uncoming computer sale. To send out promotional literature to every new customer in the database can be quite costly. A more cost e cient method would be to only target those new customers who are likely to purchase a new computer. A classi cation model can be constructed and used for this purpose. Suppose instead that you would like to predict the number of major purchases that a customer will make at AllElectronics during a scal year. Since the predicted value here is ordered, a prediction model can be constructed for this purpose. 2
Data cleaning. This refers to the preprocessing of data in order to remove or reduce noise by applying
smoothing techniques, for example, and the treatment of missing values e.g., by replacing a missing value with the most commonly occurring value for that attribute, or with the most probable value based on statistics. Although most classi cation algorithms have some mechanisms for handling noisy or missing data, this step can help reduce confusion during learning. Relevance analysis. Many of the attributes in the data may be irrelevant to the classi cation or prediction task. For example, data recording the day of the week on which a bank loan application was led is unlikely to be relevant to the success of the application. Furthermore, other attributes may be redundant. Hence, relevance analysis may be performed on the data with the aim of removing any irrelevant or redundant attributes from the learning process. In machine learning, this step is known as feature selection. Including such attributes may otherwise slow down, and possibly mislead, the learning step. Ideally, the time spent on relevance analysis, when added to the time spent on learning from the resulting reduced" feature subset, should be less than the time that would have been spent on learning from the original set of features. Hence, such analysis can help improve classi cation e ciency and scalability. Data transformation. The data can be generalized to higher-level concepts. Concept hierarchies may be used for this purpose. This is particularly useful for continuous-valued attributes. For example, numeric values for the attribute income may be generalized to discrete ranges such as low, medium, and high. Similarly, nominal-valued attributes, like street, can be generalized to higher-level concepts, like city. Since generalization compresses the original training data, fewer input output operations may be involved during learning. The data may also be normalized, particularly when neural networks or methods involving distance measurements are used in the learning step. Normalization involves scaling all values for a given attribute so that they fall within a small speci ed range, such as -1.0 to 1.0, or 0 to 1.0. In methods which use distance measurements, for example, this would prevent attributes with initially large ranges like, say income from outweighing attributes with initially smaller ranges such as binary attributes. Data cleaning, relevance analysis, and data transformation are described in greater detail in Chapter 3 of this book. Comparing classi cation methods. Classi cation and prediction methods can be compared and evaluated according to the following criteria:
<30
30-40
>40
student? no yes
yes excellent
credit_rating? fair
no
yes
yes
no
Figure 7.2: A decision tree for the concept buys computer, indicating whether or not a customer at AllElectronics is likely to purchase a computer. Each internal non-leaf node represents a test on an attribute. Each leaf node represents a class either buys computer = yes or buys computer = no. 1. Predictive accuracy. This refers to the ability of the model to correctly predict the class label of new or previously unseen data. 2. Speed. This refers to the computation costs involved in generating and using the model. 3. Robustness. This is the ability of the model to make correct predictions given noisy data or data with missing values. 4. Scalability. This refers to the ability of the learned model to perform e ciently on large amounts of data. 5. Interpretability. This refers is the level of understanding and insight that is provided by the learned model. These issues are discussed throughout the chapter. The database research community's contributions to classi cation and prediction for data mining have strongly emphasized the scalability aspect, particularly with respect to decision tree induction.
denoted by rectangles, and leaf nodes are denoted by ovals. In order to classify an unknown sample, the attribute values of the sample are tested against the decision tree. A path is traced from the root to a leaf node which holds the class prediction for that sample. Decision trees can easily be converted to classi cation rules. In Section 7.3.1, we describe a basic algorithm for learning decision trees. When decision trees are built, many of the branches may re ect noise or outliers in the training data. Tree pruning attempts to identify and remove such branches, with the goal of improving classi cation accuracy on unseen data. Tree pruning is described in Section 7.3.2. The extraction of classi cation rules from decision trees is discussed in Section 7.3.3. Enhancements of the basic decision tree algorithm are given in Section 7.3.4. Scalability issues for the induction of decision trees from large databases are discussed in Section 7.3.5. Section 7.3.6 describes the integration of decision tree induction with data warehousing facilities, such as data cubes, allowing the mining of decision trees at multiple levels of granularity. Decision trees have been used in many application areas ranging from medicine to game theory and business. Decision trees are the basis of several commercial rule induction systems.
Algorithm 7.3.1 Generate decision tree Generate a decision tree from the given training data. Input: The training samples, samples, represented by discrete-valued attributes; the set of candidate attributes, attribute-list. Output: A decision tree. Method:
1 create a node N ; 2 if samples are all of the same class, C then 3 return N as a leaf node labeled with the class C ; 4 if attribute-list is empty then 5 return N as a leaf node labeled with the most common class in samples; majority voting 6 select test-attribute, the attribute among attribute-list with the highest information gain; 7 label node N with test-attribute; 8 for each known value ai of test-attribute partition the samples 9 grow a branch from node N for the condition test-attribute=ai; 10 let si be the set of samples in samples for which test-attribute=ai; a partition 11 if si is empty then 12 attach a leaf labeled with the most common class in samples; 13 else attach the node returned by Generate decision treesi , attribute-list - test-attribute;
Figure 7.3: Basic algorithm for inducing a decision tree from training samples.
Attribute selection measure. The information gain measure is used to select the test attribute at each node
in the tree. Such a measure is referred to as an attribute selection measure or a measure of the goodness of split. The attribute with the highest information gain or greatest entropy reduction is chosen as the test attribute for the current node. This attribute minimizes the information needed to classify the samples in the resulting partitions and re ects the least randomness or impurity" in these partitions. Such an information-theoretic approach minimizes the expected number of tests needed to classify an object and guarantees that a simple but not necessarily the simplest tree is found. Let S be a set consisting of s data samples. Suppose the class label attribute has m distinct values de ning m distinct classes, Ci for i = 1; : : :; m. Let si be the number of samples of S in class Ci. The expected information needed to classify a given sample is given by: Is1 ; s2; : : :; sm =
m X p log p , i=1 i
2
7.1
where pi is the probability than an arbitrary sample belongs to class Ci and is estimated by si s. Note that a log function to the base 2 is used since the information is encoded in bits. Let attribute A have v distinct values, fa1 ; a2; ; av g. Attribute A can be used to partition S into v subsets, fS1 ; S2 ; ; Sv g, where Sj contains those samples in S that have value aj of A. If A were selected as the test attribute i.e., best attribute for splitting, then these subsets would correspond to the branches grown from the node containing the set S. Let sij be the number of samples of class Ci in a subset Sj . The entropy, or expected information based on the partitioning into subsets by A is given by: EA =
v X s j + + smj Is
1
The term v=1 s j ++smj acts as the weight of the j th subset and is the number of samples in the subset i.e., j s having value aj of A divided by the total number of samples in S. The smaller the entropy value is, the greater the purity of the subset partitions. The encoding information that would be gained by branching on A is
1
j =1
j ; : : :; smj :
7.2
7.3
In other words, GainA is the expected reduction in entropy caused by knowing the value of attribute A. The algorithm computes the information gain of each attribute. The attribute with the highest information gain is chosen as the test attribute for the given set S. A node is created and labeled with the attribute, branches are created for each value of the attribute, and the samples are partitioned accordingly.
Example 7.2 Induction of a decision tree. Table 7.1 presents a training set of data tuples taken from the AllElectronics customer database. The data are adapted from Quinlan 1986b . The class label attribute, buys computer, has two distinct values namely fyes, nog, therefore, there are two distinct classes m = 2. Let C correspond to
the class yes and class C2 correspond to no. There are 9 samples of class yes and 5 samples of class no. To compute the information gain of each attribute, we rst use Equation 7.1 to compute the expected information needed to classify a given sample. This is: 9 9 5 5 Is1 ; s2 = I9; 5 = , 14 log2 14 , 14 log2 14 = 0:940 Next, we need to compute the entropy of each attribute. Let's start with the attribute age. We need to look at the distribution of yes and no samples for each value of age. We compute the expected information for each of these distributions. for age = 30": for age = 30-40": for age = 40": s11 = 2 s12 = 4 s13 = 3 s21 = 3 s22 = 0 s23 = 2 Is11 ; s21 = 0.971 Is12 ; s22 = 0 Is13 ; s23 = 0.971
1
Table 7.1: Training data tuples from the AllElectronics customer database. Using Equation 7.2, the expected information needed to classify a given sample if the samples are partitioned according to age, is: 4 5 5 Eage = 14 Is11 ; s21 + 14 Is12 ; s22 + 14 Is13 ; s23 = 0:694: Hence, the gain in information from such a partitioning would be: Gainage = Is1 ; s2 , Eage = 0:246 Similarly, we can compute Gainincome = 0.029, Gainstudent = 0.151, and Gaincredit rating = 0.048. Since age has the highest information gain among the attributes, it is selected as the test attribute. A node is created and labeled with age, and branches are grown for each of the attribute's values. The samples are then partitioned accordingly, as shown in Figure 7.4. Notice that the samples falling into the partition for age = 30-40 all belong to the same class. Since they all belong to class yes, a leaf should therefore be created at the end of this branch and labeled with yes. The nal decision tree returned by the algorithm is shown in Figure 7.2. 2 In summary, decision tree induction algorithms have been used for classi cation in a wide range of application domains. Such systems do not use domain knowledge. The learning and classi cation steps of decision tree induction are generally fast. Classi cation accuracy is typically high for data where the mapping of classes consists of long and thin regions in concept space.
10
<30
30-40
>40
student credit_rating no yes yes yes no fair fair excellent fair excellent
Figure 7.4: The attribute age has the highest information gain and therefore becomes a test attribute at the root node of the decision tree. Branches are grown for each value of age. The samples are shown partitioned according to each branch. in choosing an appropriate threshold. High thresholds could result in oversimpli ed trees, while low thresholds could result in very little simpli cation. The postpruning approach removes branches from a fully grown" tree. A tree node is pruned by removing its branches. The cost complexity pruning algorithm is an example of the postpruning approach. The pruned node becomes a leaf and is labeled by the most frequent class among its former branches. For each non-leaf node in the tree, the algorithm calculates the expected error rate that would occur if the subtree at that node were pruned. Next, the expected error rate occurring if the node were not pruned is calculated using the error rates for each branch, combined by weighting according to the proportion of observations along each branch. If pruning the node leads to a greater expected error rate, then the subtree is kept. Otherwise, it is pruned. After generating a set of progressively pruned trees, an independent test set is used to estimate the accuracy of each tree. The decision tree that minimizes the expected error rate is preferred. Rather than pruning trees based on expected error rates, we can prune trees based on the number of bits required to encode them. The best pruned tree" is the one that minimizes the number of encoding bits. This method adopts the Minimum Description Length MDL principle which follows the notion that the simplest solution is preferred. Unlike cost complexity pruning, it does not require an independent set of samples. Alternatively, prepruning and postpruning may be interleaved for a combined approach. Postpruning requires more computation than prepruning, yet generally leads to a more reliable tree.
11
Example 7.3 Generating classi cation rules from a decision tree. The decision tree of Figure 7.2 can be
IF age = 30" AND student = no THEN buys computer = no IF age = 30" AND student = yes THEN buys computer = yes IF age = 30-40" THEN buys computer = yes IF age = 40" AND credit rating = excellent THEN buys computer = yes IF age = 40" AND credit rating = fair THEN buys computer = no
converted to classi cation IF-THEN rules by tracing the path from the root node to each leaf node in the tree. The rules extracted from Figure 7.2 are:
2
C4.5, a later version of the ID3 algorithm, uses the training samples to estimate the accuracy of each rule. Since this would result in an optimistic estimate of rule accuracy, C4.5 employs a pessimistic estimate to compensate for the bias. Alternatively, a set of test samples independent from the training set can be used to estimate rule accuracy. A rule can be pruned" by removing any condition in its antecedent that does not improve the estimated accuracy of the rule. For each class, rules within a class may then be ranked according to their estimated accuracy. Since it is possible that a given test sample will not satisfy any rule antecedent, a default rule assigning the majority class is typically added to the resulting rule set.
12
rid 1 2 3 4 ...
age 26 35 38 49 ...
rid 2 3 1 4 ...
rid 1 2 3 4 ...
node 5 2 3 6 ... 3 1
0 2
Figure 7.5: Attribute list and class list data structures used in SLIQ for the sample data of Table 7.2. More recent decision tree algorithms which address the scalability issue have been proposed. Algorithms for the induction of decision trees from very large training sets include SLIQ and SPRINT, both of which can handle categorical and continuous-valued attributes. Both algorithms propose pre-sorting techniques on disk-resident data sets that are too large to t in memory. Both de ne the use of new data structures to facilitate the tree construction. SLIQ employs disk resident attribute lists and a single memory resident class list. The attribute lists and class lists generated by SLIQ for the sample data of Table 7.2 are shown in Figure 7.5. Each attribute has an associated attribute list, indexed by rid a record identi er. Each tuple is represented by a linkage of one entry from each attribute list to an entry in the class list holding the class label of the given tuple, which in turn is linked to its corresponding leaf node in the decision tree. The class list remains in memory since it is often accessed and modi ed in the building and pruning phases. The size of the class list grows proportionally with the number of tuples in the training set. When a class list cannot t into memory, the performance of SLIQ decreases. SPRINT uses a di erent attribute list data structure which holds the class and rid information, as shown in Figure 7.6. When a node is split, the attribute lists are partitioned and distributed among the resulting child nodes
13
Figure 7.6: Attribute list data structure used in SPRINT for the sample data of Table 7.2. accordingly. When a list is partitioned, the order of the records in the list is maintained. Hence, partitioning lists does not require resorting. SPRINT was designed to be easily parallelized, further contributing to its scalability. While both SLIQ and SPRINT handle disk-resident data sets that are too large to t into memory, the scalability of SLIQ is limited by the use of its memory-resident data structure. SPRINT removes all memory restrictions, yet requires the use of a hash tree proportional in size to the training set. This may become expensive as the training set size grows. RainForest is a framework for the scalable induction of decision trees. The method adapts to the amount of main memory available, and apply to any decision tree induction algorithm. It maintains an AVC-set Attribute-Value, Class label indicating the class distribution for each attribute. RainForest reports a speed-up over SPRINT.
14
Occupation
> 30K Income 30K-40K > 40K < 30 30-40 Age > 40
Figure 7.7: A multidimensional data cube. in order to provide an alternative presentation of the data. For example, pivot may be used to transform a 3-D cube into a series of 2-D planes. The above approaches can be integrated with decision tree induction to provide interactive multilevel mining of decision trees. The data cube and knowledge stored in the concept hierarchies can be used to induce decision trees at di erent levels of abstraction. Furthermore, once a decision tree has been derived, the concept hierarchies can be used to generalize or specialize individual nodes in the tree, allowing attribute roll-up or drill-down, and reclassi cation of the data for the newly speci ed abstraction level. This interactive feature will allow users to focus their attention on areas of the tree or data which they nd interesting. When integrating AOI with decision tree induction, generalization to a very low speci c concept level can result in quite large and bushy trees. Generalization to a very high concept level can result in decision trees of little use, where interesting and important subconcepts are lost due to overgeneralization. Instead, generalization should be to some intermediate concept level, set by a domain expert or controlled by a user-speci ed threshold. Hence, the use of AOI may result in classi cation trees that are more understandable, smaller, and therefore easier to interpret than trees obtained from methods operating on ungeneralized larger sets of low-level data such as SLIQ or SPRINT. A criticism of typical decision tree generation is that, because of the recursive partitioning, some resulting data subsets may become so small that partitioning them further would have no statistically signi cant basis. The maximum size of such insigni cant" data subsets can be statistically determined. To deal with this problem, an exception threshold may be introduced. If the portion of samples in a given subset is less than the threshold, further partitioning of the subset is halted. Instead, a leaf node is created which stores the subset and class distribution of the subset samples. Owing to the large amount and wide diversity of data in large databases, it may not be reasonable to assume that each leaf node will contain samples belonging to a common class. This problem may be addressed by employing a precision or classi cation threshold. Further partitioning of the data subset at a given node is terminated if the percentage of samples belonging to any given class at that node exceeds this threshold. A data mining query language may be used to specify and facilitate the enhanced decision tree induction method. Suppose that the data mining task is to predict the credit risk of customers aged 30-40, based on their income and occupation. This may be speci ed as the following data mining query:
mine classi cation analyze credit risk in relevance to income, occupation from Customer db where age = 30 and age 40 display as rules
15
The above query, expressed in DMQL1 , executes a relational query on Customer db to retrieve the task-relevant data. Tuples not satisfying the where clause are ignored, and only the data concerning the attributes speci ed in the in relevance to clause, and the class label attribute credit risk are collected. AOI is then performed on this data. Since the query has not speci ed which concept hierarchies to employ, default hierarchies are used. A graphical user interface may be designed to facilitate user speci cation of data mining tasks via such a data mining query language. In this way, the user can help guide the automated data mining process. Hence, many ideas from data warehousing can be integrated with classi cation algorithms, such as decision tree induction, in order to facilitate data mining. Attribute-oriented induction employs concept hierarchies to generalize data to multiple abstraction levels, and can be integrated with classi cation methods in order to perform multilevel mining. Data can be stored in multidimensional data cubes to allow quick accessing to aggregate data values. Finally, a data mining query language can be used to assist users in interactive data mining.
16
The naive Bayesian classi er, or simple Bayesian classi er, works as follows: 1. Each data sample is represented by an n-dimensional feature vector, X = x1 ; x2; : : :; xn, depicting n measurements made on the sample from n attributes, respectively A1 ; A2; ::; An. 2. Suppose that there are m classes, C1 ; C2; : : :; Cm . Given an unknown data sample, X i.e., having no class label, the classi er will predict that X belongs to the class having the highest posterior probability, conditioned on X. That is, the naive Bayesian classi er assigns an unknown sample X to the class Ci if and only if : P CijX PCj jX for 1 j m; j 6= i. Thus we maximize PCijX. The class Ci for which PCijX is maximized is called the maximum posteriori hypothesis. By Bayes theorem Equation 7.4,
7.5
3. As PX is constant for all classes, only PX jCiPCi need be maximized. If the class prior probabilities are not known, then it is commonly assumed that the classes are equally likely, i.e. PC1 = PC2 = : : : = PCm , and we would therefore maximize P X jCi . Otherwise, we maximize PX jCiPCi. Note that the class prior probabilities may be estimated by P Ci = ssi , where si is the number of training samples of class Ci, and s is the total number of training samples. 4. Given data sets with many attributes, it would be extremely computationally expensive to compute PX jCi . In order to reduce computation in evaluating PX jCi , the naive assumption of class conditional independence is made. This presumes that the values of the attributes are conditionally independent of one another, given the class label of the sample, i.e., that there are no dependence relationships among the attributes. Thus, PX jCi =
n Y Px jC : k i
k=1
7.6
The probabilities Px1jCi; P x2jCi; : : :; PxnjCi can be estimated from the training samples, where: a If Ak is categorical, then Pxk jCi = ssik , where sik is the number of training samples of class Ci having i the value xk for Ak , and si is the number of training samples belonging to Ci. b If Ak is continuous-valued, then the attribute is assumed to have a Gaussian distribution. Therefore,
, Ci Ci ; P xkjCi = gxk ; Ci ; Ci = p 1 e 7.7 2 Ci where gxk ; Ci ; Ci is the Gaussian normal density function for attribute Ak , while Ci and Ci are the mean and variance respectively given the values for attribute Ak for training samples of class Ci . 5. In order to classify an unknown sample X, PX jCi PCi is evaluated for each class Ci . Sample X is then assigned to the class Ci if and only if :
2 2 2
x,
P X jCi PCi PX jCj PCj for 1 j m; j 6= i. In other words, it is assigned to the class, Ci, for which PX jCiPCi is the maximum.
17
In theory, Bayesian classi ers have the minimum error rate in comparison to all other classi ers. However, in practice this is not always the case owing to inaccuracies in the assumptions made for its use, such as class conditional independence, and the lack of available probability data. However, various empirical studies of this classi er in comparison to decision tree and neural network classi ers have found it to be comparable in some domains. Bayesian classi ers are also useful in that they provide a theoretical justi cation for other classi ers which do not explicitly use Bayes theorem. For example, under certain assumptions, it can be shown that many neural network and curve tting algorithms output the maximum posteriori hypothesis, as does the naive Bayesian classi er. label of an unknown sample using naive Bayesian classi cation, given the same training data as in Example 7.2 for decision tree induction. The training data are in Table 7.1. The data samples are described by the attributes age, income, student, and credit rating. The class label attribute, buys computer, has two distinct values namely fyes, nog. Let C1 correspond to the class buys computer = yes and C2 correspond to buys computer = no. The unknown sample we wish to classify is X = age =
30", income = medium, student = yes, credit rating = fair.
Example 7.4 Predicting a class label using naive Bayesian classi cation. We wish to predict the class
We need to maximize P X jCi PCi, for i = 1, 2. PCi, the prior probability of each class, can be computed based on the training samples: P buys computer = yes = 9=14 = 0:643 P buys computer = no = 5=14 = 0:357 To compute P X jCi, for i = 1, 2, we compute the following conditional probabilities:
Page = 30" j buys computer = yes = 2=9 = 0:222 Page = 30" j buys computer = no = 3=5 = 0:600 Pincome = medium j buys computer = yes = 4=9 = 0:444 Pincome = medium j buys computer = no = 2=5 = 0:400 Pstudent = yes j buys computer = yes = 6=9 = 0:667 Pstudent = yes j buys computer = no = 1=5 = 0:200 Pcredit rating = fair j buys computer = yes = 6=9 = 0:667 Pcredit rating = fair j buys computer = no = 2=5 = 0:400
Using the above probabilities, we obtain PX jbuys computer = yes = 0:222 0:444 0:667 0:667 = 0:044 PX jbuys computer = no = 0:600 0:400 0:200 0:400 = 0:019 PX jbuys computer = yesPbuys computer = yes = 0:044 0:643 = 0:028 PX jbuys computer = noPbuys computer = no = 0:019 0:357 = 0:007 Therefore, the naive Bayesian classi er predicts buys computer = yes" for sample X.
18
a) FamilyHistory Smoker
PositiveXRay
Dyspnea
Figure 7.8: a A simple Bayesian belief network; b The conditional probability table for the values of the variable LungCancer LC showing each possible combination of the values of its parent nodes, Family History FH and Smoker S. A belief network is de ned by two components. The rst is a directed acyclic graph, where each node represents a random variable, and each arc represents a probabilistic dependence. If an arc is drawn from a node Y to a node Z, then Y is a parent or immediate predecessor of Z, and Z is a descendent of Y . Each variable is conditionally independent of its nondescendents in the graph, given its parents. The variables may be discrete or continuous-valued. They may correspond to actual attributes given in the data, or to hidden variables" believed to form a relationship such as medical syndromes in the case of medical data. Figure 7.8a shows a simple belief network, adapted from Russell et al. 1995a for six Boolean variables. The arcs allow a representation of causal knowledge. For example, having lung cancer is in uenced by a person's family history of lung cancer, as well as whether or not the person is a smoker. Furthermore, the arcs also show that the variable LungCancer is conditionally independent of Emphysema, given its parents, FamilyHistory and Smoker. This means that once the values of FamilyHistory and Smoker are known, then the variable Emphysema does not provide any additional information regarding LungCancer. The second component de ning a belief network consists of one conditional probability table CPT for each variable. The CPT for a variable Z speci es the conditional distribution PZ jParentsZ, where P arentsZ are the parents of Z. Figure 7.8b showns a CPT for LungCancer. The conditional probability for each value of LungCancer is given for each possible combination of values of its parents. For instance, from the upper leftmost and bottom rightmost entries, respectively, we see that P LungCancer = Y es j FamilyHistory = Y es; Smoker = Y es = 0:8, and PLungCancer = No j FamilyHistory = No; Smoker = No = 0:9. by The joint probability of any tuple z1 ; :::; zn corresponding to the variables or attributes Z1 ; :::; Zn is computed
n Y Pz jParentsZ ; i=1 i i
7.8
where the values for Pzi jParentsZi correspond to the entries in the CPT for Zi . A node within the network can be selected as an output" node, representing a class label attribute. There may be more than one output node. Inference algorithms for learning can be applied on the network. The classi cation process, rather than returning a single class label, can return a probability distribution for the class label attribute, i.e., predicting the probability of each class.
19
7.9
The probability in the right-hand side of Equation 7.9 is to be calculated for each training sample Xd in S. For brevity, let's refer to this probability simply as p. When the variables represented by Yi and Ui are hidden for some Xd , then the corresponding probability p can be computed from the observed variables of the sample using standard algorithms for Bayesian network inference such as those available by the commercial software package, Hugin. 2. Take a small step in the direction of the gradient: The weights are updated by wijk wijk + l @lnPS jH ; @w
ijk
7.10
S where l is the learning rate representing the step size, and @lnP ijkjH is computed from Equation 7.9. The @w learning rate is set to a small constant. 3. Renormlize the weights: Because the weights wijk are probability values, they must be between 0 and 1.0, P and j wijk must equal 1 for all i; k. These criteria are achieved by renormalizing the weights after they have been updated by Equation 7.10.
Several algorithms exist for learning the network structure from the training data given observable variables. The problem is one of discrete optimization. For solutions, please see the bibliographic notes at the end of this chapter.
20
input layer x_1
x_2
Figure 7.9: A multilayer feed-forward neural network: A training sample, X = x1; x2; ::; xi, is fed to the input layer. Weighted connections exist between each layer, where wij denotes the weight from a unit j in one layer to a unit i in the previous layer. speaking, a neural network is a set of connected input output units where each connection has a weight associated with it. During the learning phase, the network learns by adjusting the weights so as to be able to predict the correct class label of the input samples. Neural network learning is also referred to as connectionist learning due to the connections between units. Neural networks involve long training times, and are therefore more suitable for applications where this is feasible. They require a number of parameters which are typically best determined empirically, such as the network topology or structure". Neural networks have been criticized for their poor interpretability, since it is di cult for humans to interpret the symbolic meaning behind the learned weights. These features initially made neural networks less desirable for data mining. Advantages of neural networks, however, include their high tolerance to noisy data as well as their ability to classify patterns on which they have not been trained. In addition, several algorithms have recently been developed for the extraction of rules from trained neural networks. These factors contribute towards the usefulness of neural networks for classi cation in data mining. The most popular neural network algorithm is the backpropagation algorithm, proposed in the 1980's. In Section 7.5.1 you will learn about multilayer feed-forward networks, the type of neural network on which the backpropagation algorithm performs. Section 7.5.2 discusses de ning a network topology. The backpropagation algorithm is described in Section 7.5.3. Rule extraction from trained neural networks is discussed in Section 7.5.4.
The backpropagation algorithm performs learning on a multilayer feed-forward neural network. An example of such a network is shown in Figure 7.9. The inputs correspond to the attributes measured for each training sample. The inputs are fed simultaneously into a layer of units making up the input layer. The weighted outputs of these units are, in turn, fed simultaneously to a second layer of neuron-like" units, known as a hidden layer. The hidden layer's weighted outputs can be input to another hidden layer, and so on. The number of hidden layers is arbitrary, although in practice, usually only one is used. The weighted outputs of the last hidden layer are input to units making up the output layer, which emits the network's prediction for given samples. The units in the hidden layers and output layer are sometimes referred to as neurodes, due to their symbolic biological basis, or as output units. The multilayer neural network shown in Figure 7.9 has two layers of output units. Therefore, we say that it is a two-layer neural network. Similarly, a network containing two hidden layers is called a three-layer neural network, and so on. The network is feed-forward in that none of the weights cycle back to an input unit or to an output unit of a previous layer. It is fully connected in that each unit provides input to each unit in the next forward layer. Multilayer feed-forward networks of linear threshold functions, given enough hidden units, can closely approximate any function.
21
7.5.3 Backpropagation
Backpropagation learns by iteratively processing a set of training samples, comparing the network's prediction for each sample with the actual known class label. For each training sample, the weights are modi ed so as to minimize the mean squared error between the network's prediction and the actual class. These modi cations are made in the backwards" direction, i.e., from the output layer, through each hidden layer down to the rst hidden layer hence the name backpropagation. Although it is not guaranteed, in general the weights will eventually converge, and the learning process stops. The algorithm is summarized in Figure 7.10. Each step is described below. Initialize the weights. The weights in the network are initialized to small random numbers e.g., ranging from -1.0 to 1.0, or -0.5 to 0.5. Each unit has a bias associated with it, as explained below. The biases are similarly initialized to small random numbers. Each training sample, X, is processed by the following steps. Propagate the inputs forward. In this step, the net input and output of each unit in the hidden and output layers are computed. First, the training sample is fed to the input layer of the network. The net input to each unit in the hidden and output layers is then computed as a linear combination of its inputs. To help illustrate this, a hidden layer or output layer unit is shown in Figure 7.11. The inputs to the unit are, in fact, the outputs of the units connected to it in the previous layer. To compute the net input to the unit, each input connected to the unit is multiplied by its corresponding weight, and this is summed. Given a unit j in a hidden or output layer, the net input, Ij , to unit j is: Ij =
How does backpropagation work?"
Xw
i
ij Oi + j
7.11
where wij is the weight of the connection from unit i in the previous layer to unit j; Oi is the output of unit i from the previous layer; and j is the bias of the unit. The bias acts as a threshold in that it serves to vary the activity of the unit. Each unit in the hidden and output layers takes its net input, and then applies an activation function to it, as illustrated in Figure 7.11. The function symbolizes the activation of the neuron represented by the unit. The logistic, or simoid function is used. Given the net input Ij to unit j, then Oj , the output of unit j, is computed as: Oj = 1 + 1 ,Ij e 7.12
This function is also referred to as a squashing function, since it maps a large input domain onto the smaller range
22
Algorithm 7.5.1 Backpropagation Neural network learning for classi cation, using the backpropagation algorithm. Input: The training samples, samples; the learning rate, l; a multilayer feed-forward network, network. Output: A neural network trained to classify the samples. Method:
1 Initialize all weights and biases in network; 2 while terminating condition is not satis ed f 3 for each training sample X in samples f 4 Propagate the inputs forward: 5 for each hidden or output layer unit j 6 Ij = compute the net input of unit j i wij Oi + j ; 7 for each hidden or output layer unit j 1 8 Oj = 1+e,Ij ; compute the output of each unit j 9 Backpropagate the errors: 10 for each unit j in the output layer 11 Errj = Oj 1 , Oj Tj , Oj ; compute the error 12 for each unit j in the hidden layers 13 Errj = Oj 1 , Oj k Errk wjk ; compute the error 14 for each weight wij in network f 15 wij = lErrj Oi ; weight increment 16 wij = wij + wij ; g weight update 17 for each bias j in network f 18 j = lErrj ; bias increment 19 j = j + j ; g bias update 20 gg
Figure 7.10: Backpropagation algorithm. of 0 to 1. The logistic function is nonlinear and di erentiable, allowing the backpropagation algorithm to model classi cation problems that are linearly inseparable. Backpropagate the error. The error is propagated backwards by updating the weights and biases to re ect the error of the network's prediction. For a unit j in the output layer, the error Errj is computed by: Errj = Oj 1 , Oj Tj , Oj 7.13
where Oj is the actual output of unit j, and Tj is the true output, based on the known class label of the given training sample. Note that Oj 1 , Oj is the derivative of the logistic function. To compute the error of a hidden layer unit j, the weighted sum of the errors of the units connected to unit j in the next layer are considered. The error of a hidden layer unit j is: Errj = Oj 1 , Oj
X Err w
k
k jk
7.14
where wjk is the weight of the connection from unit j to a unit k in the next higher layer, and Errk is the error of unit k. The weights and biases are updated to re ect the propagated errors. Weights are updated by Equations 7.15 and 7.16 below, where wij is the change in weight wij . wij = lErrj Oi 7.15
23
w_n x_n
input vector X
weighted sum
activation function
Figure 7.11: A hidden or output layer unit: The inputs are multiplied by their corresponding weights in order to form a weighted sum, which is added to the bias associated with the unit. A nonlinear activation function is applied to the net input. wij = wij + wij 7.16
What is the `l' in Equation 7.15?" The variable l is the learning rate, a constant typically having a value between 0 and 1:0. Backpropagation learns using a method of gradient descent to search for a set of weights which can model the given classi cation problem so as to minimize the mean squared distance between the network's class predictions and the actual class label of the samples. The learning rate helps to avoid getting stuck at a local minimum in decision space i.e., where the weights appear to converge, but are not the optimum solution, and encourages nding the global minimum. If the learning rate is too small, then learning will occur at a very slow pace. If the learning rate is too large, then oscillation between inadequate solutions may occur. A rule of thumb is to set the learning rate to 1=t, where t is the number of iterations through the training set so far. Biases are updated by Equations 7.17 and 7.18 below, where j is the change in bias j .
j = lErrj j = j + j
7.17 7.18
Note that here we are updating the weights and biases after the presentation of each sample. This is referred to as case updating. Alternatively, the weight and bias increments could be accumulated in variables, so that the weights and biases are updated after all of the samples in the training set have been presented. This latter strategy is called epoch updating, where one iteration through the training set is an epoch. In theory, the mathematical derivation of backpropagation employs epoch updating, yet in practice, case updating is more common since it tends to yield more accurate results. Terminating condition. Training stops when either 1. all wij in the previous epoch were so small as to be below some speci ed threshold, or 2. the percentage of samples misclassi ed in the previous epoch is below some threshold, or 3. a prespeci ed number of epochs has expired. In practice, several hundreds of thousands of epochs may be required before the weights will converge.
24
x_1
1 w_15 w_24
w_14 4 w_46
x_2
x_3
Example 7.5 Sample calculations for learning by the backpropagation algorithm. Figure 7.12 shows a
x1 x2 x3 w14 w15 w24 w25 w34 w35 w46 w56 4 5 6 1 0 1 0.2 -0.3 0.4 0.1 -0.5 0.2 -0.3 -0.2 -0.4 0.2 0.1 Table 7.3: Initial input, weight, and bias values. This example shows the calculations for backpropagation, given the rst training sample, X. The sample is fed into the network, and the net input and output of each unit are computed. These values are shown in Table 7.4. Unit j 4 5 6 Net Input, Ij 0:2 + 0 , 0:5 , 0:4 = ,0:7 ,0:3 + 0 + 0:2 + 0:2 = 0:1 0:30:33 , 0:20:52 + 0:1 , 0:19 Output, Oj 1=1 + e0:7 = 0:33 1=1 + e,0:1 = 0:52 1=1 + e,0:19 = 0:55
multilayer feed-forward neural network. The initial weight and bias values of the network are given in Table 7.3, along with the rst training sample, X = 1; 0; 1.
Table 7.4: The net input and output calculations. The error of each unit is computed and propagated backwards. The error values are shown in Table 7.5. The weight and bias updates are shown in Table 7.6. 2 Several variations and alternatives to the backpropagation algorithm have been proposed for classi cation in neural networks. These may involve the dynamic adjustment of the network topology, and of the learning rate or other parameters, or the use of di erent error functions.
25
Table 7.5: Calculation of the error at each node. Weight or Bias w46 w56 w14 w15 w24 w25 w34 w35 6 5 4 New Value ,0:3 = 0:90:4950:33 = ,0:153 ,0:2 = 0:90:4950:52 = ,0:032 0:2 = 0:9,0:0221 = 0:180 ,0:3 = 0:90:0371 = ,0:267 0:4 = 0:9,0:0220 = 0:4 0:1 = 0:90:0370 = 0:1 ,0:5 = 0:9,0:0221 = ,0:520 0:2 = 0:90:0371 = 0:233 0:1 + 0:90:495 = 0:546 0:2 + 0:90:037 = 0:233 ,0:4 + 0:9,0:022 = ,0:420
Table 7.6: Calculations for weight and bias updating. in extracting the knowledge embedded in trained neural networks and in representing that knowledge symbolically. Methods include extracting rules from networks and sensitivity analysis. Various algorithms for the extraction of rules have been proposed. The methods typically impose restrictions regarding procedures used in training the given neural network, the network topology, and the discretization of input values. Fully connected networks are di cult to articulate. Hence, often, the rst step towards extracting rules from neural networks is network pruning. This consists of removing weighted links that do not result in a decrease in the classi cation accuracy of the given network. Once the trained network has been pruned, some approaches will then perform link, unit, or activation value clustering. In one method, for example, clustering is used to nd the set of common activation values for each hidden unit in a given trained two-layer neural network Figure 7.13. The combinations of these activation values for each hidden unit are analyzed. Rules are derived relating combinations of activation values with corresponding output unit values. Similarly, the sets of input values and activation values are studied to derive rules describing the relationship between the input and hidden unit layers. Finally, the two sets of rules may be combined to form IF-THEN rules. Other algorithms may derive rules of other forms, including M-of-N rules where M out of a given N conditions in the rule antecedent must be true in order for the rule consequent to be applied, decision trees with M-of-N tests, fuzzy rules, and nite automata. Sensitivity analysis is used to assess the impact that a given input variable has on a network output. The input to the variable is varied while the remaining input variables are xed at some value. Meanwhile, changes in the network output are monitored. The knowledge gained from this form of analysis can be represented in rules such as IF X decreases 5 THEN Y increases 8".
26
Identify sets of common activation values for each hidden node, H_i: for H_1: (-1,0,1) for H_2: (0,1) H_1 Derive rules relating common activation values with output nodes, O_j: IF (a_2 = 0 AND a_3 = -1) OR (a_1 = -1 AND a_2 = 1 AND a_3 = -1) OR (a_1 = -1 AND a_2 = 0 AND a_3 = 0.24) THEN O_1 = 1, O_2 = 0 ELSE O_1 = 0, O_2 = 1 Derive rules relating input nodes, I_i, to output nodes, O_j: IF (I_2 = 0 AND I_7 = 0) THEN a_2 = 0 IF (I_4 = 1 AND I_6 = 1) THEN a_3 = -1 IF (I_5 = 0) THEN a_3 = -1 ... Obtain rules relating inputs and output classes: IF (I_2 = 0 AND I_7 = 0 AND I_4 = 1 AND I_6 = 1) THEN class = 1 IF (I_2 = 0 AND I_7 = 0 AND I_5 = 0) THEN class = 1 I_1 I_2 I_3 I_4 I_5 I_6 I_7 H_2 H_3 for H_3: (-1, 0.24, 1) O_1 O_2
Figure 7.13: Rules can be extracted from training neural networks. classi cation. One method of association-based classi cation, called associative classi cation, consists of two steps. In the rst step, association rules are generated using a modi ed version of the standard association rule mining algorithm known as Apriori. The second step constructs a classi er based on the association rules discovered. Let D be the training data, and Y be the set of all classes in D. The algorithm maps categorical attributes to consecutive positive integers. Continuous attributes are discretized and mapped accordingly. Each data sample d in D then is represented by a set of attribute, integer-value pairs called items, and a class label y. Let I be the set of all items in D. A class association rule CAR is of the form condset y, where condset is a set of items condset I and y 2 Y . Such rules can be represented by ruleitems of the form condset, y . A CAR has con dence c if c of the samples in D that contain condset belong to class y. A CAR has support s if s of the samples in D contain condset and belong to class y. The support count of a condset condsupCount is the number of samples in D that contain the condset. The rule count of a ruleitem rulesupCount is the number of samples in D that contain the condset and are labeled with class y. Ruleitems that satisfy minimum support are frequent ruleitems. If a set of ruleitems has the same condset, then the rule with the highest con dence is selected as the possible rule PR to represent the set. A rule satisfying minimum con dence is called accurate. The rst step of the associative classi cation method nds the set of all PRs that are both frequent and accurate. These are the class association rules CARs. A ruleitem whose condset contains k items is a k-ruleitem. The algorithm employs an iterative approach, similar to that described for Apriori in Section 5.2.1, where ruleitems are processed rather than itemsets. The algorithm scans the database, searching for the frequent k-ruleitems, for k = 1; 2; ::, until all frequent k-ruleitems have been found. One scan is made for each value of k. The k-ruleitems are used to explore k+1-ruleitems. In the rst scan of the database, the count support of 1-ruleitems is determined, and the frequent 1-ruleitems are retained. The frequent 1-ruleitems, referred to as the set F1 , are used to generate candidate 2-ruleitems, C2 . Knowledge of frequent ruleitem properties is used to prune candidate ruleitems that cannot be frequent. This knowledge states that all non-empty subsets of a frequent ruleitem must also be frequent. The database is scanned a second time to compute the support counts of each candidate, so that the frequent 2ruleitems F2 can be determined. This process repeats, where Fk is used to generate Ck+1, until no more frequent ruleitems are found. The frequent ruleitems that satisfy minimum con dence form the set of CARs. Pruning may be applied to this rule set.
How does associative classi cation work?"
27
The second step of the associative classi cation method processes the generated CARs in order to construct the classi er. Since the total number of rule subsets that would be examined in order to determine the most accurate set of rules can be huge, a heuristic method is employed. A precedence ordering among rules is de ned where a rule ri has greater precedence over a rule rj i.e., ri rj if 1 the con dence of ri is greater than that of rj , or 2 the con dences are the same, but ri has greater support, or 3 the con dences and supports of ri and rj are the same, but ri is generated earlier than rj . In general, the algorithm selects a set of high precedence CARs to cover the samples in D. The algorithm requires slightly more than one pass over D in order to determine the nal classi er. The classi er maintains the selected rules from high to low precedence order. When classifying a new sample, the rst rule satisfying the sample is used to classify it. The classi er also contains a default rule, having lowest precedence, which speci es a default class for any new sample that is not satis ed by any other rule in the classi er. In general, the above associative classi cation method was empirically found to be more accurate than C4.5 on several data sets. Each of the above two steps was shown to have linear scale-up. Association rule mining based on clustering has also been applied to classi cation. The ARCS, or Association Rule Clustering System Section 6.4.3 mines association rules of the form Aquan1 ^ Aquan2 Acat, where Aquan1 and Aquan2 are tests on quantitative attribute ranges where the ranges are dynamically determined, and Acat assigns a class label for a categorical attribute from the given training data. Association rules are plotted on a 2-D grid. The algorithm scans the grid, searching for rectangular clusters of rules. In this way, adjacent ranges of the quantitative attributes occurring within a rule cluster may be combined. The clustered association rules generated by ARCS were applied to classi cation, and their accuracy was compared to C4.5. In general, ARCS is slightly more accurate when there are outliers in the data. The accuracy of ARCS is related to the degree of discretization used. In terms of scalability, ARCS requires a constant amount of memory, regardless of the database size. C4.5 has exponentially higher execution times than ARCS, requiring the entire database, multiplied by some factor, to t entirely in main memory. Hence, association rule mining is an important strategy for generating accurate and scalable classi ers.
7.7.1
Nearest neighbor classi ers are based on learning by analogy. The training samples are described by n-dimensional numeric attributes. Each sample represents a point in an n-dimensional space. In this way, all of the training samples are stored in an n-dimensional pattern space. When given an unknown sample, a k-nearest neighbor classi er searches the pattern space for the k training samples that are closest to the unknown sample. These k training samples are the k nearest neighbors" of the unknown sample. Closeness" is de ned in terms of Euclidean distance, where the Euclidean distance between two points, X = x1 ; x2; :::; xn and Y = y1 ; y2; :::; yn is:
vn uX u dX; Y = t xi , yi :
2
i=1
7.19
The unknown sample is assigned the most common class among its k nearest neighbors. When k = 1, the unknown sample is assigned the class of the training sample that is closest to it in pattern space. Nearest neighbor classi ers are instance-based since they store all of the training samples. They can incur expensive computational costs when the number of potential neighbors i.e., stored training samples with which to compare a given unlabeled sample is great. Therefore, e cient indexing techniques are required. Unlike decision tree
28
induction and backpropagation, nearest neighbor classi ers assign equal weight to each attribute. This may cause confusion when there are many irrelevant attributes in the data. Nearest neighbor classi ers can also be used for prediction, i.e., to return a real-valued prediction for a given unknown sample. In this case, the classi er returns the average value of the real-valued labels associated with the k nearest neighbors of the unknown sample.
Case-based reasoning CBR classi ers are instanced-based. Unlike nearest neighbor classi ers, which store train-
ing samples as points in Euclidean space, the samples or cases" stored by CBR are complex symbolic descriptions. Business applications of CBR include problem resolution for customer service help desks, for example, where cases describe product-related diagnostic problems. CBR has also been applied to areas such as engineering and law, where cases are either technical designs or legal rulings, respectively. When given a new case to classify, a case-based reasoner will rst check if an identical training case exists. If one is found, then the accompanying solution to that case is returned. If no identical case is found, then the case-based reasoner will search for training cases having components that are similar to those of the new case. Conceptually, these training cases may be considered as neighbors of the new case. If cases are represented as graphs, this involves searching for subgraphs which are similar to subgraphs within the new case. The case-based reasoner tries to combine the solutions of the neighboring training cases in order to propose a solution for the new case. If incompatibilities arise with the individual solutions, then backtracking to search for other solutions may be necessary. The case-based reasoner may employ background knowledge and problem-solving strategies in order to propose a feasible combined solution. Challenges in case-based reasoning include nding a good similarity metric e.g., for matching subgraphs, developing e cient techniques for indexing training cases, and methods for combining solutions.
Genetic algorithms attempt to incorporate ideas of natural evolution. In general, genetic learning starts as follows. An initial population is created consisting of randomly generated rules. Each rule can be represented by a string
of bits. As a simple example, suppose that samples in a given training set are described by two Boolean attributes, A1 and A2, and that there are two classes, C1 and C2. The rule IF A1 and not A2 THEN C2 " can be encoded as the bit string 100", where the two leftmost bits represent attributes A1 and A2 , respectively, and the rightmost bit represents the class. Similarly, the rule if not A1 and not A2 then C1 " can be encoded as 001". If an attribute has k values where k 2, then k bits may be used to encode the attribute's values. Classes can be encoded in a similar fashion. Based on the notion of survival of the ttest, a new population is formed to consist of the ttest rules in the current population, as well as o spring of these rules. Typically, the tness of a rule is assessed by its classi cation accuracy on a set of training samples. O spring are created by applying genetic operators such as crossover and mutation. In crossover, substrings from pairs of rules are swapped to form new pairs of rules. In mutation, randomly selected bits in a rule's string are inverted. The process of generating new populations based on prior populations of rules continues until a population P evolves" where each rule in P satis es a prespeci ed tness threshold. Genetic algorithms are easily parallelizable and have been used for classi cation as well as other optimization problems. In data mining, they may be used to evaluate the tness of other algorithms.
29
Figure 7.14: A rough set approximation of the set of samples of the class C using lower and upper approximation sets of C. The rectangular regions represent equivalence classes. in terms of the available attributes. Rough sets can be used to approximately or roughly" de ne such classes. A rough set de nition for a given class C is approximated by two sets - a lower approximation of C and an upper approximation of C. The lower approximation of C consists of all of the data samples which, based on the knowledge of the attributes, are certain to belong to C without ambiguity. The upper approximation of C consists of all of the samples which, based on the knowledge of the attributes, cannot be described as not belonging to C. The lower and upper approximations for a class C are shown in Figure 7.14, where each rectangular region represents an equivalence class. Decision rules can be generated for each class. Typically, a decision table is used to represent the rules. Rough sets can also be used for feature reduction where attributes that do not contribute towards the classi cation of the given training data can be identi ed and removed, and relevance analysis where the contribution or signi cance of each attribute is assessed with respect to the classi cation task. The problem of nding the minimal subsets reducts of attributes that can describe all of the concepts in the given data set is NP-hard. However, algorithms to reduce the computation intensity have been proposed. In one method, for example, a discernibility matrix is used which stores the di erences between attribute values for each pair of data samples. Rather than searching on the entire training set, the matrix is instead searched to detect redundant attributes.
30
fuzzy membership 1.0 0.5 _ low somewhat low | 10K | 20K | 30K
medium
| 40K
| 50K
| 60K
| 70K
income
Figure 7.15: Fuzzy values for income. The sums obtained above are combined into a value that is returned by the system. This process may be done by weighting each category by its truth sum and multiplying by the mean truth value of each category. The calculations involved may be more complex, depending on the complexity of the fuzzy membership graphs. Fuzzy logic systems have been used in numerous areas for classi cation, including health care and nance.
7.8 Prediction
like to develop a model to predict the salary of college graduates with 10 years of work experience, or the potential sales of a new product given its price. Many problems can be solved by linear regression, and even more can be tackled by applying transformations to the variables so that a nonlinear problem can be converted to a linear one. For reasons of space, we cannot give a fully detailed treatment of regression. Instead, this section provides an intuitive introduction to the topic. By the end of this section, you will be familiar with the ideas of linear, multiple, and nonlinear regression, as well as generalized linear models. Several software packages exist to solve regression problems. Examples include SAS http: www.sas.com, SPSS http: www.spss.com, and S-Plus http: www.mathsoft.com.
What if we would like to predict a continuous value, rather than a categorical label?" The prediction of continuous values can be modeled by statistical techniques of regression. For example, we may
i=1
7.22
7.8. PREDICTION
= y , x; where x is the average of x1; x2; ::; xs, and y is the average of y1 ; y2 ; ::; ys. The coe cients good approximations to otherwise complicated regression equations. X
years experience salary in $1000
3 8 9 13 3 6 11 21 1 16
30 57 64 72 36 43 59 90 20 83
80
Salary (in $1000)
60
40
20
0 0 5 10 15 Years experience 20 25
Figure 7.16: Plot of the data in Table 7.7 for Example 7.6. Although the points do not fall on a straight line, the overall pattern suggests a linear relationship between X years experience and Y salary.
Example 7.6 Linear regression using the method of least squares. Table 7.7 shows a set of paired data
where X is the number of years of work experience of a college graduate and Y is the corresponding salary of the graduate. A plot of the data is shown in Figure 7.16, suggesting a linear relationship between the two variables, X and Y . We model the relationship that salary may be related to the number of years of work experience with the equation Y = + X. Given the above data, we compute x = 9:1 and y = 55:4. Substituting these values into Equation 7.22, we get 3,9:130,55:4+8,9:157,55:4+:::+16,9:183,55:4 = = 3:7 3,9:1 +8,9:1 +:::+16,9:1 = 55:4 , 3:79:1 = 21:7 Thus, the equation of the least squares line is estimated by Y = 21:7 + 3:7X. Using this equation, we can predict that the salary of a college graduate with, say, 10 years of experience is $58.7K. 2 Multiple regression is an extension of linear regression involving more than one predictor variable. It allows response variable Y to be modeled as a linear function of a multidimensional feature vector. An example of a multiple regression model based on two predictor attributes or variables, X1 and X2 , is shown in Equation 7.24.
2 2 2
32
Y = + 1 X1 + 2 X2 The method of least squares can also be applied here to solve for , 1 , and 2 .
Polynomial regression can be modeled by adding polynomial terms to the basic linear model. By applying transformations to the variables, we can convert the nonlinear model into a linear one that can then be solved by the method of least squares. Example 7.7 Transformation of a polynomial regression model to a linear regression model. Consider
a cubic polynomial relationship given by Equation 7.25. Y = + 1X + 2X 2 + 3X 3 To convert this equation to linear form, we de ne new variables as shown in Equation 7.26. X1 = X X2 = X 2 X3 = X 3 7.26 Equation 7.25 can then be converted to linear form by applying the above assignments, resulting in the equation Y = + 1 X1 + 2 X2 + 3 X3 , which is solvable by the method of least squares. 2 In Exercise 7, you are asked to nd the transformations required to convert a nonlinear model involving a power function into a linear regression model. Some models are intractably nonlinear such as the sum of exponential terms, for example and cannot be converted to a linear model. For such cases, it may be possible to obtain least-square estimates through extensive calculations on more complex formulae. 7.25
33
derive classifier
estimate accuracy
test set
34
C_1
C_T
Figure 7.18: Increasing classi er accuracy: Bagging and boosting each generate a set of classi ers, C1; C2; ::; CT . Voting strategies are used to combine the class predictions for a given unknown sample.
7.10. SUMMARY
35
of data in large databases, it is not always reasonable to assume that all objects are uniquely classi able. Rather, it is more probable to assume that each object may belong to more than one class. How then, can the accuracy of classi ers on large databases be measured? The accuracy measure is not appropriate, since it does not take into account the possibility of samples belonging to more than one class. Rather than returning a class label, it is useful to return a probability class distribution. Accuracy measures may then use a second guess heuristic whereby a class prediction is judged as correct if it agrees with the rst or second most probable class. Although this does take into consideration, in some degree, the non-unique classi cation of objects, it is not a complete solution.
7.10 Summary
Classi cation and prediction are two forms of data analysis which can be used to extract models describing important data classes or to predict future data trends. While classi cation predicts categorical labels classes, prediction models continuous-valued functions. Preprocessing of the data in preparation for classi cation and prediction can involve data cleaning to reduce noise or handle missing values, relevance analysis to remove irrelevant or redundant attributes, and data transformation, such as generalizing the data to higher level concepts, or normalizing the data. Predictive accuracy, computational speed, robustness, scalability, and interpretability are ve criteria for the evaluation of classi cation and prediction methods. ID3 and C4.5 are greedy algorithms for the induction of decision trees. Each algorithm uses an information theoretic measure to select the attribute tested for each non-leaf node in the tree. Pruning algorithms attempt to improve accuracy by removing tree branches re ecting noise in the data. Early decision tree algorithms typically assume that the data are memory resident - a limitation to data mining on large databases. Since then, several scalable algorithms have been proposed to address this issue, such as SLIQ, SPRINT, and RainForest. Decision trees can easily be converted to classi cation IF-THEN rules. Naive Bayesian classi cation and Bayesian belief networks are based on Bayes theorem of posterior probability. Unlike naive Bayesian classi cation which assumes class conditional independence, Bayesian belief networks allow class conditional independencies to be de ned between subsets of variables. Backpropagation is a neural network algorithm for classi cation which employs a method of gradient descent. It searches for a set of weights which can model the data so as to minimize the mean squared distance between the network's class prediction and the actual class label of data samples. Rules may be extracted from trained neural networks in order to help improve the interpretability of the learned network. Association mining techniques, which search for frequently occurring patterns in large databases, can be applied to and used for classi cation. Nearest neighbor classi ers and cased-based reasoning classi ers are instance-based methods of classi cation in that they store all of the training samples in pattern space. Hence, both require e cient indexing techniques. In genetic algorithms, populations of rules evolve" via operations of crossover and mutation until all rules within a population satisfy a speci ed threshold. Rough set theory can be used to approximately de ne classes that are not distinguishable based on the available attributes. Fuzzy set approaches replace brittle" threshold cuto s for continuous-valued attributes with degree of membership functions. Linear, nonlinear, and generalized linear models of regression can be used for prediction. Many nonlinear problems can be converted to linear problems by performing transformations on the predictor variables. Data warehousing techniques, such as attribute-oriented induction and the use of multidimensional data cubes, can be integrated with classi cation methods in order to allow fast multilevel mining. Classi cation tasks may be speci ed using a data mining query language, promoting interactive data mining. Strati ed k-fold cross validation is a recommended method for estimating classi er accuracy. Bagging and boosting methods can be used to increase overall classi cation accuracy by learning and combining a series of individual classi ers.
36
Exercises
1. Table 7.8 consists of training data from an employee database. The data have been generalized. For a given row entry, count represents the number of data tuples having the values for department, status, age, and salary given in that row.
department status age
sales sales sales systems systems systems systems marketing marketing secretary secretary
senior junior junior junior senior junior senior senior junior senior junior
31-35 26-30 31-35 21-25 31-35 26-30 41-45 36-40 31-35 46-50 26-30
salary
45-50K 25-30K 30-35K 45-50K 65-70K 45-50K 65-70K 45-50K 40-45K 35-40K 25-30K
count
30 40 40 20 5 3 3 10 4 4 6
Table 7.8: Generalized relation from an employee database. Let salary be the class label attribute. a How would you modify the ID3 algorithm to take into consideration the count of each data tuple i.e., of each row entry? b Use your modi ed version of ID3 to construct a decision tree from the given data. c Given a data sample with the values systems", junior", and 20-24" for the attributes department, status, and age, respectively, what would a naive Bayesian classi cation of the salary for the sample be? d Design a multilayer feed-forward neural network for the given data. Label the nodes in the input and output layers. e Using the multilayer feed-forward neural network obtained above, show the weight values after one iteration of the backpropagation algorithm given the training instance sales, senior, 31-35, 45-50K". Indicate your initial weight values and the learning rate used. Write an algorithm for k-nearest neighbor classi cation given k, and n, the number of attributes describing each sample. What is a drawback of using a separate set of samples to evaluate pruning? Given a decision tree, you have the option of a converting the decision tree to rules and then pruning the resulting rules, or b pruning the decision tree and then converting the pruned tree to rules? What advantage does a have over b? ADD QUESTIONS ON OTHER CLASSIFICATION METHODS. Table 7.9 shows the mid-term and nal exam grades obtained for students in a database course. a Plot the data. Do X and Y seem to have a linear relationship? b Use the method of least squares to nd an equation for the prediction of a student's nal exam grade based on the student's mid-term grade in the course. c Predict the nal exam grade of a student who received an 86 on the mid-term exam. Some nonlinear regression models can be converted to linear models by applying transformations to the predictor variables. Show how the nonlinear regression equation Y = X can be converted to a linear regression equation solvable by the method of least squares.
2. 3. 4. 5. 6.
7.
7.10. SUMMARY
X Y
37
mid-term exam
72 50 81 74 94 86 59 83 65 33 88 81
84 63 77 78 90 75 49 79 77 52 74 90
nal exam
Table 7.9: Mid-term and nal exam grades. 8. It is di cult to assess classi cation accuracy when individual data objects may belong to more than one class at a time. In such cases, comment on what criteria you would use to compare di erent classi ers modeled after the same data.
Bibliographic Notes
Classi cation from a machine learning perspective is described in several books, such as Weiss and Kulikowski 136 , Michie, Spiegelhalter, and Taylor 88 , Langley 67 , and Mitchell 91 . Weiss and Kulikowski 136 compare classi cation and prediction methods from many di erent elds, in addition to describing practical techniques for the evaluation of classi er performance. Many of these books describe each of the basic methods of classi cation discussed in this chapter. Edited collections containing seminal articles on machine learning can be found in Michalksi, Carbonell, and Mitchell 85, 86 , Kodrato and Michalski 63 , Shavlik and Dietterich 123 , and Michalski and Tecuci 87 . For a presentation of machine learning with respect to data mining applications, see Michalski, Bratko, and Kubat 84 . The C4.5 algorithm is described in a book by J. R. Quinlan 108 . The book gives an excellent presentation of many of the issues regarding decision tree induction, as does a comprehensive survey on decision tree induction by Murthy 94 . Other algorithms for decision tree induction include the predecessor of C4.5, ID3 Quinlan 104 , CART Breiman et al. 11 , FACT Loh and Vanichsetakul 76 , QUEST Loh and Shih 75 , and PUBLIC Rastogi and Shim 111 . Incremental versions of ID3 include ID4 Schlimmer and Fisher 120 and ID5 Utgo 132 . In addition, INFERULE Uthurusamy, Fayyad, and Spangler 133 learns decision trees from inconclusive data. KATE Manago and Kodrato 80 learns decision trees from complex structured data. Decision tree algorithms that address the scalability issue in data mining include SLIQ Mehta, Agrawal, and Rissanen 81 , SPRINT Shafer, Agrawal, and Mehta 121 , RainForest Gehrke, Ramakrishnan, and Ganti 43 , and Kamber et al. 61 . Earlier approaches described include 16, 17, 18 . For a comparison of attribute selection measures for decision tree induction, see Buntine and Niblett 15 , and Murthy 94 . For a detailed discussion on such measures, see Kononenko and Hong 65 . There are numerous algorithms for decision tree pruning, including cost complexity pruning Breiman et al. 11 , reduced error pruning Quinlan 105 , and pessimistic pruning Quinlan 104 . PUBLIC Rastogi and Shim 111 integrates decision tree construction with tree pruning. MDL-based pruning methods can be found in Quinlan and Rivest 110 , Mehta, Agrawal, and Rissanen 82 , and Rastogi and Shim 111 . Others methods include Niblett and Bratko 96 , and Hosking, Pednault, and Sudan 55 . For an empirical comparison of pruning methods, see Mingers 89 , and Malerba, Floriana, and Semeraro 79 . For the extraction of rules from decision trees, see Quinlan 105, 108 . Rather than generating rules by extracting them from decision trees, it is also possible to induce rules directly from the training data. Rule induction algorithms
38
include CN2 Clark and Niblett 21 , AQ15 Hong, Mozetic, and Michalski 54 , ITRULE Smyth and Goodman 126 , FOIL Quinlan 107 , and Swap-1 Weiss and Indurkhya 134 . Decision trees, however, tend to be superior in terms of computation time and predictive accuracy. Rule re nement strategies which identify the most interesting rules among a given rule set can be found in Major and Mangano 78 . For descriptions of data warehousing and multidimensional data cubes, see Harinarayan, Rajaraman, and Ullman 48 , and Berson and Smith 8 , as well as Chapter 2 of this book. Attribution-oriented induction AOI is presented in Han and Fu 45 , and summarized in Chapter 5. The integration of AOI with decision tree induction is proposed in Kamber et al. 61 . The precision or classi cation threshold described in Section 7.3.6 is used in Agrawal et al. 2 and Kamber et al. 61 . Thorough presentations of Bayesian classi cation can be found in Duda and Hart 32 , a classic textbook on pattern recognition, as well as machine learning textbooks such as Weiss and Kulikowski 136 and Mitchell 91 . For an analysis of the predictive power of naive Bayesian classi ers when the class conditional independence assumption is violated, see Domingosand Pazzani 31 . Experiments with kernel density estimation for continuous-valued attributes, rather than Gaussian estimation have been reported for naive Bayesian classi ers in John 59 . Algorithms for inference on belief networks can be found in Russell and Norvig 118 and Jensen 58 . The method of gradient descent, described in Section 7.4.4 for learning Bayesian belief networks, is given in Russell et al. 117 . The example given in Figure 7.8 is adapted from Russell et al. 117 . Alternative strategies for learning belief networks with hidden variables include the EM algorithm Lauritzen 68 , and Gibbs sampling York and Madigan 139 . Solutions for learning the belief network structure from training data given observable variables are proposed in 22, 14, 50 . The backpropagation algorithm was presented in Rumelhart, Hinton, and Williams 115 . Since then, many variations have been proposed involving, for example, alternative error functions Hanson and Burr 47 , dynamic adjustment of the network topology Fahlman and Lebiere 35 , Le Cun, Denker, and Solla 70 , and dynamic adjustment of the learning rate and momentum parameters Jacobs 56 . Other variations are discussed in Chauvin and Rumelhart 19 . Books on neural networks include 116, 49, 51, 40, 19, 9, 113 . Many books on machine learning, such as 136, 91 , also contain good explanations of the backpropagation algorithm. There are several techniques for extracting rules from neural networks, such as 119, 42, 131, 40, 7, 77, 25, 69 . The method of rule extraction described in Section 7.5.4 is based on Lu, Setiono, and Liu 77 . Critiques of techniques for rule extraction from neural networks can be found in Andrews, Diederich, and Tickle 5 , and Craven and Shavlik 26 . An extensive survey of applications of neural networks in industry, business, and science is provided in Widrow, Rumelhart, and Lehr 137 . The method of associative classi cation described in Section 7.6 was proposed in Liu, Hsu, and Ma 74 . ARCS was proposed in Lent, Swami, and Widom 73 , and is also described in Chapter 6. Nearest neighbor methods are discussed in many statistical texts on classi cation, such as Duda and Hart 32 , and James 57 . Additional information can be found in Cover and Hart 24 and Fukunaga and Hummels 41 . References on case-based reasoning CBR include the texts 112, 64, 71 , as well as 1 . For a survey of business applications of CBR, see Allen 4 . Examples of other applications include 6, 129, 138 . For texts on genetic algorithms, see 44, 83, 90 . Rough sets were introduced in Pawlak 97, 99 . Concise summaries of rough set theory in data mining include 141, 20 . Rough sets have been used for feature reduction and expert system design in many applications, including 98, 72, 128 . Algorithms to reduce the computation intensity in nding reducts have been proposed in 114, 125 . General descriptions of fuzzy logic can be found in 140, 8, 20 . There are many good textbooks which cover the techniques of regression. Example include 57, 30, 60, 28, 52, 95, 3 . The book by Press et al. 101 and accompanying source code contain many statistical procedures, such as the method of least squares for both linear and multiple regression. Recent nonlinear regression models include projection pursuit and MARS Friedman 39 . Log-linear models are also known in the computer science literature as multiplicative models. For log-linear models from a computer science perspective, see Pearl 100 . Regression trees Breiman et al. 11 are often comparable in performance with other regression methods, particularly when there exist many higher order dependencies among the predictor variables. Methods for data cleaning and data transformation are discussed in Pyle 102 , Kennedy et al. 62 , Weiss and Indurkhya 134 , and Chapter 3 of this book. Issues involved in estimating classi er accuracy are described in Weiss and Kulikowski 136 . The use of strati ed 10-fold cross-validation for estimating classi er accuracy is recommended over the holdout, cross-validation, leave-one-out Stone 127 , and bootstrapping Efron and Tibshirani 33 methods, based on a theoretical and empirical study by Kohavi 66 . Bagging is proposed in Breiman 10 . The boosting technique of Freund and Schapire 38 has been applied to several di erent classi ers, including decision tree induction Quinlan 109 , and naive Bayesian classi cation Elkan 34 .
7.10. SUMMARY
39
The University of California at Irvine UCI maintains a Machine Learning Repository of data sets for the development and testing of classi cation algorithms. For information on this repository, see http: www.ics.uci.edu ~mlearn MLRepository.html. No classi cation method is superior over all others for all data types and domains. Empirical comparisons on classi cation methods include 106, 37, 135, 122, 130, 12, 23, 27, 92, 29 .
40
Bibliography
1 A. Aamodt and E. Plazas. Case-based reasoning: Foundational issues, methodological variations, and system approaches. AI Comm., 7:39 52, 1994. 2 R. Agrawal, S. Ghosh, T. Imielinski, B. Iyer, and A. Swami. An interval classi er for database mining applications. In Proc. 18th Int. Conf. Very Large Data Bases, pages 560 573, Vancouver, Canada, August 1992. 3 A. Agresti. An Introduction to Categorical Data Analysis. John Wiley & Sons, 1996. 4 B. P. Allen. Case-based reasoning: Business applications. Comm. ACM, 37:40 42, 1994. 5 R. Andrews, J. Diederich, and A. B. Tickle. A survey and critique of techniques for extracting rules from trained arti cial neural networks. Knowledge-Based Systems, 8, 1995. 6 K. D. Ashley. Modeling Legal Argument: Reasoning with Cases and Hypotheticals. Cambridge, MA: MIT Press, 1990. 7 S. Avner. Discovery of comprehensible symbolic rules in a neural network. In Intl. Symposium on Intelligence in Neural and Bilogical Systems, pages 64 67, 1995. 8 A. Berson and S. J. Smith. Data Warehousing, Data Mining, and OLAP. New York: McGraw-Hill, 1997. 9 C. M. Bishop. Neural Networks for Pattern Recognition. Oxford, UK: Oxford University Press, 1995. 10 L. Breiman. Bagging predictors. Machine Learning, 24:123 140, 1996. 11 L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classi cation and Regression Trees. Wadsworth International Group, 1984. 12 C. E. Brodley and P. E. Utgo . Multivariate versus univariate decision trees. In Technical Report 8, Department of Computer Science, Univ. of Massachusetts, 1992. 13 W. Buntine. Graphical models for discovering knowledge. In U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining, pages 59 82. AAAI MIT Press, 1996. 14 W. L. Buntine. Operations for learning with graphical models. Journal of Arti cial Intelligence Research, 2:159 225, 1994. 15 W. L. Buntine and Tim Niblett. A further comparison of splitting rules for decision-tree induction. Machine Learning, 8:75 85, 1992. 16 J. Catlett. Megainduction: Machine Learning on Very large Databases. PHD Thesis, University of Sydney, 1991. 17 P. K. Chan and S. J. Stolfo. Experiments on multistrategy learning by metalearning. In Proc. 2nd. Int. Conf. Information and Knowledge Management, pages 314 323, 1993. 18 P. K. Chan and S. J. Stolfo. Metalearning for multistrategy and parallel learning. In Proc. 2nd. Int. Workshop on Multistrategy Learning, pages 150 165, 1993. 41
42
BIBLIOGRAPHY
19 Y. Chauvin and D. Rumelhart. Backpropagation: Theory, Architectures, and Applications. Hillsdale, NJ: Lawrence Erlbaum Assoc., 1995. 20 K. Cios, W. Pedrycz, and R. Swiniarski. Data Mining Methods for Knowledge Discovery. Boston: Kluwer Academic Publishers, 1998. 21 P. Clark and T. Niblett. The CN2 induction algorithm. Machine Learning, 3:261 283, 1989. 22 G. Cooper and E. Herskovits. A bayesian method for the induction of probabilistic networks from data. Machine Learning, 9:309 347, 1992. 23 V. Corruble, D. E. Brown, and C. L. Pittard. A comparison of decision classi ers with backpropagation neural networks for multimodal classi cation problems. Patern Recognition, 26:953 961, 1993. 24 T. Cover and P. Hart. Nearest neighbor pattern classi cation. IEEE Trans. Information Theory, 13:21 27, 1967. 25 M. W. Craven and J. W. Shavlik. Extracting tree-structured representations of trained networks. In D. Touretzky and M. Mozer M. Hasselmo, editors, Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press, 1996. 26 M. W. Craven and J. W. Shavlik. Using neural networks in data mining. Future Generation Computer Systems, 13:211 229, 1997. 27 S. P. Curram and J. Mingers. Neural networks, decision tree induction and discriminant analysis: An empirical comparison. J. Operational Research Society, 45:440 450, 1994. 28 J. L. Devore. Probability and Statistics for Engineering and the Science, 4th ed. Duxbury Press, 1995. 29 T. G. Dietterich, H. Hild, and G. Bakiri. A comparison of ID3 and backpropagation for english text-to-speech mapping. Machine Learning, 18:51 80, 1995. 30 A. J. Dobson. An Introduction to Generalized Linear Models. Chapman and Hall, 1990. 31 P. Domingos and M. Pazzani. Beyond independence: Conditions for the optimality of the simple bayesian classi er. In Proc. 13th Intl. Conf. Machine Learning, pages 105 112, 1996. 32 R. Duda and P. Hart. Pattern Classi cation and Scene Analysis. Wiley: New York, 1973. 33 B. Efron and R. Tibshirani. An Introduction to the Bootstrap. New York: Chapman & Hall, 1993. 34 C. Elkan. Boosting and naive bayesian learning. In Technical Report CS97-557, Dept. of Computer Science and Engineering, Univ. Calif. at San Diego, Sept. 1997. 35 S. Fahlman and C. Lebiere. The cascade-correlation learning algorithm. In Technical Report CMU-CS-90-100, Computer Science Department, Carnegie Mellon University, Pittsburgh, PA, 1990. 36 U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy eds.. Advances in Knowledge Discovery and Data Mining. AAAI MIT Press, 1996. 37 D. H. Fisher and K. B. McKusick. An empirical comparison of ID3 and back-propagation. In Proc. 11th Intl. Joint Conf. AI, pages 788 793, San Mateo, CA: Morgan Kaufmann, 1989. 38 Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55:119 139, 1997. 39 J. Friedman. Multivariate adaptive regression. Annalsof Statistics, 19:1 141, 1991. 40 L. Fu. Neural Networks in Computer Intelligence. McGraw-Hill, 1994. 41 K. Fukunaga and D. Hummels. Bayes error estimation using parzen and k-nn procedure. In IEEE Trans. Pattern Analysis and Machine Learning, pages 634 643, 1987.
BIBLIOGRAPHY
43
42 S. I. Gallant. Neural Network Learning and Expert Systems. Cambridge, MA: MIT Press, 1993. 43 J. Gehrke, R. Ramakrishnan, and V. Ganti. Rainforest: A framework for fast decision tree construction of large datasets. In Proc. 1998 Int. Conf. Very Large Data Bases, pages 416 427, New York, NY, August 1998. 44 D. Goldberg. Genetic Algorithms in Search, Optimization, and Machine Learning. Addison-Wesley: Reading, MA, 1989. 45 J. Han and Y. Fu. Exploration of the power of attribute-oriented induction in data mining. In U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining, pages 399 421. AAAI MIT Press, 1996. 46 J. Han, Y. Fu, W. Wang, J. Chiang, W. Gong, K. Koperski, D. Li, Y. Lu, A. Rajan, N. Stefanovic, B. Xia, and O. R. Za
ane. DBMiner: A system for mining knowledge in large relational databases. In Proc. 1996 Int. Conf. Data Mining and Knowledge Discovery KDD'96, pages 250 255, Portland, Oregon, August 1996. 47 S. J. Hanson and D. J. Burr. Minkowski back-propagation: Learning in connectionist models with non-euclidean error signals. In Neural Information Processing Systems, American Institute of Physics, 1988. 48 V. Harinarayan, A. Rajaraman, and J. D. Ullman. Implementing data cubes e ciently. In Proc. 1996 ACMSIGMOD Int. Conf. Management of Data, pages 205 216, Montreal, Canada, June 1996. 49 R. Hecht-Nielsen. Neurocomputing. Reading, MA: Addison Wesley, 1990. 50 D. Heckerman, D. Geiger, and D. Chickering. Learning bayesian networks: The combination of knowledge and statistical data. Machine Learning, 20:197, 1995. 51 J. Hertz, A. Krogh, and R. G. Palmer. Introduction to the Theory of Neural Computation. Addison Wesley: Reading, MA., 1991. 52 R. V. Hogg and A. T. Craig. Introduction to Mathematical Statistics, 5th ed. Prentice Hall, 1995. 53 L. B. Holder. Intermediate decision trees. In Proc. 14th Int. Joint Conf. Arti cial Intelligence IJCAI95, pages 1056 1062, Montreal, Canada, Aug. 1995. 54 J. Hong, I. Mozetic, and R. S. Michalski. AQ15: Incremental learning of attribute-based descriptions from examples, the method and user's guide. In Report ISG 85-5, UIUCDCS-F-86-949,, Department of Computer Science, University of Illinois at Urbana-Champagin, 1986. 55 J. Hosking, E. Pednault, and M. Sudan. A statistical perspective on data mining. In Future Generation Computer Systems, pages 117 134, ???, 1997. 56 R. Jacobs. Increased rates of convergence through learning rate adaptation. Neural Networks, 1:295 307, 1988. 57 M. James. Classi cation Algorithms. John Wiley, 1985. 58 F. V. Jensen. An Introduction to Bayesian Networks. Springer Verlag, 1996. 59 G. H. John. Enhancements to the Data Mining Process. Ph.D. Thesis, Computer Science Dept., Stanford Univeristy, 1997. 60 R. A. Johnson and D. W. Wickern. Applied Multivariate Statistical Analysis, 3rd ed. Prentice Hall, 1992. 61 M. Kamber, L. Winstone, W. Gong, S. Cheng, and J. Han. Generalization and decision tree induction: E cient classi cation in data mining. In Proc. 1997 Int. Workshop Research Issues on Data Engineering RIDE'97, pages 111 120, Birmingham, England, April 1997. 62 R. L Kennedy, Y. Lee, B. Van Roy, C. D. Reed, and R. P. Lippman. Solving Data Mining Problems Through Pattern Recognition. Upper Saddle River, NJ: Prentice Hall, 1998. 63 Y. Kodrato and R. S. Michalski. Machine Learning, An Arti cial Intelligence Approach, Vol. 3. Morgan Kaufmann, 1990.
44
BIBLIOGRAPHY
64 J. L. Kolodner. Case-Based Reasoning. Morgan Kaufmann, 1993. 65 I. Kononenko and S. J. Hong. Attribute selection for modeling. In Future Generation Computer Systems, pages 181 195, ???, 1997. 66 K. Koperski and J. Han. Discovery of spatial association rules in geographic information databases. In Proc. 4th Int. Symp. Large Spatial Databases SSD'95, pages 47 66, Portland, Maine, Aug. 1995. 67 P. Langley. Elements of Machine Learning. Morgan Kaufmann, 1996. 68 S. L. Lauritzen. The EM algorithm for graphical association models with missing data. Computational Statistics and Data Analysis, 19:191 201, 1995. 69 S. Lawrence, C. L Giles, and A. C. Tsoi. Symbolic conversion, grammatical inference and rule extraction for foreign exchange rate prediction. In Y. Abu-Mostafa and A. S. Weigend P. N Refenes, editors, Neural Networks in the Captial Markets. Singapore: World Scienti c, 1997. 70 Y. Le Cun, J. S. Denker, and S. A. Solla. Optimal brain damage. In D. Touretzky, editor, Advances in neural Information Processing Systems, 2, pages San Mateo, CA: Morgan Kaufmann. Cambridge, MA: MIT Press, 1990. 71 D. B. Leake. CBR in context: The present and future. In D. B. Leake, editor, Cased-Based Reasoning: Experience, Lessons, and Future Directions, pages 3 30. Menlo Park, CA: AAAI Press, 1996. 72 A. Lenarcik and Z. Piasta. Probabilistic rough classi ers with mixture of discrete and continuous variables. In T. Y. Lin N. Cercone, editor, Rough Sets and Data Mining: Analysis for Imprecise Data, pages 373 383. Boston: Kluwer, 1997. 73 B. Lent, A. Swami, and J. Widom. Clustering association rules. In Proc. 1997 Int. Conf. Data Engineering ICDE'97, pages 220 231, Birmingham, England, April 1997. 74 B. Liu, W. Hsu, and Y. Ma. Integrating classi cation and association rule mining. In Proc. 1998 Int. Conf. Knowledge Discovery and Data Mining KDD'98, pages 80 86, New York, NY, August 1998. 75 W. Y. Loh and Y.S. Shih. Split selection methods for classi cation trees. Statistica Sinica, 7:815 840, 1997. 76 W. Y. Loh and N. Vanichsetakul. Tree-structured classi caiton via generalized discriminant analysis. Journal of the American Statistical Association, 83:715 728, 1988. 77 H. Lu, R. Setiono, and H. Liu. Neurorule: A connectionist approach to data mining. In Proc. 21st Int. Conf. Very Large Data Bases, pages 478 489, Zurich, Switzerland, Sept. 1995. 78 J. Major and J. Mangano. Selecting among rules induced from a hurricane database. Journal of Intelligent Information Systems, 4:39 52, 1995. 79 D. Malerba, E. Floriana, and G. Semeraro. A further comparison of simpli cation methods for decision tree induction. In D.Fisher H. Lenz, editor, Learning from Data: AI and Statistics. Springer-Verlag, 1995. 80 M. Manago and Y. Kodrato . Induction of decision trees from complex structured data. In G. PiatetskyShapiro and W. J. Frawley, editors, Knowledge Discovery in Databases, pages 289 306. AAAI MIT Press, 1991. 81 M. Mehta, R. Agrawal, and J. Rissanen. SLIQ: A fast scalable classi er for data mining. In Proc. 1996 Int. Conf. Extending Database Technology EDBT'96, Avignon, France, March 1996. 82 M. Mehta, J. Rissanen, and R. Agrawal. MDL-based decision tree pruning. In Proc. 1st Intl. Conf. Knowledge Discovery and Data Mining KDD95, Montreal, Canada, Aug. 1995. 83 Z. Michalewicz. Genetic Algorithms + Data Structures = Evolution Programs. Springer Verlag, 1992. 84 R. S. Michalski, I. Bratko, and M. Kubat. Machine Learning and Data Mining: Methods and Applications. John Wiley & Sons, 1998.
BIBLIOGRAPHY
45
85 R. S. Michalski, J. G. Carbonell, and T. M. Mitchell. Machine Learning, An Arti cial Intelligence Approach, Vol. 1. Morgan Kaufmann, 1983. 86 R. S. Michalski, J. G. Carbonell, and T. M. Mitchell. Machine Learning, An Arti cial Intelligence Approach, Vol. 2. Morgan Kaufmann, 1986. 87 R. S. Michalski and G. Tecuci. Machine Learning, A Multistrategy Approach, Vol. 4. Morgan Kaufmann, 1994. 88 D. Michie, D. J. Spiegelhalter, and C. C. Taylor. Machine Learning, Neural and Statistical Classi cation. Ellis Horwood, 1994. 89 J. Mingers. An empirical comparison of pruning methods for decision-tree induction. Machine Learning, 4:227 243, 1989. 90 M. Mitchell. An Introduction to Genetic Algorithms. MIT Press, 1996. 91 T. M. Mitchell. Machine Learning. McGraw Hill, 1997. 92 D. Mitchie, D. J. Spiegelhalter, and C. C. Taylor. Machine Learning, Neural and Statistical Classi cation. New York: Ellis Horwood, 1994. 93 R. Mooney, J. Shavlik, G. Towell, and A. Grove. An experimental comparison of symbolic and connectionist learning algorithms. In Proc. 11th Int. Joint Conf. on Arti cial Intelligence IJCAI'89, pages 775 787, Detroit, MI, Aug. 1989. 94 S. K. Murthy. Automatic construction of decision trees from data: A multi-disciplinary survey. Data Mining and Knowledge Discovery, 2:345 389, 1998. 95 J. Neter, M. H. Kutner, C. J. Nachtsheim, and L. Wasserman. Applied Linear Statistical Models, 4th ed. Irwin: Chicago, 1996. 96 T. Niblett and I. Bratko. Learning decision rules in noisy domains. In M. A. Bramer, editor, Expert Systems '86: Research and Development in Expert Systems III, pages 25 34. British Computer Society Specialist Group on Expert Systems, Dec. 1986. 97 Z. Pawlak. Rough sets. Intl. J. Computer and Information Sciences, 11:341 356, 1982. 98 Z. Pawlak. On learning - rough set approach. In Lecture Notes 208, pages 197 227, New York: Springer-Verlag, 1986. 99 Z. Pawlak. Rough Sets, Theoretical Aspects of Reasonign about Data. Boston: Kluwer, 1991. 100 J. Pearl. Probabilistic Reasoning in Intelligent Systems. Palo Alto, CA: Morgan Kau man, 1988. 101 W. H. Press, S. A. Teukolsky, V. T. Vetterling, and B. P. Flannery. Numerical Recipes in C, The Art of Scienti c Computing. Cambridge, MA: Cambridge University Press, 1996. 102 D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999. 103 J. R. Quinlan. The e ect of noise on concept learning. In Michalski et al., editor, Machine Learning: An Arti cial Intelligence Approach, Vol. 2, pages 149 166. Morgan Kaufmann, 1986. 104 J. R. Quinlan. Induction of decision trees. Machine Learning, 1:81 106, 1986. 105 J. R. Quinlan. Simplifying decision trees. Internation Journal of Man-Machine Studies, 27:221 234, 1987. 106 J. R. Quinlan. An empirical comparison of genetic and decision-tree classi ers. In Proc. 5th Intl. Conf. Machine Learning, pages 135 141, San Mateo, CA: Morgan Kaufmann, 1988. 107 J. R. Quinlan. Learning logic de nitions from relations. Machine Learning, 5:139 166, 1990. 108 J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.
46
BIBLIOGRAPHY
109 J. R. Quinlan. Bagging, boosting, and C4.5. In Proc. 13th Natl. Conf. on Arti cial Intelligence AAAI'96, volume 1, pages 725 730, Portland, OR, Aug. 1996. 110 J. R. Quinlan and R. L. Rivest. Inferring decision trees using the minimum description length principle. Information and Computation, 80:227 248, March 1989. 111 R. Rastogi and K. Shim. Public: A decision tree classifer that integrates building and pruning. In Proc. 1998 Int. Conf. Very Large Data Bases, pages 404 415, New York, NY, August 1998. 112 C. Riesbeck and R. Schank. Inside Case-Based Reasoning. Hillsdale, NJ: Lawrence Erlbaum, 1989. 113 B. D. Ripley. Pattern Recognition and Neural Networks. Cambridge: Cambridge University Press, 1996. 114 S. Romanski. Operations on families of sets for exhaustive search, given a monotonic function. In Proc. 3rd Intl. Conf on Data and Knowledge Bases, C. Beeri et. al. eds.,, pages 310 322, Jerusalem, Israel, 1988. 115 D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error propagation. In D. E. Rumelhart J. L. McClelland, editor, Parallel Distributed Processing. Cambridge, MA: MIT Press, 1986. 116 D. E. Rumelhart and J. L. McClelland. Parallel Distributed Processing. Cambridge, MA: MIT Press, 1986. 117 S. Russell, J. Binder, D. Koller, and K. Kanazawa. Local learning in probabilistic networks with hidden variables. In Proc. 14th Joint Int. Conf. on Arti cial Intelligence IJCAI'95, volume 2, pages 1146 1152, Montreal, Canada, Aug. 1995. 118 S. Russell and P. Norvig. Arti cial Intelligence: A Modern Approach. Prentice-Hall, 1995. 119 K. Saito and R. Nakano. Medical diagnostic expert system based on PDP model. In Proc. IEEE International Conf. on Neural Networks, volume 1, pages 225 262, San Mateo, CA, 1988. 120 J. C. Schlimmer and D. Fisher. A case study of incremental concept induction. In Proc. 5th Natl. Conf. Arti cial Intelligence, pages 496 501, Phildelphia, PA: Morgan Kaufmann, 1986. 121 J. Shafer, R. Agrawal, and M. Mehta. SPRINT: A scalable parallel classi er for data mining. In Proc. 1996 Int. Conf. Very Large Data Bases, pages 544 555, Bombay, India, Sept. 1996. 122 J. W. Shavlik, R. J. Mooney, and G. G. Towell. Symbolic and neural learning algorithms: An experimental comparison. Machine Learning, 6:111 144, 1991. 123 J.W. Shavlik and T.G. Dietterich. Readings in Machine Learning. Morgan Kaufmann, 1990. 124 A. Skowron, L. Polowski, and J. Komorowski. Learning tolerance relations by boolean descriptors: Automatic feature extraction from data tables. In Proc. 4th Intl. Workshop on Rough Sets, Fuzzy Sets, and Machine Discovery, S. Tsumoto et. al. eds., pages 11 17, University of Tokyo, 1996. 125 A. Skowron and C. Rauszer. The discernibility matrices and functions in information systems. In R. Slowinski, editor, Intelligent Decision Support, Handbook of Applications and Advances of the Rough Set Theory, pages 331 362. Boston: Kluwer, 1992. 126 P. Smyth and R.M. Goodman. Rule induction using information theory. In G. Piatetsky-Shapiro and W. J. Frawley, editors, Knowledge Discovery in Databases, pages 159 176. AAAI MIT Press, 1991. 127 M. Stone. Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society, 36:111 147, 1974. 128 R. Swiniarski. Rough sets and principal component analysis and their applications in feature extraction and seletion, data model building and classi cation. In S. Pal A. Skowron, editor, Fuzzy Sets, Rough Sets and Decision Making Processes. New York: Springer-Verlag, 1998. 129 K. Sycara, R. Guttal, J. Koning, S. Narasimhan, and D. Navinchandra. CADET: A case-based synthesis tool for engineering design. Int. Journal of Expert Systems, 4:157 188, 1992.
BIBLIOGRAPHY
47
130 S. B. Thrun et al. The monk's problems: A performance comparison of di erent learning algorithms. In Technical Report CMU-CS-91-197, Department of Computer Science, Carnegie Mellon Univ., Pittsburgh, PA, 1991. 131 G. G. Towell and J. W. Shavlik. Extracting re ned rules from knowledge-based neural networks. Machine Learning, 13:71 101, Oct. 1993. 132 P. E. Utgo . An incremental ID3. In Proc. Fifth Int. Conf. Machine Learning, pages 107 120, San Mateo, California, 1988. 133 R. Uthurusamy, U. M. Fayyad, and S. Spangler. Learning useful rules from inconclusive data. In G. PiatetskyShapiro and W. J. Frawley, editors, Knowledge Discovery in Databases, pages 141 157. AAAI MIT Press, 1991. 134 S. M. Weiss and N. Indurkhya. Predictive Data Mining. Morgan Kaufmann, 1998. 135 S. M. Weiss and I. Kapouleas. An empirical comparison of pattern recognition, neural nets, and machine learning classi cation methods. In Proc. 11th Int. Joint Conf. Arti cial Intelligence, pages 781 787, Detroit, MI, Aug. 1989. 136 S. M. Weiss and C. A. Kulikowski. Computer Systems that Learn: Classi cation and Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert Systems. Morgan Kaufman, 1991. 137 B. Widrow, D. E. Rumelhart, and M. A. Lehr. Neural networks: Applications in industry, business and science. Comm. ACM, 37:93 105, 1994. 138 K. D. Wilson. Chemreg: Using case-based reasoning to support health and safety compliance in the chemical industry. AI Magazine, 19:47 57, 1998. 139 J. York and D. Madigan. Markov chaine monte carlo methods for hierarchical bayesian expert systems. In Cheesman and Oldford, pages 445 452, 1994. 140 L. A. Zadeh. Fuzzy sets. Information and Control, 8:338 353, 1965. 141 W. Ziarko. The discovery, analysis, and representation of data dependencies in databases. In G. PiatetskyShapiro W. J. Frawley, editor, Knowledge Discovery in Databases, pages 195 209. AAAI Press, 1991.