The document discusses two key data analysis techniques: classification and prediction, highlighting their differences and applications. Classification involves predicting categorical labels based on training data, while prediction focuses on estimating continuous values. The process of classification includes building a classifier from training data and using it to classify new data, with decision tree induction as a prominent method for constructing classifiers.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
5 views69 pages
Data Mining and Warehousing Mod3
The document discusses two key data analysis techniques: classification and prediction, highlighting their differences and applications. Classification involves predicting categorical labels based on training data, while prediction focuses on estimating continuous values. The process of classification includes building a classifier from training data and using it to classify new data, with decision tree induction as a prominent method for constructing classifiers.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 69
MODULE 3
Classification and Prediction
• There are two forms of data analysis that can be used for extracting models describing important classes or to predict future data trends. Classification Prediction • Classification models predict categorical class labels; and prediction models predict continuous valued functions. • For example, we can build a classification model to categorize bank loan applications as either safe or risky, or a prediction model to predict the expenditures in dollars of potential customers on computer equipment given their income and occupation. What is classification? • classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data • Following are the examples of cases where the data analysis task is Classification − A bank loan officer wants to analyze the data in order to know which customer (loan applicant) are risky or which are safe. A marketing manager at a company needs to analyze a customer with a given profile, who will buy a new computer. • In both of the above examples, a model or classifier is constructed to predict the categorical labels. These labels are risky or safe for loan application data and yes or no for marketing data. What is prediction? • models continuous-valued functions, i.e., predicts unknown or missing values • Following are the examples of cases where the data analysis task is Prediction : • Suppose the marketing manager needs to predict how much a given customer will spend during a sale at his company. In this example we are bothered to predict a numeric value. Therefore the data analysis task is an example of numeric prediction. In this case, a model or a predictor will be constructed that predicts a continuous-valued- function or ordered value. How Does Classification Works?
• The Data Classification process includes two steps −
1. Building the Classifier or Model 2. Using Classifier for Classification Building the Classifier or Mode • This step is the learning step or the learning phase. • In this step the classification algorithms build the classifier. • The classifier is built from the training set made up of database tuples and their associated class labels. • Each tuple that constitutes the training set is referred to as a category or class. These tuples can also be referred to as sample, object or data points. The data classification process: (a) Learning: Training data are analyzed by a classification algorithm. Here, the class label attribute is loan decision, and the learned model or classifier is represented in the form of classification rules. Using Classifier for Classification • In this step, the classifier is used for classification. Here the test data is used to estimate the accuracy of classification rules. The classification rules can be applied to the new data tuples if the accuracy is considered acceptable. The data classification process: (b) Classification: Test data are used to estimate the accuracy of the classification rules. If the accuracy is considered acceptable, the rules can be applied to the classification of new data tuples. Classification and Prediction Issues • The major issue is preparing the data for Classification and Prediction. Preparing the data involves the following activities − • Data Cleaning − Data cleaning involves removing the noise and treatment of missing values. The noise is removed by applying smoothing techniques and the problem of missing values is solved by replacing a missing value with most commonly occurring value for that attribute. • Relevance Analysis − Database may also have the irrelevant attributes. Correlation analysis is used to know whether any two given attributes are related. • Data Transformation and reduction − The data can be transformed by any of the following methods. • Normalization − The data is transformed using normalization. Normalization involves scaling all values for given attribute in order to make them fall within a small specified range. Normalization is used when in the learning step, the neural networks or the methods involving measurements are used. • Generalization − The data can also be transformed by generalizing it to the higher concept. For this purpose we can use the concept hierarchies. Comparison of Classification and Prediction Methods
• Accuracy − Accuracy of classifier refers to the ability of classifier. It predict
the class label correctly and the accuracy of the predictor refers to how well a given predictor can guess the value of predicted attribute for a new data. • Speed − This refers to the computational cost in generating and using the classifier or predictor. • Robustness − It refers to the ability of classifier or predictor to make correct predictions from given noisy data. • Scalability − Scalability refers to the ability to construct the classifier or predictor efficiently; given large amount of data. • Interpretability − It refers to what extent the classifier or predictor understands. General Approach to Classification • Data classification is a two-step process. • Consisting of a learning step (where a classification model is constructed) and • a classification step (where the model is used to predict class labels for given data). • In the first step, a classifier is built describing a predetermined set of data classes or concepts. • This is the learning step (or training phase), where a classification algorithm builds the classifier by analysing or “learning from” a training set made up of database tuples and their associated class labels. • A tuple, X, is represented by an n-dimensional attribute vector, X ={x1, x2, …… , xn}. • Each tuple, X, is assumed to belong to a predefined class as determined by another database attribute called the class label attribute. • The individual tuples making up the training set are referred to as training tuples and are randomly sampled from the database under analysis. • In the context of classification, data tuples can be referred to as samples, examples, instances, data points, or objects. • the class label of each training tuple is provided, this step is also known as supervised learning (i.e., the learning of the classifier is “supervised” in that it is told to which class each training tuple belongs). • It contrasts with unsupervised learning (or clustering), in which the class label of each training tuple is not known, and the number or set of classes to be learned may not be known in advance. • This first step of the classification process can also be viewed as the learning of a function, y = f (X), that can predict the associated class label y of a given tuple X. • In this view, we wish to learn a mapping or function that separates the data classes. • this mapping is represented in the form of classification rules, decision trees, or mathematical formulae. • In our example, the mapping is represented as classification rules that identify loan applications as being either safe or risky. • The rules can be used to categorize future data tuples, as well as provide deeper insight into the data contents. • They also provide a compressed data representation. • In the second step, the model is used for classification. • First, the predictive accuracy of the classifier is estimated. • If we were to use the training set to measure the classifier’s accuracy, this estimate would likely be optimistic, because the classifier tends to overfit the data (i.e., during learning it may incorporate some particular anomalies of the training data that are not present in the general data set overall). • Therefore, a test set is used, made up of test tuples and their associated class labels. They are independent of the training tuples, meaning that they were not used to construct the classifier. • The accuracy of a classifier on a given test set is the percentage of test set tuples that are correctly classified by the classifier. Decision Tree Induction • Decision tree induction is the learning of decision trees from class- labeled training tuples. • A decision tree is a flowchart-like tree structure, where each internal node (non leaf node) denotes a test on an attribute, • each branch represents an outcome of the test, and each leaf node (or terminal node) holds a class label. • The topmost node in a tree is the root node. • A typical decision tree is shown in Figure . It represents the concept buys computer, that is, it predicts whether a customer at AllElectronics is likely to purchase a computer. • Internal nodes are denoted by rectangles, and leaf nodes are denoted by ovals. • Some decision tree algorithms produce only binary trees (where each internal node branches to exactly two other nodes), whereas others can produce non binary trees. • “How are decision trees used for classification?” • Given a tuple, X, for which the associated class label is unknown, the attribute values of the tuple are tested against the decision tree. • A path is traced from the root to a leaf node, which holds the class prediction for that tuple. • Decision trees can easily be converted to classification rules. • The construction of decision tree classifiers does not require any domain knowledge or parameter setting, and therefore is appropriate for exploratory knowledge discovery. • Decision trees can handle multidimensional data. • The learning and classification steps of decision tree induction are simple and fast. • In general, decision tree classifiers have good accuracy. • During tree construction, attribute selection measures are used to select the attribute that best partitions the tuples into distinct classes. • When decision trees are built, many of the branches may reflect noise or outliers in the training data. • Tree pruning attempts to identify and remove such branches, with the goal of improving classification accuracy on unseen data. Decision Tree Algorithms • During the late 1970s and early 1980s, J. Ross Quinlan, a researcher in machine learning, developed a decision tree algorithm known as ID3 (Iterative Dichotomiser). • Quinlan later presented C4.5 (a successor of ID3). • In 1984, a group of statisticians (L. Breiman, J. Friedman, R. Olshen, and C. Stone) published the book Classification and Regression Trees (CART), which described the generation of binary decision trees. • ID3, C4.5, and CART adopt a greedy (i.e., non backtracking) approach in which decision trees are constructed in a top-down recursive divide-and-conquer manner. • Most algorithms for decision tree induction also follow a top-down approach, which starts with a training set of tuples and their associated class labels. • The training set is recursively partitioned into smaller subsets as the tree is being built. Decision tree algorithm • The algorithm is called with three parameters: D attribute list, and Attribute selection method. • D - data partition: Initially, it is the complete set of training tuples and their associated class labels. • The parameter attribute list is a list of attributes describing the tuples. • Attribute selection method specifies a heuristic procedure for selecting the attribute that “best” discriminates the given tuples according to class. • attribute selection measure such as information gain or the Gini index. • The tree starts as a single node, N, representing the training tuples in D(step 1). • If the tuples in D are all of the same class, then node N becomes a leaf and is labeled with that class.(steps 2 and 3). • Otherwise, the algorithm calls Attribute selection method to determine the splitting criterion. • The splitting criterion tells us which attribute to test at node N by determining the “best” way to separate or partition the tuples in D into individual classes(step 6). • The splitting criterion also tells us which branches to grow from node N with respect to the outcomes of the chosen test. • More specifically, the splitting criterion indicates the splitting attribute and may also indicate either a split-point or a splitting subset. • The resulting partitions at each branch are as “pure” as possible. • A partition is pure if all the tuples in it belong to the same class. In other words, if we split up the tuples in D according to the mutually exclusive outcomes of the splitting criterion, we hope for the resulting partitions to be as pure as possible. • The node N is labeled with the splitting criterion, which serves as a test at the node (step 7). • A branch is grown from node N for each of the outcomes of the splitting criterion. • The tuples in D are partitioned accordingly (steps 10 to 11). • There are three possible scenarios, as illustrated in Figure. • Let A be the splitting attribute. A has v distinct values, {a1, a2, .. , av}, based on the training data. 1. A is discrete-valued: In this case, the outcomes of the test at node N correspond directly to the known values of A. • A branch is created for each known value, aj , of A and labeled with that value . • Partition Dj is the subset of class-labeled tuples in D having value aj of A. • Because all the tuples in a given partition have the same value for A, A need not be considered in any future partitioning of the tuples. Therefore, it is removed from attribute list (steps 8 and 9). 2. A is continuous-valued: In this case, the test at node N has two possible outcomes, corresponding to the conditions A < split point and A > split point, respectively, • where split point is the split-point returned by Attribute selection method as part of the splitting criterion. (In practice, the split-point, a, is often taken as the midpoint of two known adjacent values of A and therefore may not actually be a pre-existing value of A from the training data.) • Two branches are grown from N and labeled according to the previous outcomes. • The tuples are partitioned such that D1 holds the subset of class- labeled tuples in D for which A <= split point, while D2 holds the rest. 3. A is discrete-valued and a binary tree must be produced (as dictated by the attribute selection measure or algorithm being used): • The test at node N is of the form “A € SA?,” where SA is the splitting subset for A, returned by Attribute selection method as part of the splitting criterion. • It is a subset of the known values of A. • If a given tuple has value aj of A and if aj € SA, then the test at node N is satisfied. • Two branches are grown from N. • By convention, the left branch out of N is labeled yes so that D1 corresponds to the subset of class-labeled tuples in D that satisfy the test. • The right branch out of N is labeled no so that D2 corresponds to the subset of class-labeled tuples from D that do not satisfy the test. • The algorithm uses the same process recursively to form a decision tree for the tuples at each resulting partition, Dj , of D (step 14). • The recursive partitioning stops only when any one of the following terminating conditions is true: 1. All the tuples in partition D (represented at node N) belong to the same class(steps 2 and 3). 2. There are no remaining attributes on which the tuples may be further partitioned (step 4). In this case, majority voting is employed (step 5). This involves converting node N into a leaf and labeling it with the most common class in D. Alternatively, the class distribution of the node tuples may be stored. 3. There are no tuples for a given branch, that is, a partition Dj is empty (step 12). In this case, a leaf is created with the majority class in D (step 13). • The resulting decision tree is returned (step 15). • The computational complexity of the algorithm given training set D is O(n*ІD І *log (І D І)). • where n is the number of attributes describing the tuples in D and І D І is the number of training tuples in D. Attribute Selection Measures • An attribute selection measure is a heuristic for selecting the splitting criterion that “best” separates a given data partition, D, of class- labeled training tuples into individual classes. • If we were to split D into smaller partitions according to the outcomes of the splitting criterion, ideally each partition would be pure (i.e., all the tuples that fall into a given partition would belong to the same class). • Attribute selection measures are also known as splitting rules because they determine how the tuples at a given node are to be split. • Let D, the data partition, be a training set of class-labeled tuples. Suppose the class label attribute has m distinct values defining m distinct classes, Ci (for i = 1,………., m). • Let Ci, D be the set of tuples of class Ci in D. • Let І D І and І Ci, D І denote the number of tuples in D and Ci, D, respectively. Information Gain • ID3 uses information gain as its attribute selection measure. • Let node N represent or hold the tuples of partition D. The attribute with the highest information gain is chosen as the splitting attribute for node N. • where pi is the nonzero probability that an arbitrary tuple in D belongs to class Ci and is estimated by І Ci, D І / І D І . • A log function to the base 2 is used, because the information is encoded in bits. • Info(D) is just the average amount of information needed to identify the class label of a tuple in D. • Info(D) is also known as the entropy of D. • How much more information would we still need (after the partitioning) to arrive at an exact classification? This amount is measured by
• The term І Dj І / І D І acts as the weight of the jth
partition. • InfoA (D) is the expected information required to classify a tuple from D based on the partitioning by A. • Information gain is defined as the difference between the original information requirement (i.e., based on just the proportion of classes) and the new requirement (i.e., obtained after partitioning on A). That is Gain(A) = Info(D) – InfoA(D). • Table presents a training set, D, of class-labeled tuples randomly selected from the AllElectronics customer database. • The class label attribute, buys computer, has two distinct values (namely, {yes, no} ); • therefore, there are two distinct classes (i.e., m = 2). • Let class C1 correspond to yes and class C2 correspond to no. • There are nine tuples of class yes and five tuples of class no. • A (root) node N is created for the tuples in D. To find the splitting criterion for these tuples, we must compute the information gain of each attribute. • Next, we need to compute the expected information requirement for each attribute. • Let’s start with the attribute age. • We need to look at the distribution of yes and no tuples for each category of age. • For the age category “youth,” there are two yes tuples and three no tuples. • For the category “middle aged,” there are four yes tuples and zero no tuples. • For the category “senior,” there are three yes tuples and two no tuples. • Similarly, we can compute Gain( income)= 0.029 bits, • Gain( student)= 0.151 bits, • and Gain( credit rating)=0.048 bits. • Because age has the highest information gain among the attributes, it is selected as the splitting attribute. • Node N is labeled with age, and branches are grown for each of the attribute’s values. • The tuples are then partitioned accordingly, • Notice that the tuples falling into the partition for age = middle aged all belong to the same class. • Because they all belong to class “yes,” a leaf should therefore be created at the end of this branch and labeled “yes.” Gain Ratio • C4.5, a successor of ID3, uses an extension to information gain known as gain ratio. • It applies a kind of normalization to information gain using a “split information” value defined analogously with Info( D)as • This value represents the potential information generated by splitting the training data set, D, into v partitions, corresponding to the v outcomes of a test on attribute A. • The gain ratio is defined as
The attribute with the maximum gain ratio is selected as
the splitting attribute. • Computation of gain ratio for the attribute income. A test on income splits the data of Table into three partitions, namely low, medium, and high, containing four, six, and four tuples, respectively. Gini Index • The Gini index is used in CART. the Gini index measures the impurity of D, a data partition or set of training tuples, as Tree Pruning • When a decision tree is built, many of the branches will reflect anomalies in the training data due to noise or outliers. • Tree pruning methods address this problem of Overfitting the data. • Such methods typically use statistical measures to remove the least- reliable branches. • Pruned trees tend to be smaller and less complex and, thus, easier to comprehend. • They are usually faster and better at correctly classifying independent test data than unpruned trees. • There are two common approaches to tree pruning: prepruning and postpruning. • In the prepruning approach, a tree is “pruned” by halting its construction early (e.g., by deciding not to further split or partition the subset of training tuples at a given node). • Upon halting, the node becomes a leaf. The leaf may hold the most frequent class among the subset tuples or the probability distribution of those tuples. • When constructing a tree, measures such as statistical significance, information gain, Gini index, and so on, can be used to assess the goodness of a split. • If partitioning the tuples at a node would result in a split that falls below a prespecified threshold, then further partitioning of the given subset is halted. • There are difficulties, however, in choosing an appropriate threshold. • High thresholds could result in oversimplified trees, whereas low thresholds could result in very little simplification. • The second and more common approach is postpruning, which removes subtrees from a “fully grown” tree. • A subtree at a given node is pruned by removing its branches and replacing it with a leaf. • The leaf is labeled with the most frequent class among the subtree being replaced. • For example, notice the subtree at node “A3?” in the unpruned tree of Figure . • Suppose that the most common class within this subtree is “class B.” • In the pruned version of the tree, the subtree in question is pruned by replacing it with the leaf “class B.” • Although pruned trees tend to be more compact than their unpruned counterparts, they may still be rather large and complex. • Decision trees can suffer from repetition and replication. • Repetition occurs when an attribute is repeatedly tested along a given branch of the tree (e.g., “age < 60?,” followed by “age < 45?,” and so on). • In replication, duplicate subtrees exist within the tree. subtree repetition, where an attribute is repeatedly tested along a given branch of the tree (e.g., age) subtree replication, where duplicate subtrees exist within a tree (e.g., the subtree headed by the node “credit rating?”). • These situations can impede the accuracy and comprehensibility of a decision tree. • The use of multivariate splits (splits based on a combination of attributes) can prevent these problems. Bayes Classification Methods • Bayesian classifiers are statistical classifiers. • They can predict class membership probabilities such as the probability that a given tuple belongs to a particular class. • Bayesian classification is based on Bayes’ theorem • Studies comparing classification algorithms have found a simple Bayesian classifier known as the naıve Bayesian classifier. • Naıve Bayesian classifiers assume that the effect of an attribute value on a given class is independent of the values of the other attributes. This assumption is called class conditional independence. Bayes’ Theorem • Let X be a data tuple. • Let H be some hypothesis such as that the data tuple X belongs to a specified class C. • For classification problems, we want to determine P(H/X), the probability that the hypothesis H holds given the “evidence” or observed data tuple X. • In other words, we are looking for the probability that tuple X belongs to class C. • P(H/X) is the posterior probability, or a posteriori probability, of H conditioned on X. • For example, Suppose we have the attributes age and income • And X is a 35-year-old customer with an income of $40,000. • Suppose that H is the hypothesis that our customer will buy a computer. Then P(H/X) reflects the probability that customer X will buy a computer given that we know the customer’s age and income. • In contrast, P(H) is the prior probability, or a priori probability, of H. • For our example, this is the probability that any given customer will buy a computer, regardless of age, income, or any other information, for that matter. • The posterior probability, P(H/X), is based on more information (e.g., customer information) than the prior probability, P(H), which is independent of X. • Similarly, P(X/H) is the posterior probability of X conditioned on H. • That is, it is the probability that a customer, X, is 35 years old and earns $40,000, given that we know the customer will buy a computer. • P(X) is the prior probability of X. Using our example, it is the probability that a person from our set of customers is 35 years old and earns $40,000. • Bayes’ theorem is useful in that it provides a way of calculating the posterior probability, P(H/X) from P(H), P(X/H), and P(X). Bayes’ theorem is Naıve Bayesian Classification • The naıve Bayesian classifier, or simple Bayesian classifier, works as follows: • Predicting a class label using naıve Bayesian classification. We wish to predict the class label of a tuple using naıve Bayesian classification, given the same training data as in Example for decision tree induction. • The data tuples are described by the attributes age, income, student, and credit rating. The class label attribute, buys computer, has two distinct values (namely,{yes, no}). • Let C1 correspond to the class buys computer = yes and C2 correspond to buys computer = no. • The tuple we wish • X = (age = youth, income = medium, student = yes, credit rating = fair)
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: SUPPORT VECTOR MACHINE, LOGISTIC REGRESSION, DISCRIMINANT ANALYSIS and DECISION TREES: Examples with MATLAB