0% found this document useful (0 votes)
34 views

Data Mining-Unit-3

data mining notes
Copyright
© © All Rights Reserved
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
34 views

Data Mining-Unit-3

data mining notes
Copyright
© © All Rights Reserved
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 16
UNIT-1 LECTURE 1 n? ‘What is classification and predi Data classification is a two step process. In the fi predetermined t step, a model is built describing a Set of data classes or concepts. The model is constructed by analyzing database tuples described by attributes. Each tuple is assumed to belong to a predefined class, as determined by one of the attributes, called the class label attribute. In the context of classification, data tuples are also referred to as samples, examples, or objects. The Data tuples analyzed to build the model collectively form the training data set. The individual tuples making up the training set are referred to as training samples and are randomly selected from the sample population. Since the class label of each training sample is provided, this step is also known as supervised learning (ic., the Learning of the model is ‘supervised! in that itis told to which class each training sample belongs). It contrasts with unsupervised learning (or clustering), in which the class labels of the training samples are not known, and the Number or set of classes to be learned may not be known in advance. Typically, the learned model is represented in the form of classification rules, decision trees, or mathematical formulae. For example, given a database of customer credit information, classification rules can be learned to identify customers as having either excellent or fair credit ratings Figure below. The rules can be used to categorize future data samples, as well as provide a better understanding of the database contents In the second step is shown in fig, the model is used for classification. First, the predictive accuracy of the model is estimated. The holdout method is a simple technique which uses a test set of class-labeled samples. These samples are Classification Process 4 Construction (1): Model Classification Algorithms 4 Mike [Assistant Prof | 3 no Mary |Assistant Prof 7 yes Bill Professor 2 yes Jim [Associate Prof} 7 yes Dave |Assistant Prof | 6 no [Anne [Associate Prof] 3 no IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ Classification Process (2): Use the Model in Prediction Es a Tom |AssistantProt | 2 no Merlisa [Associate Prof 7 no George|Professar é yes Joseph [AssistantProft[ 7 yes Figure 7.1: The data classification process: a)Leaming: Training data are analyzed by a classification algorithm. Here, the class label attribute is credit rating, and the learned model or classifier is represented in the form of classification rules, b) Classification: Test data are used to ceptable, the rules can be applied to the classification of new data tuples randomly selected and are estimate the accuracy of the classification rules. If the accuracy is considered a independent of the training samples. The accuracy of a model on a given test set is the percentage of test set samples that are correctly classified by the model. For each test sample, the known class label is compared with the leamed model's class prediction for that sample. Note that if the accuracy of the model were estimated based on the training data set, this estimate could be optimistic since the learned model tends to over fit the data (that is, it may have incorporated some particular anomalies of the training data which are not present in the overall sample population). Therefore, a test set is used, If the accuracy of the model is considered acceptable, the model can be used to classify future data tuples or objects for which the class label is not known. (Such data are also referred to in the machine learning literature as “unknown" or “previously unseen" data). For example, the classification rules learned in Figure 7.1a from the analysis of data from existing customers can be used to predict the credit rating of new or future (i.e., previously unseen) customers. “How is prediction different from classification?” Prediction can be viewed as the construction and use of a model to assess the class of an unlabeled object, or to assess the value or value ranges of an attribute that a given object is likely to have. In this view, classification and regression are the two major types of prediction problems where classification is used to predict discrete or nominal values, while regression is used to predict continuous ot ordered values. In our view, however, we refer to the use of predication to predict class labels as classification and The use of predication to predict continuous values (e.g., using regression techniques) as prediction. This view is commonly accepted in data mining Lecture? Issues regarding classification and prediction Preparing the data for classification and prediction: The following preprocessing steps may be applied to the data in order to help improve the accuracy, efficiency, and scalability of the classification or prediction process. 1, Data cleaning, This refers to the preprocessing of data in order to remove or reduce noise (by applying smoothing techniques, for example), and the treatment of missing values (e.g., by replacing a missing value with the most commonly occurring value for that attribute, or with the most probable value based on statistics). Although most classification algorithms have some mechanisms for handling noisy or missing data, this step can help reduce confusion during learning, 2. Relevance analysis. Many of the attributes in the data may be irrelevant to the classification or prediction task. For example, data recording the day of the week on which a bank loan application was filed is unlikely to be relevant to the success of the application. Furthermore, other attributes may be redundant. Hence, relevance analysis may be performed on the data with the aim of removing any irrelevant or redundant attributes from the learning process. In machine learning, this step is known as feature selection. Including such attributes may otherwise slow down, and possibly mislead, the learning step Ideally, the time spent on rel e analysis, when added to the time spent on learning from the resulting “reduced” feature subset, should be less than the time that would have been spent on learning from the original set of features. Hence, such analysis can help improve classification efficiency and scalability. ani 3. Data transformation. The data can be generalized to higher-level concepts. Concept hierarchies may be used for this purpose, This is particularly useful for continuous-valued attributes. For example, numeric values for the attribute income may be generalized to discrete ranges such as low, medium, and high. Similarly, nominal-valued attributes, like street, can be generalized to higher-level concepts, like city. Since generalization compresses the original training data, fewer input/output operations may be involved during learning. The data may also be normalized, particularly when neural networks or methods involving distance measurements are used in the learning step. Normalization involves scaling all values for a given attribute so that they fall within a small specified range, such as -1.0 to 1.0, or 0 to 1.0. In methods which use distance measurements, for example, this would prevent attributes with initially large ranges (like, say income) from outweighing attributes with initially smaller ranges (such as binary attributes). Comparing classification methods. Classification and prediction methods can be compared and evaluated according to the following criteria: 1, Predictive accuracy. This refers to the ability of the model to correctly predict the class label of new or Previously unseen data. 2. Speed. This refers to the computation costs involved in generating and using the model. 3. Robustness. This is the ability of the model to make correct predictions given noisy data or data with missing Values, 4, Scalabil amounts of data. ity. This refers to the ability of the learned model to perform efficiently on large 5. Interpretability. This refers is the level of understanding and insight that is provided by the learned model. Lecture 3 Classification by Decision Tree Induction: A decision tree is a flow-chart-like tree structure, where each internal node denotes a test on an attribute, each branch represents an outcome of the test, and leaf nodes represent classes or class distributions. The topmost node in a tree is the root node. A typical decision tree is shown in Figure 7.2. It represents the concept buys computer, that is, it predicts whether or not a customer at All Electronics is likely to purchase a computer. Internal nodes are denoted by rectangles, and leaf nodes are denoted by ovals. Algorithm for Decision Tree Induction: Basic algorithm (a greedy algorithm) 1. Tree is constructed in a top-down recursive divide-and-conquer manner 2. At start, all the training examples are at the root 3. Attributes are categorical (if continuous-valued, they are discretized in advance) 4, Examples are partitioned recursively based on selected attributes 5, Test attributes are selected on the basis of a heuristic or stati information gain) ical measure (e.g., Conditions for stopping partitioning 1. All samples for a given node belong to the same class, 2. There are no remaining attributes for further partitioning — majority voting is employed for classifying the leaf 3. There are no samples left Attribute Selection Measure: 1. Information gain (ID3/C4.5) All attributes are assumed to be categorical Can be modified for continnous-valued attributes Select the attribute with the highest information gain Assume there are two classes, P and N Let the set of examples $ contain p elements of class P and n elements of class N The amount of information, needed to decide if an arbitrary example in S belongs to P or N is defined as 1(p.n) =-—2e, pen 2. Gini index (IBM Intelligent Miner) All attributes are assumed continuous-valued Assume there exist several possible split values for each attribute May need other tools, such as clustering, to get the possible split values Can be modified for categorical attributes Ifa data set T contains examples from n classes, gini index, gini (I) is defined as Where p, is the relative frequency of class j in T Ifa data set T is split into two subsets T; and T2 with sizes N; and Nz respectively, the gini index of the split data contains examples from n classes, the gini index gini (T) is defined as SINI spi (7) Nisin (r)+N2 y er) The attribute provides the smallest ginigpii(T) is chosen to split the node (need to enumerate all possible splitting points for each attribute). Lecture 4 Bayesian Classification “What are Bayesian classifiers"? Bayesian classifiers are statistical classifiers. They can predict class membership probabilities, such as the probability that a given sample belongs to a particular class. Bayesian classification is based on Bayes theorem, described below. Studies comparing classification algorithms have found a simple Bayesian classifier known as the naive Bayesian classifier to be comparable in performance with decision tree and neural network classifiers. Bayesian classifiers have also exhibited high accuracy and speed when applied to large databases. Naive Bayesian classifiers assume that the effect of an attribute value on a given class is independent of the values of the other attributes. This assumption is called class conditional independence. It is made to simplify the computations involved, and in this sense, is considered “naive. Bayesian belief networks are graphical models, which unlike naive Bayesian classifiers allow the representation of dependencies among subsets of attributes. Bayesian belief networks can also be used for classification, Bayesian Theorem Given training data D, posteriori probability of a hypothesis h, P{h/D) follows the Bayes theorem, PID|h)P(h) PalD)=“Ses MAP (maximum posteriori) hypothesis h rgmax P(h|D)=argmax P(D|h)P(h). he, heH MAP Practical difficulty: require initial knowledge of many probabilities, significant computational cost Naive Bayes Classifier simplified assumption: attributes are conditionally independent: PCI) © PCN LPC) Greatly reduces the computation cost, only count the class distribution. Given a training set, we can compute the probabilities Outlook | P| N| [Humidity P | N sunny 2/9 | 3/5 [high 3/9 | 4/5 overcast 4/19 | 0 normal | 6/9 | 1/5 rain 3/9 | 2/5 hot 2/9 | 2/5 true 3/9 | 3/5 mild 4/9 | 2/5 false 6/9 | 2/5 cool 3/9 _| 1/5 fssing a-posteriori probabitit es P(C|X) = prob. that the sample tuple X= is of elass C E.g. P(class=N | outlook=sunny ,windy-true,...) Idea: assign to sample X the class label C such that P(C|X) is maximal Estimating a-posteriori probabilities Bayes theorem P(CIX) = P(XIC)-P(C) / POX) P(X) is constant for all classes P(C) = relative freq of class C samples c such that PCIX) is maximum, = C such that P(X|C)-P(C) is maximum Problem: computing P(X{C) is unfeasible Naive Bayesian Classification: Naive assumption: attribute independence POX XKIC) = POG IC)... PCRIC) If i-th attribute is categorical: P(xi{C) is estimated as the relative freq of samples having value x: as i-h attribute in class C If i-th attribute is continuous: P(x{C) is estimated thru a Gaussian density function Computationally easy in both cases Lecture 5 Classification by Back propagation A multilayer feed-forward neural network The backpropagation algorithm performs learning on a multilayer feed-forward neural network ‘An example of such a network is shown in fig below. The inputs correspond to the attributes measured for each training sample. The inputs are fed simultaneously into a layer of units making up the input layer. The weighted outputs of these units are, in turn, fed simultaneously to a second layer of “neuron-like” units, known as a hidden layer. The hidden layer's weighted outputs can be input to another hidden layer, and so on. The number of hidden layers is arbitrary, although in practice, usually only one is used. The weighted outputs of the last hidden layer are input to units making up the output, layer, which emits the network's prediction for given samples. The units in the hidden layers and output layer are sometimes referred to as neurodes, due to their symbolic biological basis, or as output units. The multilayer neural network shown in Figure 7.9 has two layers of output units, Therefore, we say that it is a two-layer neural network. Similarly, a network containing two hidden layers is called a three-layer neural network, and so on. The network is feed-forward in that none of the weights cycle back to an input unit or to an output unit of a previous layer. It is fully connected in that each unit provides input to each unit in the next forward layer. Multilayer feed-forward networks of linear threshold functions, given enough hidden units, can closely approximate any function. Neural Networks: 1. Advantages prediction accuracy is generally high robust, works when training examples contain errors, output may be discrete, real-valued, or a vector of several discrete or real-valued attributes fast evaluation of the leamed target function 2.Criticism Jong training time difficult to understand the leaned function (weights) not easy to incorporate domain knowledge Neuron output y Input weight weighted Activation vectorx vector w sum function Lecture 6 Classification Based on Concepts from Association Rule Mining Several methods for association-based classification ARCS: Quantitative association mining and clustering of association rules (Lent et al’97) It beats C4.5 in (mainly) scalability and also accuracy Associative classification: (Liu et al’98) It mines high support and high confidence rules in the form of “cond_set => y”, where y is a class label CAEP (Classification by aggregating emerging patterns) (Dong et al’99) Emerging patterns (EPs): the item sets whose support increases significantly from one class to another Mine Eps based on minimum support and growth rate Lecture 7 Other Classification Methods 1. k-nearest neighbor classifier Al instances correspond to points in the n-D space. The nearest neighbor is defined in terms of Euclidean distance. ‘The target function could be diserete- or real- valued. For discrete-valued, the k-NN returns the most common value among the k training Examples nearest to xq. Vonoroi diagram: the decision surface induced by I-NN for a typical set of training examples, 2. case-based reasoning Also uses: lazy evaluation + analyze similar instances Difference: Instances are not “points in a Euclidean space” Example: Water faucet problem in CADET (Sycara et al’92) Methodology Instances represented by rich symbolic descriptions (e.g., function graphs) Multiple retrieved cases may be combined Tight coupling between case retrieval, knowledge-based reasoning, and problem solving Indexing based on syntactic similarity measure, and when failure, backtracking, and adapting to additional cases 3. Genetic algorithm GA: based on an analogy to biological evolution Each rule is represented by a string of bits ‘An initial population is created consisting of randomly generated rules e.g, IF A: and Not Aa then C2 can be encoded as 100 Based on the notion of survival of the fittest, a new population is formed to consists of the fittest rules and their offsprings The fitness of a rule is represented by its classification accuracy on a set of training examples Offsprings are generated by crossover and mutation. 4, Rough set approach: Rough sets are used to approximately or “roughly” define equivalent classes A rough set for a given class C is approximated by two sets: a lower approximation (certain to be in C) and an upper approximation (cannot be described as not belonging to ° Finding the minimal subsets (reducts) of attributes (for feature reduction) is NP-hard but a discernibility matrix is used to reduce the computation intensity. 5. Fuzzy set approaches: Fuzzy logic uses truth values between 0.0 and 1.00 to represent the degree of membership (such as using fuzzy membership graph) Attribute values are converted to fuzzy values e.g, income is mapped into the discrete categories {low, medium, high} with fuzzy values calculated For a given new sample, more than one fuzzy value may apply Each applicable rule contributes a vote for membership in the categories Typically, the truth values for each predicted category are summed Lecture 8 Prediction and Classifier Accuracy: The prediction of continuous values can be modeled by statistical techniques of regression. For example, we may like to develop a model to predict the salary of college graduates with 10 years of work experience, or the potential sales of a new product given its price. Many problems can be solved by linear regression, and even more can be tackled by applying transformations to the variables so that a nonlinear problem can be converted to a linear one. Regress Analysis and Log-Linear Models in Prediction = Linear regression: Y = a + 6X = Two parameters , « and B specify the line and are to be estimated by using the data at hand. = using the least squares criterion to the known values of Yi, Y2, ..., Xt, X2, = Multiple regression: Y = b0 + bt X1 + b2 x2. = Many nonlinear functions can be transformed into the above. = Log-linear models: = The multi-way table of joint probabilities is approximated by a product of lower-order tables. = Probability: ofa, b, ¢ d) = aabBacvad Obed Prediction: Numerical Data Prediction: Categorical Data Eien ger ter see toe cial Oeste) Toke of Ll =) en =e el Sef ST] is] emanate rot “ Ill 366.00-420.00 J 205.00-1000.00 lj 1260.00-6008.00 ™ ous i ie |i 42000-20500 113000-128000 ee me ‘Camping Chain 60 Outet Independent 4 be "| : Pee 4 Mass Marketer Sports Chain 3 (sami a0 Go sean hee siete 23h can, sstimating Error Rates: Partition: Training-and-testing ‘use two independent data sets, e.g, training set (2/3), test set(1/3) used for data set with large number of samples alidation divide the data set into k subsamples use k-1 subsamples as training data and one sub-sample as test data —- k-fold cross-validation for data set with moderate size Bootstrapping (leave-one-out) for small size data,

You might also like