DWDM Module IV
DWDM Module IV
Classification and prediction are two forms of data analysis that can be used to extract
models describing important data classes or to predict future data trends. Such analysis
can help provide us with a better understanding of the data at large.
For example, we can build a classification model to categorize bank loan applications as
either safe or risky.
Data cleaning:
This refers to the preprocessing of data in order to remove or reduce noise (by applying smoothing
techniques, for example) and the treatment of missing values (e.g., by replacing a missing value with the
most commonly occurring value for that attribute, or with the most probable value based on statistics).
Although most classification algorithms have some mechanisms for handling noisy or missing data, this
step can help reduce confusion during learning.
Issues Regarding Classification and Prediction
Relevance analysis:
Many of the attributes in the data may be redundant. Correlation analysis can be used to identify whether
any two given attributes are statistically related. For example, a strong correlation between attributes A1
and A2 would suggest that one of the two could be removed from further analysis.
A database may also contain irrelevant attributes. Attribute subset selection4 can be used in these cases to
find a reduced set of attributes such that the resulting probability distribution of the data classes is as close
as possible to the original distribution obtained using all attributes.
Hence, relevance analysis, in the form of correlation analysis and attribute subset selection, can be used to
detect attributes that do not contribute to the classification or prediction task.
Ideally, the time spent on relevance analysis,when added to the time spent on learning fromthe resulting
“reduced” attribute (or feature) subset, should be less than the time thatwould have been spent on learning
fromthe original set of attributes.Hence, such analysis can help improve classification efficiency and
scalability.
Issues Regarding Classification and Prediction
The data may be transformed by normalization, particularly when neural networks or methods involving
distance measurements are used in the learning step.
Normalization involves scaling all values for a given attribute so that they fall within a small specified
range, such as -1.0 to 1.0 or 0.0 to 1.0 In methods that use distance measurements, for example, this would
prevent attributes with initially large ranges (like, say, income) from out weighing attributes with initially
smaller ranges (such as binary attributes).
Issues Regarding Classification and Prediction
Accuracy: The accuracy of a classifier refers to the ability of a given classifier to correctly predict the class
label of new or previously unseen data (i.e., tuples without class label information). Similarly, the accuracy
of a predictor refers to how well a given predictor can guess the value of the predicted attribute for new or
previously unseen data.
Speed: This refers to the computational costs involved in generating and using the given classifier or
predictor.
Robustness: This is the ability of the classifier or predictor to make correct predictions given noisy data or
data with missing values.
Scalability: This refers to the ability to construct the classifier or predictor efficiently given large amounts
of data.
Interpretability: This refers to the level of understanding and insight that is provided by the classifier or
predictor. Interpretability is subjective and therefore more difficult to assess.
Classification by Decision Tree Induction
This work expanded on earlier work on concept learning systems, described by E. B. Hunt, J. Marin, and P.
T. Stone. Quinlan later presented C4.5 (a successor of ID3), which became a benchmark to which newer
supervised learning algorithms are often compared.
In 1984, a group of statisticians (L. Breiman, J. Friedman, R. Olshen, and C. Stone) published the book
Classification and Regression Trees (CART), which described the generation of binary decision trees. ID3
and CART were invented independently of one another at around the same time, yet follow a similar
approach for learning decision trees from training tuples.
Classification by Decision Tree Induction
Decision tree induction is the learning of decision trees from class-labeled training tuples.
A decision tree is a flowchart-like tree structure, where each internal node (non leaf node) denotes a test on
an attribute, each branch represents an outcome of the test, and each leaf node (or terminal node) holds a
class label. The topmost node in a tree is the root node.
Classification by Decision Tree Induction
– Tree construction
• At start, all the training examples are at the root
• Partition examples recursively based on selected attributes
– Tree pruning
• Identify and remove branches that reflect noise or outliers
• Use of decision tree: Classifying an unknown sample Test the attribute values of the sample against the
decision tree
Classification by Decision Tree Induction
If we were to split D into smaller partitions according to the outcomes of the splitting criterion, ideally each
partition would be pure (i.e., all of the tuples that fall into a given partition would belong to the same class).
Conceptually, the “best” splitting criterion is the one that most closely results in such a scenario.
Information gain
ID3 uses information gain as its attribute selection measure. This measure is based on pioneering work by
Claude Shannon on information theory, which studied the value or “information content” of messages. Let
node N represent or hold the tuples of partition D. The attribute with the highest information gain is chosen
as the splitting attribute for node N. This attribute minimizes the information needed to classify the tuples in
the resulting partitions and reflects the least randomness or “impurity” in these partitions.
Such an approach minimizes the expected number of tests needed to classify a given tuple and guarantees
that a simple (but not necessarily the simplest) tree is found.
where pi is the probability that an arbitrary tuple in D belongs to class Ci and is estimated by |Ci,D|/|D|. A log
function to the base 2 is used, because the information is encoded in bits. Info(D) is just the average amount of
information needed to identify the class label of a tuple in D. Note that, at this point, the information we have is
based solely on the proportions of tuples of each class. Info(D) is also known as the entropy of D.
How much more information would we still need (after the partitioning) in order to arrive at an exact
classification? This amount is measured by
The term |Dj |/|D| acts as the weight of the jth partition. InfoA(D) is the expected information required to classify
a tuple from D based on the partitioning by A. The smaller the expected information (still) required, the greater
the purity of the partitions.
Information gain is defined as the difference between the original information requirement (i.e., based on just
the proportion of classes) and the new requirement (i.e., obtained after partitioning on A). That is,
Classification by Decision Tree Induction
Similarly, we can compute Gain(income) = 0.029 bits, Gain(student) = 0.151 bits, and
Gain(credit rating) = 0.048 bits.
Because age has the highest information gain among the attributes, it is selected as the splitting
attribute.
Classification by Decision Tree Induction
Gain Ratio
The information gain measure is biased toward tests with many outcomes. That is, it prefers to select
attributes having a large number of values.
For example, consider an attribute that acts as a unique identifier such as product ID. A split on product ID
would result in a large number of partitions (as many as there are values), each one containing just one
tuple. Clearly, such a partitioning is useless for classification.
Hence C4.5, a successor of ID3, uses an extension to information gain known as gain ratio
It applies a kind of normalization to information gain using a “split information” value defined analogously with
Info(D) as
Classification by Decision Tree Induction
Gini Index
The Gini index is used in CART. Using the notation previously described, the Gini index measures the
impurity of D, a data partition or set of training tuples, as
where pi is the probability that a tuple in D belongs to class Ci and is estimated by |Ci,D|/|D|. The sum is
computed over m classes. The Gini index considers a binary split for each attribute.
Classification by Decision Tree Induction
Gini Index
The Gini index is used in CART. Using the notation previously described, the Gini index measures the
impurity of D, a data partition or set of training tuples, as
where pi is the probability that a tuple in D belongs to class Ci and is estimated by |Ci,D|/|D|. The sum is
computed over m classes. The Gini index considers a binary split for each attribute.
Bayesian Classification
Bayesian Classification
Bayesian classifiers are statistical classifiers. They can predict class membership probabilities such
as the probability that a given tuple belongs to a particular class. Bayesian classification is based on
Bayes’ theorem.
The na¨ıve Bayesian classifier to be comparable in performance with decision tree and selected
neural network classifiers. Bayesian classifiers have also exhibited high accuracy and speed when
applied to large databases.
Na¨ıve Bayesian classifiers assume that the effect of an attribute value on a given class is
independent of the values of the other attributes. This assumption is called class conditional
independence. It is made to simplify the computations involved.
Bayesian Classification
Bayes’ Theorem
For classification problems, to determine P(H|X), the probability that the hypothesis H holds given
the “evidence” or observed data tuple X. In other words, the probability that tuple X belongs to class
C, given that the attribute description of X.
For example, suppose our world of data tuples is confined to customers described by the attributes
age and income, respectively, and that X is a 35-year-old customer with an income of $40,000.
Bayesian Classification
Suppose that H is the hypothesis that our customer will buy a computer. Then P(H|X) reflects the
probability that customer X will buy a computer given that we know the customer’s age and income.
In contrast, P(H) is the prior probability, or a priori probability, of H. For our example, this is the
probability that any given customer will buy a computer, regardless of age, income, or any other
information, for that matter. The posterior probability, P(H|X), is based on more information (e.g.,
customer information) than the prior probability, P(H), which is independent of X.
Similarly, P(X|H) is the posterior probability of X conditioned on H. That is, it is the probability that
a customer, X, is 35 years old and earns $40,000, given that we know the customer will buy a
computer. P(X) is the prior probability of X. Using our example, it is the probability that a person
from our set of customers is 35 years old and earns $40,000.
“How are these probabilities estimated?” P(H), P(X|H), and P(X) may be estimated from the given
data, as we shall see next. Bayes’ theorem is useful in that it provides a way of calculating the
posterior probability, P(H|X), from P(H), P(X|H), and P(X). Bayes’ theorem is
P(H|X) = P(X|H)/P(H) P(X) .
Bayesian Classification
We wish to predict the class label of a tuple using na¨ıve Bayesian classification, given the same
training data The data tuples are described by the attributes age, income, student, and credit rating.
The class label attribute, buys computer, has two distinct values (namely, {yes, no}).
Let C1 correspond to the class buys computer = yes and C2 correspond to buys computer = no. The
tuple we wish to classify is X = (age = youth, income = medium, student = yes, creditrating = fair)
We need to maximize P(X|Ci)P(Ci), for i = 1, 2. P(Ci), the prior probability of each class, can be
computed based on the training tuples:
P(X|buys computer = yes) = P(age = youth | buys computer = yes)× P(income = medium | buys computer =
yes)× P(student = yes | buys computer = yes)× P(credit rating = fair | buys computer = yes)
= 0.222 × 0.444 × 0.667 × 0.667 = 0.044.
Similarly,
P(X|buys computer = no) = 0.600 × 0.400 × 0.200 × 0.400 = 0.019.
Therefore, the na¨ıve Bayesian classifier predicts buys computer = yes for tuple X.
CLASSIFICATION BY BACK PROPAGATION
CLASSIFICATION BY BACK PROPAGATION
Backpropagation is a neural network learning algorithm. A neural network is a set of connected input/output
units in which each connection has a weight associated with it.
During the learning phase, the network learns by adjusting the weights so as to be able to predict the correct
class label of the input tuples. Neural network learning is also referred to as connectionist learning due to the
connections between units.
Each layer is made up of units. The inputs to the network correspond to the attributes measured for each
training tuple. The inputs are fed simultaneously into the units making up the input layer. These inputs pass
through the input layer and are then weighted and fed simultaneously to a second layer of “neuronlike” units,
known as a hidden layer.
The outputs of the hidden layer units can be input to another hidden layer, and so on. The number of hidden
layers is arbitrary, although in practice, usually only one is used. The weighted outputs of the last hidden
layer are input to units making up the output layer, which emits the network’s prediction for given tuples.
The units in the input layer are called input units.
The units in the hidden layers and output layer are sometimes referred to as neurodes, due to their symbolic
biological basis, or as output units. The multilayer neural network shown in Figure has two layers of output
units. Therefore, we say that it is a two-layer neural network. Similarly, a network containing two hidden
layers is called a three-layer neural network, and so on. It is a feed-forward network since none of the
weights cycles back to an input unit or to a previous layer’s output unit. It is fully connected in that each unit
provides input to each unit in the next forward layer.
CLASSIFICATION BY BACK PROPAGATION
Each output unit takes, as input, a weighted sum of the outputs from units in the previous layer. It applies a
nonlinear (activation) function to the weighted input. Multilayer feed-forward neural networks are able to
model the class prediction as a nonlinear combination of the inputs. From a statistical point of view, they
perform nonlinear regression. Multilayer feed-forward networks, given enough hidden units and enough
training samples, can closely approximate any function.
CLASSIFICATION BY BACK PROPAGATION
Back Propagation
Backpropagation learns by iteratively processing a data set of training tuples, comparing the network’s
prediction for each tuple with the actual known target value. The target value may be the known class
label of the training tuple (for classification problems) or a continuous value (for numeric prediction).
The steps involved are expressed in terms of inputs, outputs, and errors, and may seem awkward if this
is your first look at neural network learning. However, once you become familiar with the process, you
will see that each step is inherently simple. The steps are described next. Initialize the weights: The
weights in the network are initialized to small random numbers (e.g., ranging from −1.0 to 1.0, or −0.5
to 0.5). Each unit has a bias associated with it, as explained later. The biases are similarly initialized to
small random numbers. Each training tuple, X, is processed by the following steps. Propagate the
inputs forward: First, the training tuple is fed to the network’s input layer. The inputs pass through the
input units, unchanged. That is, for an input unit, j, its output, Oj , is equal to its input value, Ij . Next,
the net input and output of each unit in the hidden and output layers are computed. The net input to a
unit in the hidden or output layers is computed as a linear combination of its inputs.
CLASSIFICATION BY BACK PROPAGATION
Back Propagation
Backpropagation learns by iteratively processing a data set of training tuples, comparing the network’s
prediction for each tuple with the actual known target value. The target value may be the known class
label of the training tuple (for classification problems) or a continuous value (for numeric prediction).
For each training tuple, the weights are modified so as to minimize the mean squared error between the
network’s prediction and the actual target value. These modifications are made in the “backwards”
direction, that is, from the output layer, through each hidden layer down to the first hidden layer (hence
the name backpropagation).
The steps involved are expressed in terms of inputs, outputs, and errors, and may seem awkward if this
is your first look at neural network learning. However, once you become familiar with the process, you
will see that each step is inherently simple. The steps are described next.
CLASSIFICATION BY BACK PROPAGATION
CLASSIFICATION BY BACK PROPAGATION
The biases are similarly initialized to small random numbers. Each training tuple, X, is processed by
the following steps.
Next, the net input and output of each unit in the hidden and output layers are computed. The net input
to a unit in the hidden or output layers is computed as a linear combination of its inputs.
Each such unit has a number of inputs to it that are, in fact, the outputs of the units connected to it in
the previous layer. Each connection has a weight.
To compute the net input to the unit, each input connected to the unit is multiplied by its corresponding
weight, and this is summed.
Given a unit j in a hidden or output layer, the net input, Ij, to unit j is
where wi j is the weight of the connection from unit i in the previous layer to unit j; Oi is the output of
unit i from the previous layer; and qj is the bias of the unit. The bias acts as a threshold in that it serves
to vary the activity of the unit.
Each unit in the hidden and output layers takes its net input and then applies an activation function to it,
as illustrated above. The function symbolizes the activation of the neuron represented by the unit. The
logistic, or sigmoid, function is used. Given the net input Ij to unit j, then Oj, the output of unit j, is
computed as
Backpropagate the error: The error is propagated backward by updating the weights and biases to
reflect the error of the network’s prediction. For a unit j in the output layer, the error
Err j is computed by
where Oj is the actual output of unit j, and Tj is the known target value of the given training tuple.
Note that Oj(1-Oj) is the derivative of the logistic function.
To compute the error of a hidden layer unit j, the weighted sum of the errors of the units connected
to unit j in the next layer are considered. The error of a hidden layer unit j is
where wjk is the weight of the connection from unit j to a unit k in the next higher layer, and Errk is
the error of unit k.
The weights and biases are updated to reflect the propagated errors. Weights are updated by the
following equations, where D wi j is the change in weight wi j:
The variable l is the learning rate, a constant typically having a value between 0.0 and 1.0.
Backpropagation learns using a method of gradient descent to search for a set of weights that fits
the training data so as to minimize the mean squared distance between the network’s class
prediction and the known target value of the tuples. The learning rate helps avoid getting stuck at a
local minimum in decision space (i.e., where the weights appear to converge, but are not the
optimum solution) and encourages finding the global minimum. If the learning rate is too small,
then learning will occur at a very slow pace. If the learning rate is too large, then oscillation
between inadequate solutions may occur. A rule of thumb is to set the learning rate to 1/t, where t is
the number of iterations through the training set so far.
Biases are updated by the following equations below, where D q j is the change in
bias q j:
CLASSIFICATION BY BACK PROPAGATION
Let the learning rate be 0.9. The initial weight and bias values of the network are given below, along
with the first training tuple, X = (1, 0, 1), whose class label is 1.
CLASSIFICATION BASED ON CONCEPTS FROM ASSOCIATION RULE MINING
Associative Classification
association rule mining in general. Association rules are mined in a two-step process consisting of
frequent itemset mining followed by rule generation. The first step searches for patterns of attribute–
value pairs that occur repeatedly in a data set, where each attribute–value pair is considered an item.
The resulting attribute– value pairs form frequent itemsets (also referred to as frequent patterns). The
second step analyzes the frequent itemsets to generate association rules. All association rules must
satisfy certain criteria regarding their “accuracy” (or confidence) and the proportion of the data set that
they actually represent (referred to as support). For example, the following is an association rule mined
from a data set, D, shown with its confidence and support:
age = youth ∧ credit = OK ⇒ buys computer = yes [support = 20%, confidence = 93%]
algorithms for associative classification is CBA (Classification Based on Associations). CBA uses an
iterative approach to frequent itemset mining. CBA uses a heuristic method to construct the classifier,
where the rules are ordered according to decreasing precedence based on their confidence and support.
CMAR (Classification based on Multiple Association Rules) differs from CBA in its strategy for frequent itemset
mining and its construction of the classifier. It also employs several rule pruning strategies with the
help of a tree structure for efficient storage and retrieval of rules.
CMAR adopts a variant of the FP-growth
algorithm to find the complete set of rules satisfying the minimum confidence and minimum support
thresholds. CMAR employs another tree structure to store and retrieve rules efficiently and to prune rules
based on confidence, correlation, and database coverage.
two rules, R1 and R2, if the antecedent of R1 is more general than that of R2 and conf(R1) ≥ conf(R2),
then R2 is pruned. CMAR also prunes rules for which the rule antecedent and class are not positively
correlated, based on an χ 2 test of statistical significance.
CLASSIFICATION BASED ON CONCEPTS FROM ASSOCIATION RULE MINING
R1 can also be written as R1: (age = youth) ^ (student = yes))(buys computer = yes).
CLASSIFICATION BASED ON CONCEPTS FROM ASSOCIATION RULE MINING
If the condition (that is, all of the attribute tests) in a rule antecedent holds true for a
given tuple,we say that the rule antecedent is satisfied (or simply, that the rule is satisfied)
and that the rule covers the tuple.
A rule R can be assessed by its coverage and accuracy. Given a tuple, X, from a class labeled
data set,D, let ncovers be the number of tuples covered by R; ncorrect be the number
of tuples correctly classified by R; and |D| be the number of tuples in D. We can define
the coverage and accuracy of R as
That is, a rule’s coverage is the percentage of tuples that are covered by the rule (i.e., whose attribute
values hold true for the rule’s antecedent). For a rule’s accuracy, we look at the tuples that it covers and
see what percentage of them the rule can correctly classify.
CLASSIFICATION BASED ON CONCEPTS FROM ASSOCIATION RULE MINING
Rule accuracy and coverage. Let’s go back to our data of Table. These are class-labeled tuples from the
AllElectronics customer database. Our task is to predict whether a customer will buy a computer.
Consider rule R1 above, which covers 2 of the 14 tuples. It can correctly classify both tuples.
Therefore,
Prediction:
“What if we would like to predict a continuous value, rather than a categorical label?” Numeric
prediction is the task of predicting continuous (or ordered) values for given input. For example, we
may wish to predict the salary of college graduates with 10 years of work experience, or the potential
sales of a new product given its price. By far, the most widely used approach for numeric prediction
(hereafter referred to as prediction) is regression.
Regression analysis can be used to model the relationship between one or more independent or
predictor variables and a dependent or response variable (which is continuous-valued).
Prediction
Linear Regression
Straight-line regression analysis involves a response variable, y, and a single predictor variable, x. It is
the simplest form of regression, and models y as a linear function of x.
That is,
y = b+wx
where the variance of y is assumed to be constant, and b and w are regression coefficients specifying
the Y-intercept and slope of the line, respectively. The regression coefficients, w and b, can also be
thought of as weights, so that we can equivalently write,
y = w0+w1x
These coefficients can be solved for by the method of least squares, which estimates the best-fitting
straight line as the one that minimizes the error between the actual data and the estimate of the line. Let
D be a training set consisting of values of predictor variable, x, for some population and their
associated values for response variable, y. The training set contains |D| data points of the form(x1, y1),
(x2, y2), …. , (x|D|, y|D|).
The regression coefficients can be estimated using this method with the following equations:
Prediction
Linear Regression
Prediction
corresponding salary of the graduate. The 2-D data can be graphed on a scatter plot. The plot suggests
a linear relationship between the two variables, x and y.
We model the relationship that salary may be related to the number of years of work experience with
the equation y = w0+w1x.
Prediction
Nonlinear Regression
“How can we model data that does not show a linear dependence? For example, what if a given
response variable and predictor variable have a relationship that may be modeled by a polynomial
function?” Think back to the straight-line linear regression case above where dependent response
variable, y, is modeled as a linear function of a single independent predictor variable, x.
What if we can get a more accurate model using a nonlinear model, such as a parabola or some other
higher-order polynomial? Polynomial regression is often of interest when there is just one predictor
variable. It can be modeled by adding polynomial terms to the basic linear model. By applying
transformations to the variables, we can convert the nonlinear model into a linear one that can then be
solved by the method of least squares.
Prediction
CLASSIFIER ACCURACY
CLASSIFIER ACCURACY
The accuracy of a classifier on a given test set is the percentage of test set tuples that are correctly
classified by the classifier.
True Positives (TP): The positive tuples that were correctly labeled by the classifier.
TP - Number of true positives.
True Negatives(TN): The negative tuples that were correctly labeled by the classifier.
TN - Number of true negatives.
False Positives (FP): The negative tuples that were incorrectly labeled as positive
E.g., Tuples of class buys computer = no for which the classifier predicted buys computer = yes.
FP - Number of false positives.
1) Holdout method
In this method, the given data are randomly partitioned into two independent sets, a training set and a
test set. Typically, two-thirds of the data are allocated to the training set, and the remaining one-third is
allocated to the test set. The training set is used to derive the model, whose accuracy is estimated with
the test set as shown in figure.
The estimate is pessimistic because only a portion of the initial data is used to derive
the model.
CLASSIFIER ACCURACY
2) Random subsampling
is a variation of the holdout method in which the holdout method is repeated k times. The overall
accuracy estimate is taken as the average of the accuracies obtained from each iteration.
3) Cross-validation
In k-fold cross-validation, the initial data are randomly partitioned into k mutually exclusive subsets or
“folds,” D1, D2, …, Dk, each of approximately equal size. Training and testing is performed k times. In
iteration i, partition Di is reserved as the test set, and the remaining partitions are collectively used to
train the model. That is, in the first iteration, subsets D2, … , Dk collectively serve as the training set in
order to obtain a first model, which is tested on D1; the second iteration is trained on subsets
D1, D3, …, Dk and tested on D2; and so on.
Leave-one-out is a special case of k-fold cross-validation where k is set to the number of initial tuples.
That is, only one sample is “left out” at a time for the test set.
In stratified cross-validation, the folds are stratified so that the class distribution of the tuples in each
fold is approximately the same as that in the initial data.
CLASSIFIER ACCURACY
4) Bootstrap
the bootstrap method samples the given training tuples uniformly with replacement. That is, each time a
tuple is selected, it is equally likely to be selected again and readded to the training set.