0% found this document useful (0 votes)
42 views57 pages

DWDM Module IV

Uploaded by

Reddy Sindhu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views57 pages

DWDM Module IV

Uploaded by

Reddy Sindhu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 57

CLASSIFICATION AND PREDICTION

Classification and prediction are two forms of data analysis that can be used to extract
models describing important data classes or to predict future data trends. Such analysis
can help provide us with a better understanding of the data at large.

Whereas classification predicts categorical (discrete, unordered) labels, prediction models


continuous valued functions.

For example, we can build a classification model to categorize bank loan applications as
either safe or risky.

Prediction model to predict the expenditures in dollars of potential customers on


computer equipment given their income and occupation.
Issues Regarding Classification and Prediction

1. Preparing the Data for Classification and Prediction


The following preprocessing steps may be applied to the data to help improve the accuracy,
efficiency, and scalability of the classification or prediction process.

Data cleaning:
This refers to the preprocessing of data in order to remove or reduce noise (by applying smoothing
techniques, for example) and the treatment of missing values (e.g., by replacing a missing value with the
most commonly occurring value for that attribute, or with the most probable value based on statistics).
Although most classification algorithms have some mechanisms for handling noisy or missing data, this
step can help reduce confusion during learning.
Issues Regarding Classification and Prediction

Relevance analysis:
Many of the attributes in the data may be redundant. Correlation analysis can be used to identify whether
any two given attributes are statistically related. For example, a strong correlation between attributes A1
and A2 would suggest that one of the two could be removed from further analysis.

A database may also contain irrelevant attributes. Attribute subset selection4 can be used in these cases to
find a reduced set of attributes such that the resulting probability distribution of the data classes is as close
as possible to the original distribution obtained using all attributes.

Hence, relevance analysis, in the form of correlation analysis and attribute subset selection, can be used to
detect attributes that do not contribute to the classification or prediction task.

Ideally, the time spent on relevance analysis,when added to the time spent on learning fromthe resulting
“reduced” attribute (or feature) subset, should be less than the time thatwould have been spent on learning
fromthe original set of attributes.Hence, such analysis can help improve classification efficiency and
scalability.
Issues Regarding Classification and Prediction

Data transformation and reduction:

The data may be transformed by normalization, particularly when neural networks or methods involving
distance measurements are used in the learning step.
Normalization involves scaling all values for a given attribute so that they fall within a small specified
range, such as -1.0 to 1.0 or 0.0 to 1.0 In methods that use distance measurements, for example, this would
prevent attributes with initially large ranges (like, say, income) from out weighing attributes with initially
smaller ranges (such as binary attributes).
Issues Regarding Classification and Prediction

2. Comparing Classification and Prediction Methods


Classification and prediction methods can be compared and evaluated according to the following criteria:

Accuracy: The accuracy of a classifier refers to the ability of a given classifier to correctly predict the class
label of new or previously unseen data (i.e., tuples without class label information). Similarly, the accuracy
of a predictor refers to how well a given predictor can guess the value of the predicted attribute for new or
previously unseen data.
Speed: This refers to the computational costs involved in generating and using the given classifier or
predictor.
Robustness: This is the ability of the classifier or predictor to make correct predictions given noisy data or
data with missing values.
Scalability: This refers to the ability to construct the classifier or predictor efficiently given large amounts
of data.
Interpretability: This refers to the level of understanding and insight that is provided by the classifier or
predictor. Interpretability is subjective and therefore more difficult to assess.
Classification by Decision Tree Induction

Decision Tree Induction


During the late 1970s and early 1980s, J. Ross Quinlan, a researcher in machine learning, developed a
decision tree algorithm known as ID3 (Iterative Dichotomiser).

This work expanded on earlier work on concept learning systems, described by E. B. Hunt, J. Marin, and P.
T. Stone. Quinlan later presented C4.5 (a successor of ID3), which became a benchmark to which newer
supervised learning algorithms are often compared.

In 1984, a group of statisticians (L. Breiman, J. Friedman, R. Olshen, and C. Stone) published the book
Classification and Regression Trees (CART), which described the generation of binary decision trees. ID3
and CART were invented independently of one another at around the same time, yet follow a similar
approach for learning decision trees from training tuples.
Classification by Decision Tree Induction

Decision tree induction is the learning of decision trees from class-labeled training tuples.

A decision tree is a flowchart-like tree structure, where each internal node (non leaf node) denotes a test on
an attribute, each branch represents an outcome of the test, and each leaf node (or terminal node) holds a
class label. The topmost node in a tree is the root node.
Classification by Decision Tree Induction

• Decision tree generation consists of two phases

– Tree construction
• At start, all the training examples are at the root
• Partition examples recursively based on selected attributes

– Tree pruning
• Identify and remove branches that reflect noise or outliers
• Use of decision tree: Classifying an unknown sample Test the attribute values of the sample against the
decision tree
Classification by Decision Tree Induction

Algorithm for Decision Tree Induction


Basic algorithm (a greedy algorithm)
– Tree is constructed in a top-down recursive divide-and-conquer manner
– At start, all the training examples are at the root
– Attributes are categorical (if continuous-valued, they are discretized in advance)
– Examples are partitioned recursively based on selected attributes
– Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information
gain)
Classification by Decision Tree Induction

Conditions for stopping partitioning


– All samples for a given node belong to the same class
– There are no remaining attributes for further partitioning – majority voting is employed for
classifying the leaf
– There are no samples left

Extracting Classification Rules from Trees


• Represent the knowledge in the form of IF-THEN rules
• One rule is created for each path from the root to a leaf
• Each attribute-value pair along a path forms a conjunction
• The leaf node holds the class prediction
• Rules are easier for humans to understand
Classification by Decision Tree Induction

Avoid Overfitting in Classification


• The generated tree may overfit the training data
– Too many branches, some may reflect anomalies due to noise or outliers
– Result is in poor accuracy for unseen samples

• Two approaches to avoid overfitting


– Prepruning : Halt tree construction early
—do not split a node if this would result in the goodness

measure falling below a threshold


• Difficult to choose an appropriate threshold
– Post pruning: Remove branches from a —fully grown||tree—get a sequence of progressively
pruned trees
• Use a set of data different from the training data to decide which is the
—best pruned tree
Classification by Decision Tree Induction

Attribute Selection Measures


An attribute selection measure is a heuristic for selecting the splitting criterion that “best” separates a given
data partition, D, of class-labeled training tuples into individual classes.

If we were to split D into smaller partitions according to the outcomes of the splitting criterion, ideally each
partition would be pure (i.e., all of the tuples that fall into a given partition would belong to the same class).

Conceptually, the “best” splitting criterion is the one that most closely results in such a scenario.

This section describes three popular attribute selection measures


1. information gain,
2. gain ratio,
3. gini index.
Classification by Decision Tree Induction

Information gain

ID3 uses information gain as its attribute selection measure. This measure is based on pioneering work by
Claude Shannon on information theory, which studied the value or “information content” of messages. Let
node N represent or hold the tuples of partition D. The attribute with the highest information gain is chosen
as the splitting attribute for node N. This attribute minimizes the information needed to classify the tuples in
the resulting partitions and reflects the least randomness or “impurity” in these partitions.

Such an approach minimizes the expected number of tests needed to classify a given tuple and guarantees
that a simple (but not necessarily the simplest) tree is found.

The expected information needed to classify a tuple in D is given by


Classification by Decision Tree Induction

where pi is the probability that an arbitrary tuple in D belongs to class Ci and is estimated by |Ci,D|/|D|. A log
function to the base 2 is used, because the information is encoded in bits. Info(D) is just the average amount of
information needed to identify the class label of a tuple in D. Note that, at this point, the information we have is
based solely on the proportions of tuples of each class. Info(D) is also known as the entropy of D.
How much more information would we still need (after the partitioning) in order to arrive at an exact
classification? This amount is measured by

The term |Dj |/|D| acts as the weight of the jth partition. InfoA(D) is the expected information required to classify
a tuple from D based on the partitioning by A. The smaller the expected information (still) required, the greater
the purity of the partitions.
Information gain is defined as the difference between the original information requirement (i.e., based on just
the proportion of classes) and the new requirement (i.e., obtained after partitioning on A). That is,
Classification by Decision Tree Induction

Similarly, we can compute Gain(income) = 0.029 bits, Gain(student) = 0.151 bits, and
Gain(credit rating) = 0.048 bits.
Because age has the highest information gain among the attributes, it is selected as the splitting
attribute.
Classification by Decision Tree Induction

Gain Ratio
The information gain measure is biased toward tests with many outcomes. That is, it prefers to select
attributes having a large number of values.

For example, consider an attribute that acts as a unique identifier such as product ID. A split on product ID
would result in a large number of partitions (as many as there are values), each one containing just one
tuple. Clearly, such a partitioning is useless for classification.

Hence C4.5, a successor of ID3, uses an extension to information gain known as gain ratio
It applies a kind of normalization to information gain using a “split information” value defined analogously with
Info(D) as
Classification by Decision Tree Induction

Gini Index

The Gini index is used in CART. Using the notation previously described, the Gini index measures the
impurity of D, a data partition or set of training tuples, as

where pi is the probability that a tuple in D belongs to class Ci and is estimated by |Ci,D|/|D|. The sum is
computed over m classes. The Gini index considers a binary split for each attribute.
Classification by Decision Tree Induction

Gini Index

The Gini index is used in CART. Using the notation previously described, the Gini index measures the
impurity of D, a data partition or set of training tuples, as

where pi is the probability that a tuple in D belongs to class Ci and is estimated by |Ci,D|/|D|. The sum is
computed over m classes. The Gini index considers a binary split for each attribute.
Bayesian Classification

Bayesian Classification

Bayesian classifiers are statistical classifiers. They can predict class membership probabilities such
as the probability that a given tuple belongs to a particular class. Bayesian classification is based on
Bayes’ theorem.

The na¨ıve Bayesian classifier to be comparable in performance with decision tree and selected
neural network classifiers. Bayesian classifiers have also exhibited high accuracy and speed when
applied to large databases.

Na¨ıve Bayesian classifiers assume that the effect of an attribute value on a given class is
independent of the values of the other attributes. This assumption is called class conditional
independence. It is made to simplify the computations involved.
Bayesian Classification

Bayes’ Theorem

Let X be a data tuple.

In Bayesian terms, X is considered “evidence.” It is described by measurements made on a set of n


attributes. Let H be some hypothesis such as that the data tuple X belongs to a specified class C.

For classification problems, to determine P(H|X), the probability that the hypothesis H holds given
the “evidence” or observed data tuple X. In other words, the probability that tuple X belongs to class
C, given that the attribute description of X.

P(H|X) is the posterior probability, or a posteriori probability, of H conditioned on X.

For example, suppose our world of data tuples is confined to customers described by the attributes
age and income, respectively, and that X is a 35-year-old customer with an income of $40,000.
Bayesian Classification

Suppose that H is the hypothesis that our customer will buy a computer. Then P(H|X) reflects the
probability that customer X will buy a computer given that we know the customer’s age and income.
In contrast, P(H) is the prior probability, or a priori probability, of H. For our example, this is the
probability that any given customer will buy a computer, regardless of age, income, or any other
information, for that matter. The posterior probability, P(H|X), is based on more information (e.g.,
customer information) than the prior probability, P(H), which is independent of X.

Similarly, P(X|H) is the posterior probability of X conditioned on H. That is, it is the probability that
a customer, X, is 35 years old and earns $40,000, given that we know the customer will buy a
computer. P(X) is the prior probability of X. Using our example, it is the probability that a person
from our set of customers is 35 years old and earns $40,000.

“How are these probabilities estimated?” P(H), P(X|H), and P(X) may be estimated from the given
data, as we shall see next. Bayes’ theorem is useful in that it provides a way of calculating the
posterior probability, P(H|X), from P(H), P(X|H), and P(X). Bayes’ theorem is
P(H|X) = P(X|H)/P(H) P(X) .
Bayesian Classification

Predicting a class label using na¨ıve Bayesian classification.

We wish to predict the class label of a tuple using na¨ıve Bayesian classification, given the same
training data The data tuples are described by the attributes age, income, student, and credit rating.
The class label attribute, buys computer, has two distinct values (namely, {yes, no}).

Let C1 correspond to the class buys computer = yes and C2 correspond to buys computer = no. The
tuple we wish to classify is X = (age = youth, income = medium, student = yes, creditrating = fair)
We need to maximize P(X|Ci)P(Ci), for i = 1, 2. P(Ci), the prior probability of each class, can be
computed based on the training tuples:

P(buys computer = yes) = 9/14 = 0.643


P(buys computer = no) = 5/14 = 0.357 To compute P(X|Ci), for i =1, 2
Bayesian Classification

we compute the following conditional probabilities:

P(age = youth | buys computer = yes) = 2/9 = 0.222


P(age = youth | buys computer = no) = 3/5 = 0.600
P(income = medium | buys computer = yes) = 4/9 = 0.444
P(income = medium | buys computer = no) = 2/5 = 0.400
P(student = yes | buys computer = yes) = 6/9 = 0.667
P(student = yes | buys computer = no) = 1/5 = 0.200
P(credit rating = fair | buys computer = yes) = 6/9 = 0.667
P(credit rating = fair | buys computer = no) = 2/5 = 0.400
Bayesian Classification

Using these probabilities, we obtain

P(X|buys computer = yes) = P(age = youth | buys computer = yes)× P(income = medium | buys computer =
yes)× P(student = yes | buys computer = yes)× P(credit rating = fair | buys computer = yes)
= 0.222 × 0.444 × 0.667 × 0.667 = 0.044.

Similarly,
P(X|buys computer = no) = 0.600 × 0.400 × 0.200 × 0.400 = 0.019.

To find the class, Ci, that maximizes P(X|Ci)P(Ci), we compute

P(X|buys computer = yes)P(buys computer = yes) = 0.044 × 0.643 = 0.028


P(X|buys computer = no)P(buys computer = no) = 0.019 × 0.357 = 0.007

Therefore, the na¨ıve Bayesian classifier predicts buys computer = yes for tuple X.
CLASSIFICATION BY BACK PROPAGATION
CLASSIFICATION BY BACK PROPAGATION

Backpropagation is a neural network learning algorithm. A neural network is a set of connected input/output
units in which each connection has a weight associated with it.

During the learning phase, the network learns by adjusting the weights so as to be able to predict the correct
class label of the input tuples. Neural network learning is also referred to as connectionist learning due to the
connections between units.

Advantages of neural networks:


• High tolerance of noisy data as well as their ability to classify patterns on which they have not been trained.
With little knowledge of the relationships between attributes and classes, it can be used.
• They are well suited for continuous-valued inputs and outputs, unlike most decision tree algorithms.
• They have been successful on a wide array of real-world data, including handwritten character recognition,
pathology and laboratory medicine, and training a computer to pronounce English text.
• Neural network algorithms are inherently parallel; parallelization techniques can be used to speed up the
computation process.
CLASSIFICATION BY BACK PROPAGATION
A Multilayer Feed-Forward Neural Network
The backpropagation algorithm performs learning on a multilayer feed-forward neural network. It iteratively
learns a set of weights for prediction of the class label of tuples. A multilayer feed-forward neural network
consists of an input layer, one or more hidden layers, and an output layer. An example of a multilayer feed-
forward network is shown in Figure
CLASSIFICATION BY BACK PROPAGATION

Each layer is made up of units. The inputs to the network correspond to the attributes measured for each
training tuple. The inputs are fed simultaneously into the units making up the input layer. These inputs pass
through the input layer and are then weighted and fed simultaneously to a second layer of “neuronlike” units,
known as a hidden layer.

The outputs of the hidden layer units can be input to another hidden layer, and so on. The number of hidden
layers is arbitrary, although in practice, usually only one is used. The weighted outputs of the last hidden
layer are input to units making up the output layer, which emits the network’s prediction for given tuples.
The units in the input layer are called input units.

The units in the hidden layers and output layer are sometimes referred to as neurodes, due to their symbolic
biological basis, or as output units. The multilayer neural network shown in Figure has two layers of output
units. Therefore, we say that it is a two-layer neural network. Similarly, a network containing two hidden
layers is called a three-layer neural network, and so on. It is a feed-forward network since none of the
weights cycles back to an input unit or to a previous layer’s output unit. It is fully connected in that each unit
provides input to each unit in the next forward layer.
CLASSIFICATION BY BACK PROPAGATION

Each output unit takes, as input, a weighted sum of the outputs from units in the previous layer. It applies a
nonlinear (activation) function to the weighted input. Multilayer feed-forward neural networks are able to
model the class prediction as a nonlinear combination of the inputs. From a statistical point of view, they
perform nonlinear regression. Multilayer feed-forward networks, given enough hidden units and enough
training samples, can closely approximate any function.
CLASSIFICATION BY BACK PROPAGATION

Back Propagation
Backpropagation learns by iteratively processing a data set of training tuples, comparing the network’s
prediction for each tuple with the actual known target value. The target value may be the known class
label of the training tuple (for classification problems) or a continuous value (for numeric prediction).

The steps involved are expressed in terms of inputs, outputs, and errors, and may seem awkward if this
is your first look at neural network learning. However, once you become familiar with the process, you
will see that each step is inherently simple. The steps are described next. Initialize the weights: The
weights in the network are initialized to small random numbers (e.g., ranging from −1.0 to 1.0, or −0.5
to 0.5). Each unit has a bias associated with it, as explained later. The biases are similarly initialized to
small random numbers. Each training tuple, X, is processed by the following steps. Propagate the
inputs forward: First, the training tuple is fed to the network’s input layer. The inputs pass through the
input units, unchanged. That is, for an input unit, j, its output, Oj , is equal to its input value, Ij . Next,
the net input and output of each unit in the hidden and output layers are computed. The net input to a
unit in the hidden or output layers is computed as a linear combination of its inputs.
CLASSIFICATION BY BACK PROPAGATION

Back Propagation
Backpropagation learns by iteratively processing a data set of training tuples, comparing the network’s
prediction for each tuple with the actual known target value. The target value may be the known class
label of the training tuple (for classification problems) or a continuous value (for numeric prediction).
For each training tuple, the weights are modified so as to minimize the mean squared error between the
network’s prediction and the actual target value. These modifications are made in the “backwards”
direction, that is, from the output layer, through each hidden layer down to the first hidden layer (hence
the name backpropagation).

The steps involved are expressed in terms of inputs, outputs, and errors, and may seem awkward if this
is your first look at neural network learning. However, once you become familiar with the process, you
will see that each step is inherently simple. The steps are described next.
CLASSIFICATION BY BACK PROPAGATION
CLASSIFICATION BY BACK PROPAGATION

Initialize the weights:


The weights in the network are initialized to small random numbers (e.g., ranging from −1.0 to 1.0, or
−0.5 to 0.5). Each unit has a bias associated with it, as explained later.

The biases are similarly initialized to small random numbers. Each training tuple, X, is processed by
the following steps.

Propagate the inputs forward:


First, the training tuple is fed to the network’s input layer. The inputs pass through the input units,
unchanged. That is, for an input unit, j, its output, Oj , is equal to its input value, Ij .

Next, the net input and output of each unit in the hidden and output layers are computed. The net input
to a unit in the hidden or output layers is computed as a linear combination of its inputs.

The diagram is shown below


CLASSIFICATION BY BACK PROPAGATION
CLASSIFICATION BY BACK PROPAGATION

Each such unit has a number of inputs to it that are, in fact, the outputs of the units connected to it in
the previous layer. Each connection has a weight.
To compute the net input to the unit, each input connected to the unit is multiplied by its corresponding
weight, and this is summed.
Given a unit j in a hidden or output layer, the net input, Ij, to unit j is

where wi j is the weight of the connection from unit i in the previous layer to unit j; Oi is the output of
unit i from the previous layer; and qj is the bias of the unit. The bias acts as a threshold in that it serves
to vary the activity of the unit.

Each unit in the hidden and output layers takes its net input and then applies an activation function to it,
as illustrated above. The function symbolizes the activation of the neuron represented by the unit. The
logistic, or sigmoid, function is used. Given the net input Ij to unit j, then Oj, the output of unit j, is
computed as
Backpropagate the error: The error is propagated backward by updating the weights and biases to
reflect the error of the network’s prediction. For a unit j in the output layer, the error
Err j is computed by

where Oj is the actual output of unit j, and Tj is the known target value of the given training tuple.
Note that Oj(1-Oj) is the derivative of the logistic function.
To compute the error of a hidden layer unit j, the weighted sum of the errors of the units connected
to unit j in the next layer are considered. The error of a hidden layer unit j is

where wjk is the weight of the connection from unit j to a unit k in the next higher layer, and Errk is
the error of unit k.
The weights and biases are updated to reflect the propagated errors. Weights are updated by the
following equations, where D wi j is the change in weight wi j:
The variable l is the learning rate, a constant typically having a value between 0.0 and 1.0.
Backpropagation learns using a method of gradient descent to search for a set of weights that fits
the training data so as to minimize the mean squared distance between the network’s class
prediction and the known target value of the tuples. The learning rate helps avoid getting stuck at a
local minimum in decision space (i.e., where the weights appear to converge, but are not the
optimum solution) and encourages finding the global minimum. If the learning rate is too small,
then learning will occur at a very slow pace. If the learning rate is too large, then oscillation
between inadequate solutions may occur. A rule of thumb is to set the learning rate to 1/t, where t is
the number of iterations through the training set so far.
Biases are updated by the following equations below, where D q j is the change in
bias q j:
CLASSIFICATION BY BACK PROPAGATION

Sample calculations for learning by the backpropagation algorithm.

Let the learning rate be 0.9. The initial weight and bias values of the network are given below, along
with the first training tuple, X = (1, 0, 1), whose class label is 1.
CLASSIFICATION BASED ON CONCEPTS FROM ASSOCIATION RULE MINING

Associative Classification
association rule mining in general. Association rules are mined in a two-step process consisting of
frequent itemset mining followed by rule generation. The first step searches for patterns of attribute–
value pairs that occur repeatedly in a data set, where each attribute–value pair is considered an item.

The resulting attribute– value pairs form frequent itemsets (also referred to as frequent patterns). The
second step analyzes the frequent itemsets to generate association rules. All association rules must
satisfy certain criteria regarding their “accuracy” (or confidence) and the proportion of the data set that
they actually represent (referred to as support). For example, the following is an association rule mined
from a data set, D, shown with its confidence and support:

age = youth ∧ credit = OK ⇒ buys computer = yes [support = 20%, confidence = 93%]

where ∧ represents a logical “AND.”


In general, associative classification consists of the following steps:
1. Mine the data for frequent itemsets, that is, find commonly occurring attribute–value pairs in the data.
2. Analyze the frequent itemsets to generate association rules per class, which satisfy confidence and
support criteria.
3. Organize the rules to form a rule-based classifier.

algorithms for associative classification is CBA (Classification Based on Associations). CBA uses an
iterative approach to frequent itemset mining. CBA uses a heuristic method to construct the classifier,
where the rules are ordered according to decreasing precedence based on their confidence and support.

CMAR (Classification based on Multiple Association Rules) differs from CBA in its strategy for frequent itemset
mining and its construction of the classifier. It also employs several rule pruning strategies with the
help of a tree structure for efficient storage and retrieval of rules.
CMAR adopts a variant of the FP-growth
algorithm to find the complete set of rules satisfying the minimum confidence and minimum support
thresholds. CMAR employs another tree structure to store and retrieve rules efficiently and to prune rules
based on confidence, correlation, and database coverage.

two rules, R1 and R2, if the antecedent of R1 is more general than that of R2 and conf(R1) ≥ conf(R2),
then R2 is pruned. CMAR also prunes rules for which the rule antecedent and class are not positively
correlated, based on an χ 2 test of statistical significance.
CLASSIFICATION BASED ON CONCEPTS FROM ASSOCIATION RULE MINING

Using IF-THEN Rules for Classification


Rules are a good way of representing information or bits of knowledge. A rule-based
classifier uses a set of IF-THEN rules for classification. An IF-THEN rule is an expression
of the form
IF condition THEN conclusion.
An example is rule R1,
R1: IF age = youth AND student = yes THEN buys computer = yes.
The “IF”-part (or left-hand side)of a rule is known as the rule antecedent or precondition.
The “THEN”-part (or right-hand side) is the rule consequent. In the rule antecedent, the condition
consists of one or more attribute tests (such as age = youth, and student = yes) that are logically
ANDed. The rule’s consequent contains a class prediction (in this case, we are predicting whether a
customer will buy a computer).

R1 can also be written as R1: (age = youth) ^ (student = yes))(buys computer = yes).
CLASSIFICATION BASED ON CONCEPTS FROM ASSOCIATION RULE MINING

If the condition (that is, all of the attribute tests) in a rule antecedent holds true for a
given tuple,we say that the rule antecedent is satisfied (or simply, that the rule is satisfied)
and that the rule covers the tuple.
A rule R can be assessed by its coverage and accuracy. Given a tuple, X, from a class labeled
data set,D, let ncovers be the number of tuples covered by R; ncorrect be the number
of tuples correctly classified by R; and |D| be the number of tuples in D. We can define
the coverage and accuracy of R as

coverage(R) =ncovers / |D|

accuracy(R) =ncorrect / ncovers

That is, a rule’s coverage is the percentage of tuples that are covered by the rule (i.e., whose attribute
values hold true for the rule’s antecedent). For a rule’s accuracy, we look at the tuples that it covers and
see what percentage of them the rule can correctly classify.
CLASSIFICATION BASED ON CONCEPTS FROM ASSOCIATION RULE MINING

Rule accuracy and coverage. Let’s go back to our data of Table. These are class-labeled tuples from the
AllElectronics customer database. Our task is to predict whether a customer will buy a computer.
Consider rule R1 above, which covers 2 of the 14 tuples. It can correctly classify both tuples.

Therefore,

coverage(R1) = 2/14 = 14:28% and


accuracy (R1) = 2/2 = 100%.
Prediction

Prediction:
“What if we would like to predict a continuous value, rather than a categorical label?” Numeric
prediction is the task of predicting continuous (or ordered) values for given input. For example, we
may wish to predict the salary of college graduates with 10 years of work experience, or the potential
sales of a new product given its price. By far, the most widely used approach for numeric prediction
(hereafter referred to as prediction) is regression.

Regression analysis can be used to model the relationship between one or more independent or
predictor variables and a dependent or response variable (which is continuous-valued).
Prediction

Linear Regression
Straight-line regression analysis involves a response variable, y, and a single predictor variable, x. It is
the simplest form of regression, and models y as a linear function of x.
That is,
y = b+wx
where the variance of y is assumed to be constant, and b and w are regression coefficients specifying
the Y-intercept and slope of the line, respectively. The regression coefficients, w and b, can also be
thought of as weights, so that we can equivalently write,
y = w0+w1x
These coefficients can be solved for by the method of least squares, which estimates the best-fitting
straight line as the one that minimizes the error between the actual data and the estimate of the line. Let
D be a training set consisting of values of predictor variable, x, for some population and their
associated values for response variable, y. The training set contains |D| data points of the form(x1, y1),
(x2, y2), …. , (x|D|, y|D|).

The regression coefficients can be estimated using this method with the following equations:
Prediction

Linear Regression
Prediction

corresponding salary of the graduate. The 2-D data can be graphed on a scatter plot. The plot suggests
a linear relationship between the two variables, x and y.

We model the relationship that salary may be related to the number of years of work experience with
the equation y = w0+w1x.
Prediction

Nonlinear Regression
“How can we model data that does not show a linear dependence? For example, what if a given
response variable and predictor variable have a relationship that may be modeled by a polynomial
function?” Think back to the straight-line linear regression case above where dependent response
variable, y, is modeled as a linear function of a single independent predictor variable, x.

What if we can get a more accurate model using a nonlinear model, such as a parabola or some other
higher-order polynomial? Polynomial regression is often of interest when there is just one predictor
variable. It can be modeled by adding polynomial terms to the basic linear model. By applying
transformations to the variables, we can convert the nonlinear model into a linear one that can then be
solved by the method of least squares.
Prediction
CLASSIFIER ACCURACY
CLASSIFIER ACCURACY

The accuracy of a classifier on a given test set is the percentage of test set tuples that are correctly
classified by the classifier.

Metrics for evaluating classifier


Basic Terminology
 Positive tuples (P)– Tuples of the main class of interest
 Negative tuples (N) – All other tuples
Example: In 2-class problem
Positive tuples -> buys_computer = yes
Negative tuples -> buys_computer = no (other)

Classifier Evaluation Metrics: Confusion Matrix


For each tuple, we compare the classifier’s predicted class label with the tuple’s known class label.
“building blocks” used in computing many evaluation measures
CLASSIFIER ACCURACY

Metrics for evaluating classifier

4 “building blocks” used in computing many evaluation measures.

True Positives (TP): The positive tuples that were correctly labeled by the classifier.
TP - Number of true positives.

True Negatives(TN): The negative tuples that were correctly labeled by the classifier.
TN - Number of true negatives.

False Positives (FP): The negative tuples that were incorrectly labeled as positive
E.g., Tuples of class buys computer = no for which the classifier predicted buys computer = yes.
FP - Number of false positives.

False negatives (FN): Positive tuples that were mislabeled as negative


E.g., tuples of class buys computer = yes for which the classifier predicted buys computer = no
FN - Number of false negatives.
CLASSIFIER ACCURACY
Classifier Evaluation Metrics: Confusion Matrix
The confusion matrix is a useful tool for analyzing how well your classifier can recognize tuples of different
classes.
CLASSIFIER ACCURACY

Evaluating the Accuracy of a Classifier or Predictor


Holdout, random subsampling, cross validation, and the bootstrap are common techniques for assessing
accuracy based on randomly sampled partitions of the given data.

1) Holdout method
In this method, the given data are randomly partitioned into two independent sets, a training set and a
test set. Typically, two-thirds of the data are allocated to the training set, and the remaining one-third is
allocated to the test set. The training set is used to derive the model, whose accuracy is estimated with
the test set as shown in figure.

The estimate is pessimistic because only a portion of the initial data is used to derive
the model.
CLASSIFIER ACCURACY

2) Random subsampling

is a variation of the holdout method in which the holdout method is repeated k times. The overall
accuracy estimate is taken as the average of the accuracies obtained from each iteration.

3) Cross-validation

In k-fold cross-validation, the initial data are randomly partitioned into k mutually exclusive subsets or
“folds,” D1, D2, …, Dk, each of approximately equal size. Training and testing is performed k times. In
iteration i, partition Di is reserved as the test set, and the remaining partitions are collectively used to
train the model. That is, in the first iteration, subsets D2, … , Dk collectively serve as the training set in
order to obtain a first model, which is tested on D1; the second iteration is trained on subsets
D1, D3, …, Dk and tested on D2; and so on.
Leave-one-out is a special case of k-fold cross-validation where k is set to the number of initial tuples.
That is, only one sample is “left out” at a time for the test set.

In stratified cross-validation, the folds are stratified so that the class distribution of the tuples in each
fold is approximately the same as that in the initial data.
CLASSIFIER ACCURACY

4) Bootstrap
the bootstrap method samples the given training tuples uniformly with replacement. That is, each time a
tuple is selected, it is equally likely to be selected again and readded to the training set.

A commonly used one is the .632 bootstrap

You might also like