Unit-III Classification
Unit-III Classification
3.1 Classification
Classification and prediction are two forms of data analysis that can be used to extract models
describing important data classes or to predict future data trends.
For example, we can build a classification model to categorize bankloan applications as either
safe or risky, or a prediction model to predict the expenditures of potential customers on
computer equipment given their income and occupation.
Regression analysis is a statistical methodology that is most often used for numeric
prediction.
Many classification and prediction methods have been proposed by researchers in machine
learning, pattern recognition, and statistics.
Most algorithms are memory resident, typically assuming a small data size. Recent data
mining research has built on such work, developing scalable classification and prediction
techniques capable of handling large disk-resident data.
The following preprocessing steps may be applied to the data to help improve the accuracy,
efficiency, and scalability of the classification or prediction process.
(i)Data cleaning:
This refers to the preprocessing of data in order to remove or reduce noise (by applying
smoothing techniques) and the treatment of missing values(e.g., by replacing a missing value
with the most commonly occurring value for that attribute, or with the most probable value
based on statistics).
Although most classification algorithms have some mechanisms for handling noisy ormissing
data, this step can help reduce confusion during learning.
(ii)Relevance analysis:
Many of the attributes in the data may be redundant.
Correlation analysis can be used to identify whether any two given attributes
arestatisticallyrelated.
For example, a strong correlation between attributes A1 and A2 would suggest that one ofthe
two could be removed from further analysis.
A database may also contain irrelevant attributes. Attribute subset selection can be usedin
these cases to find a reduced set of attributes such that the resulting probabilitydistribution of
the data classes is as close as possible to the original distribution obtainedusing all attributes.
Hence, relevance analysis, in the form of correlation analysis and attribute subsetselection,
can be used to detect attributes that do not contribute to the classification orprediction task.
Normalization involves scaling all values for a given attribute so that they fall within a small
specified range, such as -1 to +1 or 0 to 1.
Data can also be reduced by applying many other methods, ranging from wavelet
transformation and principle components analysis to discretization techniques, such as
binning, histogram analysis, and clustering.
3.1.2 Comparing Classification and Prediction Methods:
Accuracy:
The accuracy of a classifier refers to the ability of a given classifier to correctly predictthe
class label of new or previously unseen data (i.e., tuples without class label information).
The accuracy of a predictor refers to how well a given predictor can guess the value of the
predicted attribute for new or previously unseen data.
Speed:
This refers to the computational costs involved in generating and using thegiven classifier or
predictor.
Robustness:
This is the ability of the classifier or predictor to make correct predictionsgiven noisy data or
data with missing values.
Scalability:
This refers to the ability to construct the classifier or predictor efficientlygiven large amounts
of data.
Interpretability:
This refers to the level of understanding and insight that is providedby the classifier
orpredictor.
The construction of decision treeclassifiers does not require any domain knowledge or
parameter setting, and therefore I appropriate for exploratory knowledge discovery.
The learning and classification steps of decision treeinduction are simple and fast.
Decision tree induction algorithmshave been used for classification in many application
areas, such as medicine,manufacturing and production, financial analysis, astronomy, and
molecular biology.
Attribute list
Attribute selection method specifies a heuristic procedurefor selecting the attribute that
―best‖ discriminates the given tuples according to class.
If the tuples in D are all of the same class, then node N becomes a leaf and is labeledwith that
class .
Allof the terminating conditions are explained at the end of the algorithm.
Otherwise, the algorithm calls Attribute selection method to determine the splitting
criterion.
The splitting criterion tells us which attribute to test at node N by determiningthe ―best‖ way
to separate or partition the tuples in D into individual classes.
There are three possible scenarios.LetA be the splitting attribute. A has v distinct values, {a1,
a2, … ,av}, based on the training data.
1 A is discrete-valued:
In this case, the outcomes of the test at node N corresponddirectly to the known
values of A.
A branch is created for each known value, aj, of A and labeled with that value.
2 A is continuous-valued:
In this case, the test at node N has two possible outcomes, corresponding to the conditions A
<=split point and A >split point, respectively wheresplit point is the split-point returned by
Attribute selection method as part of the splitting criterion.
3 A is discrete-valued and a binary tree must be produced:
The test at node N is of the form―A€SA?‖. SA is the splitting subset for A, returned by
Attribute selection methodas part of the splitting criterion. It is a subset of the known values
of A.
(a)If A is Discrete valued (b)If A is continuous valued (c) IfA is discrete-valued and a
binarytree must be produced:
3.3 Bayesian Classification:
Bayesian classifiers are statistical classifiers.
They can predictclass membership probabilities, such as the probability that a given tuple
belongs toa particular class.
Let H be some hypothesis, such as that the data tuple X belongs to a specified class C.
For classification problems, we want to determine P(H|X), the probability that the hypothesis
H holds given the ―evidence‖ or observed data tuple X.
Bayes’ theorem is useful in that it providesa way of calculating the posterior probability,
P(H|X), from P(H), P(X|H), and P(X).
1.The naïve Bayesian classifier, or simple Bayesian classifier, works as follows: 1.Let D be a
training set of tuples and their associated class labels. As usual, each tuple isrepresented by
an n-dimensional attribute vector, X = (x1, x2, …,xn), depicting n measurements made on the
tuple from n attributes, respectively, A1, A2, …, An.
2.Suppose that there are m classes, C1, C2, …, Cm. Given a tuple, X, the classifier
willpredict that X belongs to the class having the highest posterior probability, conditioned on
X. That is, the naïve Bayesian classifier predicts that tuple X belongs to the class Ci if and
only if
Thus we maximize P(CijX). The classifier which P(CijX) is maximized is called the
maximum posteriori hypothesis. By Bayes’ theorem
3.As P(X) is constant for all classes, only P(X|Ci)P(Ci) need be maximized. If the class prior
probabilities are not known, then it is commonly assumed that the classes are equally likely,
that is, P(C1) = P(C2) = …= P(Cm), and we would therefore maximize P(X|Ci). Otherwise,
we maximize P(X|Ci)P(Ci).
If Akis continuous-valued, then we need to do a bit more work, but the calculationis
prettystraightforward.
5.In order to predict the class label of X, P(XjCi)P(Ci) is evaluated for each class Ci.The
classifier predicts that the class label of tuple X is the class Ciif and only if
Classifier Accuracy:
The accuracy of a classifier on a given test set is the percentage of test set tuples that are
correctly classified by the classifier.
In the pattern recognition literature, this is also referred to as the overall recognition rate of
the classifier, that is, it reflects how well the classifier recognizes tuples of the various
classes.
True negatives are the negative tuples that were correctly labeled by the classifier.
False positives are the negative tuples that were incorrectly labeled.
How well the classifier can recognize, for this sensitivity and specificity measures can be
used.