0% found this document useful (0 votes)
2 views

Unit-III Classification

Data mining
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Unit-III Classification

Data mining
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Unit-III

3.1 Classification
Classification and prediction are two forms of data analysis that can be used to extract models
describing important data classes or to predict future data trends.

Classification predicts categorical (discrete, unordered) labels, prediction models continuous


valued functions.

For example, we can build a classification model to categorize bankloan applications as either
safe or risky, or a prediction model to predict the expenditures of potential customers on
computer equipment given their income and occupation.

A predictor is constructed that predicts a continuous-valued function, or ordered value, as


opposed to a categorical label.

Regression analysis is a statistical methodology that is most often used for numeric
prediction.

Many classification and prediction methods have been proposed by researchers in machine
learning, pattern recognition, and statistics.

Most algorithms are memory resident, typically assuming a small data size. Recent data
mining research has built on such work, developing scalable classification and prediction
techniques capable of handling large disk-resident data.

3.1.1 Issues Regarding Classification and Prediction:

1.Preparing the Data for Classification and Prediction:

The following preprocessing steps may be applied to the data to help improve the accuracy,
efficiency, and scalability of the classification or prediction process.
(i)Data cleaning:
This refers to the preprocessing of data in order to remove or reduce noise (by applying
smoothing techniques) and the treatment of missing values(e.g., by replacing a missing value
with the most commonly occurring value for that attribute, or with the most probable value
based on statistics).
Although most classification algorithms have some mechanisms for handling noisy ormissing
data, this step can help reduce confusion during learning.

(ii)Relevance analysis:
Many of the attributes in the data may be redundant.

Correlation analysis can be used to identify whether any two given attributes
arestatisticallyrelated.

For example, a strong correlation between attributes A1 and A2 would suggest that one ofthe
two could be removed from further analysis.

A database may also contain irrelevant attributes. Attribute subset selection can be usedin
these cases to find a reduced set of attributes such that the resulting probabilitydistribution of
the data classes is as close as possible to the original distribution obtainedusing all attributes.

Hence, relevance analysis, in the form of correlation analysis and attribute subsetselection,
can be used to detect attributes that do not contribute to the classification orprediction task.

Such analysis can help improve classification efficiency and scalability.

(iii)Data Transformation And Reduction


The data may be transformed by normalization, particularly when neural networks ormethods
involving distance measurements are used in the learning step.

Normalization involves scaling all values for a given attribute so that they fall within a small
specified range, such as -1 to +1 or 0 to 1.

The data can also be transformed by generalizing it to higher-level concepts. Concept


hierarchies may be used for this purpose. This is particularly useful for continuous valued
attributes.
For example, numeric values for the attribute income can be generalized to discrete ranges,
such as low, medium, and high. Similarly, categorical attributes, like street, can be
generalized to higher-level concepts, like city.

Data can also be reduced by applying many other methods, ranging from wavelet
transformation and principle components analysis to discretization techniques, such as
binning, histogram analysis, and clustering.
3.1.2 Comparing Classification and Prediction Methods:
Accuracy:

The accuracy of a classifier refers to the ability of a given classifier to correctly predictthe
class label of new or previously unseen data (i.e., tuples without class label information).

The accuracy of a predictor refers to how well a given predictor can guess the value of the
predicted attribute for new or previously unseen data.

Speed:

This refers to the computational costs involved in generating and using thegiven classifier or
predictor.
Robustness:

This is the ability of the classifier or predictor to make correct predictionsgiven noisy data or
data with missing values.
Scalability:

This refers to the ability to construct the classifier or predictor efficientlygiven large amounts
of data.
Interpretability:

This refers to the level of understanding and insight that is providedby the classifier
orpredictor.

Interpretability is subjective and therefore more difficultto assess.

3.2 Classification by Decision Tree Induction:


Decision tree induction is the learning of decision trees from class-labeled training tuples.

A decision tree is a flowchart-like tree structure,where

Each internal nodedenotes a test on an attribute.

Each branch represents an outcome of the test.


Each leaf node holds a class label.

The topmost node in a tree is the root node.

The construction of decision treeclassifiers does not require any domain knowledge or
parameter setting, and therefore I appropriate for exploratory knowledge discovery.

Decision trees can handle high dimensionaldata.

Their representation of acquired knowledge in tree formis intuitive and generallyeasy to


assimilate by humans.

The learning and classification steps of decision treeinduction are simple and fast.

In general, decision tree classifiers have good accuracy.

Decision tree induction algorithmshave been used for classification in many application
areas, such as medicine,manufacturing and production, financial analysis, astronomy, and
molecular biology.

3.2.1 Algorithm For Decision Tree Induction:

The algorithm is called with three parameters:


Data partition

Attribute list

Attribute selection method


The parameter attribute list is a list ofattributes describing the tuples.

Attribute selection method specifies a heuristic procedurefor selecting the attribute that
―best‖ discriminates the given tuples according to class.

The tree starts as a single node, N, representing the training tuples in D.

If the tuples in D are all of the same class, then node N becomes a leaf and is labeledwith that
class .

Allof the terminating conditions are explained at the end of the algorithm.

Otherwise, the algorithm calls Attribute selection method to determine the splitting
criterion.
The splitting criterion tells us which attribute to test at node N by determiningthe ―best‖ way
to separate or partition the tuples in D into individual classes.

There are three possible scenarios.LetA be the splitting attribute. A has v distinct values, {a1,
a2, … ,av}, based on the training data.
1 A is discrete-valued:

In this case, the outcomes of the test at node N corresponddirectly to the known

values of A.
A branch is created for each known value, aj, of A and labeled with that value.

Aneed not be considered in any future partitioning of the tuples.

2 A is continuous-valued:

In this case, the test at node N has two possible outcomes, corresponding to the conditions A
<=split point and A >split point, respectively wheresplit point is the split-point returned by
Attribute selection method as part of the splitting criterion.
3 A is discrete-valued and a binary tree must be produced:
The test at node N is of the form―A€SA?‖. SA is the splitting subset for A, returned by
Attribute selection methodas part of the splitting criterion. It is a subset of the known values
of A.

(a)If A is Discrete valued (b)If A is continuous valued (c) IfA is discrete-valued and a
binarytree must be produced:
3.3 Bayesian Classification:
Bayesian classifiers are statistical classifiers.

They can predictclass membership probabilities, such as the probability that a given tuple
belongs toa particular class.

Bayesian classification is based on Bayes’ theorem.

3.3.1 Bayes’ Theorem:


Let X be a data tuple. In Bayesian terms, X is considered ―evidence.‖and it is described by
measurements made on a set of n attributes.

Let H be some hypothesis, such as that the data tuple X belongs to a specified class C.

For classification problems, we want to determine P(H|X), the probability that the hypothesis
H holds given the ―evidence‖ or observed data tuple X.

P(H|X) is the posterior probability, or a posteriori probability, of H conditioned on X.

Bayes’ theorem is useful in that it providesa way of calculating the posterior probability,
P(H|X), from P(H), P(X|H), and P(X).

3.3.2 Naïve Bayesian Classification:

1.The naïve Bayesian classifier, or simple Bayesian classifier, works as follows: 1.Let D be a
training set of tuples and their associated class labels. As usual, each tuple isrepresented by
an n-dimensional attribute vector, X = (x1, x2, …,xn), depicting n measurements made on the
tuple from n attributes, respectively, A1, A2, …, An.

2.Suppose that there are m classes, C1, C2, …, Cm. Given a tuple, X, the classifier
willpredict that X belongs to the class having the highest posterior probability, conditioned on
X. That is, the naïve Bayesian classifier predicts that tuple X belongs to the class Ci if and
only if
Thus we maximize P(CijX). The classifier which P(CijX) is maximized is called the
maximum posteriori hypothesis. By Bayes’ theorem

3.As P(X) is constant for all classes, only P(X|Ci)P(Ci) need be maximized. If the class prior
probabilities are not known, then it is commonly assumed that the classes are equally likely,
that is, P(C1) = P(C2) = …= P(Cm), and we would therefore maximize P(X|Ci). Otherwise,
we maximize P(X|Ci)P(Ci).

4.Given data sets with many attributes, it would be extremely computationally


expensivetocompute P(X|Ci). In order to reduce computation in evaluating P(X|Ci), the naive
assumption of class conditional independence is made. This presumes that the values of the
attributes areconditionally independent of one another, given the class label of the tuple.
Thus,

We can easily estimate the probabilities P(x1|Ci), P(x2|Ci), : : : , P(xn|Ci) fromthe


trainingtuples. For eachattribute, we look at whether the attribute is categorical or continuous-
valued. Forinstance, to compute P(X|Ci), we consider the following:
If Akis categorical, then P(xk|Ci) is the number of tuples of class Ciin D havingthe
valuexkfor Ak, divided by |Ci,D| the number of tuples of class Ciin D.

If Akis continuous-valued, then we need to do a bit more work, but the calculationis
prettystraightforward.

A continuous-valued attribute is typically assumed to have a Gaussian distribution with a


mean μ and standard deviation , defined by

5.In order to predict the class label of X, P(XjCi)P(Ci) is evaluated for each class Ci.The
classifier predicts that the class label of tuple X is the class Ciif and only if
Classifier Accuracy:

The accuracy of a classifier on a given test set is the percentage of test set tuples that are
correctly classified by the classifier.

In the pattern recognition literature, this is also referred to as the overall recognition rate of
the classifier, that is, it reflects how well the classifier recognizes tuples of the various
classes.

The error rate or misclassification rate of a classifier, which is simply 1-Acc(M),


where Acc(M) is the accuracy of M.
The confusion matrix is a useful tool for analyzing how well your classifier can recognize

tuples of different classes.


True positives refer to the positive tuples that were correctly labeled by the classifier.

True negatives are the negative tuples that were correctly labeled by the classifier.

False positives are the negative tuples that were incorrectly labeled.

How well the classifier can recognize, for this sensitivity and specificity measures can be
used.

Accuracy is a function of sensitivity and specificity.

Where t _posis the number of true positives


Pos is the number of positive tuples
t _neg is the number of true negatives
neg is the number of negative tuples
f _posis the number of false positives

You might also like