Unit-Iv Data Classification: Data Warehousing and Data Mining
Unit-Iv Data Classification: Data Warehousing and Data Mining
UNIT-IV
DATA CLASSIFICATION (Alternative Techniques)
Classification is a form of data analysis that extracts models describing important data
classes. Such models, called classifiers, predict categorical (discrete, unordered) class labels.
For example, we can build a classification model to categorize bank loan applications as either
safe or risky. Such analysis can help provide us with a better understanding of the data at large.
Many classification methods have been proposed by researchers in machine learning, pattern
recognition, and statistics.
Page 1
Data Warehousing and Data Mining
maximum posteriori hypothesis. By Bayes’ theorem
𝑷 𝑿 𝑪𝒊 𝑷(𝑪𝒊)
𝑷 𝑪 𝒊𝑿 =
𝑷(𝑿)
Page 2
Data Warehousing and Data Mining
3. As P(X) is constant for all classes, only P(X|Ci)P(Ci) need be maximized. If the class
prior probabilities are not known, then it is commonly assumed that the classes are
equally likely, that is, P(C1) = P(C2) = …= P(Cm), and we would therefore maximize
P(X|Ci). Otherwise, we maximize P(X|Ci)P(Ci).
4. Given data sets with many attributes, it would be extremely computationally expensive
to compute P(X|Ci). In order to reduce computation in evaluating P(X|Ci), the naive
assumption of class conditional independence is made. This presumes that the values
of the attributes are conditionally independent of one another, given the class label of
the tuple. Thus,
𝒏
𝑷 𝑿 𝑪𝒊 = 𝑷 𝒙𝒌 𝑪𝒊
𝒌=𝟏
= 𝑷 𝒙𝟏 𝑪𝟏 × 𝑷 𝒙𝟐 𝑪𝟐 × … … . .× 𝑷 𝒙𝒏 𝑪𝒊
5. We can easily estimate the probabilities P(x1|Ci), P(x2|Ci), : : : , P(xn|Ci) from the
training tuples.
6. For each attribute, we look at whether the attribute is categorical or continuous-
valued. For instance, to compute P(X|Ci), we consider the following:
If Ak is categorical, then P(xk|Ci) is the number of tuples of class Ci in=havingthe
value xk for Ak, divided by |Ci ,D| the number of tuples of class Ci in D.
If Ak is continuous-valued, then we need to do a bit more work, but the calculation is
pretty straightforward.
Example:
We wish to predict the class label of a tuple using naïve Bayesian classification, given
the same training data above. The training data were shown above in Table. The data tuples are
described by the attributes age, income, student, and credit rating. The class label
Page 3
Data Warehousing and Data Mining
attribute, buys computer, has two distinct values (namely, {yes, no}). Let C1 correspond to the
class buys computer=yes and C2 correspond to buys computer=no. The tuple we wish to
classify is
X={age= “youth”, income= “medium”, student= “yes”, credit_rating=
“fair”}
We need to maximize P(X|Ci)P(Ci), for i=1,2. P(Ci), the prior probability of each
class, can be computed based on the training tuples:
For example, having lung cancer is influenced by a person’s family history of lung
cancer, as well as whether or not the person is a smoker. Note that the variable PositiveXRay is
independent of whether the patient has a family history of lung cancer or is a smoker, given
that we know the patient has lung cancer.
Page 5
Data Warehousing and Data Mining
In other words, once we know the outcome of the variable LungCancer, then the
variables FamilyHistory and Smoker do not provide any additional information regarding
PositiveXRay. The arcs also show that the variable LungCancer is conditionally independent
of Emphysema, given its parents, FamilyHistory and Smoker.
A belief network has one conditional probability table (CPT) for each variable.
The CPT for a variable Y specifies the conditional distribution P(Y|Parents(Y)), where
Parents(Y) are the parents of Y. Figure (b) shows a CPT for the variable LungCancer. The
conditional probability for each known value of LungCancer is given for each possible
combination of the values of its parents. For instance, from the upper leftmost and bottom
rightmost entries, respectively.
Page 6
Data Warehousing and Data Mining
Page 7