Unit 4 - Data Mining and Warehousing - WWW - Rgpvnotes.in
Unit 4 - Data Mining and Warehousing - WWW - Rgpvnotes.in
The revenue we generate from the ads we show on our website and app
funds our services. The generated revenue helps us prepare new notes
and improve the quality of existing study materials, which are
available on our website and mobile app.
If you don't use our website and app directly, it will hurt our revenue,
and we might not be able to run the services and have to close them.
So, it is a humble request for all to stop sharing the study material we
provide on various apps. Please share the website's URL instead.
Downloaded from www.rgpvnotes.in, whatsapp: 8989595022
Classification
Classification is a technique for determining which class the dependent belongs to based on one
or more independent variables. Classification is a data mining function that assigns items in a
collection to target categories or classes. The goal of classification is to accurately predict the target
class for each case in the data.
For example, a classification model could be used to identify loan applicants as low, medium, or
high credit risks.
There are various applications of classification algorithms as:
1. Medical Diagnosis
2. Image and pattern recognition
3. Fault detection
4. Financial market position etc.
There are two forms of data analysis that can be used for extracting models describing important
classes or to predict future data trends. These two forms are as follows −
i) Classification
ii) Prediction
Classification models predict categorical class labels; and prediction models predict continuous
valued functions. For example, we can build a classification model to categorize bank loan
applications as either safe or risky, or a prediction model to predict the expenditures in dollars of
potential customers on computer equipment given their income and occupation.
There are three main approaches to classify problem:
1. The first approach divides the space defined by data points into regions and each region
correspond to a given class.
2. The second approach is to find the probability of an example belonging to each class.
3. The third approach is to find the probability of a class containing that example.
Statistical-based algorithms
Statistical Distribution-Based Outlier Detection: The statistical distribution-based approach to
outlier detection assumes a distribution or probability model for the given data set (e.g., a Normal
or Poisson distribution) and then identifies outliers with respect to the model using a discordancy
test. Application of the test requires knowledge of the data set parameters knowledge of
distribution parameters such as the mean and variance and the expected number of outliers.
The hypothesis is retained if there is no statistically significant evidence supporting its rejection.
A discordancy test verifies whether an object, oi, is significantly large (or small) in relation to the
distribution F. Different test statistics have been proposed for use as a discordancy test, depending
on the available knowledge of the data.
Assuming that some statistic, T, has been chosen for discordancy testing, and the value of the
statistic for object oi is vi, then the distribution of T is constructed. Significance probability, SP
(vi) = Prob (T > vi), is evaluated. If SP (vi) is sufficiently small, then oi is discordant and the
working hypothesis is rejected.
An alternative hypothesis, H, which states that oi comes from another distribution model, G, is
adopted. The result is very much dependent on which model F is chosen because oi may be an
outlier under one model and a perfectly valid value under another. The alternative distribution is
very important in determining the power of the test, that is, the probability that the working
hypothesis is rejected when oi is really an outlier. There are different kinds of alternative
distributions.
Inherent alternative distribution:
In this case, the working hypothesis that all the objects come from distribution F is rejected in
favor of the alternative hypothesis that all of the objects arise from another distribution.
G: H:Pr (F>0) oi € G, where i = 1, 2.…, n
F and G may be different distributions or differ only in parameters of the same distribution.
There are constraints on the form of the G distribution in that it must have potential to produce
outliers. For example, it may have a different mean or dispersion.
Mixture alternative distribution:
The mixture alternative states that discordant values are not outliers in the F population, but
contaminants from some other population, G. In this case, the alternative hypothesis is:
Given a data set, the index-based algorithm uses multi dimensional indexing structures, such as R-
trees or k-d trees, to search for neighbors of each object o within radius dmin around that object.
Let M be the maximum number of objects within the dmin-neighborhood of an outlier. Therefore,
once M+1 neighbors of object o are found, it is clear that o is not an outlier. This algorithm has a
worst-case complexity of O (n2k), where n is the number of objects in the data set and k is the
dimensionality. The index-based algorithm scales well as k increases. However, this complexity
evaluation takes only the search time into account, even though the task of building an index in
itself can be computationally intensive.
Nested-loop algorithm:
The nested-loop algorithm has the same computational complexity as the index-based algorithm
but avoids index structure construction and tries to minimize the number of I/Os. It divides the
memory buffer space into two halves and the data set into several logical blocks. By carefully
choosing the order in which blocks are loaded into each half, I/O efficiency can be achieved.
➢ Although it is not guaranteed, in general the weights will eventually converge, and the
learning process stops.
Rule-based algorithms:
IF-THEN Rules
Rule-based classifier makes use of a set of IF-THEN rules for classification.
Rule Format:
IF condition THEN conclusion
Let us consider a rule R1,
R1: IF age = youth AND student = yes
THEN buy_computer = yes
Points to remember −
• The IF part of the rule is called rule antecedent or precondition.
• The THEN part of the rule is called rule consequent.
• The antecedent part the condition consist of one or more attribute tests and these tests are
logically ANDed.
• The consequent part consists of class prediction.
Note − We can also write rule R1 as follows −
R1: (age = youth) ^ (student = yes) (buys computer = yes)
If the condition holds true for a given tuple, then the antecedent is satisfied.
Rule Extraction:
Learn how to build a rule-based classifier by extracting IF-THEN rules from a decision tree.
Points to remember −
To extract a rule from a decision tree −
• One rule is created for each path from the root to the leaf node.
• To form a rule antecedent, each splitting criterion is logically ANDed.
• The leaf node holds the class prediction, forming the rule consequent.
Rule Induction Using Sequential Covering Algorithm:
Sequential Covering Algorithm can be used to extract IF-THEN rules form the training data. We
do not require generating a decision tree first. In this algorithm, each rule for a given class covers
many of the tuples of that class.
Some of the sequential Covering Algorithms are AQ, CN2, and RIPPER. As per the general
strategy the rules are learned one at a time. For each time rules are learned, a tuple covered by the
rule is removed and the process continues for the rest of the tuples. This is because the path to
each leaf in a decision tree corresponds to a rule.
The following is the sequential learning algorithm where rules are learned for one class at a time.
When learning a rule from a class Ci, rule to cover all the tuples from class C only and no tuple
form any other class.
Algorithm: Sequential Covering
Input:
D, a data set class-labeled tuple,
Att_vals, the set of all attributes and their possible values.
Output: A Set of IF-THEN rules.
Method:
Rule_set={ }; // initial set of rules learned is empty
repeat
Rule = Learn_One_Rule(D, Att_valls, c);
Remove tuples covered by rule form D;
until termination condition;
Probabilistic Classifiers:
A Bayes classifier is a probabilistic model that is used for supervised learning. A Bayes classifier
is based on the idea that the role of a class is to predict the values of features for members of that
class. Examples are grouped in classes because they have common values for some of the features.
Such classes are often called natural kinds. The learning agent learns how the features depend on
the class and uses that model to predict the classification of a new example.
The simplest case is the naive Bayes classifier, which makes the independence assumption
that the input features are conditionally independent of each other given the classification. The
independence of the naive Bayes classifier is embodied in a belief network where the features are
the nodes, the target feature (the classification) has no parents, and the target feature is the only
parent of each input feature. This belief network requires the probability distributions P(X)
P(Y) for the target feature, or class, YY and P(Xi∣Y) P(Xi∣Y) for each input feature XiXi. For
example, the prediction is computed by conditioning on observed values for the input features and
querying the classification. Multiple target variables can be modeled and learned separately.
-----------------------------***-----------------------------