0% found this document useful (0 votes)
5 views69 pages

Data Mining and Warehousing Mod3

The document discusses two key data analysis techniques: classification and prediction, highlighting their differences and applications. Classification involves predicting categorical labels based on training data, while prediction focuses on estimating continuous values. The process of classification includes building a classifier from training data and using it to classify new data, with decision tree induction as a prominent method for constructing classifiers.

Uploaded by

anjanaashok.cse
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views69 pages

Data Mining and Warehousing Mod3

The document discusses two key data analysis techniques: classification and prediction, highlighting their differences and applications. Classification involves predicting categorical labels based on training data, while prediction focuses on estimating continuous values. The process of classification includes building a classifier from training data and using it to classify new data, with decision tree induction as a prominent method for constructing classifiers.

Uploaded by

anjanaashok.cse
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 69

MODULE 3

Classification and Prediction


• There are two forms of data analysis that can be used for extracting
models describing important classes or to predict future data trends.
Classification
Prediction
• Classification models predict categorical class labels; and prediction
models predict continuous valued functions.
• For example, we can build a classification model to categorize bank
loan applications as either safe or risky, or a prediction model to
predict the expenditures in dollars of potential customers on
computer equipment given their income and occupation.
What is classification?
• classifies data (constructs a model) based on the training set and the
values (class labels) in a classifying attribute and uses it in classifying
new data
• Following are the examples of cases where the data analysis task is
Classification −
A bank loan officer wants to analyze the data in order to know which
customer (loan applicant) are risky or which are safe.
A marketing manager at a company needs to analyze a customer with a
given profile, who will buy a new computer.
• In both of the above examples, a model or classifier is constructed to
predict the categorical labels. These labels are risky or safe for loan
application data and yes or no for marketing data.
What is prediction?
• models continuous-valued functions, i.e., predicts unknown or
missing values
• Following are the examples of cases where the data analysis task is
Prediction :
• Suppose the marketing manager needs to predict how much a given
customer will spend during a sale at his company. In this example we
are bothered to predict a numeric value. Therefore the data analysis
task is an example of numeric prediction. In this case, a model or a
predictor will be constructed that predicts a continuous-valued-
function or ordered value.
How Does Classification Works?

• The Data Classification process includes two steps −


1. Building the Classifier or Model
2. Using Classifier for Classification
Building the Classifier or Mode
• This step is the learning step or the learning phase.
• In this step the classification algorithms build the classifier.
• The classifier is built from the training set made up of database tuples and
their associated class labels.
• Each tuple that constitutes the training set is referred to as a category or
class. These tuples can also be referred to as sample, object or data points.
The data classification process: (a) Learning: Training data are analyzed by a
classification algorithm. Here, the class label attribute is loan decision, and the
learned model or classifier is represented in the form of classification rules.
Using Classifier for Classification
• In this step, the classifier is used for classification. Here the test data is
used to estimate the accuracy of classification rules. The classification
rules can be applied to the new data tuples if the accuracy is
considered acceptable.
The data classification process: (b) Classification: Test data are used to
estimate the accuracy of the classification rules. If the accuracy is considered
acceptable, the rules can be applied to the classification of new data tuples.
Classification and Prediction Issues
• The major issue is preparing the data for Classification and Prediction.
Preparing the data involves the following activities −
• Data Cleaning − Data cleaning involves removing the noise and treatment
of missing values. The noise is removed by applying smoothing techniques
and the problem of missing values is solved by replacing a missing value
with most commonly occurring value for that attribute.
• Relevance Analysis − Database may also have the irrelevant attributes.
Correlation analysis is used to know whether any two given attributes are
related.
• Data Transformation and reduction − The data can be transformed by any
of the following methods.
• Normalization − The data is transformed using normalization.
Normalization involves scaling all values for given attribute in order to
make them fall within a small specified range. Normalization is used
when in the learning step, the neural networks or the methods
involving measurements are used.
• Generalization − The data can also be transformed by generalizing it
to the higher concept. For this purpose we can use the concept
hierarchies.
Comparison of Classification and Prediction Methods

• Accuracy − Accuracy of classifier refers to the ability of classifier. It predict


the class label correctly and the accuracy of the predictor refers to how
well a given predictor can guess the value of predicted attribute for a new
data.
• Speed − This refers to the computational cost in generating and using the
classifier or predictor.
• Robustness − It refers to the ability of classifier or predictor to make
correct predictions from given noisy data.
• Scalability − Scalability refers to the ability to construct the classifier or
predictor efficiently; given large amount of data.
• Interpretability − It refers to what extent the classifier or predictor
understands.
General Approach to
Classification
• Data classification is a two-step process.
• Consisting of a learning step (where a classification model is
constructed) and
• a classification step (where the model is used to predict class labels
for given data).
• In the first step, a classifier is built describing a predetermined set of
data classes or concepts.
• This is the learning step (or training phase), where a classification
algorithm builds the classifier by analysing or “learning from” a
training set made up of database tuples and their associated class
labels.
• A tuple, X, is represented by an n-dimensional attribute vector,
X ={x1, x2, …… , xn}.
• Each tuple, X, is assumed to belong to a predefined class as
determined by another database attribute called the class label
attribute.
• The individual tuples making up the training set are referred to as
training tuples and are randomly sampled from the database under
analysis.
• In the context of classification, data tuples can be referred to as
samples, examples, instances, data points, or objects.
• the class label of each training tuple is provided, this step is also
known as supervised learning (i.e., the learning of the classifier is
“supervised” in that it is told to which class each training tuple
belongs).
• It contrasts with unsupervised learning (or clustering), in which the
class label of each training tuple is not known, and the number or set
of classes to be learned may not be known in advance.
• This first step of the classification process can also be viewed as the
learning of a function, y = f (X), that can predict the associated class
label y of a given tuple X.
• In this view, we wish to learn a mapping or function that separates
the data classes.
• this mapping is represented in the form of classification rules,
decision trees, or mathematical formulae.
• In our example, the mapping is represented as classification rules that
identify loan applications as being either safe or risky.
• The rules can be used to categorize future data tuples, as well as
provide deeper insight into the data contents.
• They also provide a compressed data representation.
• In the second step, the model is used for classification.
• First, the predictive accuracy of the classifier is estimated.
• If we were to use the training set to measure the classifier’s accuracy,
this estimate would likely be optimistic, because the classifier tends to
overfit the data (i.e., during learning it may incorporate some
particular anomalies of the training data that are not present in the
general data set overall).
• Therefore, a test set is used, made up of test tuples and their
associated class labels. They are independent of the training tuples,
meaning that they were not used to construct the classifier.
• The accuracy of a classifier on a given test set is the percentage of
test set tuples that are correctly classified by the classifier.
Decision Tree Induction
• Decision tree induction is the learning of decision trees from class-
labeled training tuples.
• A decision tree is a flowchart-like tree structure, where each internal
node (non leaf node) denotes a test on an attribute,
• each branch represents an outcome of the test, and each leaf node
(or terminal node) holds a class label.
• The topmost node in a tree is the root node.
• A typical decision tree is shown in Figure . It represents the concept
buys computer, that is, it predicts whether a customer at
AllElectronics is likely to purchase a computer.
• Internal nodes are denoted by rectangles, and leaf nodes are denoted
by ovals.
• Some decision tree algorithms produce only binary trees (where each
internal node branches to exactly two other nodes), whereas others
can produce non binary trees.
• “How are decision trees used for classification?”
• Given a tuple, X, for which the associated class label is unknown, the
attribute values of the tuple are tested against the decision tree.
• A path is traced from the root to a leaf node, which holds the class
prediction for that tuple.
• Decision trees can easily be converted to classification rules.
• The construction of decision tree classifiers does not require any
domain knowledge or parameter setting, and therefore is appropriate
for exploratory knowledge discovery.
• Decision trees can handle multidimensional data.
• The learning and classification steps of decision tree induction are
simple and fast.
• In general, decision tree classifiers have good accuracy.
• During tree construction, attribute selection measures are used to
select the attribute that best partitions the tuples into distinct classes.
• When decision trees are built, many of the branches may reflect noise
or outliers in the training data.
• Tree pruning attempts to identify and remove such branches, with the
goal of improving classification accuracy on unseen data.
Decision Tree Algorithms
• During the late 1970s and early 1980s, J. Ross Quinlan, a researcher in
machine learning, developed a decision tree algorithm known as ID3
(Iterative Dichotomiser).
• Quinlan later presented C4.5 (a successor of ID3).
• In 1984, a group of statisticians (L. Breiman, J. Friedman, R. Olshen,
and C. Stone) published the book Classification and Regression Trees
(CART), which described the generation of binary decision trees.
• ID3, C4.5, and CART adopt a greedy (i.e., non backtracking) approach
in which decision trees are constructed in a top-down recursive
divide-and-conquer manner.
• Most algorithms for decision tree induction also follow a top-down
approach, which starts with a training set of tuples and their
associated class labels.
• The training set is recursively partitioned into smaller subsets as the
tree is being built.
Decision tree algorithm
• The algorithm is called with three parameters:
D
attribute list, and
Attribute selection method.
• D - data partition: Initially, it is the complete set of training tuples and
their associated class labels.
• The parameter attribute list is a list of attributes describing the
tuples.
• Attribute selection method specifies a heuristic procedure for
selecting the attribute that “best” discriminates the given tuples
according to class.
• attribute selection measure such as information gain or the Gini index.
• The tree starts as a single node, N, representing the training tuples in
D(step 1).
• If the tuples in D are all of the same class, then node N becomes a leaf
and is labeled with that class.(steps 2 and 3).
• Otherwise, the algorithm calls Attribute selection method to determine
the splitting criterion.
• The splitting criterion tells us which attribute to test at node N by
determining the “best” way to separate or partition the tuples in D into
individual classes(step 6).
• The splitting criterion also tells us which branches to grow from node N
with respect to the outcomes of the chosen test.
• More specifically, the splitting criterion indicates the splitting attribute
and may also indicate either a split-point or a splitting subset.
• The resulting partitions at each branch are as “pure” as possible.
• A partition is pure if all the tuples in it belong to the same class. In other
words, if we split up the tuples in D according to the mutually exclusive
outcomes of the splitting criterion, we hope for the resulting partitions
to be as pure as possible.
• The node N is labeled with the splitting criterion, which serves as a test
at the node (step 7).
• A branch is grown from node N for each of the outcomes of the splitting
criterion.
• The tuples in D are partitioned accordingly (steps 10 to 11).
• There are three possible scenarios, as illustrated in Figure.
• Let A be the splitting attribute. A has v distinct values, {a1, a2, .. , av},
based on the training data.
1. A is discrete-valued: In this case, the outcomes of the test at node
N correspond directly to the known values of A.
• A branch is created for each known value, aj , of A and labeled with
that value .
• Partition Dj is the subset of class-labeled tuples in D having value aj of
A.
• Because all the tuples in a given partition have the same value for A, A
need not be considered in any future partitioning of the tuples.
Therefore, it is removed from attribute list (steps 8 and 9).
2. A is continuous-valued: In this case, the test at node N has two
possible outcomes, corresponding to the conditions A < split point and
A > split point, respectively,
• where split point is the split-point returned by Attribute selection
method as part of the splitting criterion. (In practice, the split-point, a,
is often taken as the midpoint of two known adjacent values of A and
therefore may not actually be a pre-existing value of A from the
training data.)
• Two branches are grown from N and labeled according to the
previous outcomes.
• The tuples are partitioned such that D1 holds the subset of class-
labeled tuples in D for which A <= split point, while D2 holds the rest.
3. A is discrete-valued and a binary tree must be produced (as dictated by the
attribute selection measure or algorithm being used):
• The test at node N is of the form “A € SA?,” where SA is the splitting subset for
A, returned by Attribute selection method as part of the splitting criterion.
• It is a subset of the known values of A.
• If a given tuple has value aj of A and if aj € SA, then the test at node N is
satisfied.
• Two branches are grown from N.
• By convention, the left branch out of N is labeled yes so that D1 corresponds
to the subset of class-labeled tuples in D that satisfy the test.
• The right branch out of N is labeled no so that D2 corresponds to the subset of
class-labeled tuples from D that do not satisfy the test.
• The algorithm uses the same process recursively to form a decision
tree for the tuples at each resulting partition, Dj , of D (step 14).
• The recursive partitioning stops only when any one of the following
terminating conditions is true:
1. All the tuples in partition D (represented at node N) belong to the
same class(steps 2 and 3).
2. There are no remaining attributes on which the tuples may be
further partitioned (step 4). In this case, majority voting is employed
(step 5). This involves converting node N into a leaf and labeling it with
the most common class in D. Alternatively, the class distribution of the
node tuples may be stored.
3. There are no tuples for a given branch, that is, a partition Dj is empty
(step 12). In this case, a leaf is created with the majority class in D (step
13).
• The resulting decision tree is returned (step 15).
• The computational complexity of the algorithm given training set D is
O(n*ІD І *log (І D І)).
• where n is the number of attributes describing the tuples in D and І D І
is the number of training tuples in D.
Attribute Selection Measures
• An attribute selection measure is a heuristic for selecting the splitting
criterion that “best” separates a given data partition, D, of class-
labeled training tuples into individual classes.
• If we were to split D into smaller partitions according to the outcomes
of the splitting criterion, ideally each partition would be pure (i.e., all
the tuples that fall into a given partition would belong to the same
class).
• Attribute selection measures are also known as splitting rules
because they determine how the tuples at a given node are to be
split.
• Let D, the data partition, be a training set of class-labeled tuples.
Suppose the class label attribute has m distinct values defining m
distinct classes, Ci (for i = 1,………., m).
• Let Ci, D be the set of tuples of class Ci in D.
• Let І D І and І Ci, D І denote the number of tuples in D and Ci, D,
respectively.
Information Gain
• ID3 uses information gain as its attribute selection measure.
• Let node N represent or hold the tuples of partition D. The attribute
with the highest information gain is chosen as the splitting attribute
for node N.
• where pi is the nonzero probability that an arbitrary tuple in D
belongs to class Ci and is estimated by І Ci, D І / І D І .
• A log function to the base 2 is used, because the information is
encoded in bits.
• Info(D) is just the average amount of information needed to identify
the class label of a tuple in D.
• Info(D) is also known as the entropy of D.
• How much more information would we still need (after the
partitioning) to arrive at an exact classification? This amount is
measured by

• The term І Dj І / І D І acts as the weight of the jth


partition.
• InfoA (D) is the expected information required to classify a tuple
from D based on the partitioning by A.
• Information gain is defined as the difference between the original
information requirement (i.e., based on just the proportion of classes)
and the new requirement (i.e., obtained after partitioning on A). That
is
Gain(A) = Info(D) – InfoA(D).
• Table presents a training set, D, of class-labeled tuples randomly
selected from the AllElectronics customer database.
• The class label attribute, buys computer, has two distinct values
(namely, {yes, no} );
• therefore, there are two distinct classes (i.e., m = 2).
• Let class C1 correspond to yes and class C2 correspond to no.
• There are nine tuples of class yes and five tuples of class no.
• A (root) node N is created for the tuples in D. To find the splitting
criterion for these tuples, we must compute the information gain of
each attribute.
• Next, we need to compute the expected information
requirement for each attribute.
• Let’s start with the attribute age.
• We need to look at the distribution of yes and no tuples for each category of
age.
• For the age category “youth,” there are two yes tuples and three no tuples.
• For the category “middle aged,” there are four yes tuples and zero no
tuples.
• For the category “senior,” there are three yes tuples and two no tuples.
• Similarly, we can compute Gain( income)= 0.029 bits,
• Gain( student)= 0.151 bits,
• and Gain( credit rating)=0.048 bits.
• Because age has the highest information gain among the attributes, it
is selected as the splitting attribute.
• Node N is labeled with age, and branches are grown for each of the
attribute’s values.
• The tuples are then partitioned accordingly,
• Notice that the tuples falling into the partition for age = middle aged
all belong to the same class.
• Because they all belong to class “yes,” a leaf should therefore be
created at the end of this branch and labeled “yes.”
Gain Ratio
• C4.5, a successor of ID3, uses an extension to information gain known
as gain ratio.
• It applies a kind of normalization to information gain using a “split
information” value defined analogously with Info( D)as
• This value represents the potential information generated by splitting
the training data set, D, into v partitions, corresponding to the v
outcomes of a test on attribute A.
• The gain ratio is defined as

The attribute with the maximum gain ratio is selected as


the splitting attribute.
• Computation of gain ratio for the attribute income. A test on income
splits the data of Table into three partitions, namely low, medium, and
high, containing four, six, and four tuples, respectively.
Gini Index
• The Gini index is used in CART. the Gini index measures the impurity
of D, a data partition or set of training tuples, as
Tree Pruning
• When a decision tree is built, many of the branches will reflect
anomalies in the training data due to noise or outliers.
• Tree pruning methods address this problem of Overfitting the data.
• Such methods typically use statistical measures to remove the least-
reliable branches.
• Pruned trees tend to be smaller and less complex and, thus, easier to
comprehend.
• They are usually faster and better at correctly classifying independent
test data than unpruned trees.
• There are two common approaches to tree pruning:
 prepruning and
postpruning.
• In the prepruning approach, a tree is “pruned” by halting its
construction early (e.g., by deciding not to further split or partition
the subset of training tuples at a given node).
• Upon halting, the node becomes a leaf. The leaf may hold the most
frequent class among the subset tuples or the probability distribution
of those tuples.
• When constructing a tree, measures such as statistical significance,
information gain, Gini index, and so on, can be used to assess the
goodness of a split.
• If partitioning the tuples at a node would result in a split that falls
below a prespecified threshold, then further partitioning of the given
subset is halted.
• There are difficulties, however, in choosing an appropriate threshold.
• High thresholds could result in oversimplified trees, whereas low
thresholds could result in very little simplification.
• The second and more common approach is postpruning, which
removes subtrees from a “fully grown” tree.
• A subtree at a given node is pruned by removing its branches and
replacing it with a leaf.
• The leaf is labeled with the most frequent class among the subtree
being replaced.
• For example, notice the subtree at node “A3?” in the unpruned tree
of Figure .
• Suppose that the most common class within this subtree is “class B.”
• In the pruned version of the tree, the subtree in question is pruned by
replacing it with the leaf “class B.”
• Although pruned trees tend to be more compact than their unpruned
counterparts, they may still be rather large and complex.
• Decision trees can suffer from repetition and replication.
• Repetition occurs when an attribute is repeatedly tested along a given
branch of the tree (e.g., “age < 60?,” followed by “age < 45?,” and so
on).
• In replication, duplicate subtrees exist within the tree.
subtree repetition, where an attribute is repeatedly tested
along a given branch of the tree (e.g., age)
subtree replication, where duplicate subtrees exist
within a tree (e.g., the subtree headed by the node “credit
rating?”).
• These situations can impede the accuracy and comprehensibility of a
decision tree.
• The use of multivariate splits (splits based on a combination of
attributes) can prevent these problems.
Bayes Classification Methods
• Bayesian classifiers are statistical classifiers.
• They can predict class membership probabilities such as the
probability that a given tuple belongs to a particular class.
• Bayesian classification is based on Bayes’ theorem
• Studies comparing classification algorithms have found a
simple Bayesian classifier known as the naıve Bayesian
classifier.
• Naıve Bayesian classifiers assume that the effect of an
attribute value on a given class is independent of the values
of the other attributes. This assumption is called class
conditional independence.
Bayes’ Theorem
• Let X be a data tuple.
• Let H be some hypothesis such as that the data tuple X belongs to a
specified class C.
• For classification problems, we want to determine P(H/X), the
probability that the hypothesis H holds given the “evidence” or
observed data tuple X.
• In other words, we are looking for the probability that tuple X belongs
to class C.
• P(H/X) is the posterior probability, or a posteriori probability, of H
conditioned on X.
• For example, Suppose we have the attributes age and income
• And X is a 35-year-old customer with an income of $40,000.
• Suppose that H is the hypothesis that our customer will buy a computer.
Then P(H/X) reflects the probability that customer X will buy a computer
given that we know the customer’s age and income.
• In contrast, P(H) is the prior probability, or a priori probability, of H.
• For our example, this is the probability that any given customer will buy a
computer, regardless of age, income, or any other information, for that
matter.
• The posterior probability, P(H/X), is based on more information (e.g.,
customer information) than the prior probability, P(H), which is
independent of X.
• Similarly, P(X/H) is the posterior probability of X conditioned on H.
• That is, it is the probability that a customer, X, is 35 years old and
earns $40,000, given that we know the customer will buy a computer.
• P(X) is the prior probability of X. Using our example, it is the
probability that a person from our set of customers is 35 years old
and earns $40,000.
• Bayes’ theorem is useful in that it provides a way of calculating the
posterior probability, P(H/X) from P(H), P(X/H), and P(X). Bayes’
theorem is
Naıve Bayesian Classification
• The naıve Bayesian classifier, or simple Bayesian classifier, works as
follows:
• Predicting a class label using naıve Bayesian classification. We wish to
predict the class label of a tuple using naıve Bayesian classification,
given the same training data as in Example for decision tree induction.
• The data tuples are described by the attributes age, income, student,
and credit rating. The class label attribute, buys computer, has two
distinct values (namely,{yes, no}).
• Let C1 correspond to the class buys computer = yes and C2 correspond
to buys computer = no.
• The tuple we wish
• X = (age = youth, income = medium, student = yes, credit rating = fair)

You might also like