0% found this document useful (0 votes)
48 views

Module 3

The document discusses classification and prediction tasks in machine learning. It provides examples of classification, where models predict categorical labels, and prediction, where models predict continuous values. The document outlines the two-step process of classification - the learning step to build a classifier from training data, and the classification step to apply the model to new data. It also discusses decision tree algorithms for classification, including how decision trees are constructed in a top-down recursive manner and used to classify data.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views

Module 3

The document discusses classification and prediction tasks in machine learning. It provides examples of classification, where models predict categorical labels, and prediction, where models predict continuous values. The document outlines the two-step process of classification - the learning step to build a classifier from training data, and the classification step to apply the model to new data. It also discusses decision tree algorithms for classification, including how decision trees are constructed in a top-down recursive manner and used to classify data.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

Classification

Eg1:- A bank loans officer needs analysis of data to learn which


loan applicants are “safe” and which are “risky” for the bank.
Eg2:- A Marketing Manager need to analyse to guess whether a
customer with a given profile will buy a new computer.
Eg3:- A medical researcher wants to analyze breast cancer data
in order to predict which one of three specific treatments a
patient should receive.
• In each of these examples, the data analysis task is
classification, where a model or classifier is constructed to
predict categorical labels.
– Safe or risky
– Yes or No
– Treatment A, Treatment B or Treatment C
• These categories can be represented by discrete values,
where the ordering among values has no meaning
Prediction
• Eg:- A marketing manager would like to predict how much a
given customer will spend during a sale at AllElectronics.
• This data analysis task is an example of numeric prediction.
• the model constructed predicts a continuous-valued function,
or ordered value, as opposed to a categorical label.
• This model is a predictor.
• Regression analysis is a statistical methodology used for
numeric prediction
Classification—A Two-Step Process
1. Learning Step – a classification model is constructed
2. Classification step – the model is used to predict class labels
for given data
• Learning: A classifier is built describing a set of predetermined
classes
– The classification algorithm builds a classifier by analyzing or learning
from a training set made up of database tuples and their associated
class labels
– Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
– The set of tuples used for model construction: training set
– The model is represented as classification rules, decision trees, or
mathematical formulae
Classification—A Two-Step Process

• Classification step(Model usage): for classifying future or


unknown objects
– Estimate accuracy of the model
• The known label of test sample is compared with the classified
result from the model
• Accuracy rate is the percentage of test set samples that are
correctly classified by the model
• Test set is independent of training set, otherwise over-fitting will
occur
• A tuple, X, is represented by an n-dimensional attribute
vector, X = (x1, x2, …….., xn), depicting n measurements made
on the tuple from n database attributes, respectively, A1,
A2,… , An.
• Each tuple, X, is assumed to belong to a predefined class as
determined by another database attribute called the class
label attribute.
• The class label attribute is discrete-valued and unordered.
• It is categorical in that each value serves as a category or
class.
• individual tuples making up the training set are referred to as
training tuples
• Supervised Learning – the class label of each training tuple is
given
• unsupervised learning (clustering) - the class label of each
training tuple is not known, and
• the number or set of classes to be learned may not be known
in advance.
• Classification step
– Test data are used to estimate the accuracy of the
classification rules.
– The accuracy of a classifier on a given test set is the
percentage of test set tuples that are correctly classified by
the classifier.
– If the accuracy of the classifier is considered acceptable,
the classifier can be used to classify future data tuples for
which the class label is not known.
Issues regarding classification and prediction:
Preparing the Data for Classification and Prediction
• Data cleaning
– Preprocess data in order to reduce noise and handle missing
values
• Relevance analysis (feature selection)
– Remove the irrelevant or redundant attributes
– Correlation analysis
– Attribute subset selection
• Data transformation
– Normalization and generalization
– Concept hierarchy
Comparing Classification and Prediction Methods

• Accuracy
– The accuracy of a classifier refers to the ability of a given
classifier to correctly predict the class label of new or
previously unseen data.
– the accuracy of a predictor refers to how well a given
predictor can guess the value of the predicted attribute for
new or previously unseen data.
• Speed and scalability
– refers to the computational costs involved in generating
and using the given classifier or predictor.
(time to construct the model, time to use the model)
• Robustness
– Refers to the ability to make correct predictions given noisy
data and missing values
• Scalability
– Ability to construct classifier or predictor efficiently given
large amounts of data
• Interpretability:
– Level of understanding and insight provided by the
model(classifier or predictor)
Classification by Decision Tree Induction
• Decision Tree Induction - learning of decision trees from
class-labeled training tuples
• Decision tree
– A flow-chart-like tree structure
– Internal node denotes a test on an attribute
– Branch represents an outcome of the test
– Leaf nodes represent class labels or class distribution
• Decision tree generation consists of two phases
– Tree construction
• At start, all the training examples are at the root
• Partition examples recursively based on selected
attributes
– Tree pruning
• Identify and remove branches that reflect noise or
outliers
• Use of decision tree: Classifying an unknown sample
– Test the attribute values of the sample against the decision
tree
• How are decision trees used for classification?
• Given a tuple, X, for which the associated class label is
unknown, the attribute values of the tuple are tested against
the decision tree.
• A path is traced from the root to a leaf node, which holds the
class prediction for that tuple.
• Decision trees can easily be converted to classification rules.
Why are decision tree classifiers so popular?

• Decision Tree
– construction does not require any domain knowledge or
parameter setting
– It is appropriate for exploratory knowledge discovery.
– It can handle high dimensional data.
– The learning and classification steps of decision tree
induction are simple and fast.
– good accuracy.
– application areas:- medicine, manufacturing and
production, financial analysis, astronomy, molecular
biology etc.
• In early 1980’s , J. Ross Quinlan, a researcher in machine
learning, developed a decision tree algorithm known as ID3
(Iterative Dichotomiser).
• Quinlan later presented C4.5 (a successor of ID3).
• In 1984, a group of statisticians (L. Breiman, J. Friedman, R.
Olshen, and C. Stone) published the book Classification and
Regression Trees (CART), which described the generation of
binary decision trees.
• ID3 and CART were invented independently of one another at
around the same time, yet follow a similar approach for learning
decision trees from training tuples.
• ID3, C4.5, and CART adopt a greedy (nonbacktracking) approach
in which decision trees are constructed in a top-down recursive
divide-and-conquer manner.
Algorithm for Decision Tree Induction
• Basic algorithm (a greedy algorithm)
– Tree is constructed in a top-down recursive divide-and-conquer manner
– At start, all the training examples are at the root
– Attributes are categorical (if continuous-valued, they are discretized in
advance)
– Examples are partitioned recursively based on selected attributes
– Test attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain)
• Conditions for stopping partitioning
– All samples for a given node belong to the same class
– There are no remaining attributes for further partitioning – majority
voting is employed for classifying the leaf
– There are no samples left
Basic algorithm for inducing a decision tree from training tuples
Algorithm: Generate decision tree. Generate a decision tree from the training tuples
of data partition D.
• Input: Data partition, D, which is a set of training tuples and their associated class
labels;
• attribute list, the set of candidate attributes;
• Attribute selection method, a procedure to determine the splitting criterion that
“best” partitions the data tuples into individual classes. This criterion consists of a
splitting attribute and, possibly, either a split point or splitting subset.
• Output: A decision tree.
• Method:
• (1) create a node N;
• (2) if tuples in D are all of the same class, C then
• (3) return N as a leaf node labeled with the class C;
• (4) if attribute list is empty then
• (5) return N as a leaf node labeled with the majority class in D; // majority
voting
Basic algorithm for inducing a decision tree from training tuples(contd.)
• (6) apply Attribute selection method(D, attribute list) to find the “best” splitting
criterion;
• (7) label node N with splitting criterion;
• (8) if splitting attribute is discrete-valued and
multiway splits allowed then // not restricted to binary trees
• (9) attribute list ==attribute list - splitting attribute; // remove splitting attribute
• (10) for each outcome j of splitting criterion
• // partition the tuples and grow subtrees for each partition
• (11) let Dj be the set of data tuples in D satisfying outcome j; // a partition
• (12) if Dj is empty then
• (13) attach a leaf labeled with the majority class in D to node N;
• (14) else attach the node returned by Generate decision tree(Dj, attribute list)
to node N;
• endfor
• (15) return N;
Attribute Selection Measure
• A heuristic for selecting the splitting criterion that “best”
separates a given data partition, D, of class-labeled training
tuples into individual classes.
• Also known as splitting rules because they determine how the
tuples at a given node are to be split.
• Provides a ranking for each attribute describing the given training
tuples.
• The attribute having the best score for the measure is chosen as
the splitting attribute for the given tuples.
• If the splitting attribute is continuous-valued or if we are
restricted to binary trees then either a split point or a splitting
subset must also be determined as part of the splitting criterion.
• three popular attribute selection measures—
– Information gain, gain ratio, and gini index.
Information Gain
• ID3 uses Information Gain as attribute selection method.
• Node N hold(represent) the tuples of partition D.
• The attribute with the highest information gain is chosen as
the splitting attribute for node N. This attribute minimizes
the information needed to classify the tuples in the resulting
partitions
• To partition tuples in D on some attribute A having v distinct
values {a1,a2,……..av},
• If A is discrete valued, these values correspond to v outcomes
of a test on A.
• Attribute A can be used to split D into v partitions
{D1,D2,…….,Dv}, where Dj contains tuples in D having
outcome aj of A.
• These partitions correspond to the branches grown from
node N.
• If attribute a is continuous-valued, we must determine the
best split point for A.
These partitions may be impure. (ie. A partition may contain
tuples from different classes rather than from a single class)
Information Gain-Continuous valued attribute
• determine the “best” split-point for A
• sort the values of A in increasing order.
• the midpoint between each pair of adjacent values is a
possible split-point.
• given v values of A, then v-1 possible splits values.
• the midpoint between ai and ai+1 of A is

• The point with the minimum expected information


requirement for A is selected as the split point for A.
• D1 is the set of tuples in D satisfying A < split point, and
• D2 is the set of tuples in D satisfying A > split point.
Gain Ratio
• The information gain prefer attributes with large number of
values.
• A split on product_ID would result in a large number of
partitions (as many as there are values), each one containing
just one tuple.
• Because each partition is pure, the information required to
classify data set D based on this partitioning would be

• Therefore, the information gained by partitioning on this


attribute is maximal.
• such a partitioning is useless for classification.
To determine the best binary split on A, we examine all of the possible
subsets that can be formed using known values of A.
Each subset, SA, can be considered as a binary
test for attribute A of the form “?”.
Given a tuple, this test is satisfied if the value of A for the tuple is among
the values listed in SA.
the midpoint between each pair of (sorted) adjacent values is taken as a possible split-
point.
The point giving the minimum Gini index for a given (continuous-valued) attribute is
taken as the split-point of that attribute.
for a possible split-point of A, D1 is the set of tuples in D satisfying A < split point, and
D2 is the set of tuples in D satisfying A > split point.
• buys computer = yes – 9 tuples
• buys computer = No – 5 tuples
• Gini index to compute the impurity of D:

• To find the splitting criterion for the tuples in D, we compute the


gini index for each attribute.
• We start with the attribute income and consider each of the
possible splitting subsets.
• Consider the subset {low, medium}.
• This would result in 10 tuples in partition D1 satisfying the
condition
Tree Pruning
• When a decision tree is built, many of the branches will reflect
anomalies in the training data due to noise or outliers.
• Tree pruning methods address this problem of overfitting the
data.
• Pruned trees tend to be smaller and less complex and, thus,
easier to comprehend.
• They are usually faster and better at correctly classifying
independent test data than unpruned trees.
• There are two common approaches to tree pruning:
• prepruning and postpruning.`
- Prepruning: Halt tree construction early—do not split a
node if this would result in the goodness measure falling
below a threshold
• Upon halting, the node becomes a leaf. The leaf may hold the
most frequent class among the subset tuples
• measures such as statistical significance, information gain,
Gini index, and so on can be used to assess the goodness of a
split.
• Difficult to choose an appropriate threshold
– Postpruning: Remove branches(sub trees) from a “fully
grown” tree—get a sequence of progressively pruned
trees.
• A subtree at a given node is pruned by removing its branches
and replacing it with a leaf. The leaf is labeled with the most
frequent class among the subtree being replaced.
• Use a set of data different from the training data to
decide which is the “best pruned tree”- pruning set
C4.5 uses a method called pessimistic pruning, which is
similar to the cost complexity method in that it also uses
error rate estimates to make decisions regarding subtree
pruning. Pessimistic pruning, however, does not require the
use of a prune set. Instead, it uses the training set to estimate
error rates.
Postpruning requires more computation than prepruning, yet
generally leads to a more reliable tree. No single pruning
method has been found to be superior over all others
Bayesian Classification: Why?

• Probabilistic learning: can predict class membership probabilities,


such as the probability that a given tuple belongs to a particular
class.
• Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is correct.
Prior knowledge can be combined with observed data.
• Probabilistic prediction: Predict multiple hypotheses, weighted
by their probabilities
• Standard: Even when Bayesian methods are computationally
intractable, they can provide a standard of optimal decision
making against which other methods can be measured
Bayesian Classification
• A statistical classifier
• Predict class membership probabilities
• Naïve Bayesian classifier is comparable in
performance with decision tree and selected
neural network classifiers.
• Based on Bayes’ Theorem
Bayes’ Theorem
• Let X be a data tuple.
• In Bayesian terms, X is considered “evidence.”
• Let H be some hypothesis, such that the data tuple X belongs
to a specified class C.
• For classification problems, we want to determine P(H/X), the
probability that the hypothesis H holds given the “evidence”
or tuple X.
• P(H/X) is the posterior probability.
• we are looking for the probability that tuple X belongs to class
C, given that we know the attribute description of X.
• P(H/X) is the posterior probability of H conditioned on X.
• Eg:- X is a 35-year-old customer with an income of $40,000.
• H is the hypothesis that our customer will buy a computer.
• P(H/X) reflects the probability that customer X will buy a computer
given that we know the customer’s age and income.
• In contrast, P(H) is the prior probability, or a priori probability, of H.
• This is the probability that any given customer will buy a computer,
regardless of age, income, or any other information, for that
matter.
• The posterior probability, P(H/X), is based on more information
(e.g., customer information) than the prior probability, P(H), which
is independent of X.
• P(X/H) is the posterior probability of X conditioned on H. It is the
probability that a customer, X, is 35 years old and earns $40,000,
given that we know the customer will buy a computer.
• P(X) is the prior probability of X. It is the probability that a person
from our set of customers is 35 years old and earns $40,000.
• P(H), P(X/H), and P(X) may be estimated from the given data.
• Bayes’ theorem provides a way of calculating the posterior
probability, P(H/X) from P(H), P(X/H), and P(X).
• Bayes’ theorem is
Naïve Bayesian Classifier
• Let D be a training set of tuples and their associated class
labels.
• A tuple X is an n dimensional attribute vector X={x1,x2,….,xn}
with n attributes, A1,A2,…….,An.
• There are ‘m’ classes, C1,C2,….,Cm.
• Given a tuple, X, the classifier will predict that X belongs to
the class having the highest posterior probability, conditioned
on X.
• The naïve Bayesian classifier predicts that tuple X belongs to
the class Ci if and only if
• As P(X) is constant for all classes, only P(X/Ci)P(Ci) need be
maximized.
• If the class prior probabilities are not known, that is, P(C1) =
P(C2) = …….. = P(Cm), we assume that classes are equally
likely and we would therefore maximize P(X/Ci). Otherwise,
we maximize P(X/Ci)P(Ci).
• The class prior probabilities may be estimated by
where |Ci,D| is the number of training tuples of class Ci in D.
• Given data sets with many attributes, it would be extremely
computationally expensive to compute P(X/Ci).
• In order to reduce computational complexity, the naive
assumption of class conditional independence is made- the
values of the attributes are conditionally independent of one
another, given the class label of the tuple.

• The probabilities P(x1/Ci), P(x2/Ci), ………, P(xn/Ci) can be


estimated from the training tuples.
• xk refers to the value of attribute Ak for tuple X.
• For each attribute, we look at whether the attribute is
categorical or continuous-valued.
We need to compute μCi and , which are the mean (i.e., average) and
standard deviation, respectively, of the values of attribute Ak for training
tuples of class Ci.
Bayesian Belief Networks

A belief network is defined by two components—a directed


acyclic graph and a set of conditional probability tables. Each
node in the directed acyclic graph represents a random variable.
Each arc represents a probabilistic dependence.
If an arc is drawn from a node Y to a node Z, thenY is a parent or
immediate predecessor of Z, and Z is a descendant of Y.
Each variable is conditionally independent of its non descendants in
the graph, given its parents.
• Having lung cancer is influenced by a person’s family history
of lung cancer, as well as whether or not the person is a
smoker.
• PositiveXRay is independent of whether the patient has a
family history of lung cancer or is a smoker, given that we
know the patient has lung cancer.
• In other words, once we know the outcome of the variable
LungCancer, then the variables FamilyHistory and Smoker do
not provide any additional information regarding
PositiveXRay.
• The arcs also show that the variable LungCancer is
conditionally independent of Emphysema, given its parents,
FamilyHistory and Smoker.
• A belief network has one conditional probability table (CPT)
for each variable.
• The CPT for a variable Y specifies the conditional distribution
• , where Parents(Y) are the parents of Y.

A node within the network can be selected as an “output”


node, representing a class label attribute.

You might also like