0% found this document useful (0 votes)
1 views36 pages

Classification

Uploaded by

Prabha Joshi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views36 pages

Classification

Uploaded by

Prabha Joshi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Classification and Prediction

1. Objectives .................................................................................2
2. Classification vs. Prediction.....................................................3
2.1. Definitions ........................................................................3
2.2. Supervised vs. Unsupervised Learning............................3
2.3. Classification and Prediction Related Issues ...................4
3. Common Test Corpora .............................................................5
4. Classification ............................................................................6
5. Decision Tree Induction .........................................................11
5.1. Decision Tree Induction Algorithm ...............................13
5.2. Other Attribute Selection Measures...............................18
5.3. Extracting Classification Rules from Trees: ..................19
5.4. Avoid Overfitting in Classification................................19
5.5. Classification in Large Databases ..................................20
6. Bayesian Classification ..........................................................21
6.1. Basics..............................................................................22
6.2. Naïve Bayesian Classifier ..............................................24
7. Bayesian Belief Networks......................................................27
7.1. Definition........................................................................27
8. Neural Networks: Classification by Backpropagation...........30
8.1. Neural network Issues ....................................................31
8.2. Backpropagation Algorithm...........................................32
9. Prediction................................................................................35
9.1. Regress Analysis and Log-Linear Models in Prediction
35
10. Classification Accuracy: Estimating Error Rates ..............36

A. Bellaachia Page: 1
1. Objectives

• Techniques to classify datasets and provide categorical


labels, e.g., sports, technology, kid, etc.

• Example; {credit history, salary}-> credit approval ( Yes/No)

• Models to predict certain future behaviors, e.g., who is going


to buy PDAs?

A. Bellaachia Page: 2
2. Classification vs. Prediction

2.1. Definitions

• Classification:
Predicts categorical class labels (discrete or
nominal)
Classifies data (constructs a model) based on the
training set and the values (class labels) in a
classifying attribute and uses it in classifying new
data
• Prediction:
Models continuous-valued functions, i.e., predicts
unknown or missing values

• Typical Applications
Document categorization
Credit approval
Target marketing
Medical diagnosis
Treatment effectiveness analysis

2.2. Supervised vs. Unsupervised Learning

• Supervised learning (classification)

Supervision: The training data (observations,


measurements, etc.) are accompanied by labels
indicating the class of the observations

New data is classified based on the training set

• Unsupervised learning (clustering)


A. Bellaachia Page: 3
The class labels of training data is unknown

Given a set of measurements, observations, etc. with the


aim of establishing the existence of classes or clusters in
the data

2.3. Classification and Prediction Related Issues

• Data Preparation
Data cleaning
o Preprocess data in order to reduce noise and
handle missing values
Relevance analysis (feature selection)
o Remove the irrelevant or redundant
attributes
Data transformation
o Generalize and/or normalize data

• Performance Analysis
Predictive accuracy:
o Ability to classify new or previously unseen
data.
Speed and scalability
o Time to construct the model
o Time to use the model
Robustness
o Model makes correct predictions: Handling
noise and missing values
Scalability
o Efficiency in disk-resident databases
Interpretability:
o Understanding and insight provided by the
model
Goodness of rules
A. Bellaachia Page: 4
o Decision tree size
o Compactness of classification rules

3. Common Test Corpora

• Reuters - Collection of newswire stories from 1987 to


1991, labeled with categories.
• TREC-AP newswire stories from 1988 to 1990, labeled
with categories.
• OHSUMED Medline articles from 1987 to 1991, MeSH
categories assigned.
• UseNet newsgroups.
• WebKB - Web pages gathered from university CS
departments.

A. Bellaachia Page: 5
4. Classification

• A classical problem extensively studied by statisticians


and machine learning researchers
• Predicts categorical class labels
• Example: Typical Applications
{credit history, salary} credit approval ( Yes/No)
{Temp, Humidity} Rain (Yes/No)
A set of documents sports, technology, etc.

• Another Example:
If x >= 90 then grade =A.
If 80<=x<90 then grade =B.
If 70<=x<80 then grade =C. x
If 60<=x<70 then grade =D.
If x<50 then grade =F. <90 >=90

x A

<80 >=80

x B

<70 >=70
x C

<50 >=60

F D

• Classification types:

A. Bellaachia Page: 6
o Distance based
o Partitioning based

Partitioning Based Distance Based

• Classification as a two-step process:


o Model construction: Build a model for pre-determined
classes.
o Model usage: Classify unknown data samples
o If the accuracy is acceptable, use the model to classify
data objects whose class labels are not known

A. Bellaachia Page: 7
• Model construction: describing a set of predetermined
classes
o Each data sample is assumed to belong to a predefined
class, as determined by the class label attribute
o Use a training dataset for model construction.
o The model is represented as classification rules,
decision trees, or mathematical formula

Classification
Algorithms
Training
Data

NAME RANK YEARS TENURED


Mike Assistant Prof 3 no Classifier (Model)
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no IF rank =
‘professor’
OR years > 6
THEN tenured =
‘yes’

A. Bellaachia Page: 8
• Model usage: for classifying future or unknown objects
o Estimate accuracy of the model
The known label of test sample is compared with
the classified result from the model
Test accuracy rate is the percentage of test set
samples that are correctly classified by the model
Test set is independent of training set but from the
same probability distribution

Classifier
Testing
Data

NAME RANK YEARS TENURED


Unseen Data
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes (Jeff, Professor, 4)

Tenured? Yes

• Common Techniques

• K-Nearest Neighbor (kNN) - Use k closest training data


samples to predict category.
• Decision Trees (DTree)- Construct classification trees
based on training data.

A. Bellaachia Page: 9
• Neural Networks (NNet) - Learn non-linear mapping from
input data samples to categories.
• Support Vector Machines (SVMs).

A. Bellaachia Page: 10
5. Decision Tree Induction

• Example: Training Dataset [Quinlan’s ID3]

age income student credit_rating buys_computer


<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

A. Bellaachia Page: 11
• A decision tree for “buys_computer”

age?

<=30 30..40 >40

student? YES credit rating?

excellent fair
no

NO YES NO YES

A. Bellaachia Page: 12
5.1. Decision Tree Induction Algorithm

• Basic algorithm (a greedy algorithm)


o Tree is constructed in a top-down recursive divide-and-
conquer manner
o At start, all the training examples are at the root
o Attributes are categorical (if continuous-valued, they
are discretized in advance)
o Samples are partitioned recursively based on selected
attributes
o Test attributes are selected on the basis of a heuristic
or statistical measure (e.g., information gain)
o Conditions for stopping partitioning
All samples for a given node belong to the same
class
There are no remaining attributes for further
partitioning – majority voting is employed for
classifying the leaf
There are no samples left

• Attribute Selection Measure: Information gain

o Select the attribute with the highest information gain


o S contains si tuples of class Ci for i = {1, …, m}
o Entropy of the set of tuples:
It measures how informative is a node.

ms s
I ( s1, s2 ,..., sm ) = − ∑ i log 2 i
i =1 s s

A. Bellaachia Page: 13
o Entropy after choosing attribute A with values
{a1,a2,…,av}

v s1 j + ... + smj
E ( A) = ∑ I ( s1 j ,..., smj )
j =1 s

o Information gained by branching on attribute A

Gain( A) = I ( s1, s2 ,..., sm ) − E ( A)

• Information Gain Computation

o Class P:
buys_computer = “yes”
p: number of samples

o Class N:
buys_computer = “no”
n: number of samples

o The expected information:

I(p, n) = I(9, 5) =0.940

o Compute the entropy for age:

age pi ni I(pi, ni)


<=30 2 3 0.971
30…40 4 0 0
>40 3 2 0.971

A. Bellaachia Page: 14
5 4 5
E (age) = I (2,3) + I (4,0) + I (3,2) = 0.694
14 14 14

5
I (2,3) : means “age <=30” has 5 out of 14 samples, with 2
14
yes’es and 3 no’s.

Hence
Gain(age) = I(p,n)-E(age) = 0.94- 0.694 = 0.246

Similarly,

Gain(income) = 0.029
Gain(student) = 0.151
Gain(credit_rating) = 0.048

We say that age is more informative than income,


student, and credit_rating.

So age would be chosen as the root of the tree

A. Bellaachia Page: 15
Age?
>30
<=30
31...4
0
income student credit_rating class income student credit_rating class
high no fair no medium no fair yes
high no excellent no low yes fair yes
low yes fair yes low yes excellent no
medium no fair no medium yes fair yes
medium yes excellent yes medium no excellent no

income student credit_rating class


high no fair yes All yes.
It is a leaf.
low yes excellent yes
medium no excellent yes
high yes fair yes

• Recursively apply the same process to each subset.

A. Bellaachia Page: 16
• ID3 Algorithm:

A. Bellaachia Page: 17
5.2. Other Attribute Selection Measures

• Gini Index (CART, IBM IntelligentMiner)

o All attributes are assumed continuous-valued


o Assume there exist several possible split values for each
attribute
o May need other tools, such as clustering, to get the
possible split values
o Can be modified for categorical attributes

• Formal Definition
o If a data set T contains examples from n classes, gini
index, gini(T) is defined as
n 2
gini(T ) = 1 − ∑ p j
j =1
Where pj is the relative frequency of class j in T.

o If a data set T is split into two subsets T1 and T2 with


sizes N1 and N2 respectively, the gini index of the split
data contains examples from n classes, the gini index
gini(T) is defined as:

N 1 gini( ) + N 2 gini ( )
gini split (T ) = T1 T2
N N
o The attribute provides the smallest ginisplit(T) is
chosen to split the node (need to enumerate all possible
splitting points for each attribute).

A. Bellaachia Page: 18
5.3. Extracting Classification Rules from Trees:

• Represent the knowledge in the form of IF-THEN rules


• One rule is created for each path from the root to a leaf
• Each attribute-value pair along a path forms a conjunction
• The leaf node holds the class prediction
• Rules are easier for humans to understand
• Example
o IF age = “<=30” AND student = “no” THEN
buys_computer = “no”
o IF age = “<=30” AND student = “yes” THEN
buys_computer = “yes”
o IF age = “31…40” THEN buys_computer = “yes”
o IF age = “>40” AND credit_rating = “excellent”
THEN buys_computer = “yes”
o IF age = “<=30” AND credit_rating = “fair” THEN
buys_computer = “no”

5.4. Avoid Overfitting in Classification

• Overfitting: An induced tree may overfit the training data


o Too many branches, some may reflect anomalies due
to noise or outliers
o Poor accuracy for unseen samples
• Two approaches to avoid overfitting
o Prepruning: Halt tree construction early—do not split
a node if this would result in the goodness measure
falling below a threshold
Difficult to choose an appropriate threshold
o Postpruning: Remove branches from a “fully grown”
tree—get a sequence of progressively pruned trees

A. Bellaachia Page: 19
Use a set of data different from the training data
to decide which is the “best pruned tree”

5.5. Classification in Large Databases

• Classification—a classical problem extensively studied by


statisticians and machine learning researchers
• Scalability: Classifying data sets with millions of
examples and hundreds of attributes with reasonable speed
• Why decision tree induction in data mining?
o Relatively faster learning speed (than other
classification methods)
o Convertible to simple and easy to understand
classification rules
o Can use SQL queries for accessing databases
o Comparable classification accuracy with other
methods

A. Bellaachia Page: 20
• Visualization of a Decision Tree in SGI/MineSet 3.0

A. Bellaachia Page: 21
6. Bayesian Classification

• Why Bayesian?
o Probabilistic learning: Calculate explicit probabilities
for hypothesis, among the most practical approaches to
certain types of learning problems

o Incremental: Each training example can incrementally


increase/decrease the probability that a hypothesis is
correct. Prior knowledge can be combined with
observed data.

o Probabilistic prediction: Predict multiple hypotheses,


weighted by their probabilities

o Standard: Even when Bayesian methods are


computationally intractable, they can provide a standard
of optimal decision making against which other
methods can be measured

6.1. Basics

• Let X be a data sample whose class label is unknown


• Let H be a hypothesis that X belongs to class C
• Posterior Probability:
o For classification problems, determine P(H/X): the
probability that the hypothesis H holds given the
observed data sample X

• Prior Probability:
o P(H): prior probability of hypothesis H
o It is the initial probability before we observe any data

A. Bellaachia Page: 22
o It reflects the background knowledge

• P(X): probability that sample data is observed


o P(X|H) : probability of observing the sample X, given
that the hypothesis H holds

• Bayesian Theorem:

o Given training data X, posteriori probability of a


hypothesis H, P(H|X) follows the Bayes theorem

P( X | H ) P( H )
P( H | X ) =
P( X )

• Informally, this can be written as

likelihood x prior
posterior =
evidence

• Maximum Posteriori (MAP) Hypothesis:

Since P(X) is constant of all hypotheses, we have the


following:

hMAP ≡ arg max P(h | D) = arg max P( D | h) P(h).


h∈H h∈H

Where D is the data sample, H is the set of all available


hypothesis (e.g., all available classes).

• Practical difficulty:
o Require initial knowledge of many probabilities
o Significant computational cost

A. Bellaachia Page: 23
6.2. Naïve Bayesian Classifier

• Algorithm:
o A simplified assumption: attributes are conditionally
independent:
n
P( X | C i) = ∏ P( xk | C i)
k =1

o The product of occurrence of say 2 elements x1 and x2,


given the current class is C, is the product of the
probabilities of each element taken separately, given the
same class P([y1,y2],C) = P(y1,C) * P(y2,C)

o No dependence relation between attributes

o Greatly reduces the computation cost, only count the


class distribution.

o Once the probability P(X|Ci) is known, assign X to the


class with maximum P(X|Ci)*P(Ci)

• Example:

o Given the following data sample from our previous


example:

X=(age<=30,Income=medium,Student=yes,Credit_rating=Fair)

o Compute P(xk|Ci):
s
P(xk|Ci)= ik
si

A. Bellaachia Page: 24
Where sik is the number of training samples of Class
Ci having the value xk for the attribute Ak and si is
the number of training samples belonging to Ci

P(age=“<30” | buys_computer=“yes”) = 2/9=0.222


P(age=“<30” | buys_computer=“no”) = 3/5 =0.6
P(income=“medium” | buys_computer=“yes”)= 4/9 =0.444
P(income=“medium” | buys_computer=“no”) = 2/5 = 0.4
P(student=“yes” | buys_computer=“yes)= 6/9 =0.667
P(student=“yes” | buys_computer=“no”)= 1/5=0.2
P(credit_rating=“fair” | buys_computer=“yes”)=6/9=0.667
P(credit_rating=“fair” | buys_computer=“no”)=2/5=0.4

X=(age<=30 ,income =medium, student=yes,credit_rating=fair)

o Compute P(X|Ci) :

n
P(X|Ci) = ∏ P( xk | Ci )
k =1

P(X|buys_computer=“yes”)= 0.222 x 0.444 x 0.667 x 0.0.667


=0.044

P(X|buys_computer=“no”)= 0.6 x 0.4 x 0.2 x 0.4 =0.019

o P(X|Ci)*P(Ci) :

P(Ci) = the class prior probability


s
= i
s

Where si is the number of training samples in Ci


and s is the total number of training samples.

A. Bellaachia Page: 25
9
P(buys_computer=“yes”)= =0.643
14
5
P(buys_computer=“no”)= =0.36
14

P(X|buys_computer=“yes”) * P(buys_computer=“yes”)=0.028
P(X|buys_computer=“no”) * P(buys_computer=“no”)=0.007

X belongs to class “buys_computer=yes”

• Advantages:
o Easy to implement
o Good results obtained in most of the cases

• Disadvantages
o Assumption: class conditional independence, therefore
loss of accuracy
o Practically, dependencies exist among variables
o E.g., hospitals: patients: Profile: age, family history etc
o Symptoms: fever, cough etc., Disease: lung cancer,
diabetes etc
o Dependencies among these cannot be modeled by
Naïve Bayesian Classifier

• How to deal with these dependencies?


o Bayesian Belief Networks

A. Bellaachia Page: 26
7. Bayesian Belief Networks

• Objectives:
o The naïve Bayesian classifier assume that attributes
are conditionally independents

o Belief Nets are PROVEN TECHNOLOGY


Medical Diagnosis
DSS for complex machines
Forecasting, Modeling, Information Retrieval, etc.

7.1. Definition

o A bayesian network is a causal directed acyclic graph


(DAG), associated with an underlying distribution of
probability.
o DAG structure
Each node is represented by a variable v
v depends (only) on its parents conditional
probability:
P(vi | parenti = <0,1,…>)

v is INDEPENDENT of non-descendants, given


assignments to its parents

o We must specify the conditional probability distribution


for each node.

o It the variables are discrete, each node is described by a


table (Conditional Probability Table (CPT)) which lists
the probability that the child node takes on each of its
different values for each combination of values of its
parents.
• Example:
A. Bellaachia Page: 27
D I Given H = 1,
- D has no influence on J
H - J has no influence on B
- etc.

J B

• Example:

❧Simple Belief Net:

P(H=1) P(H=0)
0.05 0.95
P(H)
h P(B=1 | H=h) P(B=0 | H=h)
H
1 0.95 0.05
P(B|H)
0 0.03 0.97

P(J=1|h,b) P(J=0|h,b)
B
h b
1 1 0.8 0.2
1 0 0.8 0.2 J P(J|H,B))
0 1 0.3 0.7
0 0 0.3 0.7

A. Bellaachia Page: 28
• Challenges
o Efficient ways to use BNs
o Ways to create BNs
o Ways to maintain BNs
o Reason about time

A. Bellaachia Page: 29
8. Neural Networks: Classification by Backpropagation

• A neural network is a set of connected input/output units


where each connection has a weight associated with it.

• During the learning process, the network learns by


adjusting the weights to predict the correct class label of the
input samples.

• Typical NN structure for classification:


One output node per class
Output value is class membership function value

• Perceptron
It is one of the simplest NN
No hidden layers.

• Supervised learning

• For each tuple in training set, propagate it through NN.


Adjust weights on edges to improve future classification.

A. Bellaachia Page: 30
• Algorithms: Propagation, Backpropagation, Gradient
Descent

• Example:

8.1. Neural network Issues

• Number of source nodes


• Number of hidden layers
• Training data
• Number of sinks
• Interconnections
• Weights
• Activation Functions
• Learning Technique
• When to stop learning

A. Bellaachia Page: 31
8.2. Backpropagation Algorithm

• Backpropagation learns by iteratively processing a set of


training samples, comparing the network’s prediction for
each sample with the actual known class label.

• For each training sample, the weights are modified so as


to minimize the mean squared error between the network
prediction and the actual class modifications are made in
the “backward” directions.

• A Neuron:
w0 θj
x0
w1
x1

f
wn output y
xn

Input weight weighted Activation


vector x vector w sum function

• The n-dimensional input vector x (A training sample) is


mapped into variable y by means of the scalar product and a
nonlinear function mapping
• For example:
n
y = ∑ wi xi + θ k
i =0

A. Bellaachia Page: 32
• Algorithm:
o Initialize the weights in the NN and the bias associated
to each neuron. The value are generally chosen between
–1 and 1 or –5 and +5.
o For each sample, X, do: Propagate the inputs forward
The net input and output of each neuron j in the
hidden and output layers are computed:
I j = ∑ wij Oi + θ j
i
Where
wij is the weight of the connection from unit
i in the previous layer to unit j
Oi is the output of unit I from the previous
layer.
θj is the bias of the unit: this is used to vary
the activity of the unit.

Apply a non-linear logistic or sigmoid function to


compute the output of neuron j:
1
Oj =
−I j
1+ e

This is called a squashing function: map Ij to


a range of values between 0 and 1.

o Backpropagate the error: Update the weights and bias to


reflect the network’s prediction:
Output layer:
Err j = O j (1 − O j )(T j − O j )

Where Tj is the true output and Oj is the actual


output of neuron j
A. Bellaachia Page: 33
Hidden layer: It uses the weighted sum of the
errors of the neurons connected neuron j in the
next layer. The error of a neuron j is:

Err j = O j (1 − O j )∑ Errk w jk
k

Where wjk is the weight of the connection of


neuron j to a neuron k in the next layer, and Errk is
the error of neuron k

Adjust the weights and Biases: To reflect the


propagated errors

θ j = θ j + (l) Err j wij = wij + (l ) Err j Oi

Where l is a constant learning rate between 0 and 1

o Terminating conditions: Generally after 100s of Ks of


iterations or epochs:

All ∆wij in the previous epoch (one iteration


through the training set) are so small or below a
specified threshold.
The percentage of samples misclassified in the
previous epoch is below some threshold
A prespecified number of epochs has expired.

• Do example in the textbook: Page 308

A. Bellaachia Page: 34
9. Prediction

• Prediction (often called regression) is similar to


classification
o First, construct a model
o Second, use model to predict unknown value
• Major method for prediction is regression
o Linear and multiple regression
o Non-linear regression
• Prediction is different from classification
o Classification refers to predict categorical class label
o Prediction models continuous-valued functions
• Many classification methods can also be used for
regressions
o Decision trees can also produce probabilities
o Bayesian networks: probability
o Neural networks: continuous output

9.1. Regress Analysis and Log-Linear Models in Prediction

• Linear regression: Y = a + b X
o Two parameters, a and b specify the line and are to be
estimated by using the data at hand.
o Using the least squares criterion to the known values of
Y1, Y2, …, X1, X2, …. (See example 7.6 on page 320
of your textbook).

• Multiple regression: Y = b0 + b1 X1 + b2 X2.


o Many nonlinear functions can be transformed into the
above.

• Non-linear models:
o Can always be converted to a linear model

A. Bellaachia Page: 35
10. Classification Accuracy: Estimating Error Rates

• Partition: Training-and-testing
o Use two independent data sets, e.g., training set (2/3),
test set(1/3)
o Used for data set with large number of samples

• Cross-validation
o Divide the data set into k subsamples
o Use k-1 subsamples as training data and one sub-sample
as test data—k-fold cross-validation
o For data set with moderate size

• Bootstrapping (leave-one-out)
o For small size data

• Confusion Matrix:
o This matrix shows not only how well the classifier
predicts different classes
o It describes information about actual and detected
classes:

Detected
Positive Negative
Positive A: True positive B: False Negative
Actual
Negative C: False Positive D: True Negative

• The recall (or the true positive rate) and the precision (or
the positive predictive rate) can be derived from the
confusion matrix as follows:
o Recall = A
A+ B
A
o Precision =
A+C

A. Bellaachia Page: 36

You might also like