Classification and Prediction
Classification and Prediction
Supervised learning
The type of rule discussed can be represented by a tree.
Trees that represent classification rules are called classification
trees or decision trees & trees that represents regression rules
are called regression trees.
Tree-structured rules are very popular since they are easy to
interpret and are very accurate.
Example
Age
<=25 >25
Car Type NO
NO YES
ID3 Sunny
Overcast
2
4
3
0
0.970
0
Rain 3 2 0.970
ID3Eis(outlook)=
used to[(2+3)/(9+5)](0.970)
build DT based on+information
0 + [(3+2)/14](0.970)
theory concept
= 0.692
ID3Gain (outlook)=
chooses IG - E(outlook)
splitting attribute with highest IG.
= 0.940 – 0.692
IG is the info needed
= 0.248to make correct
classification before split Vs info needed after
split.
Split entropy
51
CART
• L = left subtree of the current node
• R = Right subtree of the current node .
• PL= probability that a tuple in the training set will be on the Left side
of the tree
• PR= probability that a tuple in the training set will be on the Right side of the
tree
This is defined as [tuples in subtree]/ [tuples in training set]
• P(Cj|tL) is the probability that a tuple is in class, Cj, and in the left subtree
• P(Cj|tR) is the probability that a tuple is in class, Cj, and in the right subtree
• This is defined as the [tuples of class j in subtree]/ [tuples at the target node ]
• At each step, only one criterion is chosen as the best over all possible criteria
52
Gender short medium Tall Total
F 3 6 0 9
M 1 2 3 6
Gender short medium Tall Total
F 3 6 0 9
M 1 2 3 6
F M
9/15 6/15
height Less Greater Total
than than
1.6 0 15 15
1.7 2 13 11
1.8 5 10 5
1.9 9 6 3
2 12 3 9
height S M T Total
<1.6 0 0 0 0
>=1.6 4 8 3 15
4 8 3
2*0/15 * 15/15 *
height
<1.6 >=1.6
0 8 3
2*2/15 * 13/15 * (0+8/15+3/15) =0.169
2*0/15 * 15/15 *
height
<1.7 >=1.7
4 6 3
height Less G
height than t
1.6 0 1
<1.8 >=1.8
1.7 2 1
1.8 5 1
1.9 9 6
2 12 3
S=4 T=0 S=0 T=3
M=1 M=7
60
CART Example
P( 1.6)=p(height<1.6)=0
P( 1.7)=p(height<1.7)=2*(2/15)*(13/15)[|2/15-2/15)|+|0-
8/15|+|0-5/15)|]
P( 1.8)= p(height<1.8)=2*(5/15)*(10/15)[|4/15-0|+|1/15-
7/15|+|0-3/15|]
P( 1.9)= p(height<1.9)=2*(9/15)*(6/15)[|4/15-0|+|5/15-
3/15|+|0-3/15|]
P( 2.0)= p(height<2.0)=2*(12/15)*(3/15)[|4/15-0|+|8/15-
0|+|0-3/15|]
61
Bayesian Classification
81
Regression
For classification the output(s) is nominal
In regression the output is continuous
Function Approximation
Many models could be used – Simplest is
linear regression
Fit data with the best hyper-plane which "goes
through" the points
y
dependent
variable
(output)
x
Simple Linear Regression
For now, assume just one (input) independent variable x,
and one (output) dependent variable y
Multiple linear regression assumes an input vector x
Multivariate linear regression assumes an output
vector y
We will "fit" the points with a line (i.e. hyper-plane)
Which line should we use?
Choose an objective function
For simple linear regression we choose sum squared
error (SSE)
S (predictedi – actuali)2 = S (residuei)2
Thus, find the line which minimizes the sum of the
squared residues (e.g. least squares)
84
Y=β0 + β1x
x 1 2 3 4 5 6 7 8 9
y 9 8 10 12 11 13 14 16 15
σx 2 = (1/n Σ X2 )-X2
a=y- bx b = rσx y=a+bx substitute the values of a and b
σy
Linear regression
not applicable for most complex problems
donot work with non numeric data
Assume a linear relationship
The straight line values can be greater than 1 and
less than 0
Cannot be used as the probability of occurrence of
target class
Regression
90
Regression Modeling
A simple regression model
(one independent variable)
fits a regression line in 2-
dimensional space
A multiple regression
model with two
explanatory variables fits a
regression plane in 3-
dimensional space
Simple Regression Model
Regression coefficients are estimated by minimizing
∑residuals2 (i.e., sum of the squared residuals) to
derive this model:
102
Logistic Regression Example
Age (X axis, input variable) – Data is fictional
Heart Failure (Y axis, 1 or 0, output variable)
Could use value of regression line as a probability approximation
Extrapolates outside 0-1 and not as good empirically
Sigmoidal curve to the right gives empirically good probability
approximation and is bounded between 0 and 1
103
Logistic Regression Approach
Learning
1. Transform initial input probabilities into log odds
(logit)
2. Do a standard linear regression on the logit values
This effectively fits a logistic curve to the data, while still
just doing a linear regression with the transformed input
(ala quadric machine, etc.)
Generalization
1. Find the value for the new input on the logit line
2. Transform that logit value back into a probability
104
Non-Linear Pre-Process to
Logit (Log Odds)
Medication # Total Probability:
Dosage Cured Patients # Cured/Total
Patients
20 1 5 .20
30 2 6 .33
40 4 6 .67
50 6 7 .86
Cure 1
d
prob.
Cure
Not d
0
Cured
0 10 20 30 40 50 0 10 20 30 40 50
60 60
CS 478 - Regression 105
Non-Linear Pre-Process to
Logit (Log Odds)
Medication # Total Probability:
Dosage Cured Patients # Cured/Total
Patients
20 1 5 .20
30 2 6 .33
40 4 6 .67
50 6 7 .86
Cured 1
prob.
Cure
Not d
0
Cured
0 10 20 30 40 50 0 10 20 30 40 50
60 60
106
Logistic Regression Approach
Could use linear regression with the probability points, but
that would not extrapolate well
Logistic version is better but how do we get it?
Similar to Quadric we do a non-linear pre-process of the
input and then do linear regression on the transformed
values – do a linear regression on the log odds - Logit
1 1
prob. prob.
Cure Cure
d d
0 0
0 10 20 30 40 50 0 10 20 30 40 50
60 60
107
Non-Linear Pre-Process to
Logit (Log Odds)
Medication # Total Probability: Odds: Logit
Dosage Cured Patients # Cured/Total p/(1-p) = Log Odds:
Patients # cured/
# not ln(Odds)
cured
20 1 5 .20 .25 -1.39
30 2 6 .33 .50 -0.69
40 4 6 .67 2.0 0.69
50 6 7 .86 6.0 1.79
Cure 1
d
prob.
Cure
Not d
0
Cured
0 10 20 30 40 50 0 10 20 30 40 50
60 60
108
Regression of Log Odds
Medicatio # Total Probability: Odds: Log +
n Dosage Cured Patient # p/(1-p) = Odds: 2
s Cured/Total # cured/ 0
Patients # not cured ln(Odds)