0% found this document useful (0 votes)
104 views

Classification and Prediction

The document discusses various algorithms used for supervised machine learning and decision tree algorithms. It provides an overview of distance-based algorithms like K-Nearest Neighbors, decision tree algorithms like ID3, C4.5, CART, and statistical algorithms like Bayesian Classification and neural networks algorithms like propagation. It then focuses on decision trees, explaining that they can represent classification or regression rules in a tree structure. It provides examples of how decision trees are constructed and evaluated.

Uploaded by

Krishnan Swami
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
104 views

Classification and Prediction

The document discusses various algorithms used for supervised machine learning and decision tree algorithms. It provides an overview of distance-based algorithms like K-Nearest Neighbors, decision tree algorithms like ID3, C4.5, CART, and statistical algorithms like Bayesian Classification and neural networks algorithms like propagation. It then focuses on decision trees, explaining that they can represent classification or regression rules in a tree structure. It provides examples of how decision trees are constructed and evaluated.

Uploaded by

Krishnan Swami
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 81

Outline

 Distance based algo


 K - Nearest Neighbor Algo
 Decision Tree - based Algorithm
 ID3
 C4.5
 CART
 Statistical algo
 Bayesian Classification
 Neural Networks based algo
 Propagation
 Linear and non linear regression
Tree-Structured rules

 Supervised learning
 The type of rule discussed can be represented by a tree.
 Trees that represent classification rules are called classification
trees or decision trees & trees that represents regression rules
are called regression trees.
 Tree-structured rules are very popular since they are easy to
interpret and are very accurate.
Example

Age

<=25 >25

Car Type NO

Sedan Sports, Truck

NO YES

 Above fig shows Insurance risk example Decision Tree


 Each path from root node to a leaf node represents one classification rule.
Decision Trees
 Also called classification tree
 Graphical representation of set of classification rules
 Each internal node represents predictor / splitting
attribute
 Each arc/ edge is labeled with predicate or splitting
criteria
 Each leaf node is labeled with a class Cj
Decision Trees
Basic step
 Build the tree
 Apply the tree to the database
Decision Trees
 A decision tree is usually constructed in two phases.
 The growth phase
 The pruning phase
 In growth phase, an overly large tree is constructed. This tree
represents the record in the input database very accurately.
 In pruning phase, the final size of the tree is determined.
 The rules represented by the tree constructed in phase one are
usually overspecialized.
 By reducing the size of the tree, we generate a smaller number of
more general rules that are better than a very large number of
very specialized rules.
Decision Trees
 The splitting criterion at a node is found through application of a split selection
method.
 A split selection method is an algorithm that takes as input a relation and outputs
the locally ‘best’ splitting criterion.

 Following is the decision tree induction schema:

Input: node n, partition D, split selection method S


Output: decision tree for D rooted at node n
BuildTree( Node n, Partition D, split selection method S)
Apply S to D to find the splitting criterion
If ( a good splitting criterion is found)
Create two children nodes n1 & n2 of n
Partition D into D1 & D2.
BuildTree(n1,D1,S)
BuildTree(n2,D2,S)
endif
ID3 Background
 “Iterative Dichotomizer 3”.
 Invented by Ross Quinlan in 1979.
 Generates Decision Trees using Entropy.
 Information Gain is used to select the most useful
attribute for classification.
 Builds the tree in top down fashion.
 Succeeded by Quinlan’s C4.5 and C5.0 algorithms.
Gain(A) = I(p,n) – E(A)
Entropy
TEMP p n I(p,n)
Hot 0 2
0
Mild 1 1
 Introduced by Claude Shannon in 1948 1
 Quantifies “randomness” cold 1 0
0
 Lower value implies less uncertainty
 Higher value implies more uncertainty
 A completely homogeneous sample has entropy of 0
 An equally divided sample has entropy of 1
 Formula:
Entropy of Attribute
E(A) = ∑ pi+ni [ Ii (p,n) ]
p+n
IG of the table I(p,n)= -p log 2 p – n log 2 n
Or
Entropy of the
starting set or parent table p+n p+n p+n p+n
Information Gain (IG)
 The information gain is based on the decrease in
entropy after a dataset is split on an attribute.
 Can decide upon which attribute creates the most
homogeneous branches?
 Formula: Gain(A) = I(p,n) – E(A)
Outlook p n I(p,n)
Sunny 2 3 0.970
Overcast 4 0 0
Rain 3 2 0.970
E(outlook)= [(2+3)/(9+5)](0.970) + 0 + [(3+2)/14](0.970)
= 0.692
Gain (outlook)= IG - E(outlook)
= 0.940 – 0.692
= 0.248
Outlook p n I(p,n)

ID3 Sunny
Overcast
2
4
3
0
0.970
0
Rain 3 2 0.970
 ID3Eis(outlook)=
used to[(2+3)/(9+5)](0.970)
build DT based on+information
0 + [(3+2)/14](0.970)
theory concept
= 0.692

 ID3Gain (outlook)=
chooses IG - E(outlook)
splitting attribute with highest IG.
= 0.940 – 0.692
 IG is the info needed
= 0.248to make correct
classification before split Vs info needed after
split.

IG = Entropy of original dataset - Entropies of split dataset


Entropies of split dataset = Weighted sum of entropies
after each of subdivided dataset
Weight of each dataset = fraction of dataset being placed
in that division
ID3
 A branch set with entropy of 0 is a leaf node.
 Otherwise, the branch needs further splitting to classify
its dataset.
 The ID3 algorithm is run recursively on the non-leaf
branches, until all data is classified.
Overfitting
 During the construction of a DT , the tree repeatedly splits the data
into node to get successively pure subsets of data
 If nodes are fitting to noise in training data, model will not
generalize well
 This occurs when model is too complex
 Complexity is determined by “no. of nodes” in the tree
 To avoid overfitting
 Post pruning
 Grow tree to max size , then prune based on validation set
 Computationally expensive method
 Replace sub tree with leaf node if generalization error improves or
dose not change
 Pre pruning
 Stop growing the tree before fully grown to fit the training data
 Stop splitting when not statistically significant
Advantages of using ID3
 Understandable prediction rules are created from the
training data.
 Builds the fastest tree.
 Only need to test enough attributes until all data is
classified.
 Finding leaf nodes enables test data to be pruned,
reducing number of tests.
 Whole dataset is searched to create tree.
Disadvantages of using ID3
 Data may be over-fitted or over-classified, if a small
sample is tested.
 Smaller decision trees should be preferred over larger
ones. This algorithm usually produces small trees, but it
does not always produce the smallest possible tree
 Only one attribute at a time is tested for making a
decision.
 Classifying continuous data may be computationally
expensive.
DT Advantages/Disadvantages
 Advantages:
 DTs are easy to use and efficient.
 Rules generated are easy to interpret and understand
 They scale well for large databases as tree size is
independent of database size.
 Trees can be constructed for data with many attributes.
 Disadvantages:
 Does not easily handle continuous data.
 May suffer from over fitting.
 Can be quite large – pruning is necessary.
 Correlations among attributes in the database are
ignored in DT process
C4.5
 A successor of ID3
 Builds DT using divide and conquer, top down,
recursive approach
 Builds DT based on information theory concept
 It chooses splitting attribute with highest Gain Ratio

Split entropy

 It is ratio of IG for a splitting attribute and entropy of


an attribute split (ignoring classes )

Gain
Formula:
Ratio
Gain ratio(A) = GAIN(A) /split entropy(A)
temp p n I(p,n)
Hot 2 2 1 I(2,2)=1
Mild 4 2 0.09
I(4,2)= -4/6 log2(4/6) -2/6 log2 (2/6)
cool 3 1 0.81

E(temp)= [(2+2)/14](1) +[(4+2)/14] (0.09) + [(3+1)/14](0.81)


= 0.9110
Gain (temp)= IG - E(outlook)
= 0.940- 0.9110
=0.0292
Split info(temp)= -4/14 log2(4/14) -6/14 log2 (6/14) -4/14 log2 (4/14)

Split info(temp)= 0.926 Gain Ratio(temp)= 0.0292/0.926


C4.5
1. Handles both continuous and discrete attributes
 The basic idea is to divide the data into ranges based
on the attribute values for that item that are found in
the training sample
2. Handling training data with missing attribute values
 Missing attribute values are simply not used in gain
ratio and entropy calculations
 To classify a record with a missing attribute value, the
value for that item can be predicted based on what is
known about the attribute values for the other records
C4.5
3. Pruning trees after creation
 C4.5 goes back through the tree once it's been created
and attempts to remove branches that do not help by
replacing them with leaf nodes
C4.5
EXAMPLE
To calculate the GainRatio for the gender split, we first find the
entropy associated with the gender split (ignoring classes )
H(9/15, 6/15)=9/15 log(15/9)+6/15 log(15/6) = 0.292

This gives the GainRatio value for the gender attribute as


0.09688 = 0.332
0.292
The entropy for the split on height (ignoring classes ) is :
H(4/15, 7/15, 2/15, 2/15)
CART
• Classification and regression tree
• If the target variable is nominal ( categorical) then the tree is
called Classification tree.
• If the target variable is numerical (continuous) then the tree is
called Regression tree.
• CART handles missing data by ignoring them in calculating the
goodness of split on the attribute.
• CART contains pruning strategy.
CART
• Classification and regression trees (CART) is a technique that
generates a binary decision tree
• Unlike ID3, however, where a child is created for each
subcategory, only two children are created
• The splitting is performed around what is determined to be
the best split point.
• At each step, an exhaustive search is used to determine the
best split, where "best" is defined by a measure φ(s/t)
CART

 Create Binary Tree


 Formula to choose split point, s, for node t:

 This formula is evaluated at the current node, t, and


for each possible splitting attribute and criterion, s

51
CART
• L = left subtree of the current node
• R = Right subtree of the current node .
• PL= probability that a tuple in the training set will be on the Left side
of the tree
• PR= probability that a tuple in the training set will be on the Right side of the
tree
This is defined as [tuples in subtree]/ [tuples in training set]

• P(Cj|tL) is the probability that a tuple is in class, Cj, and in the left subtree
• P(Cj|tR) is the probability that a tuple is in class, Cj, and in the right subtree
• This is defined as the [tuples of class j in subtree]/ [tuples at the target node ]

• At each step, only one criterion is chosen as the best over all possible criteria

52
Gender short medium Tall Total
F 3 6 0 9
M 1 2 3 6
Gender short medium Tall Total
F 3 6 0 9
M 1 2 3 6

3-1=2 6-2=4 3-0=3


gender

F M

Φ(gender) = 2 * 9/15 * 6/15 * (2/15 + 4/15 + 3/15)=0.224

S=3 T=0 S=1 T=3


M=6 M=2

9/15 6/15
height Less Greater Total
than than
1.6 0 15 15
1.7 2 13 11
1.8 5 10 5
1.9 9 6 3
2 12 3 9
height S M T Total
<1.6 0 0 0 0
>=1.6 4 8 3 15

4 8 3
2*0/15 * 15/15 *

height

<1.6 >=1.6

S=0 T=0 S=4 T=3


M=0 M=8

Φ(1.6) = 2 * 0/15 * 15/15 (4/15 + 8/15 + 3/15)=0


height S M T Total
<1.7 2 0 0 2
>=1.7 2 8 3 13

0 8 3
2*2/15 * 13/15 * (0+8/15+3/15) =0.169
2*0/15 * 15/15 *

height

<1.7 >=1.7

S=2 T=0 S=2 T=3


M=0 M=8

Φ(1.7) = 2 * 2/15 * 13/15 (0/15 + 8/15 + 3/15)=0.169


height S M T Total
<1.8 4 1 0 5
>=1.8 0 7 3 10

4 6 3

height Less G
height than t
1.6 0 1
<1.8 >=1.8
1.7 2 1
1.8 5 1
1.9 9 6
2 12 3
S=4 T=0 S=0 T=3
M=1 M=7

Φ(1.8) = 2 * 5/15 * 10/15 (4/15 + 6/15 + 3/15)=0.385


CART Example

• At the start, there are six choices for split


point (right branch on equality):
– P(Gender)=2(6/15)(9/15)(2/15 + 4/15 + 3/15)=0.224
– P(1.6) = 0
– P(1.7) = 2(2/15)(13/15)(0 + 8/15 + 3/15) = 0.169
– P(1.8) = 2(5/15)(10/15)(4/15 + 6/15 + 3/15) = 0.385
– P(1.9) = 2(9/15)(6/15)(4/15 + 2/15 + 3/15) = 0.256
– P(2.0) = 2(12/15)(3/15)(4/15 + 8/15 + 3/15) = 0.32
• Split at 1.8

60
CART Example

 P( 1.6)=p(height<1.6)=0
 P( 1.7)=p(height<1.7)=2*(2/15)*(13/15)[|2/15-2/15)|+|0-
8/15|+|0-5/15)|]
 P( 1.8)= p(height<1.8)=2*(5/15)*(10/15)[|4/15-0|+|1/15-
7/15|+|0-3/15|]
 P( 1.9)= p(height<1.9)=2*(9/15)*(6/15)[|4/15-0|+|5/15-
3/15|+|0-3/15|]
 P( 2.0)= p(height<2.0)=2*(12/15)*(3/15)[|4/15-0|+|8/15-
0|+|0-3/15|]

61
Bayesian Classification

 Bayes Rule or Bayes Theorem is


 Suppose there are m different hypotheses then
P(xi) = Σ P(xi |hj)P(hj)

P(h1 | xi) = P(xi |h1)P(h1)


P(xi)
 Here P(h1|xi) is called the posterior probability, while P(h1) is
the prior probability associated with hypothesis h1
 P(xi) is the probability of the occurrence of data value xi and
P(xi|h1) is the conditional probability that, given a hypothesis,
the tuple satisfies it.
Bayesian
Classification
Car color Type Origin stolen P(y)=5/10
No.
P(n)=5/10
1 Red sports domestic Y
2 Red sports Domestic N Color
3 Red Sports Domestic Y P(red|y)= 3/5 P(red|N)=2/5
4 Yellow Sports Domestic N P(yellow|y)=2/5 P(yellow|N)=3/5
5 Yellow Sports Importer Y
6 Yellow SUV Importer N Type
7 Yellow SUV Importer Y P(SUV|y)=1/5 P(suv|N)=3/5
8 Yellow SUV Domestic N P(sports|y)=4/5 P(sports|n)=2/5
9 Red SUV Importer N
10 Red Sports Importer Y Origin
Sample X=(red&SUV&DOM) decision=? P(dom|y)=2/5 P(dom|N)=3/5
Unlabeled sample P(imp|y)=3/5 P(imp|n)=2/5

P(X|Y) = P(red|Y)P(suv|Y)P(dom|Y) P(X|Y)=3/5 * 1/5 * 2/5 = 0.048


P(X|N) = P(red|N)P(suv|N)P(dom|N) P(X|N)= 2/5 *3/5*3/5 = 0.144

P(X|N) > P(X|Y) therefore sample X is class “N”


Car A1 A2 A3 Class
No. A1
1 A C A C1 P(a|C1)= P(a|C2)=
2 C A C C1 P(b|c1) = P(b|c2) =
3 A A C C2 P(c|c1)= P(c|c2)=
4 B C A C2
A2
5 c c b C2
P(a|C1)= P(a|C2)=
P(c1)=2/5 P(b|c1) = P(b|c2) =
P(c2)=3/5 P(c|c1)= P(c|c2)=
Sample X= A1=c, A2=c and A3=a class ?
A3
Sample Y= A1=a, A2=c and A3=b class ? P(a|C1)= P(a|C2)=
P(b|c1) = P(b|c2) =
P(c|c1)= P(c|c2)=

P(X|c1) P(c1)= P(A1|c1)P(A2|c1)P(A3|c1) P(c1) P(X|c1)=1/2 * 1/2 * 1/2 = 0.125


P(X|c2) P(c2)= P(A1|c2)P(A2|c2)P(A3|c2) P(c2) P(X|c2)= 1/3 *2/3*1/3 = 0.074

P(X|c1) > P(X|c2) therefore sample X is class “c1”


Bayesian
Classification
Bayesian Classification
 Assuming that the contribution by all attributes are independent
and that each contributes equally to the classification problem,
a simple classification scheme called naive Bayes classification
has been proposed that is based
on Bayes rule of conditional probability
 By analyzing the contribution of each "independent" attribute, a
conditional probability is determined
 A classification is made by combining the impact that the
different attributes have on the prediction to be made
 The approach is called "naive" because it assumes the
independence between the various attribute values
Statistical-based algorithm (Bayesian
Classification)

• When classifying a target tuple, the conditional and prior


probabilities generated from the training set are used to make
the prediction
• This is done by combining the effects of the different attribute
values from the tuple
• Suppose that tuple ti has p independent attribute values
{xi1,xi2,...,xjp} .From the descriptive phase, we know P( xik|Cj), for
each class Cj and attribute xik
• We then estimate P(ti|Cj) by P(ti|Cj) = ∏P(xik|Cj)
• We then have the needed prior probabilities P(Cj) for each class
and the conditional probability P(ti|Cj)
• To calculate P(ti), we can estimate the likelihood that ti is in each
class. This can be done by finding the likelihood that this tuple is
in each class and then adding all these values
Statistical-based algorithm (Bayesian
Classification)
 The probability that ti is in a class is the product of the
conditional probabilities for each attribute value
 The posterior probability P(Cj|ti) is then found for each class
 The class with the highest probability is the one chosen for the
tuple.
Statistical-based algorithm (Bayesian
Classification)
Bayesian Classification
• There are 4 tuples classified as short, 8 as medium, and 3 as tall.
• The Output classification uses the simple divisions shown below:
2 m ≤Height Tall
1.7 m < Height < 2 m Medium
Height ≤ 1. 7m Short
• The Output2 results require a much more complicated set of divisions using, both height and gender attributes.
• To facilitate classification, we divide the height attribute into six ranges:
(0,1.6], (1.6, 1. 7], (1. 7, 1.8], (1.8, 1.9], (1.9,2.0], (2.0,∞)
Statistical-based algorithm (Bayesian
Classification)
 With these training data, we estimate the prior probabilities:
P(short) = 4/15 = 0.267, P(medium) = 8/15 = 0.533,
and P(tall) = 3/15 = 0.2
 We use these values to classify a new tuple. For example,
suppose we wish to classify t = (Adam, M, 1.95 m)
 By using these values and the associated probabilities of gender
and height, we obtain the following estimates:
 P(t|short) = 1/4 x a = a
 P(t| medium) = 2/8 x 1/8 = 0.031
p(t |tall) = 3/3 x 1/3 = 0.333

Prediction
 Dependent variable , y
 The variable whose values we want to explain or
forecast
 Independent variable , x
 Variable that explains the other
 Linear regression
 assumes a linear relationship between input
variable and output variable
 Logistic regression
 Used when the dependent variable is binary
 0/1, T/F, Y/N
Linear Regression
 Objective
 To establish if there is a relationship between two
variables.
 Income & spending
 Students’ weight & exam score
 Forecast new observation
 Sales in next quarter
 If the scatter diagram indicates some relationship
between two variable x and y then the dots of the
scatter diagram will be concentrated round a curve

 The curve is called the curve of regression and the


relationship is said to the be expressed by means of
curvilinear regression

 In the particular case, when the curve is a straight


line, it is called a line of regression and the
regression is said to be linear.
Regression
 For classification the output(s) is nominal
 In regression the output is continuous
 Function Approximation
 Many models could be used – Simplest is
linear regression
 Fit data with the best hyper-plane which "goes
through" the points
y
dependent
variable
(output)

x – independent variable (input)

81
Regression
 For classification the output(s) is nominal
 In regression the output is continuous
 Function Approximation
 Many models could be used – Simplest is
linear regression
 Fit data with the best hyper-plane which "goes
through" the points
y
dependent
variable
(output)

x – independent variable (input)


Regression
 For classification the output(s) is nominal
 In regression the output is continuous
 Function Approximation
 Many models could be used – Simplest is linear regression
 Fit data with the best hyper-plane which "goes through" the
points
 For each point the differences between the predicted point and
the actual observation is the residue

x
Simple Linear Regression
 For now, assume just one (input) independent variable x,
and one (output) dependent variable y
 Multiple linear regression assumes an input vector x
 Multivariate linear regression assumes an output
vector y
 We will "fit" the points with a line (i.e. hyper-plane)
 Which line should we use?
 Choose an objective function
 For simple linear regression we choose sum squared
error (SSE)
 S (predictedi – actuali)2 = S (residuei)2
 Thus, find the line which minimizes the sum of the
squared residues (e.g. least squares)
84
Y=β0 + β1x

 The equation of the line of regression of is y=a+bx


, where y is dependent variable and x is independent
variable.

 The line of regression always passes through point


( x, y) , a=y-bx
b is the slope of the line r σy
σx

line gives the best estimate


of y given a value of x.
Y=4+2x, for every increase in x, y
increase 2 times
Regression line

y=a+bx substitute the values of a and b


rσy
a=y +bx b =
σx

Consumption= 49.13 + (0.85) Income + error


Every increase in income the consumption will increase 0.85 times

Cov (X,Y)= [1/n∑ xy ] – x y


where r = cov(x, y) / σxσy

‘a’ is the intercept and ‘b’ is the slope of the line


Examples
 Calculate the regression line of y on x for the following
data. Also obtain prediction of y which should
corresponding on the average to x = 6.2

x 1 2 3 4 5 6 7 8 9
y 9 8 10 12 11 13 14 16 15

rxy= cov(x, y)/σx σy Cov (X,Y)= [1/n∑ xy ] – x y

σx 2 = (1/n Σ X2 )-X2
a=y- bx b = rσx y=a+bx substitute the values of a and b
σy
 Linear regression
 not applicable for most complex problems
 donot work with non numeric data
 Assume a linear relationship
 The straight line values can be greater than 1 and
less than 0
 Cannot be used as the probability of occurrence of
target class
Regression

Simple regression considers the relation between


a single explanatory variable and response variable
Multiple regression simultaneously considers the
influence of multiple explanatory variables on a
response variable Y

The intent is to look at


the independent effect
of each variable.

90
Regression Modeling
 A simple regression model
(one independent variable)
fits a regression line in 2-
dimensional space

 A multiple regression
model with two
explanatory variables fits a
regression plane in 3-
dimensional space
Simple Regression Model
Regression coefficients are estimated by minimizing
∑residuals2 (i.e., sum of the squared residuals) to
derive this model:

The standard error of the regression (sY|x) is


based on the squared residuals:
Multiple Regression Model
Again, estimates for the multiple slope coefficients
are derived by minimizing ∑residuals2 to derive this
multiple regression model:

Again, the standard error of the regression is


based on the ∑residuals2:
Multiple Regression Model
 Intercept α predicts
where the regression
plane crosses the Y
axis
 Slope for variable X1
(β1) predicts the
change in Y per unit
X1 holding X2
constant
 The slope for
variable X2 (β2)
predicts the change
in Y per unit X2
holding X1 constant
15: Multiple Linear Regression Basic Biostat 94
Multiple Regression Model
A multiple regression
model with k independent
variables fits a regression
“surface” in k + 1
dimensional space (cannot
be visualized)
Multiple regression

 Method for analysing a linear relationship


involving more than two varaibles
 x1,x2,x3…….xn
 Y=a+b1 x1 +b2x2 +b3x3+…….+bnxn
Height Height Height of
of 0f daughter
mother father
63 64 58.6
67 65 64.7
64 67 66.3
… … ….

Daughter Ht= 7.5+0.707mother +0.614 father


Non linear regression

 The difference between linear and


nonlinear regression models isn’t
as straightforward as it sounds.
 You’d think that linear equations produce
straight lines and nonlinear equations model
curvature.
 Unfortunately, that’s not correct
 Both types of models can fit curves to your
data—so that’s not the defining characteristic
Non linear regression
 Linear model  y=a+bx
 Multiple linear regression  y=a+bx1+cx2
 y=a+bx+cx2
 if you take derivative with respect to any parameter
the resultant is 1(constant).
 Y= a+bx  dy/da = 1 dy/db=1x

 y=a+bx+cx2  dy/da= 1 dy/db= 1x


A regression model is called non linear if the derivative
of the model depends on one or more parameters.
Y= a+b2x  dy/db=2bx i.e.the derivative is dependent on ’b’
Non linear by parameter and not ,non linear by independent variable
Logistic regression
 Used when the dependent variable is binary
 0/1, T/F, Y/N
 Y= a+ f1(x1)+ …….. +fn(xn)
 f1 is the function being used to transform the
predictor
Logistic regression
dp/db will dependent
 Uses a logistic curve on b, thus logistic
p= e (a+bx) regression is
nonlinear regression
1+ e (a+bx)
 Logistic curve gives a value between 0 and 1
so it can be interpreted as the probability of
class membership
 Log e (p/(1-p) = a+bx
 Dependent variable , Y= a+bx
 p is the probability of being in the class
 (1-p) is the probability that it is not
Thanks
Logistic Regression
 One commonly used algorithm is Logistic Regression
 Assumes that the dependent (output) variable is
binary which is often the case in medical and other
studies. (Does person have disease or not, survive or
not, accepted or not, etc.)
 Like Quadric, Logistic Regression does a particular
non-linear transform on the data after which it just
does linear regression on the transformed data
 Logistic regression fits the data with a
sigmoidal/logistic curve rather than a line and outputs
an approximation of the probability of the output
given the input

102
Logistic Regression Example
 Age (X axis, input variable) – Data is fictional
 Heart Failure (Y axis, 1 or 0, output variable)
 Could use value of regression line as a probability approximation
 Extrapolates outside 0-1 and not as good empirically
 Sigmoidal curve to the right gives empirically good probability
approximation and is bounded between 0 and 1

103
Logistic Regression Approach
Learning
1. Transform initial input probabilities into log odds
(logit)
2. Do a standard linear regression on the logit values
 This effectively fits a logistic curve to the data, while still
just doing a linear regression with the transformed input
(ala quadric machine, etc.)
Generalization
1. Find the value for the new input on the logit line
2. Transform that logit value back into a probability
104
Non-Linear Pre-Process to
Logit (Log Odds)
Medication # Total Probability:
Dosage Cured Patients # Cured/Total
Patients

20 1 5 .20
30 2 6 .33
40 4 6 .67
50 6 7 .86

Cure 1
d
prob.
Cure
Not d
0
Cured
0 10 20 30 40 50 0 10 20 30 40 50
60 60
CS 478 - Regression 105
Non-Linear Pre-Process to
Logit (Log Odds)
Medication # Total Probability:
Dosage Cured Patients # Cured/Total
Patients

20 1 5 .20
30 2 6 .33
40 4 6 .67
50 6 7 .86

Cured 1
prob.
Cure
Not d
0
Cured
0 10 20 30 40 50 0 10 20 30 40 50
60 60
106
Logistic Regression Approach
 Could use linear regression with the probability points, but
that would not extrapolate well
 Logistic version is better but how do we get it?
 Similar to Quadric we do a non-linear pre-process of the
input and then do linear regression on the transformed
values – do a linear regression on the log odds - Logit

1 1
prob. prob.
Cure Cure
d d
0 0
0 10 20 30 40 50 0 10 20 30 40 50
60 60

107
Non-Linear Pre-Process to
Logit (Log Odds)
Medication # Total Probability: Odds: Logit
Dosage Cured Patients # Cured/Total p/(1-p) = Log Odds:
Patients # cured/
# not ln(Odds)
cured
20 1 5 .20 .25 -1.39
30 2 6 .33 .50 -0.69
40 4 6 .67 2.0 0.69
50 6 7 .86 6.0 1.79

Cure 1
d
prob.
Cure
Not d
0
Cured
0 10 20 30 40 50 0 10 20 30 40 50
60 60
108
Regression of Log Odds
Medicatio # Total Probability: Odds: Log +
n Dosage Cured Patient # p/(1-p) = Odds: 2
s Cured/Total # cured/ 0
Patients # not cured ln(Odds)

20 1 5 .20 .25 -1.39 -2


30 2 6 .33 .50 -0.69 0 10 20 30 40 50
40 4 6 .67 2.0 0.69 1 60
prob.
50 6 7 .86 6.0 1.79 Cured

• y = .11x – 3.8 - Logit regression equation 0


• Now we have a regression line for log odds (logit)
• To generalize, we interpolate the log odds value for the new data point
• Then we transform that log odds point to a probability: p = elogit(x)/(1+elogit(x))
• For example assume we want p for dosage = 10
Logit(10) = .11(10) – 3.8 = -2.7
p(10) = e-2.7/(1+e-2.7) = .06 [note that we just work backwards from logit to p]
• These p values make up the sigmoidal regression curve (which we never have to
actually plot)
109
Heart 50 50 50 50 70 70 90 90 90 90 90
rate
Heart y n n n n y y y n y y
Attack

You might also like