0% found this document useful (0 votes)
57 views

Chapter 4

The document discusses logistic regression for predicting probabilities in R. It introduces logistic regression and how it is used to predict probabilities rather than exact values like linear regression. Logistic regression uses the glm() function with the binomial family to fit classification models where the outcome is binary. The document demonstrates fitting a logistic regression model to predict the probability of having Duchenne Muscular Dystrophy based on CK and H levels. It explains how to interpret the model coefficients and make predictions on new data to obtain probabilities rather than log-odds. Evaluation metrics for logistic regression like pseudo R-squared are also discussed.

Uploaded by

110me0313
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views

Chapter 4

The document discusses logistic regression for predicting probabilities in R. It introduces logistic regression and how it is used to predict probabilities rather than exact values like linear regression. Logistic regression uses the glm() function with the binomial family to fit classification models where the outcome is binary. The document demonstrates fitting a logistic regression model to predict the probability of having Duchenne Muscular Dystrophy based on CK and H levels. It explains how to interpret the model coefficients and make predictions on new data to obtain probabilities rather than log-odds. Evaluation metrics for logistic regression like pseudo R-squared are also discussed.

Uploaded by

110me0313
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

DataCamp Supervised

Learning in R: Regression

SUPERVISED LEARNING IN R: REGRESSION

Logistic regression to
predict probabilities

Nina Zumel and John Mount


Win-Vector LLC
DataCamp Supervised Learning in R: Regression

Predicting Probabilities
Predicting whether an event occurs (yes/no): classification
Predicting the probability that an event occurs: regression
Linear regression: predicts values in [−∞, ∞]
Probabilities: limited to [0,1] interval
So we'll call it non-linear
DataCamp Supervised Learning in R: Regression

Example: Predicting Duchenne Muscular Dystrophy


(DMD)

outcome: has_dmd

inputs: CK, H
DataCamp Supervised Learning in R: Regression

A Linear Regression Model


> model <- lm(has_dmd ~ CK + H,
+ data = train) Model predicts values outside
> test$pred <- predict( the range [0:1]
+ model,
+ newdata = test
+ )
outcome: has_dmd ∈ {0,1}

0: FALSE
1: TRUE
DataCamp Supervised Learning in R: Regression

Logistic Regression
p
log( ) = β0 + β1 x1 + β2 x2 + ...
1−p

glm(formula, data, family = binomial)


Generalized linear model
Assumes inputs additive, linear in log-odds: log(p/(1 − p))
family: describes error distribution of the model
logistic regression: family = binomial
DataCamp Supervised Learning in R: Regression

DMD model
> model <- glm(has_dmd ~ CK + H, data = train, family = binomial)
outcome: two classes, e.g. a and b
model returns P rob(b)
Recommend: 0/1 or FALSE/TRUE
DataCamp Supervised Learning in R: Regression

Interpreting Logistic Regression Models


> model

## Call: glm(formula = has_dmd ~ CK + H, family = binomial, data = train)


##
## Coefficients:
## (Intercept) CK H
## -16.22046 0.07128 0.12552
##
## Degrees of Freedom: 86 Total (i.e. Null); 84 Residual
## Null Deviance: 110.8
## Residual Deviance: 45.16 AIC: 51.16
DataCamp Supervised Learning in R: Regression

Predicting with a glm() model


predict(model, newdata, type = "response")
newdata: by default, training data

To get probabilities: use type = "response"

By default: returns log-odds


DataCamp Supervised Learning in R: Regression

DMD Model
> model <- glm(has_dmd ~ CK + H, data = train, family = binomial)
> test$pred <- predict(model, newdata = test, type = "response")
DataCamp Supervised Learning in R: Regression

2
Evaluating a logistic regression model: pseudo-R

2 RSS
R =1−
SS T ot

2 deviance
pseudoR = 1 −
null.deviance

Deviance: analogous to variance (RSS)


Null deviance: Similar to SS T ot
pseudo R^2: Deviance explained
DataCamp Supervised Learning in R: Regression

2
Pseudo-R on Training data

Using broom::glance()

> glance(model) %>%


+ summarize(pR2 = 1 - deviance/null.deviance)

## pseudoR2
## 1 0.5922402

Using sigr::wrapChiSqTest()

> wrapChiSqTest(model)

## "... pseudo-R2=0.59 ..."


DataCamp Supervised Learning in R: Regression

2
Pseudo-R on Test data
# Test data
> test %>%
+ mutate(pred = predict(model, newdata = test, type = "response")) %>%
+ wrapChiSqTest("pred", "has_dmd", TRUE)

Arguments:

data frame
prediction column name
outcome column name
target value (target event)
DataCamp Supervised Learning in R: Regression

The Gain Curve Plot


> GainCurvePlot(test, "pred","has_dmd", "DMD model on test")
DataCamp Supervised Learning in R: Regression

SUPERVISED LEARNING IN R: REGRESSION

Let's practice!
DataCamp Supervised Learning in R: Regression

SUPERVISED LEARNING IN R: REGRESSION

Poisson and
quasipoisson
regression to predict
Nina Zumel and John Mount
Win-Vector, LLC counts
DataCamp Supervised Learning in R: Regression

Predicting Counts
Linear regression: predicts values in [−∞, ∞]
Counts: integers in range [0, ∞]
DataCamp Supervised Learning in R: Regression

Poisson/Quasipoisson Regression
glm(formula, data, family)
family: either poisson or quasipoisson

inputs additive and linear in log(count)


DataCamp Supervised Learning in R: Regression

Poisson/Quasipoisson Regression
glm(formula, data, family)
family: either poisson or quasipoisson

inputs additive and linear in log(count)


outcome: integer
counts: e.g. number of traffic tickets a driver gets
rates: e.g. number of website hits/day
prediction: expected rate or intensity (not integral)
expected # traffic tickets; expected hits/day
DataCamp Supervised Learning in R: Regression

Poisson vs. Quasipoisson


Poisson assumes that mean(y) = var(y)

If var(y) much different from mean(y) - quasipoisson

Generally requires a large sample size


If rates/counts >> 0 - regular regression is fine
DataCamp Supervised Learning in R: Regression

Example: Predicting Bike Rentals


DataCamp Supervised Learning in R: Regression

Fit the model


> bikesJan %>%
+ summarize(mean = mean(cnt), var = var(cnt))

## mean var
## 1 130.5587 14351.25

Since var(cnt) >> mean(cnt) → use quasipoisson

> fmla <- cnt ~ hr + holiday + workingday +


+ weathersit + temp + atemp + hum + windspeed

> model <- glm(fmla, data = bikesJan, family = quasipoisson)


DataCamp Supervised Learning in R: Regression

Check model fit

2 deviance
pseudoR = 1 −
null.deviance

> glance(model) %>%


+ summarize(pseudoR2 = 1 - deviance/null.deviance)

## pseudoR2
## 1 0.7654358
DataCamp Supervised Learning in R: Regression

Predicting from the model


> predict(model, newdata = bikesFeb, type = "response")
DataCamp Supervised Learning in R: Regression

Evaluate the model

You can evaluate count models by RMSE

> bikesFeb %>%


+ mutate(residual = pred - cnt) %>%
+ summarize(rmse = sqrt(mean(residual^2)))

## rmse
## 1 69.32869

> sd(bikesFeb$cnt)
[1] 134.2865
DataCamp Supervised Learning in R: Regression

Compare Predictions and Actual Outcomes


DataCamp Supervised Learning in R: Regression

SUPERVISED LEARNING IN R: REGRESSION

Let's practice!
DataCamp Supervised Learning in R: Regression

SUPERVISED LEARNING IN R: REGRESSION

GAM to learn non-


linear transformations

Nina Zumel and John Mount


Win-Vector, LLC
DataCamp Supervised Learning in R: Regression

Generalized Additive Models (GAMs)

y ∼ b0 + s1(x1) + s2(x2) + ....


DataCamp Supervised Learning in R: Regression

Learning Non-linear Relationships


DataCamp Supervised Learning in R: Regression

gam() in the mgcv package


gam(formula, family, data)

family:

gaussian (default): "regular" regression


binomial: probabilities
poisson/quasipoisson: counts

Best for larger data sets


DataCamp Supervised Learning in R: Regression

The s() function


> anx ~ s(hassles)
s() designates that variable should be non-linear

Use s() with continuous variables

More than about 10 unique values


DataCamp Supervised Learning in R: Regression

Revisit the hassles data


Model RMSE R2
(cross-val) (training)

Linear ( 7.69 0.53


hassles)

Quadratic 6.89 0.63


(hassles 2 )

Cubic ( 6.70 0.65


hassles 3 )
DataCamp Supervised Learning in R: Regression

GAM of the hassles data


> model <- gam(anx ~ s(hassles), data = hassleframe, family = gaussian)

> summary(model)

## ...
##
## R-sq.(adj) = 0.619 Deviance explained = 64.1%
## GCV = 49.132 Scale est. = 45.153 n = 40
DataCamp Supervised Learning in R: Regression

Examining the Transformations


> plot(model)

y values: predict(model, type = "terms")


DataCamp Supervised Learning in R: Regression

Predicting with the Model


> predict(model, newdata = hassleframe, type = "response")
DataCamp Supervised Learning in R: Regression

Comparing out-of-sample performance

Knowing the correct transformation is best, but GAM is useful when


transformation isn't known

Model RMSE (cross-val) R2 (training)

Linear (hassles) 7.69 0.53

Quadratic (hassles 2 ) 6.89 0.63

Cubic (hassles3 ) 6.70 0.65

GAM 7.06 0.64

Small data set → noisier GAM


DataCamp Supervised Learning in R: Regression

SUPERVISED LEARNING IN R: REGRESSION

Let's practice!

You might also like