0% found this document useful (0 votes)

57 views

Chapter 4

The document discusses logistic regression for predicting probabilities in R. It introduces logistic regression and how it is used to predict probabilities rather than exact values like linear regression. Logistic regression uses the glm() function with the binomial family to fit classification models where the outcome is binary. The document demonstrates fitting a logistic regression model to predict the probability of having Duchenne Muscular Dystrophy based on CK and H levels. It explains how to interpret the model coefficients and make predictions on new data to obtain probabilities rather than log-odds. Evaluation metrics for logistic regression like pseudo R-squared are also discussed.

Uploaded by

110me0313

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

57 views

Chapter 4

Uploaded by

110me0313

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

DataCamp Supervised

Learning in R: Regression

SUPERVISED LEARNING IN R: REGRESSION

Logistic regression to
predict probabilities

Nina Zumel and John Mount

Win-Vector LLC
DataCamp Supervised Learning in R: Regression

Predicting Probabilities
Predicting whether an event occurs (yes/no): classification
Predicting the probability that an event occurs: regression
Linear regression: predicts values in [−∞, ∞]
Probabilities: limited to [0,1] interval
So we'll call it non-linear
DataCamp Supervised Learning in R: Regression

Example: Predicting Duchenne Muscular Dystrophy

(DMD)

outcome: has_dmd

inputs: CK, H
DataCamp Supervised Learning in R: Regression

A Linear Regression Model

> model <- lm(has_dmd ~ CK + H,
+ data = train) Model predicts values outside
> test$pred <- predict( the range [0:1]
+ model,
+ newdata = test
+ )
outcome: has_dmd ∈ {0,1}

0: FALSE
1: TRUE
DataCamp Supervised Learning in R: Regression

Logistic Regression
p
log( ) = β0 + β1 x1 + β2 x2 + ...
1−p

glm(formula, data, family = binomial)

Generalized linear model
Assumes inputs additive, linear in log-odds: log(p/(1 − p))
family: describes error distribution of the model
logistic regression: family = binomial
DataCamp Supervised Learning in R: Regression

DMD model
> model <- glm(has_dmd ~ CK + H, data = train, family = binomial)
outcome: two classes, e.g. a and b
model returns P rob(b)
Recommend: 0/1 or FALSE/TRUE
DataCamp Supervised Learning in R: Regression

Interpreting Logistic Regression Models

> model

## Call: glm(formula = has_dmd ~ CK + H, family = binomial, data = train)

##
## Coefficients:
## (Intercept) CK H
## -16.22046 0.07128 0.12552
##
## Degrees of Freedom: 86 Total (i.e. Null); 84 Residual
## Null Deviance: 110.8
## Residual Deviance: 45.16 AIC: 51.16
DataCamp Supervised Learning in R: Regression

Predicting with a glm() model

predict(model, newdata, type = "response")
newdata: by default, training data

To get probabilities: use type = "response"

By default: returns log-odds

DataCamp Supervised Learning in R: Regression

DMD Model
> model <- glm(has_dmd ~ CK + H, data = train, family = binomial)
> test$pred <- predict(model, newdata = test, type = "response")
DataCamp Supervised Learning in R: Regression

2
Evaluating a logistic regression model: pseudo-R

2 RSS
R =1−
SS T ot

2 deviance
pseudoR = 1 −
null.deviance

Deviance: analogous to variance (RSS)

Null deviance: Similar to SS T ot
pseudo R^2: Deviance explained
DataCamp Supervised Learning in R: Regression

2
Pseudo-R on Training data

Using broom::glance()

> glance(model) %>%

+ summarize(pR2 = 1 - deviance/null.deviance)

## pseudoR2
## 1 0.5922402

Using sigr::wrapChiSqTest()

> wrapChiSqTest(model)

## "... pseudo-R2=0.59 ..."

DataCamp Supervised Learning in R: Regression

2
Pseudo-R on Test data
# Test data
> test %>%
+ mutate(pred = predict(model, newdata = test, type = "response")) %>%
+ wrapChiSqTest("pred", "has_dmd", TRUE)

Arguments:

data frame
prediction column name
outcome column name
target value (target event)
DataCamp Supervised Learning in R: Regression

The Gain Curve Plot

> GainCurvePlot(test, "pred","has_dmd", "DMD model on test")
DataCamp Supervised Learning in R: Regression

SUPERVISED LEARNING IN R: REGRESSION

Let's practice!
DataCamp Supervised Learning in R: Regression

SUPERVISED LEARNING IN R: REGRESSION

Poisson and
quasipoisson
regression to predict
Nina Zumel and John Mount
Win-Vector, LLC counts
DataCamp Supervised Learning in R: Regression

Predicting Counts
Linear regression: predicts values in [−∞, ∞]
Counts: integers in range [0, ∞]
DataCamp Supervised Learning in R: Regression

Poisson/Quasipoisson Regression
glm(formula, data, family)
family: either poisson or quasipoisson

inputs additive and linear in log(count)

DataCamp Supervised Learning in R: Regression

Poisson/Quasipoisson Regression
glm(formula, data, family)
family: either poisson or quasipoisson

inputs additive and linear in log(count)

outcome: integer
counts: e.g. number of traffic tickets a driver gets
rates: e.g. number of website hits/day
prediction: expected rate or intensity (not integral)
expected # traffic tickets; expected hits/day
DataCamp Supervised Learning in R: Regression

Poisson vs. Quasipoisson

Poisson assumes that mean(y) = var(y)

If var(y) much different from mean(y) - quasipoisson

Generally requires a large sample size

If rates/counts >> 0 - regular regression is fine
DataCamp Supervised Learning in R: Regression

Example: Predicting Bike Rentals

DataCamp Supervised Learning in R: Regression

Fit the model

> bikesJan %>%
+ summarize(mean = mean(cnt), var = var(cnt))

## mean var
## 1 130.5587 14351.25

Since var(cnt) >> mean(cnt) → use quasipoisson

> fmla <- cnt ~ hr + holiday + workingday +

+ weathersit + temp + atemp + hum + windspeed

> model <- glm(fmla, data = bikesJan, family = quasipoisson)

DataCamp Supervised Learning in R: Regression

Check model fit

2 deviance
pseudoR = 1 −
null.deviance

> glance(model) %>%

+ summarize(pseudoR2 = 1 - deviance/null.deviance)

## pseudoR2
## 1 0.7654358
DataCamp Supervised Learning in R: Regression

Predicting from the model

> predict(model, newdata = bikesFeb, type = "response")
DataCamp Supervised Learning in R: Regression

Evaluate the model

You can evaluate count models by RMSE

> bikesFeb %>%

+ mutate(residual = pred - cnt) %>%
+ summarize(rmse = sqrt(mean(residual^2)))

## rmse
## 1 69.32869

> sd(bikesFeb$cnt)
[1] 134.2865
DataCamp Supervised Learning in R: Regression

Compare Predictions and Actual Outcomes

DataCamp Supervised Learning in R: Regression

SUPERVISED LEARNING IN R: REGRESSION

Let's practice!
DataCamp Supervised Learning in R: Regression

SUPERVISED LEARNING IN R: REGRESSION

GAM to learn non-

linear transformations

Nina Zumel and John Mount

Win-Vector, LLC
DataCamp Supervised Learning in R: Regression

Generalized Additive Models (GAMs)

y ∼ b0 + s1(x1) + s2(x2) + ....

DataCamp Supervised Learning in R: Regression

Learning Non-linear Relationships

DataCamp Supervised Learning in R: Regression

gam() in the mgcv package

gam(formula, family, data)

family:

gaussian (default): "regular" regression

binomial: probabilities
poisson/quasipoisson: counts

Best for larger data sets

DataCamp Supervised Learning in R: Regression

The s() function

> anx ~ s(hassles)
s() designates that variable should be non-linear

Use s() with continuous variables

More than about 10 unique values

DataCamp Supervised Learning in R: Regression

Revisit the hassles data

Model RMSE R2
(cross-val) (training)

Linear ( 7.69 0.53

hassles)

Quadratic 6.89 0.63

(hassles 2 )

Cubic ( 6.70 0.65

hassles 3 )
DataCamp Supervised Learning in R: Regression

GAM of the hassles data

> model <- gam(anx ~ s(hassles), data = hassleframe, family = gaussian)

> summary(model)

## ...
##
## R-sq.(adj) = 0.619 Deviance explained = 64.1%
## GCV = 49.132 Scale est. = 45.153 n = 40
DataCamp Supervised Learning in R: Regression

Examining the Transformations

> plot(model)

y values: predict(model, type = "terms")

DataCamp Supervised Learning in R: Regression

Predicting with the Model

> predict(model, newdata = hassleframe, type = "response")
DataCamp Supervised Learning in R: Regression

Comparing out-of-sample performance

Knowing the correct transformation is best, but GAM is useful when

transformation isn't known

Model RMSE (cross-val) R2 (training)

Linear (hassles) 7.69 0.53

Quadratic (hassles 2 ) 6.89 0.63

Cubic (hassles3 ) 6.70 0.65

GAM 7.06 0.64

Small data set → noisier GAM

DataCamp Supervised Learning in R: Regression

SUPERVISED LEARNING IN R: REGRESSION

Let's practice!

S2 License Application Procedure (For GCQ Period)
100% (1)
S2 License Application Procedure (For GCQ Period)
4 pages
MATH6183 Introduction+Regression
No ratings yet
MATH6183 Introduction+Regression
70 pages
Introduction To Statistical Learning: With Applications in R
No ratings yet
Introduction To Statistical Learning: With Applications in R
13 pages
Cerebral Palsy Revalida Format
No ratings yet
Cerebral Palsy Revalida Format
10 pages
De Tia-Pro1 en 01 V130100
100% (1)
De Tia-Pro1 en 01 V130100
433 pages
Chapter 1
No ratings yet
Chapter 1
28 pages
Generalized Linear Model
No ratings yet
Generalized Linear Model
67 pages
Chapter 2
No ratings yet
Chapter 2
36 pages
Logistic Regression
No ratings yet
Logistic Regression
12 pages
W8 - Logistic Regression
No ratings yet
W8 - Logistic Regression
18 pages
Logistic Gams For Classification: Noam Ross
No ratings yet
Logistic Gams For Classification: Noam Ross
36 pages
R Code Default Data PDF
No ratings yet
R Code Default Data PDF
10 pages
Basic ML Algorithm
No ratings yet
Basic ML Algorithm
74 pages
Lecture 09_02.09.2024_Regression-01
No ratings yet
Lecture 09_02.09.2024_Regression-01
62 pages
ApplStats Spring2022 Final Practice
No ratings yet
ApplStats Spring2022 Final Practice
5 pages
Understanding The Geometry of Predictive Models: Workshop at S P Jain School Institute of Management and Research
No ratings yet
Understanding The Geometry of Predictive Models: Workshop at S P Jain School Institute of Management and Research
78 pages
DS535 Note 4 (With Marks)
No ratings yet
DS535 Note 4 (With Marks)
18 pages
Logistic Regression (With R) : 1 Theory
No ratings yet
Logistic Regression (With R) : 1 Theory
15 pages
Untitled Document
No ratings yet
Untitled Document
6 pages
Lecture 4 Linear Regression
100% (1)
Lecture 4 Linear Regression
44 pages
cor
No ratings yet
cor
6 pages
FALLSEM2024-25 BCSE401L TH VL2024250102078 2024-09-04 Reference-Material-I
No ratings yet
FALLSEM2024-25 BCSE401L TH VL2024250102078 2024-09-04 Reference-Material-I
27 pages
Ourse Notes Ogistic Egression: Course Notes: Descriptive Statistics Course Notes: Descriptive Statistics
No ratings yet
Ourse Notes Ogistic Egression: Course Notes: Descriptive Statistics Course Notes: Descriptive Statistics
6 pages
Seu Ds610 Mod03
No ratings yet
Seu Ds610 Mod03
45 pages
Logistic Regression Implementation in R: The Dataset
No ratings yet
Logistic Regression Implementation in R: The Dataset
8 pages
Supervised Learning With R
No ratings yet
Supervised Learning With R
30 pages
AST Day 2 Slides
No ratings yet
AST Day 2 Slides
58 pages
ML_Introduction
No ratings yet
ML_Introduction
76 pages
10-11
No ratings yet
10-11
50 pages
Linear Regression
No ratings yet
Linear Regression
104 pages
Log-Linear Models and Conditional Random Fieldsels
No ratings yet
Log-Linear Models and Conditional Random Fieldsels
27 pages
Pseudo-R Squared
No ratings yet
Pseudo-R Squared
9 pages
Lecture 1 - Overview of Supervised Learning
No ratings yet
Lecture 1 - Overview of Supervised Learning
133 pages
03. Logistic Regression
No ratings yet
03. Logistic Regression
19 pages
Business Analytics & Machine Learning: Logistic and Poisson Regressions
No ratings yet
Business Analytics & Machine Learning: Logistic and Poisson Regressions
62 pages
Islp 1
No ratings yet
Islp 1
15 pages
GLM in R
No ratings yet
GLM in R
6 pages
ISLR
No ratings yet
ISLR
9 pages
Lab 4 Classification v.0
No ratings yet
Lab 4 Classification v.0
5 pages
Lab 3 - Logistic Regression: Part B
No ratings yet
Lab 3 - Logistic Regression: Part B
7 pages
Regression 101
No ratings yet
Regression 101
18 pages
ML 2024 Part6 Classification Unsupervised
No ratings yet
ML 2024 Part6 Classification Unsupervised
43 pages
PGP25116 - Soubhagya - Dash - DPolynomial Regression
No ratings yet
PGP25116 - Soubhagya - Dash - DPolynomial Regression
4 pages
Statistical Methods-1
No ratings yet
Statistical Methods-1
63 pages
05-1 Supervised Learning
No ratings yet
05-1 Supervised Learning
65 pages
Regression
No ratings yet
Regression
34 pages
ML-2
No ratings yet
ML-2
155 pages
class
No ratings yet
class
102 pages
Data Science Unit-5
No ratings yet
Data Science Unit-5
37 pages
Lecture Notes - Logistic Regression
100% (1)
Lecture Notes - Logistic Regression
11 pages
Notes 07 - Regression
No ratings yet
Notes 07 - Regression
23 pages
07 GLM
No ratings yet
07 GLM
49 pages
(GAM) Application PDF
No ratings yet
(GAM) Application PDF
30 pages
Logistic Regression
No ratings yet
Logistic Regression
14 pages
Least Squares Fit To Polynomial
No ratings yet
Least Squares Fit To Polynomial
12 pages
Lecture 10 Logistic Regression Part 1
No ratings yet
Lecture 10 Logistic Regression Part 1
19 pages
Machine Learning and Pattern Recognition Week 3 Intro - Classification
No ratings yet
Machine Learning and Pattern Recognition Week 3 Intro - Classification
5 pages
week2
No ratings yet
week2
43 pages
Week 9 - PROG 8510 Week 9
No ratings yet
Week 9 - PROG 8510 Week 9
27 pages
ML - LAB - BE CSE (DS) Final
No ratings yet
ML - LAB - BE CSE (DS) Final
110 pages
Logistic Regression With R
No ratings yet
Logistic Regression With R
5 pages
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet
Amazing Java: Learn Java Quickly
From Everand
Amazing Java: Learn Java Quickly
Andrei Besedin
No ratings yet
Cisco IT Case Study IT Operating Model
No ratings yet
Cisco IT Case Study IT Operating Model
10 pages
1 s2.0 S092753712200135X Main
No ratings yet
1 s2.0 S092753712200135X Main
13 pages
BOQ - S - Erection CGL
No ratings yet
BOQ - S - Erection CGL
3 pages
Chapter4 (Classification)
No ratings yet
Chapter4 (Classification)
25 pages
Mis RMHS & Logistics - 22.02.2018
No ratings yet
Mis RMHS & Logistics - 22.02.2018
98 pages
Template BOQService
No ratings yet
Template BOQService
2 pages
Dust Suppression at Stacker & Reclaimer
No ratings yet
Dust Suppression at Stacker & Reclaimer
3 pages
Chapter2 (Classification)
No ratings yet
Chapter2 (Classification)
17 pages
Chapter1 (Classification)
No ratings yet
Chapter1 (Classification)
16 pages
2013-14 BKT - Annual-Report-For-Financial-Year
No ratings yet
2013-14 BKT - Annual-Report-For-Financial-Year
80 pages
Math 6 Week 10 Q4 DLP
No ratings yet
Math 6 Week 10 Q4 DLP
9 pages
Student's Elementary 146
No ratings yet
Student's Elementary 146
1 page
Link Belt Excavator 160 x2 Ex Schematics Operators Shop Parts Service Manual
No ratings yet
Link Belt Excavator 160 x2 Ex Schematics Operators Shop Parts Service Manual
22 pages
Automobile Roof Crush Analysis With Abaqus 2006
No ratings yet
Automobile Roof Crush Analysis With Abaqus 2006
1 page
MDS Denistry
No ratings yet
MDS Denistry
34 pages
Internet Banking in SBI - Preeti Pawar 357358
100% (3)
Internet Banking in SBI - Preeti Pawar 357358
90 pages
2) Modal - Simple
No ratings yet
2) Modal - Simple
15 pages
ABORT Magazine - Issue 16
No ratings yet
ABORT Magazine - Issue 16
48 pages
export_Jacob Ehrreich_2025_04_12_08_18_53
No ratings yet
export_Jacob Ehrreich_2025_04_12_08_18_53
27 pages
C TERP10 67 Sample Questions
No ratings yet
C TERP10 67 Sample Questions
5 pages
Protect Application or System Sw_Final Exam_answer
0% (1)
Protect Application or System Sw_Final Exam_answer
3 pages
A Simple Regen Radio For Beginners
No ratings yet
A Simple Regen Radio For Beginners
4 pages
ADV L02 The Goblins of Grim Hollow
No ratings yet
ADV L02 The Goblins of Grim Hollow
12 pages
C V
No ratings yet
C V
5 pages
Vessel Weighing Guide-Loadcells
100% (1)
Vessel Weighing Guide-Loadcells
39 pages
Q-Academy-Quản Trị Mar
100% (1)
Q-Academy-Quản Trị Mar
1,004 pages
Cognex PDF
No ratings yet
Cognex PDF
59 pages
Gcse Photography Coursework Examples
67% (3)
Gcse Photography Coursework Examples
8 pages
9c - Geometric Dimension Ing & Tolerancing (Part 3)
100% (2)
9c - Geometric Dimension Ing & Tolerancing (Part 3)
53 pages
Jeremias Rivera@deped Gov PH
No ratings yet
Jeremias Rivera@deped Gov PH
1 page
Efficacy&safety Scalp Acupuncture Insomnia
No ratings yet
Efficacy&safety Scalp Acupuncture Insomnia
12 pages
Amps To Watts Calculator
No ratings yet
Amps To Watts Calculator
3 pages
Jaya Real Property TBK.: Company Report: January 2019 As of 31 January 2019
No ratings yet
Jaya Real Property TBK.: Company Report: January 2019 As of 31 January 2019
3 pages
2021 Global Automotive Consumer Study: EMEA Countries
No ratings yet
2021 Global Automotive Consumer Study: EMEA Countries
20 pages
Amoi Lc26t1e 32t1e 37t1e 42t1e (ET)
No ratings yet
Amoi Lc26t1e 32t1e 37t1e 42t1e (ET)
120 pages
[Ebooks PDF] download Biological wastewater treatment Third Edition. Edition Grady full chapters
100% (1)
[Ebooks PDF] download Biological wastewater treatment Third Edition. Edition Grady full chapters
67 pages

Chapter 4

Uploaded by

Chapter 4

Uploaded by

DataCamp Supervised

SUPERVISED LEARNING IN R: REGRESSION

Nina Zumel and John Mount

Example: Predicting Duchenne Muscular Dystrophy

A Linear Regression Model

glm(formula, data, family = binomial)

Interpreting Logistic Regression Models

## Call: glm(formula = has_dmd ~ CK + H, family = binomial, data = train)

Predicting with a glm() model

To get probabilities: use type = "response"

By default: returns log-odds

Deviance: analogous to variance (RSS)

> glance(model) %>%

## "... pseudo-R2=0.59 ..."

The Gain Curve Plot

SUPERVISED LEARNING IN R: REGRESSION

SUPERVISED LEARNING IN R: REGRESSION

inputs additive and linear in log(count)

inputs additive and linear in log(count)

Poisson vs. Quasipoisson

If var(y) much different from mean(y) - quasipoisson

Generally requires a large sample size

Example: Predicting Bike Rentals

Fit the model

Since var(cnt) >> mean(cnt) → use quasipoisson

> fmla <- cnt ~ hr + holiday + workingday +

> model <- glm(fmla, data = bikesJan, family = quasipoisson)

Check model fit

> glance(model) %>%

Predicting from the model

Evaluate the model

You can evaluate count models by RMSE

> bikesFeb %>%

Compare Predictions and Actual Outcomes

SUPERVISED LEARNING IN R: REGRESSION

SUPERVISED LEARNING IN R: REGRESSION

GAM to learn non-

Nina Zumel and John Mount

Generalized Additive Models (GAMs)

y ∼ b0 + s1(x1) + s2(x2) + ....

Learning Non-linear Relationships

gam() in the mgcv package

gaussian (default): "regular" regression

Best for larger data sets

The s() function

Use s() with continuous variables

More than about 10 unique values

Revisit the hassles data

Linear ( 7.69 0.53

Quadratic 6.89 0.63

Cubic ( 6.70 0.65

GAM of the hassles data

Examining the Transformations

y values: predict(model, type = "terms")

Predicting with the Model

Comparing out-of-sample performance

Knowing the correct transformation is best, but GAM is useful when

Model RMSE (cross-val) R2 (training)

Linear (hassles) 7.69 0.53

Quadratic (hassles 2 ) 6.89 0.63

Cubic (hassles3 ) 6.70 0.65

GAM 7.06 0.64

Small data set → noisier GAM

SUPERVISED LEARNING IN R: REGRESSION

You might also like