0% found this document useful (0 votes)
51 views

Supervised Learning With R

This document provides an introduction to linear regression. It explains that regression is used to predict a numerical outcome from inputs. Linear regression fits a linear model to predict the outcome based on the inputs. It demonstrates how to fit a linear regression model in R using the lm() function and formula syntax. The document also discusses how to make predictions on new data from a fitted model and examine the coefficients and performance of the model. It notes some pros and cons of linear regression, including that it can only model linear relationships and may be sensitive to collinearity between inputs.

Uploaded by

Cost
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views

Supervised Learning With R

This document provides an introduction to linear regression. It explains that regression is used to predict a numerical outcome from inputs. Linear regression fits a linear model to predict the outcome based on the inputs. It demonstrates how to fit a linear regression model in R using the lm() function and formula syntax. The document also discusses how to make predictions on new data from a fitted model and examine the coefficients and performance of the model. It notes some pros and cons of linear regression, including that it can only model linear relationships and may be sensitive to collinearity between inputs.

Uploaded by

Cost
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Welcome and

Introduction
S UP ERVIS ED LEARN IN G IN R: REGRES S ION

Nina Zumel and John Mount


Data Scientists, Win Vector LLC
What is Regression?
Regression: Predict a numerical outcome ("dependent variable")
from a set of inputs ("independent variables").

Statistical Sense: Predicting the expected value of the outcome.

Casual Sense: Predicting a numerical outcome, rather than a


discrete one.

SUPERVISED LEARNING IN R: REGRESSION


What is Regression?
How many units will we sell? (Regression)

Will this customer buy our product (yes/no)? (Classi cation)

What price will the customer pay for our product? (Regression)

SUPERVISED LEARNING IN R: REGRESSION


Example: Predict Temperature from Chirp Rate

SUPERVISED LEARNING IN R: REGRESSION


Predict Temperature from Chirp Rate

SUPERVISED LEARNING IN R: REGRESSION


Predict Temperature from Chirp Rate

SUPERVISED LEARNING IN R: REGRESSION


Regression from a Machine Learning Perspective
Scienti c mindset: Modeling to understand the data generation
process
Engineering mindset: *Modeling to predict accurately

Machine Learning: Engineering mindset

SUPERVISED LEARNING IN R: REGRESSION


Let's practice!
S UP ERVIS ED LEARN IN G IN R: REGRES S ION
Linear regression -
the fundamental
method
S UP ERVIS ED LEARN IN G IN R: REGRES S ION

Nina Zumel and John Mount


Win-Vector LLC
Linear Regression
y = β0 + β1 x1 + β2 x2 + ...

y is linearly related to each xi


Each xi contributes additively to y

SUPERVISED LEARNING IN R: REGRESSION


Linear Regression in R: lm()
cmodel <- lm(temperature ~ chirps_per_sec, data = cricket)

formula: temperature ~ chirps_per_sec

data frame: cricket

SUPERVISED LEARNING IN R: REGRESSION


Formulas
fmla_1 <- temperature ~ chirps_per_sec
fmla_2 <- blood_pressure ~ age + weight

LHS: outcome

RHS: inputs
use + for multiple inputs

fmla_1 <- as.formula("temperature ~ chirps_per_sec")

SUPERVISED LEARNING IN R: REGRESSION


Looking at the Model
y = β0 + β1 x1 + β2 x2 + ...

cmodel

Call:
lm(formula = temperature ~ chirps_per_sec, data = cricket)

Coefficients:
(Intercept) chirps_per_sec
25.232 3.291

SUPERVISED LEARNING IN R: REGRESSION


More Information about the Model
summary(cmodel)

Call:
lm(formula = fmla, data = cricket)

Residuals:
Min 1Q Median 3Q Max
-6.515 -1.971 0.490 2.807 5.001

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 25.2323 10.0601 2.508 0.026183 *
chirps_per_sec 3.2911 0.6012 5.475 0.000107 ***

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.829 on 13 degrees of freedom


Multiple R-squared: 0.6975, Adjusted R-squared: 0.6742
F-statistic: 29.97 on 1 and 13 DF, p-value: 0.0001067

SUPERVISED LEARNING IN R: REGRESSION


More Information about the Model
broom::glance(cmodel)

sigr::wrapFTest(cmodel)

SUPERVISED LEARNING IN R: REGRESSION


Let's practice!
S UP ERVIS ED LEARN IN G IN R: REGRES S ION
Predicting once you
t a model
S UP ERVIS ED LEARN IN G IN R: REGRES S ION

Nina Zumel and John Mount


Win-Vector LLC
Predicting From the Training Data
cricket$prediction <- predict(cmodel)

predict() by default returns training data predictions

SUPERVISED LEARNING IN R: REGRESSION


Looking at the Predictions
ggplot(cricket, aes(x = prediction, y = temperature)) +
+ geom_point() +
+ geom_abline(color = "darkblue") +
+ ggtitle("temperature vs. linear model prediction")

SUPERVISED LEARNING IN R: REGRESSION


Predicting on New Data
newchirps <- data.frame(chirps_per_sec = 16.5)
newchirps$prediction <- predict(cmodel, newdata = newchirps)
newchirps
chirps_per_sec pred
1 16.5 79.53537

SUPERVISED LEARNING IN R: REGRESSION


Let's practice!
S UP ERVIS ED LEARN IN G IN R: REGRES S ION
Wrapping up linear
regression
S UP ERVIS ED LEARN IN G IN R: REGRES S ION

Nina Zumel and John Mount


Win-Vector, LLC
Pros and Cons of Linear Regression
Pros
Easy to t and to apply

Concise

Less prone to over tting

SUPERVISED LEARNING IN R: REGRESSION


Pros and Cons of Linear Regression
Pros
Easy to t and to apply

Concise

Less prone to over tting

Interpretable

Call:
lm(formula = blood_pressure ~ age + weight, data = bloodpressure)

Coefficients:
(Intercept) age weight
30.9941 0.8614 0.3349

SUPERVISED LEARNING IN R: REGRESSION


Pros and Cons of Linear Regression
Pros
Easy to t and to apply

Concise

Less prone to over tting

Interpretable

Cons
Can only express linear and additive relationships

SUPERVISED LEARNING IN R: REGRESSION


Collinearity
Collinearity -- when input variables are partially correlated.

Call:
lm(formula = blood_pressure ~ age + weight, data = bloodpressure)

Coefficients:
(Intercept) age weight
30.9941 0.8614 0.3349

SUPERVISED LEARNING IN R: REGRESSION


Collinearity
Collinearity -- when variables are partially correlated.

Coef cients might change sign

Call:
lm(formula = blood_pressure ~ age + weight, data = bloodpressure)

Coefficients:
(Intercept) age weight
30.9941 0.8614 0.3349

SUPERVISED LEARNING IN R: REGRESSION


Collinearity
Collinearity -- when variables are partially correlated.

Coef cients might change sign

High collinearity:
Coef cients (or standard errors) look too large

Model may be unstable

Call:
lm(formula = blood_pressure ~ age + weight, data = bloodpressure)

Coefficients:
(Intercept) age weight
30.9941 0.8614 0.3349

SUPERVISED LEARNING IN R: REGRESSION


Coming Next
Evaluating a regression model

Properly training a model

SUPERVISED LEARNING IN R: REGRESSION


Let's practice!
S UP ERVIS ED LEARN IN G IN R: REGRES S ION

You might also like