SlideShare a Scribd company logo
Linear
Regression
This course content is being actively
developed by Delta Analytics, a 501(c)3
Bay Area nonprofit that aims to
empower communities to leverage their
data for good.
Please reach out with any questions or
feedback to inquiry@deltanalytics.org.
Find out more about our mission here.
Delta Analytics builds technical capacity
around the world.
Module 3:
Linear Regression
Let’s do a quick
review of module 2!
How does a model learn from raw data?
Task
What is the problem we want our
model to solve?
Performance
Measure
Quantitative measure we use to
evaluate the model’s performance.
Learning
Methodology
ML algorithms can be supervised or
unsupervised. This determines the
learning methodology.
All models have the following components:
Source: Deep Learning Book - Chapter 5: Introduction to Machine Learning
Classification task:
No
No
Yes
Yes
Exercise 1
Predicted
Approval for
credit card
Y Y*
12,000
60,000
11,000
200,000
Current
Annual
Income
($)
Approval
for credit
card
Yes
No
No
Yes
Outstanding
Debt
200
60,000
0
10,000
What are the explanatory features? What
is the outcome feature?
Task
Performance
Measure
Learning
Experience
Let’s fill in the blanks for our credit approval
example:
Task
Should this customer be approved
for a credit card?
Performance
Measure
Log Loss
Learning
Experience
Supervised
Let’s fill in the blanks for our credit approval
example:
Regression task:
90%
30%
95%
50%
Exercise 2
X Y Y*
10
2
12
0
Time spent
A week
studying
machine
learning
Accuracy of
classification
model built
by student
30%
60%
26%
88%
Predicted
Accuracy of
classification
model
What are the explanatory features?
What is the outcome feature?
Task
Performance
Measure
Learning
Experience
Let’s fill in the blanks for our studying example:
Task
How accurate is this student’s
classification model?
Performance
Measure
MSE
Learning
Experience
Supervised
Let’s fill in the blanks for our studying example:
Module 3:
Linear Regression
Course overview:
Now let’s turn to the data we will be using...
✓ Module 1: Introduction to Machine Learning
✓ Module 2: Machine Learning Deep Dive
✓ Module 3: Linear Regression
❏ Module 4: Decision Trees
❏ Module 5: Ensemble Algorithms
❏ Module 6: Unsupervised Learning Algorithms
❏ Module 7: Natural Language Processing Part 1
❏ Module 8: Natural Language Processing Part 2
❏ Linear regression
❏ Relationship between two variables (x and y)
❏ Formalizing f(x)
❏ Correlation between two variables
❏ Assumptions
❏ Feature engineering and selection
❏ Learning process: Loss function and Mean Squared Error
❏ Univariate regression, Multivariate regression
❏ Measures of performance (R2, Adjusted R2, MSE)
❏ Overfitting, Underfitting
Module Checklist
What is linear regression?
Linear regression is a model that explains the relationship
between explanatory features and an outcome feature as a
line in two dimensional space.
Linear regression has been in use since
the 19th century & is one of the most
important machine learning tools
available to researchers.
Many other models are built upon the
logic of linear models. For example, the
most simple form of deep learning model,
with no hidden layers, is a linear model!
Why is linear regression important?
Source: History of Linear Regression - Introduction to Linear Regression
Modeling
Phase
Performance
Learning
Methodology
Task
Linear regression: model cheat sheet
● Linear relationship
between x and y
● Normal distribution of
variables
● No multicollinearity
(Independent variables)
● Homoscedasticity
● Rule of thumb: at least 20
observations per
independent variable in
the analysis
● Sensitive to outliers
● The world is not always
linear; we often want to
model more complex
relationships
● Does not allow us to
model interactions
between explanatory
features (we will be
able to do this using a
decision tree)
● Very popular model
with intuitive, easy to
understand results.
● Natural extension of
correlation analysis
Pros Cons Assumptions*
* We will explain each of these assumptions in this module!
Task
What is the problem we want our
model to solve?
Defining f(x) What is f(x) for a linear model?
Feature
engineering
& selection
What is x? How do we decide what
explanatory features to include in
our model?
Today we are looking closer at each component of the framework we
discussed in the last class
What assumptions does a linear
regression make about the data? Do
we have to transform the data?
Is our f(x)
correct for
this problem?
Learning
Methodology
Linear models are supervised, how
does that affect the learning
processing?
What is our
loss function?
Every supervised model has a loss
function it wants to minimize.
Optimization
process
How does the model minimize the
loss function.
Learning methodology is how the linear regression model learns which line
best fits the raw data.
How does
our ML
model learn?
Overview of how the model
teaches itself.
Performance
Quantitative measure we use to
evaluate the model’s performance.
Measures of
performance
R2, Adjusted
R2, MSE
Feature
Performance
Statistical
significance, p
values
There are linear models for both regression and classification problems.
Overfitting,
underfitting,
bias, variance
Ability to
generalize to
unseen data
The task
Task
What is the problem we want our
model to solve?
Defining f(x) What is f(x) for a linear model?
Feature
engineering
& selection
What is x? How do we decide what
explanatory features to include in
our model?
What assumptions does a linear
regression make about the data? Do
we have to transform the data?
Is our f(x)
correct for
this problem?
Regression
Classification
Continuous variable
Categorical variable
Recall our discussion of two types of supervised tasks,
regression & classification. There are linear models for
both. Today we will only discuss regression.
Task
Ordinary Least Squares
(OLS) regression
Logistic regression
We will not cover this in the
lecture, but we will provide
resources for further study
OLS is a linear regression method
based on minimizing the sum of
squared residuals
We have plotted the
total $ loaned by
KIVA in Kenya each
year. What is the
trend? How would
you summarize this
trend in one
sentence?
OLS
Regression
Task
A OLS regression is a trend line that predicts how much
Y will change for a given change in x. Let’s try and
break that down a little more by looking at an example.
“Every year, the value
of loans in KE appears
to be increasing”
A linear regression formalizes our perceived
trend as a relationship between x and Y. It
can be understood intuitively as a trend line.
Every additional year
corresponds to x
additional dollars Kiva
loans in KE.
Human Intuition OLS Regression
A linear model allows us
to predict beyond our
set of observations
because for every point
on the x axis we can find
the corresponding Y*
A linear model expresses the relationship between our
explanatory features and our outcome features as a
straight line. The output of f(x) for a linear model will
always be a line.
x
Y*
For example, we can now say
for an x not in our
scatterplot (May, 2012) what
we predict the $ amount of
loans is.
Defining f(x)
Is linear regression
the right model for
our data?
A big part of being a machine learning researcher involves choosing the right
model for the task.
Each model makes certain assumptions about the underlying data.
Let’s take a closer look at how this relates in linear regression.
OLS
Regression
Task
Is our f(x)
correct for
this problem?
OLS
Regression
Task
Before we choose a linear model, we need
to make sure all the assumptions hold true
in our data.
Is our f(x)
correct for
this problem?
❏ Linear relationship between x and Y
❏ Normal distribution of variables
❏ No autocorrelation (Independent variables)
❏ Homoscedasticity
❏ No multicollinearity
❏ Rule of thumb: at least 20 observations per independent variable in the analysis
OLS Linear Regression Assumptions
OLS
Regression
Task
Is there a linear relationship
between x and Y?
Is our f(x)
correct for
this problem?
Source: Linear and non-linear relationships
Linear regression assumes that there is a linear
relationship between x and Y.
If this is not true, our trend line will do a poor
job of predicting Y.
Source: Statistics - Normal distribution
OLS
Regression
Task
Is our data normally distributed?Is our f(x)
correct for
this problem?
Normal distribution of explanatory and
outcome features avoids distortion of
results due to outliers or skewed data.
OLS
Regression
Task
Multicollinearity occurs when
explanatory variables are highly
correlated. Do we have multicollinearity?
Is our f(x)
correct for
this problem?
Multicollinearity introduces redundancy to the model and reduces our certainty in the results.
We want no multicollinearity in our model.
Province
Country
Province and county are highly
correlated. We will want to
include only one of these
variables in our model.
Age
Loan
Amount
Age and loan amount appear to
have no multicollinearity. We can
include both in our model.
OLS
Regression
Task
Do we have homoscedasticity?Is our f(x)
correct for
this problem?
Error term or “noise” is the same across all
values of the outcome variables. If
homoscedasticity does not hold, cases
with a greater error term will have
outsized influence on the regression.
Data must be homoscedastic, meaning the
error rate is evenly distributed at all
outcome variable values.
Good Not good Not good
Source: Homoscedasticity
OLS
Regression
Task
Do we have autocorrelation?Is our f(x)
correct for
this problem?
Autocorrelation is correlation between
values of a variable and its delayed copy.
For example, a stock’s price today is
correlated with yesterday’s price
Autocorrelation commonly occurs when
you work with time series.
OLS
Regression
Task
In the coding lab, we will go over
code that will help you determine
whether a linear model provides the
best f(x)
Is our f(x)
correct for
this problem?
✓ Linear relationship
between x and Y
✓ Normal distribution of
variables
✓ No autocorrelation
(Independent variables)
✓ Homoscedasticity
✓ Rule of thumb: at least 20
observations per
independent variable in
the analysis
OLS Assumptions
We have signed off on all of our
assumptions, which means we can
confidently choose a linear OLS
model for this task.
Let’s start building our model!
OLS
Regression
Task
Question! What happens if you don’t
have all the assumptions?
Is our f(x)
correct for
this problem?
✓ Linear relationship
between x and Y
✓ Normal distribution of
variables
✓ No autocorrelation
(Independent variables)
✓ Homoscedasticity
✓ Rule of thumb: at least 20
observations per
independent variable in
the analysis
OLS Assumptions If these assumptions do not hold true our trend line will
not be accurate. We have a few options:
1) Transform our data to fulfill the assumptions
2) Choose a different model to capture the
relationship between x and Y
Yes! Linear
regression is an
appropriate model
choice. Now what?
True Y
Y*=f(x)+e
OLS
Regression
Task
Defining f(x)
Remember that all models involve
a function f(x) that map an input x
to a predicted Y (Y*). The goal of
the function is to have a model
that predicts Y* as close to true Y
as possible.
f(x) for a linear model is:
Y*=a + bx + e
What is f(x) for a linear regression
model?
Just like in algebra!
Y*
a
OLS
Regression
Task
Defining f(x)
Y*=a + bx + e
What does Y*=a + bx + e actually mean?
Y
bx + e
The y-intercept of the
regression line
b The slope or gradient of the
regression line. This
determines how steep the
line is and it’s directionality
e The irreducible error term,
error our model cannot
reduce
Y* The predicted Y output of
our function
a
parameters
How does
our OLS
model learn?
There are an infinite number of possible
lines in a two dimensional space. How
does our model choose the best one?
Learning
Methodology
Y
X
Linear regression is a learning
algorithm. That is what makes it a
machine learning model.
It learns to find the best trend line
from an infinite number of
possibilities. How?
First, we must understand what
levers we control.
Values that control the
behavior of the model
and are learnt through
experience.
In each model there are some things we cannot control, like e
(irreducible error).
A model may also have hyperparameters which are set beforehand
and not trained using data (no need to think too much about this now).
Y*=a + bx + e
Parameters
Hyperparameters Higher level settings of
a model that are fixed
before training begins.
How does
our OLS
model learn?
Parameters of a model are values that we
control. The parameters in an OLS Model
are a (intercept) and b (slope)
Learning
Methodology
Our model can move a &
b but not e.
How does
our OLS
model learn?
a and b are the only two levers our
simple OLS model can change to get Y*
closer to Y.
Learning
Methodology
Y
X
Y*=a + bx + e
How do I decide in
what direction to
change a and b?
changing a shifts
our line up or down
the y intercept,
+/- b changes the
steepness of our
line and direction
How does
our OLS
model learn?
We can change a and b to move our line
in space. Below is some intuition about
how changing a and b affects f(x).
Learning
Methodology
Y*=a + bx + e
+ b means an
upwards sloping
line
a and b are our
two parameters.
changing a shifts
our line up or down
the y intercept.
Negative a moves
the y-intercept
below 0.
- b means a
downward
sloping line
b=0 means there
is no relationship
between x and Y
Source: Intro to Stat - Introduction to Linear Regression
How does
our OLS
model learn?
The directionality of b is very
important. It tells us if Y gets smaller
or larger when we increase x.
Learning
Methodology
Y*=a + bx + e
+ b is a positive slope - b is a negative slope
Source: Beginning Algebra - Coordinate System and graphing lines
Test your intuition.
Exercise 1
Learning
Methodology
Y
X
Y*=a + bx + e
What happens if I
increase a by 2?
Exercise 1
Learning
Methodology
Y
X
Y*=a + bx + e
What happens if I
increase a by 2?
Exercise 2
Learning
Methodology
Y
X
Y*=a + bx + e
What happens if I
increase b by 3?
Exercise 2
Learning
Methodology
Y
X
Y*=a + bx + e
What happens if I
increase b by 3?
What parameters give us
the best trend line?
What is our
loss function?
We make a decision about how to change
our parameters based upon our loss
function.
Learning
Methodology
Y*=a + bx + e
Let me try
different values
of a and b to
minimize the
total loss
function.
Recall: Our model starts with a random a and
b, and our job is to change a and b in a way that
moves Y* closer to the true Y.
You are in fact trying to reduce the distance
between Y* and true Y. We measure the
distance using mean squared error. This is our
loss function.
All supervised models have a loss function
(sometimes also known as the cost function)
they must optimize by changing the model
parameters.
What is our
loss function?
For every a and b we try, we measure
the mean squared error. Remember
MSE?
Learning
Methodology
This job isn’t done
until I reduce MSE.
Source: Intro to Stat -Introduction to Linear Regression
Mean squared error is a measure of
how close Y* is to Y.
There are four steps to MSE:
Y-Y* For every point in our dataset,
measure the difference between
true Y and predicted Y.
^2 Square each Y-Y* to get the
absolute distance, so positive
values don’t cancel out negative
ones when we sum.
Sum Sum across all observations so we
get the total error.
mean Divide the sum by the number of
observations we have.
What is our
loss function?
The green boxes in the chart below is
the MSE for each data point. We sum
and then take the mean across all data
points to get the MSE.
Learning
Methodology
Source: Seeing Theory - Regression
Y-Y* For every point in our dataset,
measure the difference between
true Y and predicted Y.
^2 Square each Y-Y* to get the
absolute distance, so positive
values don’t cancel out negative
ones when we sum.
Sum Sum across all observations so we
get the total error.
mean Divide the sum by the number of
observations we have.
Optimization
Process
The process of changing a and b to
reduce MSE is called learning. It is what
makes OLS regression a machine learning
algorithm.
Learning
Methodology
Source: Seeing Theory - Regression
For every combination of a
and b we choose there is an
associated MSE. The learning
process involves updating a
and b in order to reach the
global minimum.
The learning process for OLS
is technically called learning
by gradient descent.
Optimization of
MSE:
Next, let’s look at model validation.
What is our
loss function?
Optimization
Process
How does
our OLS
model learn?
Learning
Methodology
Is our f(x)
correct for
this problem?
OLS
Regression
Task
Defining f(x)
Model Validation
Validation is a process of evaluating your model performance.
We have to evaluate a model on two critical aspects:
1. How close the model’s Y* is to true Y
2. How well the model performs in the real world (i.e., on unseen data).
Common validation metrics:
1. R2
2. Adjusted R2
3. MSE (a loss function and a measure of performance)
Thoroughly validating your model is crucial.
Let’s take a look at these metrics…
Mean Squared Error
We have already introduced MSE. This serves as
both a loss function and an evaluation of
performance.
R2
Explained Variation / Total Variation
Increases as number of x’s increase.
Adjusted R2
Explained Variation / Total Variation, adjusted for
the number of features present in the model
Measures of
performance
Performance
For an OLS regression, we will introduce
three measures of model performance:
R2, Adjusted R2 and MSE.
How did I do?
R2 and Adjusted R2 answer the
question, how much of y’s variation can
be explained by our model?
This reminds us of MSE, but R2 metrics
are scaled to be between 0 and 1.
Note: Adjusted R2
is preferable to R2
,
because R2
increases as you include
more features in your model. This may
artificially inflate R2
.
Measures of
performance
Performance
True Y
Y*=f(x)+e
If Y* is explained variation, what’s left
between Y* and Y is unexplained variation
Source: This is a snippet of the output from Notebook 2.
In a regression output, you can find R-squared metrics here:
R2 can give us causation where correlation couldn’t!
R-squared reminds us of
correlation, but there is an
important difference.
Correlation measures the
association between x and y.
R-squared measures how much y
is explained by x.
Ability to
generalize to
unseen data
Performance
Now we know how well the model
performs on the data we have, how do
we predict how it will do in the real
world?
We need a way to quantify the way our model performs on unseen data. In an ideal
world, we would go out and find new data to test our model on.
However, this is often not realistic as we are constrained by time and resources.
Instead of doing this, we can instead split our data into two portions: training and
test data.
We will use a portion of our data to train our model, and the
rest of our data (which is “unseen” by the model, like real world
data) to test our model.
● Randomly split data into “training” and “test” sets
● Use regression results from “training” set to predict“test” set
● Compare Predicted Y to Actual Y
Source: Fortmann-Roe, Accurately Measuring Model Prediction Error,
http://scott.fortmann-roe.com/docs/MeasuringError.html
Ability to
generalize
to unseen
data
Performance
We split our labelled data into training
and test. The test represents our
unseen future data.
Predicted Y - Actual Y
The test data is “unseen” in that the algorithm doesn’t use it to
train the model!
Ability to
generalize
to unseen
data
Performance
We split our labelled data into training
and test. The test represents our
unseen future data.
Predicted Y
Using ~70%
data
Actual Y
Using ~30%
data
It is important to clarify here that we are using train Y*- train Y to train the model,
which is different than the test Y*- test Y we use to evaluate how well the model can
generalize to unseen data.
Ability to
generalize
to unseen
data
Performance
Don’t confuse our use of loss functions
with our use of test data!
Training data
Yields
Our model
(the red line)
We apply the
model to
Test data
We use train Y*- train Y in our
loss functions that train the
model.
We use test Y* - test Y to see how well
our model can be generalized to unseen
data, or data that wasn’t used to train
our model..
Evaluating whether or not a model can be generalized to unseen data is very
important - in fact, being able to predict it is often the whole point of creating a
model in the first place.
If we do not evaluate how well a model can generalize to outside data, there is a
danger that we are creating a model that is too specific to our dataset - that is, a
model that is GREAT at predicting this particular dataset, but is USELESS at
predicting the real world.
What does this look like?
Ability to
generalize
to unseen
data
Performance
Splitting train and test data is
extremely important!
Here, we are using wealth to predict happiness. The dotted lines are the models generated by ML algorithms.
● On the left, the model is not useful because it is too general and does not capture the relationship
accurately.
● On the right, the model is not useful because not general enough and captures every single
idiosyncrasy in the dataset. This means the model is too specific to this dataset, and cannot be
generalized to different datasets.
We want to have a model that is just right - it is accurate enough within the dataset, and is general
enough to be applied outside of the dataset!
Ability to
generalize
to unseen
data
Performance
Splitting train and test data is
extremely important!
“underfitting” “overfitting”“just right”
Bias is how accurate the model is at
predicting the dataset.
Variance is how sensitive the model
is to small fluctuations in the
training set.
Ideally, we would have both low bias
and low variance.
Source: Fortmann-Roe, Accurately Measuring Model Prediction Error.
http://scott.fortmann-roe.com/docs/MeasuringError.html
Ability to
generalize
to unseen
data
Performance
This concept of a model being “just
right” is also called the bias-variance
trade-off.
“underfitting”
“overfitting”“just right”
terrible!
Don’t worry about the specifics of the bias-variance
trade-off for now - we will return to this concept
regularly throughout the course and in-depth in the next
module.
Let’s turn to another aspect of performance that is very
important: feature performance.
Performance
The performance of model components is
just as important as the performance
of the model itself
Feature
Importance
Performance
Understanding feature importance allows us to:
1) Quantify which feature is driving explanatory power in the model
2) Compare one feature to another
3) Guide final feature selection
Evaluating the performance of features becomes important when we start
using more sophisticated linear models where f(x) includes more than one
explanatory variable. Let’s introduce that concept and then return here.
Univariate
Multivariate
One explanatory
variable
Multiple explanatory
variables.
Deciding whether a univariate or
multivariate regression is the best
model for your problem is part of the
task.
OLS Task
E.g. Trying to predict
malaria using patient’s
temperature, travel history,
whether or not they have
chills, nausea, headache.
E.g. Trying to predict
malaria using only patient’s
temperature
Univariate vs.
Multivariate
as f(x)
Source: James, Witten, Hastie and Tibshirani. Introduction to Statistical Learning with Applications in R.
Income =
a + b1
(Seniority) + b2
(Years of
Education) + e
This is an example of linear
regression with 2 explanatory
variables in 3 dimensions. Extend
this to n variables in n+1
dimensions.
OLS Task Defining f(x)
Many of the relationships we try and
model are more complicated than a
univariate regression. Instead, we use a
multivariate regression.
Regressions can also be non-linear!
Source: [link]
Miles walked per day =
a + b1
(Age)2
+ b2
(Age) + e
In this example, we see that the
relationship between your mobility
and your age increases, then
decreases after some point. This is
an example of a non-linear
regression, which is best explained
by a quadratic equation.
OLS Task Defining f(x)
# miles
walked
per day
age
When we model using a multivariate
regression, feature selection becomes
an important step. What explanatory
variables should we include?
OLS Task
Feature
engineering
& selection
Qualitative Research
Exploratory Analysis
Using Other Models
Analyzing Linear
Regression Output
Feature selection is often the difference between a project that fails and succeeds. Some
common ways of doing feature selection:
Literature review of past work
done
Sanity checks on data; human intuition
of what would influence outcome
Quantify feature importance (we’ll see
this later with decision trees)
Look at each feature’s coefficient
and p-value
Feature
Importance
Performance
How do we assess feature importance in
a linear regression?
Each feature has a coefficient and a
p-value.
Each feature has a:
1. Coefficient
2. P-value
The output of a linear regression is a model:
The size of the coefficient = amount of influence that the feature has on
y. Whether or not the feature is negative or positive is the direction of the
relationship that the feature has with y.
Feature
Importance
Performance
How do we assess feature importance in
a linear regression?
Y* = intercept + coef*feature
Feature
Importance
Each feature has a:
1. Coefficient
2. P-value
A huge coefficient is great, but how much confidence we have
in that coefficient is dependent on that coefficient’s p-value.
In technical terms, the p-value is the probability of getting results
as extreme as the ones observed, if the coefficient were actually
zero. It answers the question, “Could I have gotten my result
by random chance?”
A small p-value (<= 0.05, or 5%) says that the result is
probably not random chance - great news for our model!
Performance
How do we assess feature importance in
a linear regression?
Expressed as a %
Heads or tails?
It’s random!
Feature
Importance
Performance
How would you assess this feature using
p-value and size of the coefficient?
Feature
Importance
Performance
How would you assess this feature using
p-value and size of the coefficient?
A person in the clothing sector will, on
average, get a higher loan amount, but
only by very little. I’m reasonably
confident in this conclusion.
Extrapolation is the act of
inferring unknown values
based on known data.
Even validated algorithms are
subject to irresponsible
extrapolation!
Source: https://xkcd.com/605/
Feature
Importance
Performance A few last thoughts...
Linear regression has almost
countless potential applications,
provided we interpret carefully.
“All models are wrong, some are useful.”
- George E.P. Box,
British statistician
✓ Linear regression
✓ Relationship between two variables (x and y)
✓ Formalizing f(x)
✓ Correlation between two variables
✓ Assumptions
✓ Feature engineering and selection
✓ Univariate regression, Multivariate regression
✓ Measures of performance (R2, Adjusted R2, MSE)
✓ Overfitting, Underfitting
✓ Learning process: Loss function and Mean Squared Error
Module Checklist
Advanced resources
● Textbooks
○ An Introduction to Statistical Learning with Applications in R (James, Witten, Hastie and
Tibshirani): Chapters 2.1, 3, 4, 6
○ The Elements of Statistical Learning: Data Mining, Inference, and Prediction (Hastie,
Tibshirani, Friedman): Chapters 3, 4
● Online resources
○ Analytics Vidhya’s guide to understanding regression:
https://www.analyticsvidhya.com/blog/2015/08/comprehensive-guide-regression/
○ Brown University’s introduction to probability and statistics,
http://students.brown.edu/seeing-theory/
● If you are interested in more sophisticated regression models, search:
○ Logistic regression, Polynomial regression, Interactions
● If you are interested in additional ways to solve multicollinearity, search:
○ Eigenvectors, Principal Components Analysis
Want to take this further? Here are
some resources we recommend:
You are on fire! Go straight to the
next module here.
Need to slow down and digest? Take a
minute to write us an email about
what you thought about the course. All
feedback small or large welcome!
Email: sara@deltanalytics.org
Congrats! You finished
module 3!
Find out more about
Delta’s machine
learning for good
mission here.

More Related Content

PDF
Module 2: Machine Learning Deep Dive
PDF
Module 1 introduction to machine learning
PDF
Logistic regression : Use Case | Background | Advantages | Disadvantages
PPTX
Logistic Regression
PPTX
Logistic Regression.pptx
PDF
Module 4: Model Selection and Evaluation
ODP
Machine Learning With Logistic Regression
ODP
Introduction to Bayesian Statistics
Module 2: Machine Learning Deep Dive
Module 1 introduction to machine learning
Logistic regression : Use Case | Background | Advantages | Disadvantages
Logistic Regression
Logistic Regression.pptx
Module 4: Model Selection and Evaluation
Machine Learning With Logistic Regression
Introduction to Bayesian Statistics

What's hot (20)

PDF
Logistic regression in Machine Learning
PPTX
Introduction to predictive modeling v1
PDF
Explainability and bias in AI
PPTX
ML - Simple Linear Regression
PDF
Latent Dirichlet Allocation
PDF
Master LLMs with LangChain -the basics of LLM
PDF
Nasscom AI top 50 use cases
PDF
Gnn overview
PDF
AI Vs ML Vs DL PowerPoint Presentation Slide Templates Complete Deck
PPTX
Evolution of AI in workplace.pptx
PDF
DC02. Interpretation of predictions
PDF
Sentiment analysis - Our approach and use cases
PPTX
Generative AI Risks & Concerns
PDF
𝐆𝐞𝐧𝐞𝐫𝐚𝐭𝐢𝐯𝐞 𝐀𝐈: 𝐂𝐡𝐚𝐧𝐠𝐢𝐧𝐠 𝐇𝐨𝐰 𝐁𝐮𝐬𝐢𝐧𝐞𝐬𝐬 𝐈𝐧𝐧𝐨𝐯𝐚𝐭𝐞𝐬 𝐚𝐧𝐝 𝐎𝐩𝐞𝐫𝐚𝐭𝐞𝐬
PPTX
Machine learning
PPTX
PPTX
Linear and Logistics Regression
PDF
AI in Business - Impact and Possibility
PPTX
Semi-Supervised Learning
PPT
Machine learning Algorithm
Logistic regression in Machine Learning
Introduction to predictive modeling v1
Explainability and bias in AI
ML - Simple Linear Regression
Latent Dirichlet Allocation
Master LLMs with LangChain -the basics of LLM
Nasscom AI top 50 use cases
Gnn overview
AI Vs ML Vs DL PowerPoint Presentation Slide Templates Complete Deck
Evolution of AI in workplace.pptx
DC02. Interpretation of predictions
Sentiment analysis - Our approach and use cases
Generative AI Risks & Concerns
𝐆𝐞𝐧𝐞𝐫𝐚𝐭𝐢𝐯𝐞 𝐀𝐈: 𝐂𝐡𝐚𝐧𝐠𝐢𝐧𝐠 𝐇𝐨𝐰 𝐁𝐮𝐬𝐢𝐧𝐞𝐬𝐬 𝐈𝐧𝐧𝐨𝐯𝐚𝐭𝐞𝐬 𝐚𝐧𝐝 𝐎𝐩𝐞𝐫𝐚𝐭𝐞𝐬
Machine learning
Linear and Logistics Regression
AI in Business - Impact and Possibility
Semi-Supervised Learning
Machine learning Algorithm
Ad

Similar to Module 3: Linear Regression (20)

PDF
HRUG - Linear regression with R
PPTX
Qt unit i
PDF
Linear models for data science
PPTX
Linear_Regression
PDF
Interpretability in ML & Sparse Linear Regression
PDF
Data Science as a Career and Intro to R
PDF
Machine learning in credit risk modeling : a James white paper
PDF
ML_Lec4 introduction to linear regression.pdf
PPTX
MachineLlearning introduction
PPTX
Introduction to ml
PDF
Learning to learn Model Behavior: How to use "human-in-the-loop" to explain d...
PDF
Unit2_Linear Regression_Performance Metrics.pdf
PPTX
Forecasting Using the Predictive Analytics
DOCX
Neural Network Model
PDF
ML_Lec3 introduction to regression problems.pdf
PDF
Top 50+ Data Science Interview Questions and Answers for 2025 (1).pdf
PDF
A tour of the top 10 algorithms for machine learning newbies
PPTX
Linear regression aims to find the "best-fit" linear line
PPTX
HRUG - Linear regression with R
Qt unit i
Linear models for data science
Linear_Regression
Interpretability in ML & Sparse Linear Regression
Data Science as a Career and Intro to R
Machine learning in credit risk modeling : a James white paper
ML_Lec4 introduction to linear regression.pdf
MachineLlearning introduction
Introduction to ml
Learning to learn Model Behavior: How to use "human-in-the-loop" to explain d...
Unit2_Linear Regression_Performance Metrics.pdf
Forecasting Using the Predictive Analytics
Neural Network Model
ML_Lec3 introduction to regression problems.pdf
Top 50+ Data Science Interview Questions and Answers for 2025 (1).pdf
A tour of the top 10 algorithms for machine learning newbies
Linear regression aims to find the "best-fit" linear line
Ad

More from Sara Hooker (9)

PDF
Module 9: Natural Language Processing Part 2
PDF
Module 8: Natural language processing Pt 1
PDF
Module 7: Unsupervised Learning
PDF
Module 6: Ensemble Algorithms
PDF
Module 5: Decision Trees
PDF
Module 1.3 data exploratory
PDF
Module 1.2 data preparation
PDF
Storytelling with Data (Global Engagement Summit at Northwestern University 2...
PPTX
Delta Analytics Open Data Science Conference Presentation 2016
Module 9: Natural Language Processing Part 2
Module 8: Natural language processing Pt 1
Module 7: Unsupervised Learning
Module 6: Ensemble Algorithms
Module 5: Decision Trees
Module 1.3 data exploratory
Module 1.2 data preparation
Storytelling with Data (Global Engagement Summit at Northwestern University 2...
Delta Analytics Open Data Science Conference Presentation 2016

Recently uploaded (20)

PDF
Sensors and Actuators in IoT Systems using pdf
PDF
HCSP-Presales-Campus Network Planning and Design V1.0 Training Material-Witho...
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
AI And Its Effect On The Evolving IT Sector In Australia - Elevate
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Advanced Soft Computing BINUS July 2025.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Advanced IT Governance
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Modernizing your data center with Dell and AMD
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
cuic standard and advanced reporting.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Chapter 2 Digital Image Fundamentals.pdf
Sensors and Actuators in IoT Systems using pdf
HCSP-Presales-Campus Network Planning and Design V1.0 Training Material-Witho...
NewMind AI Monthly Chronicles - July 2025
AI And Its Effect On The Evolving IT Sector In Australia - Elevate
NewMind AI Weekly Chronicles - August'25 Week I
Advanced Soft Computing BINUS July 2025.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Advanced IT Governance
“AI and Expert System Decision Support & Business Intelligence Systems”
Modernizing your data center with Dell and AMD
Per capita expenditure prediction using model stacking based on satellite ima...
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
cuic standard and advanced reporting.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
MYSQL Presentation for SQL database connectivity
Chapter 2 Digital Image Fundamentals.pdf

Module 3: Linear Regression

  • 2. This course content is being actively developed by Delta Analytics, a 501(c)3 Bay Area nonprofit that aims to empower communities to leverage their data for good. Please reach out with any questions or feedback to [email protected]. Find out more about our mission here. Delta Analytics builds technical capacity around the world.
  • 4. Let’s do a quick review of module 2!
  • 5. How does a model learn from raw data? Task What is the problem we want our model to solve? Performance Measure Quantitative measure we use to evaluate the model’s performance. Learning Methodology ML algorithms can be supervised or unsupervised. This determines the learning methodology. All models have the following components: Source: Deep Learning Book - Chapter 5: Introduction to Machine Learning
  • 6. Classification task: No No Yes Yes Exercise 1 Predicted Approval for credit card Y Y* 12,000 60,000 11,000 200,000 Current Annual Income ($) Approval for credit card Yes No No Yes Outstanding Debt 200 60,000 0 10,000 What are the explanatory features? What is the outcome feature?
  • 7. Task Performance Measure Learning Experience Let’s fill in the blanks for our credit approval example:
  • 8. Task Should this customer be approved for a credit card? Performance Measure Log Loss Learning Experience Supervised Let’s fill in the blanks for our credit approval example:
  • 9. Regression task: 90% 30% 95% 50% Exercise 2 X Y Y* 10 2 12 0 Time spent A week studying machine learning Accuracy of classification model built by student 30% 60% 26% 88% Predicted Accuracy of classification model What are the explanatory features? What is the outcome feature?
  • 10. Task Performance Measure Learning Experience Let’s fill in the blanks for our studying example:
  • 11. Task How accurate is this student’s classification model? Performance Measure MSE Learning Experience Supervised Let’s fill in the blanks for our studying example:
  • 13. Course overview: Now let’s turn to the data we will be using... ✓ Module 1: Introduction to Machine Learning ✓ Module 2: Machine Learning Deep Dive ✓ Module 3: Linear Regression ❏ Module 4: Decision Trees ❏ Module 5: Ensemble Algorithms ❏ Module 6: Unsupervised Learning Algorithms ❏ Module 7: Natural Language Processing Part 1 ❏ Module 8: Natural Language Processing Part 2
  • 14. ❏ Linear regression ❏ Relationship between two variables (x and y) ❏ Formalizing f(x) ❏ Correlation between two variables ❏ Assumptions ❏ Feature engineering and selection ❏ Learning process: Loss function and Mean Squared Error ❏ Univariate regression, Multivariate regression ❏ Measures of performance (R2, Adjusted R2, MSE) ❏ Overfitting, Underfitting Module Checklist
  • 15. What is linear regression? Linear regression is a model that explains the relationship between explanatory features and an outcome feature as a line in two dimensional space.
  • 16. Linear regression has been in use since the 19th century & is one of the most important machine learning tools available to researchers. Many other models are built upon the logic of linear models. For example, the most simple form of deep learning model, with no hidden layers, is a linear model! Why is linear regression important? Source: History of Linear Regression - Introduction to Linear Regression Modeling Phase Performance Learning Methodology Task
  • 17. Linear regression: model cheat sheet ● Linear relationship between x and y ● Normal distribution of variables ● No multicollinearity (Independent variables) ● Homoscedasticity ● Rule of thumb: at least 20 observations per independent variable in the analysis ● Sensitive to outliers ● The world is not always linear; we often want to model more complex relationships ● Does not allow us to model interactions between explanatory features (we will be able to do this using a decision tree) ● Very popular model with intuitive, easy to understand results. ● Natural extension of correlation analysis Pros Cons Assumptions* * We will explain each of these assumptions in this module!
  • 18. Task What is the problem we want our model to solve? Defining f(x) What is f(x) for a linear model? Feature engineering & selection What is x? How do we decide what explanatory features to include in our model? Today we are looking closer at each component of the framework we discussed in the last class What assumptions does a linear regression make about the data? Do we have to transform the data? Is our f(x) correct for this problem?
  • 19. Learning Methodology Linear models are supervised, how does that affect the learning processing? What is our loss function? Every supervised model has a loss function it wants to minimize. Optimization process How does the model minimize the loss function. Learning methodology is how the linear regression model learns which line best fits the raw data. How does our ML model learn? Overview of how the model teaches itself.
  • 20. Performance Quantitative measure we use to evaluate the model’s performance. Measures of performance R2, Adjusted R2, MSE Feature Performance Statistical significance, p values There are linear models for both regression and classification problems. Overfitting, underfitting, bias, variance Ability to generalize to unseen data
  • 22. Task What is the problem we want our model to solve? Defining f(x) What is f(x) for a linear model? Feature engineering & selection What is x? How do we decide what explanatory features to include in our model? What assumptions does a linear regression make about the data? Do we have to transform the data? Is our f(x) correct for this problem?
  • 23. Regression Classification Continuous variable Categorical variable Recall our discussion of two types of supervised tasks, regression & classification. There are linear models for both. Today we will only discuss regression. Task Ordinary Least Squares (OLS) regression Logistic regression We will not cover this in the lecture, but we will provide resources for further study OLS is a linear regression method based on minimizing the sum of squared residuals
  • 24. We have plotted the total $ loaned by KIVA in Kenya each year. What is the trend? How would you summarize this trend in one sentence? OLS Regression Task A OLS regression is a trend line that predicts how much Y will change for a given change in x. Let’s try and break that down a little more by looking at an example.
  • 25. “Every year, the value of loans in KE appears to be increasing” A linear regression formalizes our perceived trend as a relationship between x and Y. It can be understood intuitively as a trend line. Every additional year corresponds to x additional dollars Kiva loans in KE. Human Intuition OLS Regression
  • 26. A linear model allows us to predict beyond our set of observations because for every point on the x axis we can find the corresponding Y* A linear model expresses the relationship between our explanatory features and our outcome features as a straight line. The output of f(x) for a linear model will always be a line. x Y* For example, we can now say for an x not in our scatterplot (May, 2012) what we predict the $ amount of loans is. Defining f(x)
  • 27. Is linear regression the right model for our data?
  • 28. A big part of being a machine learning researcher involves choosing the right model for the task. Each model makes certain assumptions about the underlying data. Let’s take a closer look at how this relates in linear regression. OLS Regression Task Is our f(x) correct for this problem?
  • 29. OLS Regression Task Before we choose a linear model, we need to make sure all the assumptions hold true in our data. Is our f(x) correct for this problem? ❏ Linear relationship between x and Y ❏ Normal distribution of variables ❏ No autocorrelation (Independent variables) ❏ Homoscedasticity ❏ No multicollinearity ❏ Rule of thumb: at least 20 observations per independent variable in the analysis OLS Linear Regression Assumptions
  • 30. OLS Regression Task Is there a linear relationship between x and Y? Is our f(x) correct for this problem? Source: Linear and non-linear relationships Linear regression assumes that there is a linear relationship between x and Y. If this is not true, our trend line will do a poor job of predicting Y.
  • 31. Source: Statistics - Normal distribution OLS Regression Task Is our data normally distributed?Is our f(x) correct for this problem? Normal distribution of explanatory and outcome features avoids distortion of results due to outliers or skewed data.
  • 32. OLS Regression Task Multicollinearity occurs when explanatory variables are highly correlated. Do we have multicollinearity? Is our f(x) correct for this problem? Multicollinearity introduces redundancy to the model and reduces our certainty in the results. We want no multicollinearity in our model. Province Country Province and county are highly correlated. We will want to include only one of these variables in our model. Age Loan Amount Age and loan amount appear to have no multicollinearity. We can include both in our model.
  • 33. OLS Regression Task Do we have homoscedasticity?Is our f(x) correct for this problem? Error term or “noise” is the same across all values of the outcome variables. If homoscedasticity does not hold, cases with a greater error term will have outsized influence on the regression. Data must be homoscedastic, meaning the error rate is evenly distributed at all outcome variable values. Good Not good Not good Source: Homoscedasticity
  • 34. OLS Regression Task Do we have autocorrelation?Is our f(x) correct for this problem? Autocorrelation is correlation between values of a variable and its delayed copy. For example, a stock’s price today is correlated with yesterday’s price Autocorrelation commonly occurs when you work with time series.
  • 35. OLS Regression Task In the coding lab, we will go over code that will help you determine whether a linear model provides the best f(x) Is our f(x) correct for this problem? ✓ Linear relationship between x and Y ✓ Normal distribution of variables ✓ No autocorrelation (Independent variables) ✓ Homoscedasticity ✓ Rule of thumb: at least 20 observations per independent variable in the analysis OLS Assumptions We have signed off on all of our assumptions, which means we can confidently choose a linear OLS model for this task. Let’s start building our model!
  • 36. OLS Regression Task Question! What happens if you don’t have all the assumptions? Is our f(x) correct for this problem? ✓ Linear relationship between x and Y ✓ Normal distribution of variables ✓ No autocorrelation (Independent variables) ✓ Homoscedasticity ✓ Rule of thumb: at least 20 observations per independent variable in the analysis OLS Assumptions If these assumptions do not hold true our trend line will not be accurate. We have a few options: 1) Transform our data to fulfill the assumptions 2) Choose a different model to capture the relationship between x and Y
  • 37. Yes! Linear regression is an appropriate model choice. Now what?
  • 38. True Y Y*=f(x)+e OLS Regression Task Defining f(x) Remember that all models involve a function f(x) that map an input x to a predicted Y (Y*). The goal of the function is to have a model that predicts Y* as close to true Y as possible. f(x) for a linear model is: Y*=a + bx + e What is f(x) for a linear regression model? Just like in algebra!
  • 39. Y* a OLS Regression Task Defining f(x) Y*=a + bx + e What does Y*=a + bx + e actually mean? Y bx + e The y-intercept of the regression line b The slope or gradient of the regression line. This determines how steep the line is and it’s directionality e The irreducible error term, error our model cannot reduce Y* The predicted Y output of our function a parameters
  • 40. How does our OLS model learn? There are an infinite number of possible lines in a two dimensional space. How does our model choose the best one? Learning Methodology Y X Linear regression is a learning algorithm. That is what makes it a machine learning model. It learns to find the best trend line from an infinite number of possibilities. How? First, we must understand what levers we control.
  • 41. Values that control the behavior of the model and are learnt through experience. In each model there are some things we cannot control, like e (irreducible error). A model may also have hyperparameters which are set beforehand and not trained using data (no need to think too much about this now). Y*=a + bx + e Parameters Hyperparameters Higher level settings of a model that are fixed before training begins. How does our OLS model learn? Parameters of a model are values that we control. The parameters in an OLS Model are a (intercept) and b (slope) Learning Methodology Our model can move a & b but not e.
  • 42. How does our OLS model learn? a and b are the only two levers our simple OLS model can change to get Y* closer to Y. Learning Methodology Y X Y*=a + bx + e How do I decide in what direction to change a and b? changing a shifts our line up or down the y intercept, +/- b changes the steepness of our line and direction
  • 43. How does our OLS model learn? We can change a and b to move our line in space. Below is some intuition about how changing a and b affects f(x). Learning Methodology Y*=a + bx + e + b means an upwards sloping line a and b are our two parameters. changing a shifts our line up or down the y intercept. Negative a moves the y-intercept below 0. - b means a downward sloping line b=0 means there is no relationship between x and Y Source: Intro to Stat - Introduction to Linear Regression
  • 44. How does our OLS model learn? The directionality of b is very important. It tells us if Y gets smaller or larger when we increase x. Learning Methodology Y*=a + bx + e + b is a positive slope - b is a negative slope Source: Beginning Algebra - Coordinate System and graphing lines
  • 46. Exercise 1 Learning Methodology Y X Y*=a + bx + e What happens if I increase a by 2?
  • 47. Exercise 1 Learning Methodology Y X Y*=a + bx + e What happens if I increase a by 2?
  • 48. Exercise 2 Learning Methodology Y X Y*=a + bx + e What happens if I increase b by 3?
  • 49. Exercise 2 Learning Methodology Y X Y*=a + bx + e What happens if I increase b by 3?
  • 50. What parameters give us the best trend line?
  • 51. What is our loss function? We make a decision about how to change our parameters based upon our loss function. Learning Methodology Y*=a + bx + e Let me try different values of a and b to minimize the total loss function. Recall: Our model starts with a random a and b, and our job is to change a and b in a way that moves Y* closer to the true Y. You are in fact trying to reduce the distance between Y* and true Y. We measure the distance using mean squared error. This is our loss function. All supervised models have a loss function (sometimes also known as the cost function) they must optimize by changing the model parameters.
  • 52. What is our loss function? For every a and b we try, we measure the mean squared error. Remember MSE? Learning Methodology This job isn’t done until I reduce MSE. Source: Intro to Stat -Introduction to Linear Regression Mean squared error is a measure of how close Y* is to Y. There are four steps to MSE: Y-Y* For every point in our dataset, measure the difference between true Y and predicted Y. ^2 Square each Y-Y* to get the absolute distance, so positive values don’t cancel out negative ones when we sum. Sum Sum across all observations so we get the total error. mean Divide the sum by the number of observations we have.
  • 53. What is our loss function? The green boxes in the chart below is the MSE for each data point. We sum and then take the mean across all data points to get the MSE. Learning Methodology Source: Seeing Theory - Regression Y-Y* For every point in our dataset, measure the difference between true Y and predicted Y. ^2 Square each Y-Y* to get the absolute distance, so positive values don’t cancel out negative ones when we sum. Sum Sum across all observations so we get the total error. mean Divide the sum by the number of observations we have.
  • 54. Optimization Process The process of changing a and b to reduce MSE is called learning. It is what makes OLS regression a machine learning algorithm. Learning Methodology Source: Seeing Theory - Regression For every combination of a and b we choose there is an associated MSE. The learning process involves updating a and b in order to reach the global minimum. The learning process for OLS is technically called learning by gradient descent. Optimization of MSE:
  • 55. Next, let’s look at model validation. What is our loss function? Optimization Process How does our OLS model learn? Learning Methodology Is our f(x) correct for this problem? OLS Regression Task Defining f(x)
  • 57. Validation is a process of evaluating your model performance. We have to evaluate a model on two critical aspects: 1. How close the model’s Y* is to true Y 2. How well the model performs in the real world (i.e., on unseen data). Common validation metrics: 1. R2 2. Adjusted R2 3. MSE (a loss function and a measure of performance) Thoroughly validating your model is crucial. Let’s take a look at these metrics…
  • 58. Mean Squared Error We have already introduced MSE. This serves as both a loss function and an evaluation of performance. R2 Explained Variation / Total Variation Increases as number of x’s increase. Adjusted R2 Explained Variation / Total Variation, adjusted for the number of features present in the model Measures of performance Performance For an OLS regression, we will introduce three measures of model performance: R2, Adjusted R2 and MSE. How did I do?
  • 59. R2 and Adjusted R2 answer the question, how much of y’s variation can be explained by our model? This reminds us of MSE, but R2 metrics are scaled to be between 0 and 1. Note: Adjusted R2 is preferable to R2 , because R2 increases as you include more features in your model. This may artificially inflate R2 . Measures of performance Performance True Y Y*=f(x)+e If Y* is explained variation, what’s left between Y* and Y is unexplained variation
  • 60. Source: This is a snippet of the output from Notebook 2. In a regression output, you can find R-squared metrics here:
  • 61. R2 can give us causation where correlation couldn’t! R-squared reminds us of correlation, but there is an important difference. Correlation measures the association between x and y. R-squared measures how much y is explained by x.
  • 62. Ability to generalize to unseen data Performance Now we know how well the model performs on the data we have, how do we predict how it will do in the real world? We need a way to quantify the way our model performs on unseen data. In an ideal world, we would go out and find new data to test our model on. However, this is often not realistic as we are constrained by time and resources. Instead of doing this, we can instead split our data into two portions: training and test data. We will use a portion of our data to train our model, and the rest of our data (which is “unseen” by the model, like real world data) to test our model.
  • 63. ● Randomly split data into “training” and “test” sets ● Use regression results from “training” set to predict“test” set ● Compare Predicted Y to Actual Y Source: Fortmann-Roe, Accurately Measuring Model Prediction Error, http://scott.fortmann-roe.com/docs/MeasuringError.html Ability to generalize to unseen data Performance We split our labelled data into training and test. The test represents our unseen future data. Predicted Y - Actual Y
  • 64. The test data is “unseen” in that the algorithm doesn’t use it to train the model! Ability to generalize to unseen data Performance We split our labelled data into training and test. The test represents our unseen future data. Predicted Y Using ~70% data Actual Y Using ~30% data
  • 65. It is important to clarify here that we are using train Y*- train Y to train the model, which is different than the test Y*- test Y we use to evaluate how well the model can generalize to unseen data. Ability to generalize to unseen data Performance Don’t confuse our use of loss functions with our use of test data! Training data Yields Our model (the red line) We apply the model to Test data We use train Y*- train Y in our loss functions that train the model. We use test Y* - test Y to see how well our model can be generalized to unseen data, or data that wasn’t used to train our model..
  • 66. Evaluating whether or not a model can be generalized to unseen data is very important - in fact, being able to predict it is often the whole point of creating a model in the first place. If we do not evaluate how well a model can generalize to outside data, there is a danger that we are creating a model that is too specific to our dataset - that is, a model that is GREAT at predicting this particular dataset, but is USELESS at predicting the real world. What does this look like? Ability to generalize to unseen data Performance Splitting train and test data is extremely important!
  • 67. Here, we are using wealth to predict happiness. The dotted lines are the models generated by ML algorithms. ● On the left, the model is not useful because it is too general and does not capture the relationship accurately. ● On the right, the model is not useful because not general enough and captures every single idiosyncrasy in the dataset. This means the model is too specific to this dataset, and cannot be generalized to different datasets. We want to have a model that is just right - it is accurate enough within the dataset, and is general enough to be applied outside of the dataset! Ability to generalize to unseen data Performance Splitting train and test data is extremely important! “underfitting” “overfitting”“just right”
  • 68. Bias is how accurate the model is at predicting the dataset. Variance is how sensitive the model is to small fluctuations in the training set. Ideally, we would have both low bias and low variance. Source: Fortmann-Roe, Accurately Measuring Model Prediction Error. http://scott.fortmann-roe.com/docs/MeasuringError.html Ability to generalize to unseen data Performance This concept of a model being “just right” is also called the bias-variance trade-off. “underfitting” “overfitting”“just right” terrible!
  • 69. Don’t worry about the specifics of the bias-variance trade-off for now - we will return to this concept regularly throughout the course and in-depth in the next module. Let’s turn to another aspect of performance that is very important: feature performance. Performance
  • 70. The performance of model components is just as important as the performance of the model itself Feature Importance Performance Understanding feature importance allows us to: 1) Quantify which feature is driving explanatory power in the model 2) Compare one feature to another 3) Guide final feature selection Evaluating the performance of features becomes important when we start using more sophisticated linear models where f(x) includes more than one explanatory variable. Let’s introduce that concept and then return here.
  • 71. Univariate Multivariate One explanatory variable Multiple explanatory variables. Deciding whether a univariate or multivariate regression is the best model for your problem is part of the task. OLS Task E.g. Trying to predict malaria using patient’s temperature, travel history, whether or not they have chills, nausea, headache. E.g. Trying to predict malaria using only patient’s temperature Univariate vs. Multivariate as f(x)
  • 72. Source: James, Witten, Hastie and Tibshirani. Introduction to Statistical Learning with Applications in R. Income = a + b1 (Seniority) + b2 (Years of Education) + e This is an example of linear regression with 2 explanatory variables in 3 dimensions. Extend this to n variables in n+1 dimensions. OLS Task Defining f(x) Many of the relationships we try and model are more complicated than a univariate regression. Instead, we use a multivariate regression.
  • 73. Regressions can also be non-linear! Source: [link] Miles walked per day = a + b1 (Age)2 + b2 (Age) + e In this example, we see that the relationship between your mobility and your age increases, then decreases after some point. This is an example of a non-linear regression, which is best explained by a quadratic equation. OLS Task Defining f(x) # miles walked per day age
  • 74. When we model using a multivariate regression, feature selection becomes an important step. What explanatory variables should we include? OLS Task Feature engineering & selection Qualitative Research Exploratory Analysis Using Other Models Analyzing Linear Regression Output Feature selection is often the difference between a project that fails and succeeds. Some common ways of doing feature selection: Literature review of past work done Sanity checks on data; human intuition of what would influence outcome Quantify feature importance (we’ll see this later with decision trees) Look at each feature’s coefficient and p-value
  • 75. Feature Importance Performance How do we assess feature importance in a linear regression? Each feature has a coefficient and a p-value.
  • 76. Each feature has a: 1. Coefficient 2. P-value The output of a linear regression is a model: The size of the coefficient = amount of influence that the feature has on y. Whether or not the feature is negative or positive is the direction of the relationship that the feature has with y. Feature Importance Performance How do we assess feature importance in a linear regression? Y* = intercept + coef*feature
  • 77. Feature Importance Each feature has a: 1. Coefficient 2. P-value A huge coefficient is great, but how much confidence we have in that coefficient is dependent on that coefficient’s p-value. In technical terms, the p-value is the probability of getting results as extreme as the ones observed, if the coefficient were actually zero. It answers the question, “Could I have gotten my result by random chance?” A small p-value (<= 0.05, or 5%) says that the result is probably not random chance - great news for our model! Performance How do we assess feature importance in a linear regression? Expressed as a % Heads or tails? It’s random!
  • 78. Feature Importance Performance How would you assess this feature using p-value and size of the coefficient?
  • 79. Feature Importance Performance How would you assess this feature using p-value and size of the coefficient? A person in the clothing sector will, on average, get a higher loan amount, but only by very little. I’m reasonably confident in this conclusion.
  • 80. Extrapolation is the act of inferring unknown values based on known data. Even validated algorithms are subject to irresponsible extrapolation! Source: https://xkcd.com/605/ Feature Importance Performance A few last thoughts... Linear regression has almost countless potential applications, provided we interpret carefully.
  • 81. “All models are wrong, some are useful.” - George E.P. Box, British statistician
  • 82. ✓ Linear regression ✓ Relationship between two variables (x and y) ✓ Formalizing f(x) ✓ Correlation between two variables ✓ Assumptions ✓ Feature engineering and selection ✓ Univariate regression, Multivariate regression ✓ Measures of performance (R2, Adjusted R2, MSE) ✓ Overfitting, Underfitting ✓ Learning process: Loss function and Mean Squared Error Module Checklist
  • 84. ● Textbooks ○ An Introduction to Statistical Learning with Applications in R (James, Witten, Hastie and Tibshirani): Chapters 2.1, 3, 4, 6 ○ The Elements of Statistical Learning: Data Mining, Inference, and Prediction (Hastie, Tibshirani, Friedman): Chapters 3, 4 ● Online resources ○ Analytics Vidhya’s guide to understanding regression: https://www.analyticsvidhya.com/blog/2015/08/comprehensive-guide-regression/ ○ Brown University’s introduction to probability and statistics, http://students.brown.edu/seeing-theory/ ● If you are interested in more sophisticated regression models, search: ○ Logistic regression, Polynomial regression, Interactions ● If you are interested in additional ways to solve multicollinearity, search: ○ Eigenvectors, Principal Components Analysis Want to take this further? Here are some resources we recommend:
  • 85. You are on fire! Go straight to the next module here. Need to slow down and digest? Take a minute to write us an email about what you thought about the course. All feedback small or large welcome! Email: [email protected]
  • 86. Congrats! You finished module 3! Find out more about Delta’s machine learning for good mission here.