DS w13 Regression
DS w13 Regression
Science
Lecture 13: Regression
• Ex: Given the following values of X and Y as (X, Y):( 1,1), (2,2), (4,4), (100,100), (20,
20)
• What is the value of Y when X is 5?
• Now, let’s take a look at different example. Say you have the following pairs of X
and Y: (1,1), (2,4), (4,16), (100,10000), (20, 400). Can you calculate the value of Y
when X is 5?
Ref: https://medium.com/@amarbudhiraja
Regression
• Ex: Given the following values of X and Y as (X, Y):( 1,1), (2,2), (4,4), (100,100), (20,
20)
• What is the value of Y when X is 5?
• The answer is : 5. Not very difficult, right?
• Now, let’s take a look at different example. Say you have the following pairs of X
and Y: (1,1), (2,4), (4,16), (100,10000), (20, 400). Can you calculate the value of Y
when X is 5?
Ref: https://medium.com/@amarbudhiraja
Regression
• Ex: Given the following values of X and Y as (X, Y):( 1,1), (2,2), (4,4), (100,100), (20,
20)
• What is the value of Y when X is 5?
• The answer is : 5. Not very difficult, right?
• Now, let’s take a look at different example. Say you have the following pairs of X
and Y: (1,1), (2,4), (4,16), (100,10000), (20, 400). Can you calculate the value of Y
when X is 5?
• The answer is : 25. Was it difficult?
Ref: https://medium.com/@amarbudhiraja
Regression
• Your computer looks at some examples and then tries to identify “the most suitable”
relationship between the sets X and Y. Using this identified relationship, it will try to
predict (or more) for new examples for which you don’t know Y.
Ref: https://medium.com/@amarbudhiraja
What is “Linear”?
You have a collection of (x,y) pairs, and you try to fit a line to them of theform.
Fitting a line to data.
Remember this:
Y=mX+B?
B
Regression
Plot the points out, fit aline to them by eye, trace the line with aruler, and use that to
pull out m and b.
Each of the four datasets has the same line of best fit and the same quality of fit
Estimate of the regression coefficients
For a given data set
ROTOPAPAS, PILLAI
CS109A, P 5
Estimate of the regression coefficients
Is this line good?
ROTOPAPAS, PILLAI
CS109A, P 6
Estimate of the regression coefficients
Maybe this one?
ROTOPAPAS, PILLAI
CS109A, P 7
Estimate of the regression coefficients
Or this one?
The line that best fits a set of data points is the one having the
smallest possible sum of squared errors
Linear Regression Example
The standard way to fit a line is called least squares. In Python, it can be fit using the Linear
Regression class in the example scripts, and the fit coefficients can be found in the following
way:
Linear regression will bend the parameters so as to avoid large deviations of even a single
point, which makes it unsuitable in situations where a handful of large deviations are to be
expected.
An alternative approach that is more suitable to data with outliers is to use the following
loss function:
Regression
• Here we just take the absolute values of the different error terms and add them
• This is called “L1 regression,” among other names.
• Outliers will still have an impact, but it is not as egregious as with least squares (L2).
•
• On the other hand, L1 regression penalizes small deviations from expectation more
harshly compared to least squares, and it is significantly more complicated to
implement computationally.
Regression
Fitting Nonlinear Curves
• Fitting a curve to data is a ubiquitous problem not just in data science but in
engineering and the sciences in general.
• Often, there are good a priori reasons that we expect acertain functional form,
and extracting the best-fit parameters will tell us something very meaningful
about the system we are studying.
A few Examples:
• Exponential decay to some baseline. This is useful for modeling
manyprocesses
where a system starts in some kind of agitated state
and decays toa baseline
Regression
Fitting Nonlinear Curves
Exponential growth:
Logistic growth:
Least squares is the typical approach in all of these cases, where we pickthe parameters to as to minimize the loss
function
Regression
Fitting Nonlinear Curves
The following script creates some data of the form y = 2 + 3x2, adds some noise to it,
and then usescurve_fit to fit acurve of the form y =a+bx2 to the data.
When you are assessing the quality of a fitted curve, there are two questions we
want to answer:
• How accurately can we predict values?
• We assumed that the data followed some functional form. Was that even a good
assumption?
• The standard way to answer the first of the questions is called R2, pronounced “R squared.”
• R2 is often described as the fraction of the variance that is accounted for by the model.
• A value of 1.0 means a perfect match, and a value of 0 means you didn’t capture any of the
variation.
• In some cases (there are a few different definitions of R2 floating around), it can even take on
negative values.
Regression
Goodness of Fit: R2 and Correlation
These allow us to say, in a precise sense, that your fitted model accounts for a certain
percentage of the variation in the data. The definition of R2 is then
and you can see it as the fraction of all variation that is captured by the model.
Of course, taking the squares of the residuals isn’t necessarily the “right” way to quantify
variation, but it is the most standard option.
Regression
Goodness of Fit: R2 and Correlation
Despite looking like a square, technically R2 can be negative if your model is truly abysmal.
Having R2 = 0 is what you would see if you just defined your fitted function to return the
average of y as a constant value.
You can think of this as the crudest way to fit a function to data. Do any worse than this, and your R2
score will go negative.
• Another way to quantify your goodness-of-fit is to simply take the correlation between your
predicted values and the known values in the test data.
• This has the advantage that you can use Pearson, Spearman, or Kendall correlation, depending
on how you want to deal with outliers.
• On the other hand though, correlation just measures whether your predictions and target
values are related; it doesn’t measure whether they actually match up.
Regression
Goodness of Fit: R2 and Correlation
Correlation of Residuals
The simplest way to assess the quality of our model form is to plot
the known data against a curve of the predicted values. Do they
match up?
x x
Negative Linear Correlation Positive Linear Correlation
y y
x x
No Correlation Nonlinear Correlation
Multi-variate Linear Regression
Now let’s move on from fitting a curve and into topics that fit more firmly under the “machine learning”
umbrella.
Linear regression is the same process as fitting a line to data, except that we say
Most of the previous sections carry over directly to this more general case: we fit the data using least
squares, we quantify performance using R2, and we can also use correlation between predicted and actual
values.
Linear Regression
The first big difference is that it’s no longer practical to plot the predicted curve against the actual data
points.
What you can do instead is to make a scatterplot between the known test values and the values predicted
for those test data points.
This allows us to gauge whether our model performs better for larger or smaller values and whether it
suffers from major outliers.
Linear Regression
This script will generate this figure for the linear regression
model:
plt.scatter(y_test, y_pred)
plt.xlabel('Actual’)
plt.ylabel('Predicted’)
plt.title(Lin. Reg. Corr= %f Rsq=%f’ %
(corr, rsq))
plt.show()
Linear Regression
We can see that there is a clear correlation between the
predicted and actual numbers, but it is fairly tenuous.
In the example script, we used the normalize() function to scale all the features so that they had mean 0 and
standard deviation 1. This means that, by looking at the relative size of their weights in the linear model, we can
get a sense of how related they are to the progression of diabetes.
>>> print(linear.coef_)
[-28.12698694 -33.32069944 85.46294936 70.47966698
-37.66512686
20.59488356 -14.6726611 33.10813747 43.68434357
-5.50529361]
This suggests that the third and fourth features are particularly interesting, if we want to zero in on and examine
their relationship to diabetes more closely.
LASSO Regression and Feature Selection
Look at the coefficients in the linear regression again. We are able to identify several features as being more
promising than the other as targets of further investigation, but the painful truth is that all of the coefficients
except that last one are pretty big. There are two problems with this:
• It makes it harder to pinpoint exactly which features are the most interesting.
• There is a very good chance that the data are overfitted. Many of the moderate-
sized coefficients could be set so that they balance each other out, yielding a slightly better fit on the
training data itself but generalizing very poorly.
LASSO Regression and Feature Selection
• The idea of LASSO regression is that we still fit a linear model to the data, but we
want to penalize nonzero weights.
• A LASSO regression model takes in a parameter called alpha, which indicates how
severely nonzero weights should be punished.
The different between linear and lasso jumps out when we look at the fitted coefficients:
>>> print(lasso.coef_)
[ -0. -11.49747021 73.20707164 37.75257628
0. 0.
-10.36895667 3.70576596 24.17976499 0. ]
Four of the six features have weights of precisely 0. Of the remaining features, it is clear that the third is
the most relevant to diabetes progression, followed by the fourth and the ninth.
Regression
The following script uses a dataset describing physiological measurements taken from 442 diabetes
patients, with the target variable being an indicator of the progression of their disease.
import sklearn.datasets
import pandas as pd
Regression Example
from matplotlib import pyplot as plt
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import\
LinearRegression, Lasso
from sklearn.preprocessing import normalize
from sklearn.metrics import r2_score
diabetes = sklearn.datasets.load_diabetes()
X, Y = normalize(diabetes['data']), diabetes['target']
X_train, X_test, Y_train, Y_test = \
train_test_split(X, Y, test_size=.8)
Regression Example
linear = LinearRegression()
linear.fit(X_train, Y_train)
preds_linear = linear.predict(X_test)
corr_linear = round(pd.Series(preds_linear).corr(
pd.Series(Y_test)), 3)
rsquared_linear = r2_score(Y_test, preds_linear)
print("Linear coefficients:")
print(linear.coef_)
plt.scatter(preds_linear, Y_test)
plt.title("Lin. Reg. Corr=%f Rsq=%f"
% (corr_linear, rsquared_linear))
plt.xlabel("Predicted")
plt.ylabel("Actual")
# add x=y line for comparison
plt.plot(Y_test, Y_test, ‘k--’)
plt.show()
Regression
lasso = Lasso()
lasso.fit(X_train, Y_train)
preds_lasso = lasso.predict(X_test)
corr_lasso = round(pd.Series(preds_lasso).corr(
pd.Series(Y_test)), 3)
rsquared_lasso = round(
r2_score(Y_test, preds_lasso), 3)
print("Lasso coefficients:")
print(lasso.coef_)
plt.scatter(preds_lasso, Y_test)
plt.title("Lasso. Reg. Corr=%f Rsq=%f"
% (corr_lasso, rsquared_lasso))
plt.xlabel("Predicted")
plt.ylabel("Actual")
# add x=y line for comparison
plt.plot(Y_test, Y_test, ‘k--’)
plt.show()
Regression
Regression
>>> r e g . f i t ( X , y )
ROTOPAPAS, PILLAI
CS109A, P 16
Optimization
How does one minimize a loss function?
The global minima or maxima of
𝐿 = ( 𝛽", 𝛽#)
must occur at a point where the
gradient (slope)
ROTOPAPAS, PILLAI
CS109A, P 23
Derivative definition
A derivative is the instantaneous rate of change of a single valued
function. Given a function f(x) the derivative can be defined as:
ROTOPAPAS, PILLAI
CS109A, P
Partial derivatives
For a loss function 𝐿, the partial derivative is written as:
𝑥! 𝑥!
𝐿 = 𝑓(𝛽", 𝛽!)
𝑦
𝜕𝑓 𝜕𝑓
𝜕𝛽" 𝜕𝛽#
What is the rate of change of the function with respect to one
variable with the others held fixed?
ROTOPAPAS ILLAI
CS109A, P ,P
Partial derivative example
ROTOPAPAS, PILLAI
CS109A, P
Partial derivative example
Partial derivative
ROTOPAPAS, PILLAI
CS109A, P 23
Optimization
ROTOPAPAS, PILLAI
CS109A, P 25
Optimization
Summary: Estimate of the regression coefficients
We use MSE as our loss function,
ROTOPAPAS, PILLAI
CS109A, P 27
References
Introduction to Machine Learning with Python, a Guide for Data Scientists by Andreas C. Müller & Sarah
Guido.
Introduction to Regression, CS109A Introduction to Data Science by Pavlos Protopapas & Natesh Pillai .