0% found this document useful (0 votes)
11 views60 pages

DS w13 Regression

Lecture 13 of CENG313 covers regression in data science, explaining how it predicts real-valued outputs based on input features, contrasting it with classification. It discusses linear regression, including the least squares method, the impact of outliers, and introduces LASSO regression for feature selection. The lecture also emphasizes the importance of assessing model performance using metrics like R-squared and correlation.

Uploaded by

Kofte Wkmek
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views60 pages

DS w13 Regression

Lecture 13 of CENG313 covers regression in data science, explaining how it predicts real-valued outputs based on input features, contrasting it with classification. It discusses linear regression, including the least squares method, the impact of outliers, and introduces LASSO regression for feature selection. The lecture also emphasizes the importance of assessing model performance using metrics like R-squared and correlation.

Uploaded by

Kofte Wkmek
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

CENG313 Introduction to Data

Science
Lecture 13: Regression

Instructor: Assist. Prof. Ceren Güzel Turhan


Regression

• Regression is similar to classification: you have a


number of input features, and you want to predict
an output feature. In classification, this output
feature is either binary or categorical. With
regression, it is a real-valued number.

Typically, regression algorithms model the output as a


linear combination of the inputs.
Regression

• Ex: Given the following values of X and Y as (X, Y):( 1,1), (2,2), (4,4), (100,100), (20,
20)
• What is the value of Y when X is 5?

• Now, let’s take a look at different example. Say you have the following pairs of X
and Y: (1,1), (2,4), (4,16), (100,10000), (20, 400). Can you calculate the value of Y
when X is 5?

Ref: https://medium.com/@amarbudhiraja
Regression

• Ex: Given the following values of X and Y as (X, Y):( 1,1), (2,2), (4,4), (100,100), (20,
20)
• What is the value of Y when X is 5?
• The answer is : 5. Not very difficult, right?

• Now, let’s take a look at different example. Say you have the following pairs of X
and Y: (1,1), (2,4), (4,16), (100,10000), (20, 400). Can you calculate the value of Y
when X is 5?

Ref: https://medium.com/@amarbudhiraja
Regression

• Ex: Given the following values of X and Y as (X, Y):( 1,1), (2,2), (4,4), (100,100), (20,
20)
• What is the value of Y when X is 5?
• The answer is : 5. Not very difficult, right?

• Now, let’s take a look at different example. Say you have the following pairs of X
and Y: (1,1), (2,4), (4,16), (100,10000), (20, 400). Can you calculate the value of Y
when X is 5?
• The answer is : 25. Was it difficult?

Ref: https://medium.com/@amarbudhiraja
Regression

• Let’s understand abit asto what happened in the above examples.


• When we look at the first example, after look at the given pairs, one can establish that
the relationship between Xand Yis Y=X.
• Similarly, in the second example, the relationship is Y= X*X.
• In these two examples, we candetermine the relationship between two given variables
(Xand Y)becausewe could easily identify the relationship betweenthem.

• Your computer looks at some examples and then tries to identify “the most suitable”
relationship between the sets X and Y. Using this identified relationship, it will try to
predict (or more) for new examples for which you don’t know Y.

Ref: https://medium.com/@amarbudhiraja
What is “Linear”?
You have a collection of (x,y) pairs, and you try to fit a line to them of theform.
Fitting a line to data.

Remember this:
Y=mX+B?

B
Regression
Plot the points out, fit aline to them by eye, trace the line with aruler, and use that to
pull out m and b.

Each of the four datasets has the same line of best fit and the same quality of fit
Estimate of the regression coefficients
For a given data set

ROTOPAPAS, PILLAI
CS109A, P 5
Estimate of the regression coefficients
Is this line good?

ROTOPAPAS, PILLAI
CS109A, P 6
Estimate of the regression coefficients
Maybe this one?

ROTOPAPAS, PILLAI
CS109A, P 7
Estimate of the regression coefficients
Or this one?

CS109A, PROTOPAPAS, PILLAI 8


Estimate of the regression coefficients
Question: Which line is the best?

As before, for each observation (𝑥!, 𝑦!)the absolute residuals, 𝑟! = | 𝑦! − 𝑦


#!|
quantify the error at each observation.
ROTOPAPAS, PILLAI
CS109A, P 9
Estimate of the regression coefficients

The standard way to fit a line is called least squares.

Least squares works by picking the values of m and b that


minimize the “loss function” which adds up an error term
across all of the points:

The line that best fits a set of data points is the one having the
smallest possible sum of squared errors
Linear Regression Example

Which of the 3 lines captures the pattern in the


data in the best possible way?
Linear Regression Example
Which of the 3 lines captures the pattern in the data in the best possible way?
Need to compute the sum of the squares

For the line Y= 4X+10:


Linear Regression Example (Cont)
Regression

The standard way to fit a line is called least squares. In Python, it can be fit using the Linear
Regression class in the example scripts, and the fit coefficients can be found in the following
way:

>>> import numpy as np


>>> from sklearn.linear_model import LinearRegression
>>> x = np.array([[0.0],[1.0],[2.0]])
>>> y = np.array([1.0,2.0,2.9])
>>> lm = LinearRegression().fit(x, y)
>>> lm.coef_ # m
array([ 0.95])
>>> lm.intercept_ # b
1.0166666666666671
Linear Regression to predict Diabetes
Linear Regression to predict Diabetes
Regression
The key thing to understand here is that this loss function makes least squares regression
extremely sensitive to outliers in the data:
• Three deviations of size 5 will give a loss of 75, but just a single larger deviation of
size 10 will give the larger penalty of size 100.

Linear regression will bend the parameters so as to avoid large deviations of even a single
point, which makes it unsuitable in situations where a handful of large deviations are to be
expected.

An alternative approach that is more suitable to data with outliers is to use the following
loss function:
Regression

• Here we just take the absolute values of the different error terms and add them
• This is called “L1 regression,” among other names.

• Outliers will still have an impact, but it is not as egregious as with least squares (L2).

• On the other hand, L1 regression penalizes small deviations from expectation more
harshly compared to least squares, and it is significantly more complicated to
implement computationally.
Regression
Fitting Nonlinear Curves

• Fitting a curve to data is a ubiquitous problem not just in data science but in
engineering and the sciences in general.
• Often, there are good a priori reasons that we expect acertain functional form,
and extracting the best-fit parameters will tell us something very meaningful
about the system we are studying.

A few Examples:
• Exponential decay to some baseline. This is useful for modeling
manyprocesses
where a system starts in some kind of agitated state
and decays toa baseline
Regression
Fitting Nonlinear Curves

Exponential growth:

Logistic growth:

Polynomials of various degrees:

Least squares is the typical approach in all of these cases, where we pickthe parameters to as to minimize the loss
function
Regression
Fitting Nonlinear Curves

The following script creates some data of the form y = 2 + 3x2, adds some noise to it,
and then usescurve_fit to fit acurve of the form y =a+bx2 to the data.

from scipy.optimize import curve_fit


xs = np.array([1.0, 2.0, 3.0, 4.0])
ys = 2.0 + 3.0 *xs*xs + 0.2*np.random.uniform(3)
def calc(x, a, b):
return a + b*x*x
cf = curve_fit(calc, xs, ys)
best_fit_params = cf[0]

Running it on computer, it found a = 2.33677376 and b = 3− a.


Regression
Goodness of Fit: R2 and Correlation

When you are assessing the quality of a fitted curve, there are two questions we
want to answer:
• How accurately can we predict values?
• We assumed that the data followed some functional form. Was that even a good
assumption?

• The standard way to answer the first of the questions is called R2, pronounced “R squared.”
• R2 is often described as the fraction of the variance that is accounted for by the model.
• A value of 1.0 means a perfect match, and a value of 0 means you didn’t capture any of the
variation.
• In some cases (there are a few different definitions of R2 floating around), it can even take on
negative values.
Regression
Goodness of Fit: R2 and Correlation

The calculation of R2 is based on two concepts:

The total variation:

where ȳ is the average of all the y values in your data.

The residual variation:


Regression
Goodness of Fit: R2 and Correlation

These allow us to say, in a precise sense, that your fitted model accounts for a certain
percentage of the variation in the data. The definition of R2 is then

and you can see it as the fraction of all variation that is captured by the model.

Of course, taking the squares of the residuals isn’t necessarily the “right” way to quantify
variation, but it is the most standard option.
Regression
Goodness of Fit: R2 and Correlation

Despite looking like a square, technically R2 can be negative if your model is truly abysmal.
Having R2 = 0 is what you would see if you just defined your fitted function to return the
average of y as a constant value.

You can think of this as the crudest way to fit a function to data. Do any worse than this, and your R2
score will go negative.

from sklearn.metrics import r2_score


rsquared_linear = r2_score(Y_test, preds_linear)
Regression
Goodness of Fit: R2 and Correlation

• Another way to quantify your goodness-of-fit is to simply take the correlation between your
predicted values and the known values in the test data.
• This has the advantage that you can use Pearson, Spearman, or Kendall correlation, depending
on how you want to deal with outliers.
• On the other hand though, correlation just measures whether your predictions and target
values are related; it doesn’t measure whether they actually match up.
Regression
Goodness of Fit: R2 and Correlation

Correlation of Residuals

The main ways to measure goodness-of-fit in regression


situations are R2 and correlation between predictions and targets.

The simplest way to assess the quality of our model form is to plot
the known data against a curve of the predicted values. Do they
match up?

In Anscombe’s quartet, for example, it is visually clear that a linear


model is the correct way to approach the first dataset, but the
wrong way to approach the second one.
Remember: Types of Correlation
y y
As x increases, y As x increases, y tends
tends to decrease. to increase.

x x
Negative Linear Correlation Positive Linear Correlation

y y

x x
No Correlation Nonlinear Correlation
Multi-variate Linear Regression
Now let’s move on from fitting a curve and into topics that fit more firmly under the “machine learning”
umbrella.

Linear regression is the same process as fitting a line to data, except that we say

where d is the number of input features we have.

Most of the previous sections carry over directly to this more general case: we fit the data using least
squares, we quantify performance using R2, and we can also use correlation between predicted and actual
values.
Linear Regression
The first big difference is that it’s no longer practical to plot the predicted curve against the actual data
points.

What you can do instead is to make a scatterplot between the known test values and the values predicted
for those test data points.

This allows us to gauge whether our model performs better for larger or smaller values and whether it
suffers from major outliers.
Linear Regression
This script will generate this figure for the linear regression
model:

from sklearn.metrics import r2_score


from sklearn.feature_selection import
r_regression
rsq = r2_score(y_test, y_pred)
corr = r_regression(y_test, y_pred)

# Create scatter plot with actual and


predicted values

plt.scatter(y_test, y_pred)
plt.xlabel('Actual’)
plt.ylabel('Predicted’)
plt.title(Lin. Reg. Corr= %f Rsq=%f’ %
(corr, rsq))
plt.show()
Linear Regression
We can see that there is a clear correlation between the
predicted and actual numbers, but it is fairly tenuous.

In particular, we can see that there are a number of data


points where the actual value was substantially below
our predictions.

In fact, the fit line as a whole looks slightly more shallow


than the data itself.

Together, these suggest that there are a number of


anomalously low data points, which are pulling our
overall predictions lower than perhaps they should be.
Linear Regression
The other thing that we can do with linear regression is use it to identify features in the data that are particularly
interesting.

In the example script, we used the normalize() function to scale all the features so that they had mean 0 and
standard deviation 1. This means that, by looking at the relative size of their weights in the linear model, we can
get a sense of how related they are to the progression of diabetes.

>>> print(linear.coef_)
[-28.12698694 -33.32069944 85.46294936 70.47966698
-37.66512686
20.59488356 -14.6726611 33.10813747 43.68434357
-5.50529361]

This suggests that the third and fourth features are particularly interesting, if we want to zero in on and examine
their relationship to diabetes more closely.
LASSO Regression and Feature Selection
Look at the coefficients in the linear regression again. We are able to identify several features as being more
promising than the other as targets of further investigation, but the painful truth is that all of the coefficients
except that last one are pretty big. There are two problems with this:

• It makes it harder to pinpoint exactly which features are the most interesting.
• There is a very good chance that the data are overfitted. Many of the moderate-
sized coefficients could be set so that they balance each other out, yielding a slightly better fit on the
training data itself but generalizing very poorly.
LASSO Regression and Feature Selection

• The idea of LASSO regression is that we still fit a linear model to the data, but we
want to penalize nonzero weights.

• A LASSO regression model takes in a parameter called alpha, which indicates how
severely nonzero weights should be punished.

• Setting alpha to 0 will reduce to linear regression.


• The default value, which was used in the example script, is 1.0.
LASSO Regression and Feature Selection

The sample script produces the same


scatterplot and performance metrics that
were created for linear regression. We can
see that the predicted/actual scatterplot
hugs the middle line a little more closely,
suggesting a better fit.
LASSO Regression and Feature Selection

The different between linear and lasso jumps out when we look at the fitted coefficients:

>>> print(lasso.coef_)
[ -0. -11.49747021 73.20707164 37.75257628
0. 0.
-10.36895667 3.70576596 24.17976499 0. ]

Four of the six features have weights of precisely 0. Of the remaining features, it is clear that the third is
the most relevant to diabetes progression, followed by the fourth and the ninth.
Regression

Example: Predicting Diabetes Progression

The following script uses a dataset describing physiological measurements taken from 442 diabetes
patients, with the target variable being an indicator of the progression of their disease.

import sklearn.datasets
import pandas as pd
Regression Example
from matplotlib import pyplot as plt
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import\
LinearRegression, Lasso
from sklearn.preprocessing import normalize
from sklearn.metrics import r2_score
diabetes = sklearn.datasets.load_diabetes()
X, Y = normalize(diabetes['data']), diabetes['target']
X_train, X_test, Y_train, Y_test = \
train_test_split(X, Y, test_size=.8)
Regression Example
linear = LinearRegression()
linear.fit(X_train, Y_train)
preds_linear = linear.predict(X_test)
corr_linear = round(pd.Series(preds_linear).corr(
pd.Series(Y_test)), 3)
rsquared_linear = r2_score(Y_test, preds_linear)
print("Linear coefficients:")
print(linear.coef_)
plt.scatter(preds_linear, Y_test)
plt.title("Lin. Reg. Corr=%f Rsq=%f"
% (corr_linear, rsquared_linear))
plt.xlabel("Predicted")
plt.ylabel("Actual")
# add x=y line for comparison
plt.plot(Y_test, Y_test, ‘k--’)
plt.show()
Regression
lasso = Lasso()
lasso.fit(X_train, Y_train)
preds_lasso = lasso.predict(X_test)
corr_lasso = round(pd.Series(preds_lasso).corr(
pd.Series(Y_test)), 3)
rsquared_lasso = round(
r2_score(Y_test, preds_lasso), 3)
print("Lasso coefficients:")
print(lasso.coef_)
plt.scatter(preds_lasso, Y_test)
plt.title("Lasso. Reg. Corr=%f Rsq=%f"
% (corr_lasso, rsquared_lasso))
plt.xlabel("Predicted")
plt.ylabel("Actual")
# add x=y line for comparison
plt.plot(Y_test, Y_test, ‘k--’)
plt.show()
Regression
Regression
>>> r e g . f i t ( X , y )

How to solve y=𝛽!x + 𝛽"?


𝛽!? 𝛽"?

ROTOPAPAS, PILLAI
CS109A, P 16
Optimization
How does one minimize a loss function?
The global minima or maxima of
𝐿 = ( 𝛽", 𝛽#)
must occur at a point where the
gradient (slope)

• Brute Force: Try every combination


• Exact: Solve the above equation
• Greedy Algorithm: Gradient Descent

ROTOPAPAS, PILLAI
CS109A, P 23
Derivative definition
A derivative is the instantaneous rate of change of a single valued
function. Given a function f(x) the derivative can be defined as:

ROTOPAPAS, PILLAI
CS109A, P
Partial derivatives
For a loss function 𝐿, the partial derivative is written as:
𝑥! 𝑥!

𝐿 = 𝑓(𝛽", 𝛽!)
𝑦

𝜕𝑓 𝜕𝑓
𝜕𝛽" 𝜕𝛽#
What is the rate of change of the function with respect to one
variable with the others held fixed?
ROTOPAPAS ILLAI
CS109A, P ,P
Partial derivative example

Looks like we’re going


to need the chain rule,
but what is it? I forget

ROTOPAPAS, PILLAI
CS109A, P
Partial derivative example
Partial derivative

CS109A, PROTOPAPAS, PILLAI


𝜕𝑓
Partial derivative
𝜕𝛽#

CS109A, PROTOPAPAS, PILLAI


Optimization
How does one minimize a loss function?
The global minima or maxima of
𝐿 = ( 𝛽", 𝛽#)
must occur at a point where the
gradient (slope)

• Brute Force: Try every combination


• Exact: Solve the above equation
• Greedy Algorithm: Gradient Descent

ROTOPAPAS, PILLAI
CS109A, P 23
Optimization

ROTOPAPAS, PILLAI
CS109A, P 25
Optimization
Summary: Estimate of the regression coefficients
We use MSE as our loss function,

We choose 𝛽" and 𝛽# in order to minimize the predictive errors made by


our model, i.e. minimize our loss function.

Then the optimal values for 𝛽" and 𝛽# should be:


Finding the exact solution only
works for rare cases. Linear WE CALL THISFITTING
regression is one of such rare cases. OR TRAINING THE
MODEL

ROTOPAPAS, PILLAI
CS109A, P 27
References
Introduction to Machine Learning with Python, a Guide for Data Scientists by Andreas C. Müller & Sarah
Guido.

Introduction to Regression, CS109A Introduction to Data Science by Pavlos Protopapas & Natesh Pillai .

You might also like