0% found this document useful (0 votes)
19 views

Regression

This document provides an overview of linear regression models. It defines linear regression as a mathematical model that describes the relationship between an input variable and an output variable. Simple linear regression uses one input variable, while multiple linear regression uses two or more input variables. The document discusses estimating regression coefficients, the coefficient of determination, assumptions of linear regression like linearity and homoskedasticity, and transformations that can be applied to better meet the assumptions. Examples are provided to illustrate key concepts.

Uploaded by

Henrique Calhau
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Regression

This document provides an overview of linear regression models. It defines linear regression as a mathematical model that describes the relationship between an input variable and an output variable. Simple linear regression uses one input variable, while multiple linear regression uses two or more input variables. The document discusses estimating regression coefficients, the coefficient of determination, assumptions of linear regression like linearity and homoskedasticity, and transformations that can be applied to better meet the assumptions. Examples are provided to illustrate key concepts.

Uploaded by

Henrique Calhau
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

Linear Regression

2022/2023

Luís Paquete
University of Coimbra
Linear Regression

Contents

● Linear regression model


● Multiple linear regression model
● Coefficient of determination
● Assumptions of linear regression
● Transformations
Linear Regression

Regression model

● A mathematical model that describes the behavior of a system over a range of


input values.
● A regression model allows to predict how the system will perform when given an
input value that was not measured.
● A linear regression model assumes a linear relationship between the input
variable and the ouput variable.
Linear Regression
Regression model

● A simple linear regression model has the form


y = a + bx
where x is the input variable, y is the predicted output
variable and a and b are the regression parameters.
● If yi is the value measured for the input value xi, then
(xi,yi) can be written as

yi = a + bxi + ei
where ei is the residual for the i-th measurement, that
is, the difference between the measured value for yi
and that would have been predicted from the model.
Linear Regression

Regression model

● To find a and b that will form a line that most closely


fits the n measured data points, minimize the sum of
squares of the residuals, SSE:
Linear Regression

A side note: Why the sum of squares?

● Why not the sum of absolute differences? This function is not differentiable
at 0. Then, the minimizers of the function cannot be easily found.

● The sum of squares function is differentiable everywhere and it is convex,


that is, the local minimum is also global minimum. Moreover, a and b can
be calculated by a closed formula.
Linear Regression

Example

Develop a regression model to relate the time required to perform a file-read operation to
the number of bytes read

File size in bytes Times in ms


10 3.8
50 8.1
100 11.9
500 55.6
1000 99.6
5000 500.2
10000 1006.1 y = 2.24 + 0.1002 x
Linear Regression

Example in R
Linear Regression

Multiple linear regression

● Multiple linear regression extends linear regression for k > 1 independent input
variables

y = b0 + b1 x1 + b2 x2 + ... + bk xk

● Each data point (x1i, x2i, ..., xki, yi) can be expressed as
yi = b0 + b1 x1i + b2 x2i + ... + bk xki + ei

where ei is the residual


Linear Regression

Multiple linear regression

● The square sum of errors (SSE) is

● Using matrix notation, we have a multiple linear regression model as follows

Y=Xb+e

where b = (XTX)-1XT Y minimizes SSE


Linear Regression

Example

Develop a regression model to relate the time required to perform a certain number of
input-output and memory operations

IO operations Mem. operations Times in ms


10 10 2.8
10 100 3.1
100 10 10.9
100 100 12.6
1000 10 106.2
1000 100 119.1
Linear Regression

Example in R
> D <- read.table("regr5.in",header=TRUE)
> lr.out <- lm(D$time ~ D$IO + D$Mem)
> summary(lr.out)

Call:
lm(formula = R$time ~ R$IO + R$mem)

Residuals:
1 2 3 4 5 6
2.9144 -1.7523 0.9941 -2.2725 -3.9086 4.0248
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.779630 2.947538 -0.604 0.589
R$IO 0.111336 0.003698 30.104 8.05e-05 ***
R$Mem 0.055185 0.036737 1.502 0.230
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.049 on 3 degrees of freedom
Multiple R-squared: 0.9967, Adjusted R-squared: 0.9945
F-statistic: 454.2 on 2 and 3 DF, p-value: 0.0001888
y = -1.780 + 0.111 x1 + 0.055 x2
Linear Regression

Multivariate linear regression

● Multivariate linear regression extends linear regression for m > 1 dependent


variables

Y = B0 + B1 x1 + B2 x2 + ... + Bk xk

● Each data point (x1i, x2i, ..., xki, yij) can be expressed as
yij = b0j + b1j x1i + b2j x2i + ... + bkj xki + eij

where eij is the residual


Linear Regression

Coefficient of determination

● Determine how much of the total variation is "explained" by the linear model.
● SST is the total variation of the measured system output

which is partitioned into two components:

SSR: portion of the SST that is explained by the regression model

SSE: portion of the SST that is due to the measurement error


Linear Regression

Coefficient of determination

● The coefficient of determination r2 is the fraction of SST "explained" by the model

● If r2 = 0, then SSE is as large as SST


● If r2 = 1, then SSE is 0
Linear Regression

Coefficient of correlation

● The coeficient of determination is the squared value of the coefficient of


correlation of x and y.

● It allows to investigate whether the correlation between input and output is positive
(0 < r ≤ 1) or negative (-1 ≤ r < 0). It indicates the strength of the linear relation.
Linear Regression

Coefficient of correlation

● A side note: correlation does not imply causation


Linear Regression

Example

Develop a regression model to relate the time required to perform a file-read operation to
the number of bytes read

File size in bytes Times in ms


10 3.8
50 8.1
100 11.9
500 55.6
1000 99.6
5000 500.2
10000 1006.1 y = 2.24 + 0.1002 x r2 = 0.9996
Linear Regression

Example in R
Linear Regression

Assumptions of linear regression

● A more complete examination of the underying assumptions of linear regression


may indicate whether the model can be used for prediction (inference).
● In R, the linear regression assumptions can be verified by doing
plot(<linear model>)
Linear Regression

Assumptions of linear regression

Residuals-vs-fitted plot allows to verify:


● Linearity: the mean residual value for every fitted
value region (red line) should be close to 0.
● Homoskedasticity (constante variance): The
spread of residuals should be approximately the
same across the x-axis.
● Outliers: identify extreme residuals

Normal Q-Q plot to verify the normality of residuals


Linear Regression

Example:
Linear Regression

Transformations

● A way of overcoming the problem with assumptions is to transform the data

Rule of Thumb 1: Transforming y may correct problems with the error terms.

Rule of Thumb 2: Transforming x may correct the non-linearity.

● However, a transformed model may be harder to interpret


Linear Regression

Transformations

Example: (D. Bruce and F. X. Schumacher, 1935)


● Predict the volume of a tree (y) from its diameter (x)

y = -41.57 + 6.93 x r2 = 0.89


Linear Regression

Transformations

Example: (D. Bruce and F. X. Schumacher, 1935)


● Predict the log of the volume of a tree (ln y) from the log of its diameter (ln x)

ln y = -2.87 + 2.56 ln x r2 = 0.97


Linear Regression

Transformations

● It is also possible to deduce a possible transformation by plotting the data or having


some assumption about the process of generating y values
● For instance, if an exponential behavior is expected, such as

y = abx
by taking the logarithm of both sides

ln y = ln a + (ln b)x
the expression has a linear form:

y' = a' + b' x


Linear Regression

Example

Develop a regression model for the number of transistors in the following years

Year Transistors
1 9500
2 16000
3 23000
4 38000
5 62000
6 105000
Linear Regression

Example in R
Linear Regression

Example

Develop a regression model for the number of transistors in the following years

Year ln(Transistors)
1 9.1590
2 9.6803
3 10.0432
4 10.5453
5 11.0349
6 11.5617

b’ = 0.474
a’ = 8.679
y' = 8.679 + 0.474x
Linear Regression

Example in R

após transformada ln(transistores)


o r2 deu um valor muito mais
proximo de 1
Linear Regression

Example

Develop a regression model for the number of transistors in the following years

Year Transistors
1 9500
2 16000
3 23000
4 38000
5 62000
6 105000

b’ = 0.474 b = eb' = 1.61


a’ = 8.679 a = ea’ = 5878
y = (5878)1.61x
Linear Regression

Example

Develop a regression model for the relation between CPU-time and number of processors

Processors CPU-time
1 100
2 54
3 25
4 18
5 15
6 12
7 10
8 12
9 8
Linear Regression

Example in R
Linear Regression

Example

Reciprocal transformation:

Processors CPU-time-1
1 0.01
2 0.02
3 0.04
4 0.06
5 0.07
6 0.08
7 0.10
8 0.08
9 0.13
Linear Regression

Example in R
Linear Regression

Example

Develop a regression model for the relation between CPU-time and number of processors

Processors CPU-time
1 100
2 54
3 25
4 18
5 15
6 12
7 10
8 12
9 8

y = (-0.002+0.013 x)-1
Linear Regression

Example

Develop a regression model for the CPU-time of binary search given a list size

Size CPU-time
1 6.91
2 7.60
3 8.00
4 8.29
5 8.52
6 8.70
7 8.85
8 8.99
9 9.01
Linear Regression

Example in R
Linear Regression

Example

Logarithmic transformation: y = a + b log x

log Size CPU-time


0.00 6.91
0.69 7.60
1.10 8.00
1.39 8.29
1.61 8.52
1.79 8.70
1.95 8.85
2.08 8.99
2.20 9.01
Linear Regression

Example in R
Linear Regression

Example

Develop a regression model for the CPU-time of binary search given a list size

Size CPU-time
1 6.91
2 7.60
3 8.00
4 8.29
5 8.52
6 8.70
7 8.85
8 8.99
9 9.01

y = 6.92 + 0.98 log x


Linear Regression

Example

Develop a regression model for the CPU-time of insertion sort


Size CPU-time
1 2
2 1
3 6
4 14
5 15
6 30
7 40
8 74
9 75
Linear Regression

Example in R
Linear Regression

Example

Square root transformation: y1/2 = a + b x


Size CPU-time
1 1.00
2 1.41
3 2.45
4 3.74
5 3.87
6 5.48
7 6.32
8 8.60
9 8.66
Linear Regression

Example in R
Linear Regression

Example

Develop a regression model for the CPU-time of insertion sort


Size CPU-time
1 2
2 1
3 6
4 14
5 15
6 30
7 40
8 74
9 75

y = (-0.49 + 1.02 x)2


Linear Regression
Recap:

● Linear regression model assumes a linear relationship between the input variable and
the output variable.
● Multiple linear regression model deals with more than one input variable
● Coefficient of determination is the fraction of total variation that is provided by the linear
model
● The assumptions of linear regression need to be met in order to ensure that the model
can be used for inference (e.g prediction).
● Transformations can be applied in order to model polynomial, exponential or inverse
relationships, but some care must be taken in the interpretation of the resulting model.
Linear Regression

References:

● D.J.Lilja, Measuring computer performance, Cambridge University Press, 2002 (see


chapter 8)
● C.C. McGeoch, A Guide to Experimental Algorithmics, Cambrigde University Press,
2012 (see chapter 7)
● J. Faraway, Practical regression and ANOVA in R, chapter 8.
● D. Bruce , F. X. Schumacher, Forest Mensuration, Botanical Gazette, 1935
● J.W. Tukey, Exploratory Data Anaysis, Addison Wesley, 1977
● F. Mosteller and J.W. Tukey, Data Anaysis and Regression: A Second Course in
Statistics. Addison Wesley, 1977.

You might also like