0% found this document useful (0 votes)
192 views

Course Notes Linear Regression

Linear regression is a method for predicting the relationship between variables. It finds the line of best fit to approximate the relationship between an independent variable and dependent variable. The linear regression equation estimates predicted values (y^) based on sample data. It helps make predictions about a population. Key metrics include the coefficients, which represent the intercept and slope, and R^2, which indicates how well the model fits the data. The assumptions of the ordinary least squares method used to estimate the regression include linearity and no autocorrelation.

Uploaded by

Anuj Kaushik
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
192 views

Course Notes Linear Regression

Linear regression is a method for predicting the relationship between variables. It finds the line of best fit to approximate the relationship between an independent variable and dependent variable. The linear regression equation estimates predicted values (y^) based on sample data. It helps make predictions about a population. Key metrics include the coefficients, which represent the intercept and slope, and R^2, which indicates how well the model fits the data. The assumptions of the ordinary least squares method used to estimate the regression include linearity and no autocorrelation.

Uploaded by

Anuj Kaushik
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

COURSE NOTES: REGRESSION

ANALYSIS
Course
Coursenotes:
notes:
Descriptive
Descriptive
statistics
statistics
What is linear regression?

Regression analysis is one of the most widely used methods for prediction. Linear regression is probably the most
fundamental machine learning method out there and a starting point for the advanced analytical learning path of
every aspiring data scientist.

A linear regression is a linear approximation of a causal relationship between two or more variables.
Regression models are highly valuable, as they are one of the most common ways to make inferences and
predictions. Apart from this, regression analysis is also employed to determine and assess factors that affect a
certain outcome in a meaningful way.

As many other statistical techniques, regression models help us make predictions about the population based on
sample data.

Make predictions about


Get sample data Design a model the whole population
Linear regression model Linear regression equation

Dependent Estimated (predicted)


Slope
variable value Coefficient
Error

𝑦ො = b0 + b1*x1
Constant Independent
variable
Constant
(Estimate) Sample data for
Note: When we refer to the population independent variable
models, we use Greek letters

As many other statistical techniques, regression models help us


make predictions about the population based on sample data.
Geometrical representation of linear regression

yො i = b0 + b1 xi
y
Estimator of the error (ොei )

Observed value (xi)

Slope (b1 )

Intercept (b0 )

x
*On average the expected value of the error is 0, that is why it is not included in the regression equation
Correlation vs Regression

Represents the relationship Represents the relationship


between two variables between two or more variables

Shows that two variables move Shows cause and effect (one variable
together (no matter in which direction) is affected by the other)

Symmetrical w.r.t. the two variables: One way – there is always only one
𝝆(x,y) = 𝝆 (y,x) variable that is causally dependent

A single point (a number) A line (in 2D space)


Summary table and important regression metrics

Variability of the data, explained


by the regression model Variability of the data, explained by the
Range: [0;1] regression model, considering the number
of independent variables
The dependent variable, Range: <1; could be negative, but a
y; This is the variable we negative number is interpreted as 0
are trying to predict

P-value for F-statistic; F-statistic evaluates the overall


significance of the model (if at least 1 predictor is
significant, F-statistic is also significant)
Coefficient of the intercept, b0; sometimes
we refer to this variable as constant or
bias (as it ‘corrects’ the regression
equation with a constant value)
P-value of t-statistic; The t-statistic of a coefficient shows if the
corresponding independent variable is significant or not
Coefficient of the independent
variable i: bi; this is usually the most
important metric – it shows us the
relative/absolute contribution of each
independent variable of our model
OLS assumptions
OLS (ordinary least squares) is one of the most common methods for estimating the linear regression equation. However,
its simplicity implies that it cannot be always used. Therefore, all OLS regression assumptions should be met before we can
rely on this method of estimation.

Linearity No endogeneity Normality and No autocorrelation No multicollinearity


homoscedasticity

Y=β0​+β1X1​+…+βkXk+ε 𝜎𝑋𝜀 = 0 : ∀ 𝑥, 𝜀 𝜀 ~ 𝑁(0, 𝜎 2 ) 𝜎𝜀𝑖𝜀𝑗 = 0 : ∀ 𝑖 ≠ 𝑗 𝜌𝑥𝑖𝑥𝑗 ≉ 1 : ∀ 𝑖, 𝑗; 𝑖 ≠ 𝑗

The specified model The independent The variance of the No identifiable No predictor variable
must represent a linear variables shouldn’t be errors should be relationship should exist should be perfectly (or
relationship correlated with the consistent across between the values of almost perfectly) explained
error term. observations. the error term. by the other predictors.
Other methods for finding the regression line
OLS (ordinary least squares) is just the beginning. OLS is the simplest, although often sufficient method to estimate the
regression line. In fact, there are more complex methods that are more appropriate for certain datasets and problems.

Generalized least squares (GLS)


Maximum likelihood estimation (MLE)

Bayesian regression

Kernel regression
Gaussian progress regression

You might also like