Linear Regression
Linear Regression
Introduction
This paper explores the theoretical foundations of linear regression, its practical applications,
and the methods used to evaluate and enhance the model’s accuracy. We will examine both
simple linear regression (involving one predictor) and multiple linear regression (involving
multiple predictors), along with the assumptions and limitations of these models.
Linear regression is a statistical method used to model the relationship between a dependent
variable YYY and one or more independent variables XXX. The goal is to fit a linear equation to
the observed data, thereby allowing us to predict the dependent variable’s values based on the
independent variables. The linear equation for simple linear regression can be expressed as:
where:
To ensure that linear regression provides valid results, certain assumptions must hold:
1. Linearity: The relationship between the independent and dependent variables is linear.
2. Independence: Observations are independent of each other.
3. Homoscedasticity: The variance of residuals (differences between observed and
predicted values) is constant across all levels of the independent variable(s).
4. Normality: Residuals should be approximately normally distributed.
5. No Multicollinearity: In multiple linear regression, the independent variables should not
be highly correlated.
3. Estimating Parameters
where YiY_iYiis the observed value and XiX_iXiis the independent variable for each data point
iii.
In matrix form, for multiple regression, the OLS estimator is calculated as:
where XXX is the matrix of input features, and YYY is the vector of output values.
Several metrics can be used to evaluate the performance of a linear regression model:
1. Mean Squared Error (MSE): Measures the average of the squared differences between
observed and predicted values. MSE=1n∑i=1n(Yi−Y^i)2\text{MSE} = \frac{1}{n}
\sum_{i=1}^n (Y_i - \hat{Y}_i)^2MSE=n1i=1∑n(Yi−Y^i)2
2. R-squared (R²): Indicates the proportion of the variance in the dependent variable that is
predictable from the independent variables. It ranges from 0 to 1, with values closer to 1
indicating a better fit. R2=1−∑i=1n(Yi−Y^i)2∑i=1n(Yi−Yˉ)2R^2 = 1 - \frac{\sum_{i=1}^n
(Y_i - \hat{Y}_i)^2}{\sum_{i=1}^n (Y_i - \bar{Y})^2}R2=1−∑i=1n(Yi−Yˉ)2∑i=1n(Yi−Y^i)2
3. Adjusted R-squared: Adjusts R-squared for the number of predictors in the model,
preventing overfitting by penalizing the addition of irrelevant variables.
To address some limitations, several extensions and variations of linear regression have been
developed:
Conclusion
Linear regression remains a fundamental tool in statistical analysis and machine learning,
valued for its interpretability, efficiency, and broad applicability. Understanding its assumptions,
limitations, and evaluation techniques is crucial for proper model application and interpretation.
Despite its simplicity, linear regression provides a robust foundation for more complex predictive
models and continues to be an essential technique in data analysis.