Mulitple Linear Regression
Mulitple Linear Regression
Independent variable (X): The predictors used to explain the dependent variable's variation.
Coefficients (β): The parameters that determine the relationship between the dependent
variable and the independent variables.
Intercept (β₀): The point at which the regression line or hyperplane intersects the Y-axis
when all independent variables are equal to zero.
1. Linearity: The relationship between the dependent variable and each independent variable
should be linear. This means that any increase or decrease in the independent variable's
value should result in a proportional change in the dependent variable's value.
2. Independence Of Errors: The error terms should be independent of each other, meaning
that the error associated with one observation should not influence the error of any other
observation. This assumption helps ensure that the model's predictions are unbiased and
accurate.
3. Multivariate Normality: The error terms should follow a multivariate normal distribution,
meaning the errors are normally distributed around the regression line or hyperplane. This
assumption allows for the generation of accurate confidence intervals and hypothesis tests.
4. Homoscedasticity: The error terms should have constant variance across all levels of the
independent variables. This means that the spread of the errors should be consistent
regardless of the values of the predictors. If this assumption is not met, it could lead to
unreliable confidence intervals and hypothesis tests.
5. No Multicollinearity: The independent variables should not be highly correlated with one
another. High correlation among independent variables can make it difficult to determine
the individual effects of each predictor on the dependent variable, leading to unreliable
coefficient estimates and reduced model interpretability.
Collecting and preparing the data is crucial to building a robust multiple linear regression
model. In this section, we'll walk you through the process of identifying variables, collecting
data, and cleaning and preprocessing the data to ensure that it's ready for analysis.
The dependent variable, also known as the target or response variable, is the outcome you
want to predict or explain using the independent variables. You'll need to select a single
dependent variable in multiple linear regression.
Examples include
House prices,
Sales revenue.
The independent variables, also called predictors or features, are used to explain the
variations in the dependent variable. In multiple linear regression, you can use two or more
independent variables.
When selecting independent variables, consider factors that are likely to influence the
dependent variable and have a theoretical basis for inclusion in the model.
Collecting data for multiple linear regression can be done using various methods, depending
on your research question and the domain you're working in. Common data collection
methods include:
4. Existing Databases and Datasets: Using pre-existing data from sources such as government
agencies, research institutions, or online repositories.
1. Missing Values: Missing values can occur when data points are not recorded or need to be
completed. Depending on the nature and extent of the missing data, you can choose to
impute the missing values using methods such as mean, median, or mode imputation or
remove the observations with missing values altogether.
2. Outliers: Outliers are data points significantly different from most of the data. Outliers can
considerably impact the multiple linear regression model, so it's essential to identify and
handle them appropriately. You can use visualisation techniques, such as box plots or scatter
plots, and statistical methods, such as the Z-score or IQR method, to detect outliers.
Depending on the context, you can either remove the outliers or transform the data to
reduce their impact.
4. Encoding Categorical Variables: Multiple linear regression requires that all independent
variables be numerical. If your dataset includes categorical variables (e.g., gender, color, or
region), you must convert them into numerical values. One common method for encoding
categorical variables is one-hot encoding, which creates binary (0 or 1) features for each
category of the variable.
Choosing the most relevant and significant predictors is essential for building an accurate and
interpretable multiple linear regression model. Here are three popular techniques for
predictor selection:
1. Forward Selection: This method starts with an empty model and iteratively adds predictors
one at a time based on their contribution to the model's performance. The process continues
until no significant improvement in model performance is observed.
2. Backward Elimination: This method starts with a model that includes all potential predictors
and iteratively removes the least significant predictor one at a time. The process continues
until removing any more predictors results in a significant decrease in model performance.
3. Stepwise Regression: This method combines both forward selection and backward
elimination. It starts with an empty model, adds predictors one at a time, and evaluates the
model at each step. It may be removed if a predictor's inclusion no longer improves the
model. The process continues until no more predictors can be added or removed without
significantly affecting model performance.
Cross-validation Techniques
Performance Metrics
Several metrics can be used to evaluate the performance of a multiple linear regression
model. These metrics quantify the difference between the predicted values and the actual
values of the dependent variable. Common performance metrics include:
1. Mean Squared Error (MSE): The MSE is the average of the squared differences between the
predicted and actual values. It emphasizes larger errors and is sensitive to outliers.
2. Mean Absolute Error (MAE): The MAE is the average of the absolute differences between
the predicted and actual values. It is less sensitive to outliers than the MSE.
3. Root Mean Squared Error (RMSE): The RMSE is the square root of the MSE. It is expressed in
the same units as the dependent variable, making it easier to interpret.
1. Variance inflation factor (VIF): The VIF measures how much a coefficient's variance is
inflated due to multicollinearity. A VIF value greater than 10 is often considered indicative of
multicollinearity. To calculate the VIF for each predictor, use statistical software or Python
libraries such as Statsmodels or Scikit-Learn.
1. Remove one of the correlated predictors: If two or more predictors are highly
correlated, consider removing one of them to reduce multicollinearity.
By validating and optimizing your multiple linear regression model, you can ensure that it
generalizes well to new data and provides accurate and reliable predictions.
Multiple linear regression is widely used across various industries due to its versatility and
ability to model relationships between multiple variables.
1. Finance: In the finance industry, multiple linear regression is used to predict stock prices,
assess investment risks, and estimate the impact of various factors, such as interest rates,
inflation, and economic indicators, on financial assets.
2. Healthcare: Multiple linear regression is employed in healthcare to identify risk factors for
diseases, predict patient outcomes, and evaluate the effectiveness of treatments. For
example, it can be used to model the relationship between a patient's age, weight, blood
pressure, and the likelihood of developing a specific medical condition.
4. Sports: Multiple linear regression is also used in sports analytics to predict player
performance, evaluate team strategies, and determine game outcome factors. For example,
it can be employed to predict a basketball player's points scored based on their shooting
percentage, minutes played, and other relevant statistics.
1. Housing Price Prediction: A real estate company might use multiple linear regression to
predict housing prices based on features such as square footage, the number of bedrooms
and bathrooms, the age of the house, and location. This information can help buyers and
sellers make informed decisions and assist the company in setting competitive prices for
their listings.
3. Demand Forecasting: A retail company can use multiple linear regression to forecast product
demand based on factors like seasonality, economic conditions, and promotional activities.
Accurate demand forecasting helps businesses manage inventory levels, optimize supply
chain operations, and plan marketing campaigns effectively.