0% found this document useful (0 votes)
61 views3 pages

Prediction & Forecasting: Regression Analysis

Regression analysis is a statistical technique used to analyze relationships between variables and predict unknown variables. Simple linear regression models a linear relationship between one dependent and one independent variable, while multiple linear regression extends this to model relationships between one dependent and multiple independent variables. Key steps in regression analysis include defining the business problem, collecting and preprocessing data, building and evaluating models, and deploying the optimal model.

Uploaded by

smartanand2009
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views3 pages

Prediction & Forecasting: Regression Analysis

Regression analysis is a statistical technique used to analyze relationships between variables and predict unknown variables. Simple linear regression models a linear relationship between one dependent and one independent variable, while multiple linear regression extends this to model relationships between one dependent and multiple independent variables. Key steps in regression analysis include defining the business problem, collecting and preprocessing data, building and evaluating models, and deploying the optimal model.

Uploaded by

smartanand2009
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Regression analysis a statistical technique to analyse the relationship betn different variables and predict a continuous unknown

variable using this relationship.


Regression analysis can be of the following two types: Prediction & Forecasting The fundamental difference between prediction
and forecasting is the introduction of the time dimension as an essential component in the forecasting use-cases.
PROCESS FLOW: DefineBusinessProblem -> GetData -> Data Pre-Processing & Analysis -> Build Model -> Evaluate the Model -> Deploy
Simple Linear Regression : There is only one dependent variable and one independent variable for which the predictions are to be
made, having a linear relationship between them. Independent variable == Predictor variable & Dependent == Output variable.
Difference w/ Correlation
- Tells the extent of relationship between 2 variables (-1 to 1) Vs making predictions
- No distinction between dependent & independent variables Vs Clear distinction
SLR Equation: y = β0 + β1x β0, β1 are coefficients ͠ Intercept & Slope
A best-fit line is to be fitted on the data that explains the relationship between the variables and is found by minimising the sum of
squares of the residuals (RSS):

∑𝒏𝒊 𝟏( 𝒀(𝒊) − 𝒀 𝒊𝒑𝒓𝒆𝒅 )𝟐 RSS is minimized to find the optimal values of β₀ and β₁
 Output of regression is always a continuous numeric variable
Multiple Linear Regression: Represents the relationship between Multiple Independent variables of any type and a single dependent
variable which is continuous.

MLR is an extension of SLR: y = β0 + β1 x1 + β2 x2 + …. + βn xn + ϵ (models error term)


 Find the best fit Hyperplane instead of best fit straight line
 Coefficients obtained by minimizing sum of squared errors (least squares) as in SLR
 There should be linear relation between predictor & response variable/s
Overfitting: More variables , more tendency to memorize outcomes in training
- makes the model complex, fail to generalize, high accuracy on training but less in test
Multi-Collinearity: There should be no or minimal multi-collinearity between predictor variables
Feature Selection: Keep only most relevant variables & avoid overfit-leading to less test accuracy
 Not all potential predictors are significant
 Too many features - lead to over-fitting & increase in training time
 Helps in determining adequate number of features for highest accuracy
Methods of feature selection: RFE: Recursive Feature Elimination
 Brute force: Try all combinations. Not efficient
 Manual Feature selection: Good when the number of features are less
 Forward Selection: Automated bottom-up method, starting with one feature and adding till there is no additional benefit
 Backward Elimination: Bottom-up method starting with all and keep removing features that are least significant
Multi-Collinearity : Having related predictor variables in the input data set. While the model is assumed to be built using several
independent variables, some may be inter-related.
- One may not know which variables are actually contributing - Explainability vanishes
- Does NOT impact precision or accuracy of prediction, or value of response variable
- P-value of coefficients may not be reliable
Detection : Visual way : Pair wise correlation plot using Scatter plot, Heat Map matrix
Objective way : VIF (Variance Inflation Factor)

VIF calculates how well one independent variable is explained by all the other independent variables combined. VIF measures how
well a predictor variable can be predicted using all the other predictor variables.
𝟏
𝑽𝑰𝑭 = { > 10 Eliminate > 5 Worth Inspecting < 5 GOOD }
𝟏 𝑹𝟐
 Calculated running regression for every X considered as dependent variable & rest as independent
Solution: Drop highly corelated variables (only 1 at a time) or those with less business value, create a new variable by combining
related variables, Transform the variables (PCA)
k-NN for Regression Supervised learning algorithm that analyses a certain number of neighbours to assign the dependent variable a
cumulative value.
 The K-NN algorithm does not assume any kind of linear relationship between the independent and dependent variables. It simply
finds the unknown value by assigning the most suitable value using the nearest neighbour criterion.
 find the nearest points to an unknown point and assign the new value by calculating the mean of these nearest points(dependent
variables)
o Select K
o Calculate Euclidean distance
o Average of the points
o Assign value
Euclidean distance between two points is: (𝑥 − 𝑥1) + (𝑦 − 𝑦1)
SIMPLE EVALUATION METRICS - would not be helpful unless you compare them to other models’ equivalent values
Mean Squared Error (MSE) 1 2
𝑛
Root Mean Squared Error(RMSE) 2
The deviation of the values predicted by a model from the actual ∑
observed values. Lower the better. 𝑛
Mean Absolute Error(MAE) 1
Only for magnitude of error and not direction 𝑛
R SQUARED R-Squared, also known as the Coefficient of  Lies between [0,1]
Determination, is the percentage of the variance of the output
variable that can be explained by the independent variables. [Unexplained/Total variation]
 Higher the better. High value implies a strong linear relation
i.e., the data fits the regression line more accurately
RSS (Residual Sum of Squares)
 1 implies - the variance in the data is explained by the model
 0 implies - none of the variance is being explained by model. TSS (Total Sum of Squares)
 0.1 R-square: model explains 10% of variation within the data.
Significance of the derived beta coefficient – P-Value [If p-value < significance level (usually 0.05) then your model fits the data well]
Fitted line in case of randomly scattered data does not serve the purpose, hence it is necessary to check if it is statistically significant.
Finding β1 is significant : Hypothesis testing
 Start with assuming β1 is not significant i.e. No relationship between X & Y
 NULL HYPOTHESIS H0 : β1 = 0 ALTERNATE HYPOTHESIS HA : β1 ≠ 0
 if the p-value < 0.05, reject the NULL hypothesis and state that β1 is indeed significant
 if you fail to reject the null hypothesis – independent variable is insignificant in the prediction of the dependent variable
How can you find out the variables which are contributing the least to the model?
 The p-value for each term tests the null hypothesis that the coefficient is equal to zero (no effect).
 High p-values(significance of a variable in a model) > 0.05 are not relevant and should be dropped(do not help in prediction)
 A low p-value (< 0.05) indicates that you can reject the null hypothesis. i.e., a predictor that has a low p-value is likely to be a
meaningful addition to your model because changes in the predictor's value are related to changes in the response variable

You might also like