Prediction & Forecasting: Regression Analysis
Prediction & Forecasting: Regression Analysis
∑𝒏𝒊 𝟏( 𝒀(𝒊) − 𝒀 𝒊𝒑𝒓𝒆𝒅 )𝟐 RSS is minimized to find the optimal values of β₀ and β₁
Output of regression is always a continuous numeric variable
Multiple Linear Regression: Represents the relationship between Multiple Independent variables of any type and a single dependent
variable which is continuous.
VIF calculates how well one independent variable is explained by all the other independent variables combined. VIF measures how
well a predictor variable can be predicted using all the other predictor variables.
𝟏
𝑽𝑰𝑭 = { > 10 Eliminate > 5 Worth Inspecting < 5 GOOD }
𝟏 𝑹𝟐
Calculated running regression for every X considered as dependent variable & rest as independent
Solution: Drop highly corelated variables (only 1 at a time) or those with less business value, create a new variable by combining
related variables, Transform the variables (PCA)
k-NN for Regression Supervised learning algorithm that analyses a certain number of neighbours to assign the dependent variable a
cumulative value.
The K-NN algorithm does not assume any kind of linear relationship between the independent and dependent variables. It simply
finds the unknown value by assigning the most suitable value using the nearest neighbour criterion.
find the nearest points to an unknown point and assign the new value by calculating the mean of these nearest points(dependent
variables)
o Select K
o Calculate Euclidean distance
o Average of the points
o Assign value
Euclidean distance between two points is: (𝑥 − 𝑥1) + (𝑦 − 𝑦1)
SIMPLE EVALUATION METRICS - would not be helpful unless you compare them to other models’ equivalent values
Mean Squared Error (MSE) 1 2
𝑛
Root Mean Squared Error(RMSE) 2
The deviation of the values predicted by a model from the actual ∑
observed values. Lower the better. 𝑛
Mean Absolute Error(MAE) 1
Only for magnitude of error and not direction 𝑛
R SQUARED R-Squared, also known as the Coefficient of Lies between [0,1]
Determination, is the percentage of the variance of the output
variable that can be explained by the independent variables. [Unexplained/Total variation]
Higher the better. High value implies a strong linear relation
i.e., the data fits the regression line more accurately
RSS (Residual Sum of Squares)
1 implies - the variance in the data is explained by the model
0 implies - none of the variance is being explained by model. TSS (Total Sum of Squares)
0.1 R-square: model explains 10% of variation within the data.
Significance of the derived beta coefficient – P-Value [If p-value < significance level (usually 0.05) then your model fits the data well]
Fitted line in case of randomly scattered data does not serve the purpose, hence it is necessary to check if it is statistically significant.
Finding β1 is significant : Hypothesis testing
Start with assuming β1 is not significant i.e. No relationship between X & Y
NULL HYPOTHESIS H0 : β1 = 0 ALTERNATE HYPOTHESIS HA : β1 ≠ 0
if the p-value < 0.05, reject the NULL hypothesis and state that β1 is indeed significant
if you fail to reject the null hypothesis – independent variable is insignificant in the prediction of the dependent variable
How can you find out the variables which are contributing the least to the model?
The p-value for each term tests the null hypothesis that the coefficient is equal to zero (no effect).
High p-values(significance of a variable in a model) > 0.05 are not relevant and should be dropped(do not help in prediction)
A low p-value (< 0.05) indicates that you can reject the null hypothesis. i.e., a predictor that has a low p-value is likely to be a
meaningful addition to your model because changes in the predictor's value are related to changes in the response variable