DS Module 05
DS Module 05
We calculate aaa and bbb using the Least Squares Method, which minimizes the error between actual and
predicted Y values.
Where:
• n: number of observations
Real-Life Applications
• Predicting house prices based on area.
Ans: The regression equation is used for prediction by plugging in the value of the independent variable (X)
into the equation to estimate the corresponding value of the dependent variable (Y). The equation represents
a straight-line relationship between the two variables and is written as:
Y=a + b X
Here, Y is the predicted value of Y, a is the intercept, and b is the slope of the line. Once the values of aaa
and b are calculated using the given data, you can substitute any new value of X into the equation to predict
what Y would be.
For example, if the regression equation is Y^=5+2X , and you want to predict Y when X=10, you substitute
it into the equation:
Y=5+2(10)=25
So, the predicted value of Y is 25. This process is useful in making informed guesses or forecasts based on
past data when a linear relationship exists.
Ans: Residual analysis is the process of examining the differences between the actual values and the
predicted values in a regression model. These differences are called residuals, and they are calculated as:
Residual=Y−Y^
Where Y is the actual observed value and Y is the predicted value from the regression equation.
Residual analysis is important because it helps to check the validity of the regression model. By analyzing
the residuals, we can test whether the assumptions of linear regression are satisfied, such as:
Ans: Outliers are data points that lie far away from the general pattern of the other observations in a dataset.
In the context of regression, they are points where the actual value of the dependent variable (Y) is
significantly different from the value predicted by the regression model.
Outliers can have a strong impact on regression analysis. Since regression lines are calculated using the least
squares method, which minimizes the sum of squared residuals, even a single outlier with a large residual
can greatly influence the slope and intercept of the regression line. This can lead to a model that does not
accurately represent the relationship between the variables for most of the data.
• They can distort the regression coefficients, making predictions less accurate.
• They may increase the error variance, affecting the model's overall fit.
• They can violate model assumptions, especially those related to normality and constant variance of
residuals.
Because of their influence, it’s important to detect outliers through residual plots or statistical tests and
decide whether they should be kept, investigated, or removed based on the context of the data.
Ans: Influential observations are data points that have a strong impact on the estimated regression line or
regression coefficients. Unlike regular outliers, which are just far from the predicted value, influential
observations can significantly change the slope, intercept, or direction of the regression line if they are
included or removed.
These points typically lie far from the rest of the data in terms of their X-values (independent variable), Y-
values, or both. Even if an influential point fits the general trend (has a small residual), its position can still
pull the regression line toward itself, affecting the model’s accuracy.
1. Leverage – Measures how far an X-value is from the mean of X-values. A high-leverage point has an
extreme X-value.
2. Cook’s Distance – Combines both the leverage and residual of a data point to assess its influence.
Points with a Cook’s Distance significantly greater than others are considered influential.
3. DFBETAS and DFFITS – These are statistical measures used to assess how much a single
observation affects the regression coefficients or fitted values.
Why It Matters
Influential observations can distort your regression model, leading to misleading conclusions or poor
predictions. Once identified, analysts need to decide whether the point is a data error, a rare but valid case,
or an indication that the model is missing something important.
6.] Define Multiple Linear Regression. How is it different from Simple Linear Regression?
Ans:
Ans: Multicollinearity happens in multiple linear regression when two or more independent variables are
highly related to each other. This means they carry the same or similar information, which makes it hard for
the model to figure out which variable is actually affecting the result (dependent variable).
When multicollinearity is present, the values of the coefficients (slopes) in the regression equation become
unstable and confusing. They might change a lot if you add or remove a variable, and it becomes difficult to
trust which variable is really important.
In short, multicollinearity makes it hard to tell which factor really affects the result, even if your overall
model gives good predictions.