Linear Regression Subjective Questions
Linear Regression Subjective Questions
The year 2019 saw a higher number of bookings compared to 2018, indicating
positive business growth.
The fall season experienced a significant rise in bookings, and overall, all
seasons showed a substantial increase from 2018 to 2019.
Booking counts were lower on non-holidays, likely because people preferred
spending time at home with family during holidays.
Clear weather conditions (labelled as "Good" in the dataset) had a notable
impact on attracting more bookings.
Bookings were evenly distributed between working and non-working days.
Higher booking rates were observed on Thursday, Friday, Saturday, and Sunday
compared to the earlier days of the week.
Most bookings occurred between May and October, with a steady increase from
the start of the year, peaking mid-year, and then declining towards the end.
ANSWER)
The variable 'temp' shows the strongest correlation with the target variable, as
observed in the graph.
Since 'temp' and 'atemp' are redundant variables, only one of them is chosen while
determining the best-fit line.
This selection helps prevent multicollinearity and ensures a more accurate model.
cnt = 4491.30 + 1001 x yr + 1117.21 x temp - 426.72 x hum - 358.38 x windspeed +
322.14 x Summer + 530.93 x Winter + 225.16 x September -39.38 x December -92.53
x January - 71.38 x November.
4. How did you validate the assumptions of Linear Regression after building the
model on the training set? (3 marks)
ANSWER)
Validating the assumptions of linear regression is a crucial step to ensure the reliability of the
model. After building the model on the training set, here are the steps I followed to validate the
assumptions:
1. Residual Analysis
Residual analysis helps verify if the model's assumptions hold and if any adjustments
are needed for better predictions.
2. Homoscedasticity
Process: Ensure residuals have a constant variance across all levels of the
independent variables.
Check:
o Residuals vs. predicted values should show no clear pattern—they
should be randomly scattered.
o If the spread of residuals increases or decreases systematically, it
indicates heteroscedasticity (non-constant variance).
o Breusch-Pagan test or Goldfeld-Quandt test can formally detect
heteroscedasticity.
Process: Verify that the relationship between independent variables and the
target variable is linear.
Check:
o Scatter plots of residuals vs. predicted values should show no clear
pattern—randomly dispersed points indicate linearity.
o If residuals form a curve or systematic pattern, it suggests a non-linear
relationship, meaning a transformation or a non-linear model may be
needed.
o Checking correlation coefficients between independent and dependent
variables can also help assess linearity.
Process: Ensure that residuals are not correlated with each other.
Check:
o Use the Durbin-Watson test (values close to 2 indicate no
autocorrelation, while values near 0 or 4 suggest positive or negative
autocorrelation, respectively).
o Residual plots over time should show random patterns rather than
trends or cycles.
5. Multicollinearity
6. Cross-Validation
Process: Identify if the model is too complex and fits the noise instead of the
actual pattern.
Check:
o Compare training vs. test accuracy:
A much higher training accuracy than test accuracy
suggests overfitting.
o Use Regularization techniques (Lasso, Ridge) to control complexity.
o Learning curves can reveal how well the model generalizes across
different dataset sizes.
5. Based on the final model, which are the top 3 features contributing significantly
towards explaining the demand of the shared bikes? (2 marks)
ANSWER)
The following three features significantly contribute to explaining the demand for
shared bikes:
Temperature (temp)
Winter season (winter)
Calendar year (year)
These three features play a crucial role in predicting bike demand, influencing user
behavior and seasonal trends.
General Subjective Questions
1. Explain the linear regression algorithm in detail. (4 marks)
ANSWER)
1. Model Representation
y=b0+b1⋅x+ϵ
where:
y=b0+b1⋅x1+b2⋅x2+...+bn⋅xn+ϵ
where:
o x1, x2, ..., xnx_1, x_2, ..., x_nx1, x2, ..., xn are independent variables
o b1, b2, ..., bnb_1, b_2, ..., b_nb1, b2, ..., bn are the regression coefficients
function:
where:
To find the best-fit line, we need to minimize the cost function. This is done using:
The algorithm is trained on a dataset where it learns the optimal coefficient values by
minimizing the error between actual and predicted values.
5. Prediction
Once trained, the model is used to predict y for new values of x by plugging them into
the regression equation.
6. Model Evaluation
Conclusion
Linear regression is a simple yet powerful algorithm for predictive modelling. However,
it is important to check whether its assumptions hold for a given dataset. If
assumptions are violated, alternative models like polynomial regression or
regularization techniques (Ridge/Lasso) may be more suitable.
ANSWER)
Anscombe’s Quartet is a set of four datasets that have nearly identical descriptive
statistics but display significantly different patterns when visualized. It was introduced
by statistician Francis Anscombe in 1973 to emphasize the importance of graphing
data rather than relying solely on numerical summaries.
Key Observations
All four datasets share similar mean, variance, correlation coefficient, and
regression line, yet their scatter plots reveal distinct relationships.
The quartet demonstrates how outliers, nonlinear relationships, and
influential points can distort statistical analysis.
Graphical Interpretation
Dataset I (Top-left): Shows a linear relationship, making it well-suited for simple
regression.
Dataset II (Top-right): Indicates a nonlinear pattern, meaning linear regression
is inappropriate.
Dataset III (Bottom-left): Contains an outlier that skews the regression line and
correlation.
Dataset IV (Bottom-right): Has a single high-leverage point, misleadingly
inflating the correlation.
Example
Consider a dataset where we analyse students' study hours vs. exam scores:
The linear regression model would predict that exam scores increase
proportionally with study hours.
However, if one student (outlier) has 10 hours but scores only 40, it may
distort the regression results.
This example aligns with Dataset III of Anscombe’s Quartet, where an outlier
misleads the regression analysis.
Conclusion
ANSWER)
Interpretation of Pearson’s R
Value of r Interpretation
r=1 Perfect positive correlation (as X increases, Y increases)
0<r<1 Positive correlation (stronger as it nears 1)
r=0 No correlation (no linear relationship)
-1 < r < 0 Negative correlation (stronger as it nears -1)
r = -1 Perfect negative correlation (as X increases, Y decreases)
Example
The Pearson’s r for this dataset would be close to 1, indicating a strong positive
correlation between study hours and exam scores.
If the scores fluctuated randomly despite increasing hours, r would be closer to
0.
ANSWER)
Advantages of Scaling:
1. Equal Weightage: Ensures that all variables contribute equally to the analysis,
preventing variables with larger magnitudes from disproportionately influencing
the results.
2. Convergence: Many machine learning algorithms (e.g., KNN, SVM, Gradient
Descent-based models) perform better when features are on a similar scale.
Scaling helps in faster convergence during optimization.
3. Interpretability: Improves the interpretability of coefficients in linear models, as
the coefficients represent the change in the dependent variable for a one-unit
change in the predictor variable.
Range: Scales the values of a variable to a specific range, usually [0, 1] or [-1, 1].
Advantages: Useful when the distribution of the variable is unknown or not
Gaussian.
Disadvantages: Sensitive to outliers.
Mean and Standard Deviation: Scales the values to have a mean of 0 and a
standard deviation of 1.
Example:
Conclusion:
Scaling is essential for improving model efficiency and ensuring fair feature
contribution.
ANSWER)
where R^2 is the coefficient of determination for a predictor regressed on all other
predictors.
VIF becomes infinite when R2=1R^2 = 1, meaning that one independent variable is
perfectly correlated (linearly dependent) with one or more other independent
variables. This leads to:
Example:
X1 X2 X3 (Duplicate of X1)
10 5 10
20 10 20
30 15 30
40 20 40
An infinite VIF occurs due to perfect correlation among features. This should be
resolved to avoid instability in regression models.
6. What is a Q-Q plot? Explain the use and importance of a Q-Q plot in linear
regression. (3 marks)
ANSWER)
If the points in the Q-Q plot fall along a straight line, it suggests that the data
follows the chosen theoretical distribution.
If the points deviate from the line, it indicates departures from normality, such
as skewness or heavy tails.
1. Normality Assessment:
o Use: Linear regression assumes that residuals (differences between
observed and predicted values) are normally distributed. The Q-Q plot
helps check this assumption.
o Importance: If residuals deviate significantly from normality, it may
affect the validity of hypothesis tests and confidence intervals in
regression.
2. Identifying Outliers:
o Use: Points that deviate significantly from the straight line in a Q-Q plot
indicate outliers in the dataset.
o Importance: Outliers can skew the model results and impact the
estimation of regression coefficients.
3. Model Fit Assessment:
o Use: A Q-Q plot visually assesses how well residuals conform to a normal
distribution.
o Importance: If the plot shows systematic deviations, it may indicate
poor model fit or the need for data transformation.
4. Validity of Statistical Tests:
o Use: Many statistical tests in regression, such as t-tests and F-tests,
assume normality of residuals.
o Importance: If this assumption is violated, the p-values and confidence
intervals derived from these tests may be unreliable.
If the points align closely with the diagonal line: The residuals are
approximately normally distributed.
If the points show curvature or heavy tails: The residuals deviate from
normality, suggesting skewness or outliers.
Q-Q plots are powerful diagnostic tools in linear regression for assessing the
normality of residuals, detecting outliers, and validating statistical assumptions.
They provide a simple yet effective visual check for model correctness.