Subjective Questions Answers
Subjective Questions Answers
1. Growth in Bookings (2019 vs. 2018): The year 2019 saw a notable increase in
bookings compared to 2018, indicating a positive growth trajectory for the business.
2. Seasonal Booking Patterns: Fall experienced a significant surge in bookings, while
all seasons showed considerable growth from 2018 to 2019, demonstrating a broad
increase in demand throughout the year.
3. Bookings on Non-Holiday Days: Booking numbers tended to be lower on non-
holiday days, possibly reflecting a preference for spending time at home with family
during these periods, rather than engaging in travel or leisure bookings.
4. Impact of Weather on Bookings: Clear weather conditions (classified as "Good" in
the dataset) had a significant positive effect on booking numbers, suggesting that
favourable weather encouraged more people to make bookings.
5. Even Distribution Across Weekdays: Bookings were fairly evenly spread across
both working and non-working days, which could point to flexible consumer
behaviour, perhaps influenced by remote work arrangements or changing work
schedules.
6. Higher Bookings on Weekends: A notable increase in bookings was observed on
Thursday, Friday, Saturday, and Sunday, compared to earlier in the week, suggesting
a preference for making bookings toward the weekend, likely for leisure or travel.
7. Monthly Trends: The majority of bookings occurred between May and October,
with a consistent rise in bookings during the first half of the year, peaking mid-year,
and then declining towards the end, which is typical of seasonal travel patterns.
3. Looking at the pair-plot among the numerical variables, which one has the
highest correlation with the target variable? (1 mark)
ANSWER)
• The variable 'temp' shows the strongest correlation with the target variable, as observed in
the graph.
• Since 'temp' and 'atemp' are redundant variables, only one of them is chosen while
determining the best-fit line.
• This selection helps prevent multicollinearity and ensures a more accurate model.
cnt = 4491.30 + 1001 x yr + 1117.21 x temp - 426.72 x hum - 358.38 x windspeed + 322.14
x Summer + 530.93 x Winter + 225.16 x September -39.38 x December -92.53 x January -
71.38 x November.
4. How did you validate the assumptions of Linear Regression after building the
model on the training set? (3 marks)
ANSWER)
Validating the assumptions of linear regression is a crucial step after building the model
to ensure that the results are reliable and the model fits the data appropriately. Here’s a
detailed breakdown of how you can validate the assumptions of linear regression after
fitting the model on the training set:
1. Linearity
Assumption: The relationship between the independent variables and the dependent
variable should be linear.
• Validation Method:
o Residual Plot: Plot the residuals (the difference between the predicted and
actual values) against the fitted values. If the plot shows a random scatter
with no clear pattern, the linearity assumption is satisfied. If there’s a distinct
pattern (like a curve), it suggests that the relationship between the variables
may not be linear.
o Partial Regression Plots: For each predictor, you can plot the residuals of
the response variable against the residuals of that predictor to check for
linearity between each independent variable and the dependent variable.
2. Independence of Errors
Assumption: The residuals (errors) should be independent of each other, meaning no
autocorrelation between errors.
• Validation Method:
o Durbin-Watson Test: This statistical test helps to detect the presence of
autocorrelation in the residuals. A value close to 2 indicates no
autocorrelation; values below 1 or above 3 suggest the presence of
autocorrelation.
o Residual Plot over Time: If the data is time-series, you can plot the
residuals over time to check for any patterns. If the residuals show a trend
over time, there might be autocorrelation.
3. Homoscedasticity (Constant Variance of Errors)
Assumption: The variance of the residuals should be constant across all levels of the
independent variables.
• Validation Method:
o Residual vs. Fitted Plot: After fitting the model, plot the residuals against
the fitted values. If the residuals fan out or contract as the fitted values
increase (forming a pattern like a cone or a bow-tie), it indicates
heteroscedasticity (non-constant variance). A random scatter of residuals
with no clear pattern suggests homoscedasticity.
o Breusch-Pagan Test: This test can be used to formally check for
heteroscedasticity. A significant p-value indicates heteroscedasticity.
4. Normality of Errors
Assumption: The residuals should follow a normal distribution. This assumption is
especially important for making valid inferences (e.g., confidence intervals, hypothesis
testing).
• Validation Method:
o Q-Q (Quantile-Quantile) Plot: Plot the quantiles of the residuals against
the quantiles of a normal distribution. If the points roughly follow a straight
line, the residuals are approximately normally distributed.
o Histogram of Residuals: Plot a histogram of the residuals. If the histogram
approximates a bell curve, the normality assumption is likely satisfied.
o Shapiro-Wilk Test / Kolmogorov-Smirnov Test: These are statistical tests
that can formally test the normality of residuals. A non-significant p-value
indicates that the residuals are approximately normal.
5. No Multicollinearity
Assumption: The independent variables should not be highly correlated with each other.
High multicollinearity can inflate the standard errors of the coefficients and make the
model unstable.
• Validation Method:
o Variance Inflation Factor (VIF): Compute the VIF for each predictor
variable. A VIF greater than 10 (or some might use a threshold of 5)
indicates high multicollinearity, meaning the predictor variable is highly
correlated with the other predictors.
o Correlation Matrix: Examine the correlation matrix of the predictors to
ensure that no two variables are highly correlated (e.g., correlation greater
than 0.8).
6. No Outliers or Influential Data Points
Assumption: Outliers or influential data points can disproportionately affect the model's
estimates, leading to misleading conclusions.
• Validation Method:
o Leverage vs. Residuals Plot: This plot helps to identify influential points
that have a large effect on the model. Points with high leverage and large
residuals are the most influential.
o Cook’s Distance: Calculate Cook’s Distance for each data point. Points with
a Cook’s Distance greater than 1 might be influential and worth further
investigation.
5. Based on the final model, which are the top 3 features contributing significantly towards
explaining the demand of the shared bikes? (2 marks)
ANSWER)
cnt = 4491.30 + 1001 x yr + 1117.21 x temp - 426.72 x hum - 358.38 x windspeed + 322.14
x Summer + 530.93 x Winter + 225.16 x September -39.38 x December -92.53 x January -
71.38 x November
The following three features significantly contribute to explaining the demand for shared
bikes:
• Temperature (temp)
• Winter season (winter)
• Calendar year (year)
These three features play a crucial role in predicting bike demand, influencing user
behaviour and seasonal trends.
ANSWER)
Linear regression is a statistical method used to model the relationship between a dependent
variable and one or more independent variables. It is widely used for predicting the
dependent variable based on given input values. The key idea is to find the best-fitting line
(or hyperplane in the case of multiple independent variables) that minimizes the difference
between actual and predicted values.
1. Model Representation
y=b0+b1⋅x+ϵ
where:
y=b0+b1⋅x1+b2⋅x2+...+bn⋅xn+ϵ
where:
o x1, x2, ..., xnx_1, x_2, ..., x_nx1, x2, ..., xn are
independent variables o b1, b2, ..., bnb_1, b_2, ..., b_nb1,
b2, ..., bn are the regression coefficients
The goal is to find values of b0, b1, ... bn that minimize the error between actual and
predicted values. The Mean Squared Error (MSE) is commonly used as the cost
function:
where:
To find the best-fit line, we need to minimize the cost function. This is done using:
The algorithm is trained on a dataset where it learns the optimal coefficient values by
minimizing the error between actual and predicted values.
5. Prediction
Once trained, the model is used to predict y for new values of x by plugging them into the
regression equation.
6. Model Evaluation
• R^2 Score (Coefficient of Determination): Measures how well the model explains
variance in the dependent variable. Higher values indicate better fit.
• MSE (Mean Squared Error): Measures the average squared difference between
actual and predicted values. Lower MSE indicates better performance.
For linear regression to provide accurate results, the following assumptions must hold:
Conclusion
Linear regression is a simple yet powerful algorithm for predictive modelling. However, it
is important to check whether its assumptions hold for a given dataset. If assumptions are
violated, alternative models like polynomial regression or regularization techniques
(Ridge/Lasso) may be more suitable.
ANSWER)
Anscombe’s Quartet is a set of four datasets that have nearly identical descriptive statistics
but display significantly different patterns when visualized. It was introduced by statistician
Francis Anscombe in 1973 to emphasize the importance of graphing data rather than
relying solely on numerical summaries.
Key Observations
• All four datasets share similar mean, variance, correlation coefficient, and regression
line, yet their scatter plots reveal distinct relationships.
• The quartet demonstrates how outliers, nonlinear relationships, and influential points
can distort statistical analysis.
Graphical Interpretation
Example
Consider a dataset where we analyse students' study hours vs. exam scores:
Student Study Hours (x) Exam Score (y)
A 2 50
B 4 60
C 6 70
D 8 80
E 10 90
• The linear regression model would predict that exam scores increase proportionally
with study hours.
• However, if one student (outlier) has 10 hours but scores only 40, it may distort the
regression results.
This example aligns with Dataset III of Anscombe’s Quartet, where an outlier misleads the
regression analysis.
Conclusion
Anscombe’s Quartet highlights the limitations of relying only on summary statistics and the
necessity of visualizing data before drawing conclusions in statistical modelling. It serves
as a reminder that data distributions can differ drastically despite having identical statistical
properties.
ANSWER)
Interpretation of Pearson’s R
Value of r Interpretation
r=1 Perfect positive correlation (as X increases, Y increases)
0<r<1 Positive correlation (stronger as it nears 1)
r=0 No correlation (no linear relationship)
Pearson’s R is a powerful tool for measuring linear relationships between variables, but it
does not imply causation. It should be used alongside data visualization and other statistical
analyses for accurate interpretation.
ANSWER)
Scaling in data pre-processing refers to the process of transforming the values of variables
to a specific range or distribution. The goal is to bring all variables to a similar scale,
making them comparable and preventing one variable from dominating others. Advantages
of Scaling:
1. Equal Weightage: Ensures that all variables contribute equally to the analysis,
preventing variables with larger magnitudes from disproportionately influencing the
results.
2. Convergence: Many machine learning algorithms (e.g., KNN, SVM, Gradient
Descent-based models) perform better when features are on a similar scale. Scaling
helps in faster convergence during optimization.
3. Interpretability: Improves the interpretability of coefficients in linear models, as the
coefficients represent the change in the dependent variable for a one-unit change in
the predictor variable.
• Range: Scales the values of a variable to a specific range, usually [0, 1] or [-1, 1].
• Advantages: Useful when the distribution of the variable is unknown or not
Gaussian.
• Disadvantages: Sensitive to outliers.
• Mean and Standard Deviation: Scales the values to have a mean of 0 and a standard
deviation of 1.
Example:
Conclusion:
Scaling is essential for improving model efficiency and ensuring fair feature contribution.
ANSWER)
where R^2 is the coefficient of determination for a predictor regressed on all other
predictors.
VIF becomes infinite when R2=1R^2 = 1, meaning that one independent variable is
perfectly correlated (linearly dependent) with one or more other independent variables. This
leads to:
Example:
An infinite VIF occurs due to perfect correlation among features. This should be resolved to
avoid instability in regression models.
6. What is a Q-Q plot? Explain the use and importance of a Q-Q plot in linear regression.
(3 marks)
ANSWER)
In linear regression, we use a Q-Q plot to check the normality of residuals (the differences
between the predicted and actual values).
• Straight Line: If the points follow a straight line, the residuals are normally
distributed.
• Curved or Deviating Points: If the points stray from the line, the residuals might
not be normal. This could indicate skewness, outliers, or that the model isn't a good
fit.
2. Identifying Skewness or Outliers: The plot helps to identify if the residuals are
skewed or if there are outliers that might be affecting the model.
3. Improving the Model: If the Q-Q plot shows issues with normality, you may need
to adjust your model by transforming the data or removing outliers.