0% found this document useful (0 votes)

5 views14 pages

Subjective Questions Answers

The document discusses a bike sharing assignment that includes analysis of categorical variables affecting bookings, the importance of using drop_first=True in dummy variable creation, and validation of linear regression assumptions. It identifies temperature, winter season, and calendar year as the top features influencing bike demand. Additionally, it explains linear regression, Anscombe's quartet, and Pearson's R, emphasizing the importance of visualizing data and understanding relationships between variables.

Uploaded by

Pranav Datta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views14 pages

Subjective Questions Answers

Uploaded by

Pranav Datta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Bike Sharing Assignment

Assignment-based Subjective Questions

1. From your analysis of the categorical variables from the dataset, what could you infer
about their effect on the dependent variable? (3 marks)
ANSWER)

1. Growth in Bookings (2019 vs. 2018): The year 2019 saw a notable increase in
bookings compared to 2018, indicating a positive growth trajectory for the business.
2. Seasonal Booking Patterns: Fall experienced a significant surge in bookings, while
all seasons showed considerable growth from 2018 to 2019, demonstrating a broad
increase in demand throughout the year.
3. Bookings on Non-Holiday Days: Booking numbers tended to be lower on non-
holiday days, possibly reflecting a preference for spending time at home with family
during these periods, rather than engaging in travel or leisure bookings.
4. Impact of Weather on Bookings: Clear weather conditions (classified as "Good" in
the dataset) had a significant positive effect on booking numbers, suggesting that
favourable weather encouraged more people to make bookings.
5. Even Distribution Across Weekdays: Bookings were fairly evenly spread across
both working and non-working days, which could point to flexible consumer
behaviour, perhaps influenced by remote work arrangements or changing work
schedules.
6. Higher Bookings on Weekends: A notable increase in bookings was observed on
Thursday, Friday, Saturday, and Sunday, compared to earlier in the week, suggesting
a preference for making bookings toward the weekend, likely for leisure or travel.
7. Monthly Trends: The majority of bookings occurred between May and October,
with a consistent rise in bookings during the first half of the year, peaking mid-year,
and then declining towards the end, which is typical of seasonal travel patterns.

2. Why is it important to use drop_first=True during dummy variable creation?

ANSWER
When creating dummy variables (also known as one-hot encoding) from categorical data,
it's important to set the drop_first=True argument in certain cases to avoid multicollinearity
and improve the stability of the model.
Here’s why it's important:
1. Avoiding Multicollinearity
• Multicollinearity occurs when one or more independent variables are highly
correlated with each other, which can make it difficult to interpret the coefficients of
a regression model and lead to overfitting.
• When we create dummy variables for a categorical feature, we create a separate
column for each category. However, if we include all of these dummy variables in
the model, they will be perfectly correlated (because if one is 1, the others must be
0, and vice versa).
• By setting drop_first=True, we drop one of the dummy variable columns (usually
the first category). This creates a baseline category, and the remaining categories are
compared against this baseline. This avoids perfect multicollinearity and ensures
that the model doesn't become overly complex.
2. Model Interpretation
• The dropped category (the one not included as a dummy variable) serves as the
reference or baseline category. The coefficients for the remaining dummy variables
show how the other categories differ from this baseline.
• Without dropping the first category, it can be harder to interpret the model because
all categories would be treated independently without any baseline for comparison.

3. Looking at the pair-plot among the numerical variables, which one has the
highest correlation with the target variable? (1 mark)

ANSWER)
• The variable 'temp' shows the strongest correlation with the target variable, as observed in
the graph.
• Since 'temp' and 'atemp' are redundant variables, only one of them is chosen while
determining the best-fit line.
• This selection helps prevent multicollinearity and ensures a more accurate model.
cnt = 4491.30 + 1001 x yr + 1117.21 x temp - 426.72 x hum - 358.38 x windspeed + 322.14
x Summer + 530.93 x Winter + 225.16 x September -39.38 x December -92.53 x January -
71.38 x November.

4. How did you validate the assumptions of Linear Regression after building the
model on the training set? (3 marks)
ANSWER)
Validating the assumptions of linear regression is a crucial step after building the model
to ensure that the results are reliable and the model fits the data appropriately. Here’s a
detailed breakdown of how you can validate the assumptions of linear regression after
fitting the model on the training set:
1. Linearity
Assumption: The relationship between the independent variables and the dependent
variable should be linear.
• Validation Method:
o Residual Plot: Plot the residuals (the difference between the predicted and
actual values) against the fitted values. If the plot shows a random scatter
with no clear pattern, the linearity assumption is satisfied. If there’s a distinct
pattern (like a curve), it suggests that the relationship between the variables
may not be linear.
o Partial Regression Plots: For each predictor, you can plot the residuals of
the response variable against the residuals of that predictor to check for
linearity between each independent variable and the dependent variable.
2. Independence of Errors
Assumption: The residuals (errors) should be independent of each other, meaning no
autocorrelation between errors.
• Validation Method:
o Durbin-Watson Test: This statistical test helps to detect the presence of
autocorrelation in the residuals. A value close to 2 indicates no
autocorrelation; values below 1 or above 3 suggest the presence of
autocorrelation.
o Residual Plot over Time: If the data is time-series, you can plot the
residuals over time to check for any patterns. If the residuals show a trend
over time, there might be autocorrelation.
3. Homoscedasticity (Constant Variance of Errors)
Assumption: The variance of the residuals should be constant across all levels of the
independent variables.
• Validation Method:
o Residual vs. Fitted Plot: After fitting the model, plot the residuals against
the fitted values. If the residuals fan out or contract as the fitted values
increase (forming a pattern like a cone or a bow-tie), it indicates
heteroscedasticity (non-constant variance). A random scatter of residuals
with no clear pattern suggests homoscedasticity.
o Breusch-Pagan Test: This test can be used to formally check for
heteroscedasticity. A significant p-value indicates heteroscedasticity.
4. Normality of Errors
Assumption: The residuals should follow a normal distribution. This assumption is
especially important for making valid inferences (e.g., confidence intervals, hypothesis
testing).
• Validation Method:
o Q-Q (Quantile-Quantile) Plot: Plot the quantiles of the residuals against
the quantiles of a normal distribution. If the points roughly follow a straight
line, the residuals are approximately normally distributed.
o Histogram of Residuals: Plot a histogram of the residuals. If the histogram
approximates a bell curve, the normality assumption is likely satisfied.
o Shapiro-Wilk Test / Kolmogorov-Smirnov Test: These are statistical tests
that can formally test the normality of residuals. A non-significant p-value
indicates that the residuals are approximately normal.
5. No Multicollinearity
Assumption: The independent variables should not be highly correlated with each other.
High multicollinearity can inflate the standard errors of the coefficients and make the
model unstable.
• Validation Method:
o Variance Inflation Factor (VIF): Compute the VIF for each predictor
variable. A VIF greater than 10 (or some might use a threshold of 5)
indicates high multicollinearity, meaning the predictor variable is highly
correlated with the other predictors.
o Correlation Matrix: Examine the correlation matrix of the predictors to
ensure that no two variables are highly correlated (e.g., correlation greater
than 0.8).
6. No Outliers or Influential Data Points
Assumption: Outliers or influential data points can disproportionately affect the model's
estimates, leading to misleading conclusions.
• Validation Method:
o Leverage vs. Residuals Plot: This plot helps to identify influential points
that have a large effect on the model. Points with high leverage and large
residuals are the most influential.
o Cook’s Distance: Calculate Cook’s Distance for each data point. Points with
a Cook’s Distance greater than 1 might be influential and worth further
investigation.

5. Based on the final model, which are the top 3 features contributing significantly towards
explaining the demand of the shared bikes? (2 marks)

ANSWER)

From the equation of the best fit line:

cnt = 4491.30 + 1001 x yr + 1117.21 x temp - 426.72 x hum - 358.38 x windspeed + 322.14
x Summer + 530.93 x Winter + 225.16 x September -39.38 x December -92.53 x January -
71.38 x November

The following three features significantly contribute to explaining the demand for shared
bikes:
• Temperature (temp)
• Winter season (winter)
• Calendar year (year)

Temperature (temp): A higher temperature is associated with an increase in bike

demand (coefficient = 1117.21), indicating that warmer weather encourages more
people to use shared bikes.
Winter season (Winter): The positive coefficient (530.93) suggests that despite the
cold, winter sees a significant impact on demand, likely due to factors such as
increased commuting needs or improved biking infrastructure.
Calendar year (Year): The coefficient (1001) indicates a consistent increase in bike-
sharing demand over the years, reflecting a growing preference for shared bikes.

These three features play a crucial role in predicting bike demand, influencing user
behaviour and seasonal trends.

General Subjective Questions

1. Explain the linear regression algorithm in detail. (4 marks)

ANSWER)

Linear regression is a statistical method used to model the relationship between a dependent
variable and one or more independent variables. It is widely used for predicting the
dependent variable based on given input values. The key idea is to find the best-fitting line
(or hyperplane in the case of multiple independent variables) that minimizes the difference
between actual and predicted values.

Steps in Linear Regression Algorithm

1. Model Representation

• Simple Linear Regression (One Independent Variable)

The equation is:

y=b0+b1⋅x+ϵ

where:

o y = dependent variable (target variable)

o x = independent variable (feature)
o b0 = intercept (constant term)
o b1 = slope of the line o ϵ = error term
• Multiple Linear Regression (Multiple Independent Variables)

The equation extends to:

y=b0+b1⋅x1+b2⋅x2+...+bn⋅xn+ϵ
where:

o x1, x2, ..., xnx_1, x_2, ..., x_nx1, x2, ..., xn are
independent variables o b1, b2, ..., bnb_1, b_2, ..., b_nb1,
b2, ..., bn are the regression coefficients

2. Objective Function (Cost Function)

The goal is to find values of b0, b1, ... bn that minimize the error between actual and
predicted values. The Mean Squared Error (MSE) is commonly used as the cost

function:

where:

• n = number of data points

• yi= actual value
• yi^ = predicted value

3. Optimization (Minimization of Error)

To find the best-fit line, we need to minimize the cost function. This is done using:

• Gradient Descent: Iteratively updates the coefficients in the direction of decreasing

error. The update rule for each coefficient bj is:

where α is the learning rate.

• Normal Equation: Directly calculates the optimal values of coefficients without

iterations (used when dataset is small).

4. Training the Model

The algorithm is trained on a dataset where it learns the optimal coefficient values by
minimizing the error between actual and predicted values.
5. Prediction

Once trained, the model is used to predict y for new values of x by plugging them into the
regression equation.

6. Model Evaluation

The model’s performance is measured using:

• R^2 Score (Coefficient of Determination): Measures how well the model explains
variance in the dependent variable. Higher values indicate better fit.
• MSE (Mean Squared Error): Measures the average squared difference between
actual and predicted values. Lower MSE indicates better performance.

7. Assumptions of Linear Regression

For linear regression to provide accurate results, the following assumptions must hold:

1. Linearity: The relationship between independent and dependent variables is linear.

2. Independence: Residuals (errors) should not be correlated.
3. Homoscedasticity: The variance of residuals should remain constant across all
values of the independent variable.
4. Normality: Residuals should be normally distributed.
5. No Multicollinearity: Independent variables should not be highly correlated with
each other.

Conclusion

Linear regression is a simple yet powerful algorithm for predictive modelling. However, it
is important to check whether its assumptions hold for a given dataset. If assumptions are
violated, alternative models like polynomial regression or regularization techniques
(Ridge/Lasso) may be more suitable.

2. Explain the Anscombe’s quartet in detail. (3 marks)

ANSWER)

Anscombe’s Quartet is a set of four datasets that have nearly identical descriptive statistics
but display significantly different patterns when visualized. It was introduced by statistician
Francis Anscombe in 1973 to emphasize the importance of graphing data rather than
relying solely on numerical summaries.

Key Observations
• All four datasets share similar mean, variance, correlation coefficient, and regression
line, yet their scatter plots reveal distinct relationships.
• The quartet demonstrates how outliers, nonlinear relationships, and influential points
can distort statistical analysis.

Graphical Interpretation

• Dataset I (Top-left): Shows a linear relationship, making it well-suited for simple

regression.
• Dataset II (Top-right): Indicates a nonlinear pattern, meaning linear regression is
inappropriate.
• Dataset III (Bottom-left): Contains an outlier that skews the regression line and
correlation.
• Dataset IV (Bottom-right): Has a single high-leverage point, misleadingly inflating
the correlation.

Example

Consider a dataset where we analyse students' study hours vs. exam scores:
Student Study Hours (x) Exam Score (y)
A 2 50
B 4 60
C 6 70
D 8 80
E 10 90
• The linear regression model would predict that exam scores increase proportionally
with study hours.
• However, if one student (outlier) has 10 hours but scores only 40, it may distort the
regression results.

This example aligns with Dataset III of Anscombe’s Quartet, where an outlier misleads the
regression analysis.
Conclusion

Anscombe’s Quartet highlights the limitations of relying only on summary statistics and the
necessity of visualizing data before drawing conclusions in statistical modelling. It serves
as a reminder that data distributions can differ drastically despite having identical statistical
properties.

3. What is Pearson’s R? (3 marks)

ANSWER)

Pearson’s correlation coefficient (Pearson’s R) is a statistical measure that quantifies the

strength and direction of a linear relationship between two variables. It is denoted by r and
ranges from -1 to 1.

Interpretation of Pearson’s R
Value of r Interpretation
r=1 Perfect positive correlation (as X increases, Y increases)
0<r<1 Positive correlation (stronger as it nears 1)
r=0 No correlation (no linear relationship)

-1 < r < 0 Negative correlation (stronger as it nears -1)

r = -1 Perfect negative correlation (as X increases, Y decreases)
Example

Consider the relationship between study hours and exam scores:

Student Study Hours (X) Exam Score (Y)
A 2 50
B 4 60
Student Study Hours (X) Exam Score (Y)
C 6 70
D 8 80
E 10 90
• The Pearson’s r for this dataset would be close to 1, indicating a strong positive
correlation between study hours and exam scores.
• If the scores fluctuated randomly despite increasing hours, r would be closer to
0.

Pearson’s R is a powerful tool for measuring linear relationships between variables, but it
does not imply causation. It should be used alongside data visualization and other statistical
analyses for accurate interpretation.

4. What is scaling? Why is scaling performed? What is the di erence between

normalized scaling and standardized scaling? (3 marks)

ANSWER)

Scaling in data pre-processing refers to the process of transforming the values of variables
to a specific range or distribution. The goal is to bring all variables to a similar scale,
making them comparable and preventing one variable from dominating others. Advantages
of Scaling:

1. Equal Weightage: Ensures that all variables contribute equally to the analysis,
preventing variables with larger magnitudes from disproportionately influencing the
results.
2. Convergence: Many machine learning algorithms (e.g., KNN, SVM, Gradient
Descent-based models) perform better when features are on a similar scale. Scaling
helps in faster convergence during optimization.
3. Interpretability: Improves the interpretability of coefficients in linear models, as the
coefficients represent the change in the dependent variable for a one-unit change in
the predictor variable.

Differences between Normalized Scaling and Standardized Scaling:

1. Normalized Scaling (Min-Max Scaling):

• Range: Scales the values of a variable to a specific range, usually [0, 1] or [-1, 1].
• Advantages: Useful when the distribution of the variable is unknown or not
Gaussian.
• Disadvantages: Sensitive to outliers.

2. Standardized Scaling (Z-score Normalization):

• Mean and Standard Deviation: Scales the values to have a mean of 0 and a standard
deviation of 1.

• Advantages: Less sensitive to outliers; preserves the shape of the distribution.

Disadvantages: Assumes that the variable follows a Gaussian distribution.

Example:

Consider a dataset with house prices (in $1000s):

House Price ($1000s)
A 200
B 400
C 600
D 800
E 1000
• Normalized Scaling (Min-Max): o Converts values to a range [0,1], ensuring no
value dominates.
• Standardized Scaling (Z-score):
o Centres data around mean 0 with unit variance, making it comparable to
normal distributions.

Conclusion:

Scaling is essential for improving model efficiency and ensuring fair feature contribution.

• Normalization is preferred for bounded data.

• Standardization is useful for normally distributed data.
5. You might have observed that sometimes the value of VIF is infinite. Why does
this happen? (3 marks)

ANSWER)

Variance Inflation Factor (VIF) is a measure used to detect multicollinearity in a dataset. It

quantifies how much the variance of a regression coefficient is inflated due to correlation
among independent variables.

where R^2 is the coefficient of determination for a predictor regressed on all other
predictors.

VIF becomes infinite when R2=1R^2 = 1, meaning that one independent variable is
perfectly correlated (linearly dependent) with one or more other independent variables. This
leads to:

• Perfect Multicollinearity: If a predictor is an exact linear combination of other

predictors, its VIF becomes infinite, indicating severe multicollinearity.
• Singular Matrix Issue: When perfect correlation exists, the design matrix becomes
non-invertible, causing computational issues in regression models.
• Dummy Variable Trap: If categorical variables are one-hot encoded incorrectly (e.g.,
including all categories instead of k−1k-1 categories), VIF can become infinite due
to redundancy.

Example:

Consider a dataset with three independent variables:

X1 X2 X3 (Duplicate of X1)
10 5 10
20 10 20
30 15 30
40 20 40
• Here, X1 and X3 are identical (X3=X1X3 = X1), leading to perfect
multicollinearity.
• This results in R2=1R^2 = 1 for X3, making its VIF infinite.

Solution to Avoid Infinite VIF:

1. Remove Highly Correlated Features – Drop one of the redundant variables.
2. Use Principal Component Analysis (PCA) – Transform correlated features into
uncorrelated components.
3. Regularization Techniques – Ridge regression can help mitigate multicollinearity.
4. Check Dummy Variables – Ensure correct one-hot encoding to avoid redundancy.

An infinite VIF occurs due to perfect correlation among features. This should be resolved to
avoid instability in regression models.

6. What is a Q-Q plot? Explain the use and importance of a Q-Q plot in linear regression.
(3 marks)

ANSWER)

A Q-Q (Quantile-Quantile) plot is a graphical tool used to check if a dataset follows a

specific distribution, usually the normal distribution. It plots the observed data's quantiles
against the expected quantiles of a theoretical distribution (like normal distribution). If the
points lie along a straight line, the data is likely normally distributed.

Using a Q-Q Plot in Linear Regression:

In linear regression, we use a Q-Q plot to check the normality of residuals (the differences
between the predicted and actual values).

• X-axis: Quantiles of the normal distribution.

• Y-axis: Quantiles of the residuals.

• Straight line: Ideal case if residuals follow a normal distribution.

Interpreting a Q-Q Plot:

• Straight Line: If the points follow a straight line, the residuals are normally
distributed.

• Curved or Deviating Points: If the points stray from the line, the residuals might
not be normal. This could indicate skewness, outliers, or that the model isn't a good
fit.

Q-Q Plot Importance in Linear Regression

1. Normality of Residuals: One assumption in linear regression is that the residuals

should be normally distributed. A Q-Q plot helps check if this assumption holds,
which is important for making valid predictions and confidence intervals.

2. Identifying Skewness or Outliers: The plot helps to identify if the residuals are
skewed or if there are outliers that might be affecting the model.
3. Improving the Model: If the Q-Q plot shows issues with normality, you may need
to adjust your model by transforming the data or removing outliers.

Linear and Generalized Linear Mixed Models and Their Applications - 2nd Edition Optimized DOCX Download
100% (15)
Linear and Generalized Linear Mixed Models and Their Applications - 2nd Edition Optimized DOCX Download
17 pages
Miller and Freunds Probability and Statistics For Engineers 9th Edition Johnson Solutions Manual Download
100% (4)
Miller and Freunds Probability and Statistics For Engineers 9th Edition Johnson Solutions Manual Download
52 pages
Linear Regression Assignment Questions and Answer
No ratings yet
Linear Regression Assignment Questions and Answer
7 pages
Introductory Econometrics A Modern Approach 6th Edition Wooldridge Test Bankpdf Download
100% (5)
Introductory Econometrics A Modern Approach 6th Edition Wooldridge Test Bankpdf Download
44 pages
Solution - Data Analysis With Python-Project-2 - v1.0
No ratings yet
Solution - Data Analysis With Python-Project-2 - v1.0
14 pages
CHAPTER 6 Solution
67% (3)
CHAPTER 6 Solution
64 pages
Subjective Questions
92% (13)
Subjective Questions
6 pages
Linear Regression Firm Basit PDF
No ratings yet
Linear Regression Firm Basit PDF
21 pages
Unit-2 Ak
No ratings yet
Unit-2 Ak
106 pages
Linear Regression Tech
No ratings yet
Linear Regression Tech
15 pages
Statistic and Data Science Ii PDF
No ratings yet
Statistic and Data Science Ii PDF
37 pages
ML CH
No ratings yet
ML CH
19 pages
Bike Sharing Assignment
100% (6)
Bike Sharing Assignment
7 pages
Data - Analysis 2
No ratings yet
Data - Analysis 2
22 pages
A Penetration Model For Metallic Targets Based On Experimental Data
No ratings yet
A Penetration Model For Metallic Targets Based On Experimental Data
12 pages
Exam Final
100% (1)
Exam Final
21 pages
Marketing Analytics Unit 3
No ratings yet
Marketing Analytics Unit 3
54 pages
DMV Unit 3 PPT - RSK - 250419 - 125620 Jfhuehiwhu
No ratings yet
DMV Unit 3 PPT - RSK - 250419 - 125620 Jfhuehiwhu
89 pages
Unit 2 Machine Learning
No ratings yet
Unit 2 Machine Learning
32 pages
ML Asssignment Subjective Questions Answers
No ratings yet
ML Asssignment Subjective Questions Answers
7 pages
Heteroskedasticity Glejser Using SPSS
No ratings yet
Heteroskedasticity Glejser Using SPSS
9 pages
FT201033 MMLA Assignment
No ratings yet
FT201033 MMLA Assignment
10 pages
QUIZ1 Solution
No ratings yet
QUIZ1 Solution
6 pages
Analysis of Autogenous and Drying Shrinkage of Concrete
No ratings yet
Analysis of Autogenous and Drying Shrinkage of Concrete
171 pages
Assignment-Based Subjective Questions
100% (1)
Assignment-Based Subjective Questions
10 pages
第一次電腦分組作業
No ratings yet
第一次電腦分組作業
12 pages
PM Week1 MLSDeck0.2
No ratings yet
PM Week1 MLSDeck0.2
15 pages
Al Manja Hie 2020
No ratings yet
Al Manja Hie 2020
15 pages
Linear Regression Subjective Questions
No ratings yet
Linear Regression Subjective Questions
17 pages
MST Math Lesson - T Vezi 16072025
No ratings yet
MST Math Lesson - T Vezi 16072025
15 pages
The Influence of Financial Management Using The Financial Freedom Approach, Financial Technology and Social Capital On The Income of Msmes in The Tourism Sector
No ratings yet
The Influence of Financial Management Using The Financial Freedom Approach, Financial Technology and Social Capital On The Income of Msmes in The Tourism Sector
14 pages
Final Cc01 Group7
No ratings yet
Final Cc01 Group7
23 pages
Regression Linaire Python Tome I
No ratings yet
Regression Linaire Python Tome I
9 pages
CC02 Group6 Report
No ratings yet
CC02 Group6 Report
36 pages
Maths Statistic S3 Notes
No ratings yet
Maths Statistic S3 Notes
43 pages
Countdata2018 2
No ratings yet
Countdata2018 2
23 pages
Group7 Report
No ratings yet
Group7 Report
10 pages
MGT 636 Chapter 12 Problem
No ratings yet
MGT 636 Chapter 12 Problem
6 pages
Linear Regression Assignment - Subjective
No ratings yet
Linear Regression Assignment - Subjective
7 pages
Linear Regression Subjective Questions
No ratings yet
Linear Regression Subjective Questions
14 pages
Linear Regression Datascience Basit PDF
No ratings yet
Linear Regression Datascience Basit PDF
19 pages
Subjective Ques SKS
No ratings yet
Subjective Ques SKS
8 pages
The Central Limit Theorem
No ratings yet
The Central Limit Theorem
11 pages
CIA Understanding
No ratings yet
CIA Understanding
5 pages
Subjective Questions
No ratings yet
Subjective Questions
3 pages
The Social Brain Hypothesis - Robin Dunbar
0% (1)
The Social Brain Hypothesis - Robin Dunbar
13 pages
Unit II - Diagnotis and Multiple Linear
No ratings yet
Unit II - Diagnotis and Multiple Linear
8 pages
m2 Data Analytic and Visualization
No ratings yet
m2 Data Analytic and Visualization
53 pages
Assignment Linear Regression
No ratings yet
Assignment Linear Regression
10 pages
FINAL - CC01 - Group7
No ratings yet
FINAL - CC01 - Group7
23 pages
ML PR-2
No ratings yet
ML PR-2
11 pages
Chapter 5 Regression With A Single Regressor: Hypothesis Tests and Confidence Intervals
No ratings yet
Chapter 5 Regression With A Single Regressor: Hypothesis Tests and Confidence Intervals
32 pages
A New Questionnaire For Measuring Quality of Life - The Stark Qol
No ratings yet
A New Questionnaire For Measuring Quality of Life - The Stark Qol
7 pages
Assignment - 01 - SEC - B - GROUP No. 11
No ratings yet
Assignment - 01 - SEC - B - GROUP No. 11
14 pages
Econometrics For Finance (2017-I)
No ratings yet
Econometrics For Finance (2017-I)
6 pages
Bike Renting PDF
No ratings yet
Bike Renting PDF
26 pages
Minimum Travel Demand Model Calibration and Validation Guidelines For State of Tennessee
No ratings yet
Minimum Travel Demand Model Calibration and Validation Guidelines For State of Tennessee
25 pages
Revision 235
No ratings yet
Revision 235
8 pages
3.multiple Correlation & Regression
No ratings yet
3.multiple Correlation & Regression
24 pages
Index
No ratings yet
Index
2 pages
3-Linear Regreesion-Assumptions
No ratings yet
3-Linear Regreesion-Assumptions
28 pages
Assignment-Based Subjective Questions/Answers
No ratings yet
Assignment-Based Subjective Questions/Answers
3 pages
Assignment-Based Subjective Questions
No ratings yet
Assignment-Based Subjective Questions
1 page
Practical - Regression
No ratings yet
Practical - Regression
114 pages
Panel Data Models
No ratings yet
Panel Data Models
112 pages
Assignment 2 - LP1
No ratings yet
Assignment 2 - LP1
7 pages
DSR Notes 3 To 5
No ratings yet
DSR Notes 3 To 5
70 pages
MCQ On Anova
100% (2)
MCQ On Anova
6 pages
Data Science Module 5 Q & A
No ratings yet
Data Science Module 5 Q & A
8 pages
An Elasticity Based Deterministic Study of Relationship Between Factors Effecting Domestic Energy Usage in Pakistan
No ratings yet
An Elasticity Based Deterministic Study of Relationship Between Factors Effecting Domestic Energy Usage in Pakistan
4 pages
Topic - 9 PDF
No ratings yet
Topic - 9 PDF
12 pages
Regression Statistics: Residuals
No ratings yet
Regression Statistics: Residuals
5 pages
CF Chapter 11 Excel Master Student
No ratings yet
CF Chapter 11 Excel Master Student
40 pages
Lecture 10
No ratings yet
Lecture 10
5 pages
Vertical Multiphase Flow Correlations For High Production Rates and Large Tubulars
No ratings yet
Vertical Multiphase Flow Correlations For High Production Rates and Large Tubulars
8 pages
Module3-Fitting A Model To Data
No ratings yet
Module3-Fitting A Model To Data
57 pages
Unit5 R
No ratings yet
Unit5 R
5 pages
Ifr: Spe-49061-Ms
No ratings yet
Ifr: Spe-49061-Ms
12 pages
AMDA Regression Solution
No ratings yet
AMDA Regression Solution
12 pages
I. Errors, Mistakes, Accuracy and Precision of Data Surveyed. A. Errors
No ratings yet
I. Errors, Mistakes, Accuracy and Precision of Data Surveyed. A. Errors
53 pages
Classical Machine Learning: Linear Regression: Ramesh S
No ratings yet
Classical Machine Learning: Linear Regression: Ramesh S
28 pages
Rekapitulacija NIR - Sve
No ratings yet
Rekapitulacija NIR - Sve
23 pages
Regression and Introduction To Bayesian Network
No ratings yet
Regression and Introduction To Bayesian Network
12 pages
SMDM Predictive Modeling Business Report 05.02.2022 PDF
No ratings yet
SMDM Predictive Modeling Business Report 05.02.2022 PDF
38 pages
TYPES OF ERRORS 08052022 070853pm
No ratings yet
TYPES OF ERRORS 08052022 070853pm
14 pages
Bike Rental (Project)
No ratings yet
Bike Rental (Project)
16 pages
IS4240 - AY1314S2 - Assignment - DM1
No ratings yet
IS4240 - AY1314S2 - Assignment - DM1
3 pages
Assignment-Based Subjective Questions/Answers
No ratings yet
Assignment-Based Subjective Questions/Answers
3 pages
Acceptance-Rejection Sampling and Multi-dimensional Monte Carlo Integrations Utilizing Mathematica®
From Everand
Acceptance-Rejection Sampling and Multi-dimensional Monte Carlo Integrations Utilizing Mathematica®
SUJAUL CHOWDHURY
No ratings yet
Gale Researcher Guide for: Econometric Models
From Everand
Gale Researcher Guide for: Econometric Models
Chupp
No ratings yet

Subjective Questions Answers

Uploaded by

Subjective Questions Answers

Uploaded by

Bike Sharing Assignment

Assignment-based Subjective Questions

2. Why is it important to use drop_first=True during dummy variable creation?

From the equation of the best fit line:

Temperature (temp): A higher temperature is associated with an increase in bike

General Subjective Questions

Steps in Linear Regression Algorithm

• Simple Linear Regression (One Independent Variable)

The equation is:

o y = dependent variable (target variable)

The equation extends to:

2. Objective Function (Cost Function)

• n = number of data points

3. Optimization (Minimization of Error)

• Gradient Descent: Iteratively updates the coefficients in the direction of decreasing

where α is the learning rate.

• Normal Equation: Directly calculates the optimal values of coefficients without

4. Training the Model

The model’s performance is measured using:

7. Assumptions of Linear Regression

1. Linearity: The relationship between independent and dependent variables is linear.

2. Explain the Anscombe’s quartet in detail. (3 marks)

• Dataset I (Top-left): Shows a linear relationship, making it well-suited for simple

3. What is Pearson’s R? (3 marks)

Pearson’s correlation coefficient (Pearson’s R) is a statistical measure that quantifies the

-1 < r < 0 Negative correlation (stronger as it nears -1)

Consider the relationship between study hours and exam scores:

4. What is scaling? Why is scaling performed? What is the di erence between

Differences between Normalized Scaling and Standardized Scaling:

1. Normalized Scaling (Min-Max Scaling):

2. Standardized Scaling (Z-score Normalization):

• Advantages: Less sensitive to outliers; preserves the shape of the distribution.

Consider a dataset with house prices (in $1000s):

• Normalization is preferred for bounded data.

Variance Inflation Factor (VIF) is a measure used to detect multicollinearity in a dataset. It

• Perfect Multicollinearity: If a predictor is an exact linear combination of other

Consider a dataset with three independent variables:

Solution to Avoid Infinite VIF:

A Q-Q (Quantile-Quantile) plot is a graphical tool used to check if a dataset follows a

Using a Q-Q Plot in Linear Regression:

• X-axis: Quantiles of the normal distribution.

• Y-axis: Quantiles of the residuals.

• Straight line: Ideal case if residuals follow a normal distribution.

Interpreting a Q-Q Plot:

Q-Q Plot Importance in Linear Regression

1. Normality of Residuals: One assumption in linear regression is that the residuals

You might also like