0% found this document useful (0 votes)

11 views17 pages

Linear Regression Subjective Questions

The document discusses a bike sharing assignment that includes analysis of categorical variables, dummy variable creation, linear regression validation, and the importance of various features in predicting bike demand. It covers the linear regression algorithm in detail, explaining its steps, assumptions, and the significance of visualizing data using Anscombe’s quartet. Additionally, it addresses Pearson’s R, the concept of scaling, and the differences between normalized and standardized scaling.

Uploaded by

Pranav Datta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views17 pages

Linear Regression Subjective Questions

Uploaded by

Pranav Datta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

Bike Sharing Assignment

Assignment-based Subjective Questions

1. From your analysis of the categorical variables from the dataset, what could you
infer about their e ect on the dependent variable? (3 marks)
ANSWER)

 The year 2019 saw a higher number of bookings compared to 2018, indicating
positive business growth.
 The fall season experienced a significant rise in bookings, and overall, all
seasons showed a substantial increase from 2018 to 2019.
 Booking counts were lower on non-holidays, likely because people preferred
spending time at home with family during holidays.
 Clear weather conditions (labelled as "Good" in the dataset) had a notable
impact on attracting more bookings.
 Bookings were evenly distributed between working and non-working days.
 Higher booking rates were observed on Thursday, Friday, Saturday, and Sunday
compared to the earlier days of the week.
 Most bookings occurred between May and October, with a steady increase from
the start of the year, peaking mid-year, and then declining towards the end.

2. Why is it important to use drop_ﬁrst=True during dummy variable creation?

(2 mark)
ANSWER)

 Dummy variable creation is a technique in statistical modelling and machine

learning used to represent categorical variables with binary values (0 or 1).
 It involves generating binary (dummy) variables for each category of the original
categorical variable to indicate the presence or absence of a speciﬁc category.
 The number of dummy variables created depends on the number of categories.
For a categorical variable with n categories, n - 1 dummy variables are typically
used.
Rationale for the n - 1 Dummy Variables Rule:
1. Avoiding Multicollinearity: Creating n dummy variables introduces
multicollinearity, as one category can be predicted from the others. Using n - 1
dummy variables prevent this issue.
2. Avoiding Redundancy: The omitted category is implicitly captured by the
constant term in the model, preventing redundant information.
3. Enhancing Interpretability: The coe icients of dummy variables indicate the
change in the response variable compared to the omitted category, making the
model easier to interpret.
Example:
If a categorical variable "Color" has three categories: Red, Blue, and Green, we create
two dummy variables:

 Is_Blue (1 if Blue, 0 otherwise)

 Is_Green (1 if Green, 0 otherwise)
 If both are 0, the color is Red (omitted category).
Implementation in Python:
 Using pandas, we can create dummy variables with drop_ﬁrst=True to
automatically drop one category and follow the n - 1 rule:
import pandas as pd
df = pd.DataFrame({'Color': ['Red', 'Blue', 'Green', 'Red', 'Green']})
df_dummies = pd.get_dummies(df, drop_ﬁrst=True)
print(df_dummies)

 This ensures a proper dummy variable representation while avoiding

multicollinearity.
3. Looking at the pair-plot among the numerical variables, which one has the
highest correlation with the target variable? (1 mark)

ANSWER)
 The variable 'temp' shows the strongest correlation with the target variable, as
observed in the graph.
 Since 'temp' and 'atemp' are redundant variables, only one of them is chosen while
determining the best-fit line.
 This selection helps prevent multicollinearity and ensures a more accurate model.
cnt = 4491.30 + 1001 x yr + 1117.21 x temp - 426.72 x hum - 358.38 x windspeed +
322.14 x Summer + 530.93 x Winter + 225.16 x September -39.38 x December -92.53
x January - 71.38 x November.
4. How did you validate the assumptions of Linear Regression after building the
model on the training set? (3 marks)
ANSWER)
Validating the assumptions of linear regression is a crucial step to ensure the reliability of the
model. After building the model on the training set, here are the steps I followed to validate the
assumptions:

1. Residual Analysis

 Process: Examine the residuals (differences between observed and predicted

values).
 Check:
o Residuals should be approximately normally distributed (checked via
histogram or Q-Q plot).
o Residual plots should show no discernible pattern (randomly scattered
points).
o The mean of residuals should be close to zero, ensuring no systematic
bias.

Residual analysis helps verify if the model's assumptions hold and if any adjustments
are needed for better predictions.
2. Homoscedasticity

 Process: Ensure residuals have a constant variance across all levels of the
independent variables.
 Check:
o Residuals vs. predicted values should show no clear pattern—they
should be randomly scattered.
o If the spread of residuals increases or decreases systematically, it
indicates heteroscedasticity (non-constant variance).
o Breusch-Pagan test or Goldfeld-Quandt test can formally detect
heteroscedasticity.

Homoscedasticity is crucial for maintaining the reliability of confidence intervals and

avoiding biased estimations.
3. Linearity

 Process: Verify that the relationship between independent variables and the
target variable is linear.
 Check:
o Scatter plots of residuals vs. predicted values should show no clear
pattern—randomly dispersed points indicate linearity.
o If residuals form a curve or systematic pattern, it suggests a non-linear
relationship, meaning a transformation or a non-linear model may be
needed.
o Checking correlation coefficients between independent and dependent
variables can also help assess linearity.

Ensuring linearity is essential for an accurate linear regression model, as non-linearity

can lead to poor predictions.
4. Independence of Residuals

 Process: Ensure that residuals are not correlated with each other.
 Check:
o Use the Durbin-Watson test (values close to 2 indicate no
autocorrelation, while values near 0 or 4 suggest positive or negative
autocorrelation, respectively).
o Residual plots over time should show random patterns rather than
trends or cycles.

Independence of residuals is crucial to avoid biased standard errors and misleading

significance tests.

5. Multicollinearity

 Process: Detect high correlations among independent variables.

 Check:
o Compute Variance Inflation Factor (VIF):
 VIF > 5 or 10 indicates multicollinearity.
o Check the correlation matrix—high correlations (>0.8) between
independent variables suggest multicollinearity.

Multicollinearity can distort regression coefficients, making interpretation difficult.

Removing redundant variables or using Principal Component Analysis (PCA) can help.

6. Cross-Validation

 Process: Assess model performance on different data subsets.

 Check:
o Use k-fold cross-validation (commonly 5-fold or 10-fold) to ensure
model stability.
o Compare training and validation errors to check for consistency.

Cross-validation helps generalize the model, reducing overfitting risks.

7. Check for Overfitting

 Process: Identify if the model is too complex and fits the noise instead of the
actual pattern.
 Check:
o Compare training vs. test accuracy:
 A much higher training accuracy than test accuracy
suggests overfitting.
o Use Regularization techniques (Lasso, Ridge) to control complexity.
o Learning curves can reveal how well the model generalizes across
different dataset sizes.

Preventing overfitting ensures better real-world performance and avoids misleading

predictions.

5. Based on the final model, which are the top 3 features contributing significantly
towards explaining the demand of the shared bikes? (2 marks)

ANSWER)

From the equation of the best fit line:

cnt = 4491.30 + 1001 x yr + 1117.21 x temp - 426.72 x hum - 358.38 x windspeed +

322.14 x Summer + 530.93 x Winter + 225.16 x September -39.38 x December -92.53 x
January - 71.38 x November

The following three features significantly contribute to explaining the demand for
shared bikes:

 Temperature (temp)
 Winter season (winter)
 Calendar year (year)

 Temperature (temp): A higher temperature is associated with an increase in

bike demand (coefficient = 1117.21), indicating that warmer weather
encourages more people to use shared bikes.
 Winter season (Winter): The positive coefficient (530.93) suggests that despite
the cold, winter sees a significant impact on demand, likely due to factors such
as increased commuting needs or improved biking infrastructure.
 Calendar year (Year): The coefficient (1001) indicates a consistent increase in
bike-sharing demand over the years, reflecting a growing preference for shared
bikes.

These three features play a crucial role in predicting bike demand, influencing user
behavior and seasonal trends.
General Subjective Questions
1. Explain the linear regression algorithm in detail. (4 marks)

ANSWER)

Linear regression is a statistical method used to model the relationship between a

dependent variable and one or more independent variables. It is widely used for
predicting the dependent variable based on given input values. The key idea is to find
the best-fitting line (or hyperplane in the case of multiple independent variables) that
minimizes the difference between actual and predicted values.

Steps in Linear Regression Algorithm

1. Model Representation

 Simple Linear Regression (One Independent Variable)

The equation is:

y=b0+b1⋅x+ϵ

where:

o y = dependent variable (target variable)

o x = independent variable (feature)
o b0 = intercept (constant term)
o b1 = slope of the line
o ϵ = error term
 Multiple Linear Regression (Multiple Independent Variables)

The equation extends to:

y=b0+b1⋅x1+b2⋅x2+...+bn⋅xn+ϵ

where:

o x1, x2, ..., xnx_1, x_2, ..., x_nx1, x2, ..., xn are independent variables
o b1, b2, ..., bnb_1, b_2, ..., b_nb1, b2, ..., bn are the regression coefficients

2. Objective Function (Cost Function)

The goal is to find values of b0, b1, ... bn that minimize the error between actual and
predicted values. The Mean Squared Error (MSE) is commonly used as the cost

function:

where:

 n = number of data points

 yi= actual value
 yi^ = predicted value

3. Optimization (Minimization of Error)

To find the best-fit line, we need to minimize the cost function. This is done using:

 Gradient Descent: Iteratively updates the coefficients in the direction of

decreasing error. The update rule for each coefficient bj is:

where α is the learning rate.

 Normal Equation: Directly calculates the optimal values of coefficients without

iterations (used when dataset is small).

4. Training the Model

The algorithm is trained on a dataset where it learns the optimal coefficient values by
minimizing the error between actual and predicted values.

5. Prediction

Once trained, the model is used to predict y for new values of x by plugging them into
the regression equation.

6. Model Evaluation

The model’s performance is measured using:

 R^2 Score (Coefficient of Determination): Measures how well the model

explains variance in the dependent variable. Higher values indicate better fit.
 MSE (Mean Squared Error): Measures the average squared difference between
actual and predicted values. Lower MSE indicates better performance.

7. Assumptions of Linear Regression

For linear regression to provide accurate results, the following assumptions must hold:

1. Linearity: The relationship between independent and dependent variables is

linear.
2. Independence: Residuals (errors) should not be correlated.
3. Homoscedasticity: The variance of residuals should remain constant across all
values of the independent variable.
4. Normality: Residuals should be normally distributed.
5. No Multicollinearity: Independent variables should not be highly correlated
with each other.

Conclusion

Linear regression is a simple yet powerful algorithm for predictive modelling. However,
it is important to check whether its assumptions hold for a given dataset. If
assumptions are violated, alternative models like polynomial regression or
regularization techniques (Ridge/Lasso) may be more suitable.

2. Explain the Anscombe’s quartet in detail. (3 marks)

ANSWER)

Anscombe’s Quartet is a set of four datasets that have nearly identical descriptive
statistics but display significantly different patterns when visualized. It was introduced
by statistician Francis Anscombe in 1973 to emphasize the importance of graphing
data rather than relying solely on numerical summaries.

Key Observations

 All four datasets share similar mean, variance, correlation coefficient, and
regression line, yet their scatter plots reveal distinct relationships.
 The quartet demonstrates how outliers, nonlinear relationships, and
influential points can distort statistical analysis.

Graphical Interpretation
 Dataset I (Top-left): Shows a linear relationship, making it well-suited for simple
regression.
 Dataset II (Top-right): Indicates a nonlinear pattern, meaning linear regression
is inappropriate.
 Dataset III (Bottom-left): Contains an outlier that skews the regression line and
correlation.
 Dataset IV (Bottom-right): Has a single high-leverage point, misleadingly
inflating the correlation.

Example

Consider a dataset where we analyse students' study hours vs. exam scores:

Student Study Hours (x) Exam Score (y)

A 2 50
B 4 60
C 6 70
D 8 80
E 10 90

 The linear regression model would predict that exam scores increase
proportionally with study hours.
 However, if one student (outlier) has 10 hours but scores only 40, it may
distort the regression results.

This example aligns with Dataset III of Anscombe’s Quartet, where an outlier
misleads the regression analysis.
Conclusion

Anscombe’s Quartet highlights the limitations of relying only on summary statistics

and the necessity of visualizing data before drawing conclusions in statistical
modelling. It serves as a reminder that data distributions can differ drastically despite
having identical statistical properties.

3. What is Pearson’s R? (3 marks)

ANSWER)

Pearson’s correlation coefficient (Pearson’s R) is a statistical measure that

quantifies the strength and direction of a linear relationship between two variables. It
is denoted by r and ranges from -1 to 1.

Interpretation of Pearson’s R

Value of r Interpretation
r=1 Perfect positive correlation (as X increases, Y increases)
0<r<1 Positive correlation (stronger as it nears 1)
r=0 No correlation (no linear relationship)
-1 < r < 0 Negative correlation (stronger as it nears -1)
r = -1 Perfect negative correlation (as X increases, Y decreases)

Example

Consider the relationship between study hours and exam scores:

Student Study Hours (X) Exam Score (Y)

A 2 50
B 4 60
Student Study Hours (X) Exam Score (Y)
C 6 70
D 8 80
E 10 90

 The Pearson’s r for this dataset would be close to 1, indicating a strong positive
correlation between study hours and exam scores.
 If the scores fluctuated randomly despite increasing hours, r would be closer to
0.

Pearson’s R is a powerful tool for measuring linear relationships between variables,

but it does not imply causation. It should be used alongside data visualization and
other statistical analyses for accurate interpretation.

4. What is scaling? Why is scaling performed? What is the di erence between

normalized scaling and standardized scaling? (3 marks)

ANSWER)

Scaling in data pre-processing refers to the process of transforming the values of

variables to a specific range or distribution. The goal is to bring all variables to a similar
scale, making them comparable and preventing one variable from dominating others.

Advantages of Scaling:

1. Equal Weightage: Ensures that all variables contribute equally to the analysis,
preventing variables with larger magnitudes from disproportionately influencing
the results.
2. Convergence: Many machine learning algorithms (e.g., KNN, SVM, Gradient
Descent-based models) perform better when features are on a similar scale.
Scaling helps in faster convergence during optimization.
3. Interpretability: Improves the interpretability of coefficients in linear models, as
the coefficients represent the change in the dependent variable for a one-unit
change in the predictor variable.

Differences between Normalized Scaling and Standardized Scaling:

1. Normalized Scaling (Min-Max Scaling):

 Range: Scales the values of a variable to a specific range, usually [0, 1] or [-1, 1].
 Advantages: Useful when the distribution of the variable is unknown or not
Gaussian.
 Disadvantages: Sensitive to outliers.

2. Standardized Scaling (Z-score Normalization):

 Mean and Standard Deviation: Scales the values to have a mean of 0 and a
standard deviation of 1.

 Advantages: Less sensitive to outliers; preserves the shape of the distribution.

 Disadvantages: Assumes that the variable follows a Gaussian distribution.

Example:

Consider a dataset with house prices (in $1000s):

House Price ($1000s)

A 200
B 400
C 600
D 800
E 1000

 Normalized Scaling (Min-Max):

o Converts values to a range [0,1], ensuring no value dominates.
 Standardized Scaling (Z-score):
o Centers data around mean 0 with unit variance, making it comparable
to normal distributions.

Conclusion:

Scaling is essential for improving model efficiency and ensuring fair feature
contribution.

 Normalization is preferred for bounded data.

 Standardization is useful for normally distributed data.
5. You might have observed that sometimes the value of VIF is inﬁnite. Why does
this happen? (3 marks)

ANSWER)

Variance Inflation Factor (VIF) is a measure used to detect multicollinearity in a

dataset. It quantifies how much the variance of a regression coefficient is inflated due
to correlation among independent variables.

where R^2 is the coefficient of determination for a predictor regressed on all other
predictors.

VIF becomes infinite when R2=1R^2 = 1, meaning that one independent variable is
perfectly correlated (linearly dependent) with one or more other independent
variables. This leads to:

 Perfect Multicollinearity: If a predictor is an exact linear combination of other

predictors, its VIF becomes infinite, indicating severe multicollinearity.
 Singular Matrix Issue: When perfect correlation exists, the design matrix
becomes non-invertible, causing computational issues in regression models.
 Dummy Variable Trap: If categorical variables are one-hot encoded incorrectly
(e.g., including all categories instead of k−1k-1 categories), VIF can become
infinite due to redundancy.

Example:

Consider a dataset with three independent variables:

X1 X2 X3 (Duplicate of X1)
10 5 10
20 10 20
30 15 30
40 20 40

 Here, X1 and X3 are identical (X3=X1X3 = X1), leading to perfect

multicollinearity.
 This results in R2=1R^2 = 1 for X3, making its VIF infinite.

Solution to Avoid Infinite VIF:

1. Remove Highly Correlated Features – Drop one of the redundant variables.

2. Use Principal Component Analysis (PCA) – Transform correlated features into
uncorrelated components.
3. Regularization Techniques – Ridge regression can help mitigate
multicollinearity.
4. Check Dummy Variables – Ensure correct one-hot encoding to avoid
redundancy.

An infinite VIF occurs due to perfect correlation among features. This should be
resolved to avoid instability in regression models.

6. What is a Q-Q plot? Explain the use and importance of a Q-Q plot in linear
regression. (3 marks)

ANSWER)

A Quantile-Quantile (Q-Q) plot is a graphical tool used to assess whether a dataset

follows a particular theoretical distribution, such as the normal distribution. It
compares the quantiles of the observed data against the quantiles of the expected
distribution.

 If the points in the Q-Q plot fall along a straight line, it suggests that the data
follows the chosen theoretical distribution.
 If the points deviate from the line, it indicates departures from normality, such
as skewness or heavy tails.

Use and Importance of Q-Q Plot in Linear Regression

1. Normality Assessment:
o Use: Linear regression assumes that residuals (differences between
observed and predicted values) are normally distributed. The Q-Q plot
helps check this assumption.
o Importance: If residuals deviate significantly from normality, it may
affect the validity of hypothesis tests and confidence intervals in
regression.
2. Identifying Outliers:
o Use: Points that deviate significantly from the straight line in a Q-Q plot
indicate outliers in the dataset.
o Importance: Outliers can skew the model results and impact the
estimation of regression coefficients.
3. Model Fit Assessment:
o Use: A Q-Q plot visually assesses how well residuals conform to a normal
distribution.
o Importance: If the plot shows systematic deviations, it may indicate
poor model fit or the need for data transformation.
4. Validity of Statistical Tests:
o Use: Many statistical tests in regression, such as t-tests and F-tests,
assume normality of residuals.
o Importance: If this assumption is violated, the p-values and confidence
intervals derived from these tests may be unreliable.

Interpretation of Q-Q Plots

 If the points align closely with the diagonal line: The residuals are
approximately normally distributed.
 If the points show curvature or heavy tails: The residuals deviate from
normality, suggesting skewness or outliers.

Q-Q plots are powerful diagnostic tools in linear regression for assessing the
normality of residuals, detecting outliers, and validating statistical assumptions.
They provide a simple yet effective visual check for model correctness.

Niagara Falls
77% (13)
Niagara Falls
26 pages
Hydraulic Seal Catalogue 2022
100% (3)
Hydraulic Seal Catalogue 2022
423 pages
Exam All Questions
No ratings yet
Exam All Questions
566 pages
Assessing Affective Learning Outcomes
50% (2)
Assessing Affective Learning Outcomes
45 pages
Solution - Data Analysis With Python-Project-2 - v1.0
No ratings yet
Solution - Data Analysis With Python-Project-2 - v1.0
14 pages
Linear Regression Assignment Questions and Answer
No ratings yet
Linear Regression Assignment Questions and Answer
7 pages
Linear Regression Subjective Questions
No ratings yet
Linear Regression Subjective Questions
1 page
Unit V - Update
No ratings yet
Unit V - Update
53 pages
Predictive Modelling Sweta Kumari
No ratings yet
Predictive Modelling Sweta Kumari
35 pages
Subjective Questions Answers
No ratings yet
Subjective Questions Answers
14 pages
Professional Education With RATIONALIZATION
0% (1)
Professional Education With RATIONALIZATION
5 pages
Subjective Questions
92% (13)
Subjective Questions
6 pages
Cognitive Class - Answers Data Analysis With Python
No ratings yet
Cognitive Class - Answers Data Analysis With Python
6 pages
6.classification & Regression
No ratings yet
6.classification & Regression
45 pages
Exam Final
100% (1)
Exam Final
21 pages
Live Session ML Linear Regression Assignment
No ratings yet
Live Session ML Linear Regression Assignment
22 pages
Data - Analysis 2
No ratings yet
Data - Analysis 2
22 pages
Assignment-Based Subjective Questions
100% (1)
Assignment-Based Subjective Questions
10 pages
Final AIP Spring 24 (Sloution)
No ratings yet
Final AIP Spring 24 (Sloution)
16 pages
Bike Sharing Assignment
100% (6)
Bike Sharing Assignment
7 pages
ML Assignment 1
No ratings yet
ML Assignment 1
11 pages
Unit II - Diagnotis and Multiple Linear
No ratings yet
Unit II - Diagnotis and Multiple Linear
8 pages
Linear Regression Firm Basit PDF
No ratings yet
Linear Regression Firm Basit PDF
21 pages
Unit 5
No ratings yet
Unit 5
18 pages
ML Asssignment Subjective Questions Answers
No ratings yet
ML Asssignment Subjective Questions Answers
7 pages
Subjective Ques SKS
No ratings yet
Subjective Ques SKS
8 pages
Linear Regression Assignment - Subjective
No ratings yet
Linear Regression Assignment - Subjective
7 pages
DSEnd
No ratings yet
DSEnd
30 pages
Assignment 2 B
No ratings yet
Assignment 2 B
10 pages
Group7 Report
No ratings yet
Group7 Report
10 pages
Weda
50% (2)
Weda
60 pages
DAV 2201079 Exp 2 2-1
No ratings yet
DAV 2201079 Exp 2 2-1
35 pages
Rohit Unit 2 ML Notes
No ratings yet
Rohit Unit 2 ML Notes
7 pages
Linear Regression Tech
No ratings yet
Linear Regression Tech
15 pages
Unit - Iv
No ratings yet
Unit - Iv
11 pages
CIA Understanding
No ratings yet
CIA Understanding
5 pages
CC02 Group6 Report
No ratings yet
CC02 Group6 Report
36 pages
UNIT 1 DS - End - Term
No ratings yet
UNIT 1 DS - End - Term
6 pages
Subjective Questions
No ratings yet
Subjective Questions
3 pages
LP III Lab Manual
100% (1)
LP III Lab Manual
8 pages
Linear Regression Subjective Questions
No ratings yet
Linear Regression Subjective Questions
14 pages
21CS53 DBMS Module3 QuestionBank 2023-24
No ratings yet
21CS53 DBMS Module3 QuestionBank 2023-24
3 pages
Information Retrieval Important Questions
No ratings yet
Information Retrieval Important Questions
20 pages
Assignment Linear Regression
No ratings yet
Assignment Linear Regression
10 pages
Regression Logistic Unit3 Notes
No ratings yet
Regression Logistic Unit3 Notes
6 pages
Advance Machine Learning
No ratings yet
Advance Machine Learning
16 pages
Linear Regression Datascience Basit PDF
No ratings yet
Linear Regression Datascience Basit PDF
19 pages
Assignment 2 - LP1
No ratings yet
Assignment 2 - LP1
7 pages
R Group Task Reg
No ratings yet
R Group Task Reg
1 page
ML PR-2
No ratings yet
ML PR-2
11 pages
Exam in Statistical Machine Learning Statistisk Maskininlärning (1RT700)
No ratings yet
Exam in Statistical Machine Learning Statistisk Maskininlärning (1RT700)
11 pages
PA Answers
No ratings yet
PA Answers
4 pages
Assessment Recording Sheet and Data Tracker: Created by Primary Junction
No ratings yet
Assessment Recording Sheet and Data Tracker: Created by Primary Junction
7 pages
Ba ZG512 Ec-3r First Sem 2024-2025
No ratings yet
Ba ZG512 Ec-3r First Sem 2024-2025
3 pages
Assignment-Based Subjective Questions/Answers
No ratings yet
Assignment-Based Subjective Questions/Answers
3 pages
Assignment-Based Subjective Questions
No ratings yet
Assignment-Based Subjective Questions
1 page
Regression Questionnaire
No ratings yet
Regression Questionnaire
10 pages
Econometrics For Finance (2017-I)
No ratings yet
Econometrics For Finance (2017-I)
6 pages
3 Da
No ratings yet
3 Da
16 pages
Mekanisme Pengelolaan Persediaan Sparepart Sepeda Motor Honda Pada PT. Bintang Motor Jaya, TBK Cabang Cirebon
No ratings yet
Mekanisme Pengelolaan Persediaan Sparepart Sepeda Motor Honda Pada PT. Bintang Motor Jaya, TBK Cabang Cirebon
35 pages
MACHINE LEARNING T.E. (IT) (2019 Pattern) (Semester-I) Nov Dec 2022
No ratings yet
MACHINE LEARNING T.E. (IT) (2019 Pattern) (Semester-I) Nov Dec 2022
4 pages
Grade 10 Work Sheet w5 q1
100% (2)
Grade 10 Work Sheet w5 q1
2 pages
Grade 3 Data Mining: Question Text
No ratings yet
Grade 3 Data Mining: Question Text
28 pages
Mid Term Solutions
No ratings yet
Mid Term Solutions
5 pages
Seanewdim Philology Ii10 Issue 47
No ratings yet
Seanewdim Philology Ii10 Issue 47
118 pages
Data Science Module 5 Q & A
No ratings yet
Data Science Module 5 Q & A
8 pages
Lab 2 Tutorial PDF
No ratings yet
Lab 2 Tutorial PDF
16 pages
IS4240 - AY1314S2 - Assignment - DM1
No ratings yet
IS4240 - AY1314S2 - Assignment - DM1
3 pages
Department of Education School Building Inventory Form (As of February 28, 2022)
No ratings yet
Department of Education School Building Inventory Form (As of February 28, 2022)
17 pages
General Ledger Conversion Document - Workday Community
No ratings yet
General Ledger Conversion Document - Workday Community
7 pages
Assignment-Based Subjective Questions/Answers
No ratings yet
Assignment-Based Subjective Questions/Answers
3 pages
Communication
No ratings yet
Communication
33 pages
Chap 10
No ratings yet
Chap 10
50 pages
Lampiran 1 Lembar Persetujuan Setelah Penjelasan (PSP) : (Informed Consent)
No ratings yet
Lampiran 1 Lembar Persetujuan Setelah Penjelasan (PSP) : (Informed Consent)
9 pages
Assessment 2 Instructions - Capstone Proposal - ..
No ratings yet
Assessment 2 Instructions - Capstone Proposal - ..
3 pages
Problem Solving in Organizations A Methodological Handbook For Business Students 1st Edition Van Aken 2024 Scribd Download
100% (11)
Problem Solving in Organizations A Methodological Handbook For Business Students 1st Edition Van Aken 2024 Scribd Download
84 pages
11.cattle Insurance
No ratings yet
11.cattle Insurance
22 pages
Parts
No ratings yet
Parts
4 pages
Samba de Verão & Wave - Sax
No ratings yet
Samba de Verão & Wave - Sax
2 pages
An Empirical Model For Brand Loyalty Measurement: M. Punniyamoorthy
No ratings yet
An Empirical Model For Brand Loyalty Measurement: M. Punniyamoorthy
12 pages
3.7 Design Formulas For Siphons
No ratings yet
3.7 Design Formulas For Siphons
6 pages
Fabco Sda 1800 Steerable Drive Axle Parts Manual
No ratings yet
Fabco Sda 1800 Steerable Drive Axle Parts Manual
14 pages
Power Windows Description and Operation
No ratings yet
Power Windows Description and Operation
4 pages
Development of Presentation Media Design Based On Google Slides Add-On Pear-Deck On High School Sequences and Series Material
No ratings yet
Development of Presentation Media Design Based On Google Slides Add-On Pear-Deck On High School Sequences and Series Material
9 pages
Activity 6-2 Name: Joella Mae Escanda Section: B
No ratings yet
Activity 6-2 Name: Joella Mae Escanda Section: B
2 pages
SABIK MARINE Datasheet LED 155
No ratings yet
SABIK MARINE Datasheet LED 155
2 pages
Van-Tharp - The Flow of The Markets
No ratings yet
Van-Tharp - The Flow of The Markets
3 pages
CR03 - PPAP-Flammability-IMDS-OTOP Status
No ratings yet
CR03 - PPAP-Flammability-IMDS-OTOP Status
1 page
Acceptance-Rejection Sampling and Multi-dimensional Monte Carlo Integrations Utilizing Mathematica®
From Everand
Acceptance-Rejection Sampling and Multi-dimensional Monte Carlo Integrations Utilizing Mathematica®
SUJAUL CHOWDHURY
No ratings yet
Multi-dimensional Monte Carlo Integrations Utilizing Mathematica
From Everand
Multi-dimensional Monte Carlo Integrations Utilizing Mathematica
SUJAUL CHOWDHURY
No ratings yet

Linear Regression Subjective Questions

Uploaded by

Linear Regression Subjective Questions

Uploaded by

Bike Sharing Assignment

Assignment-based Subjective Questions

2. Why is it important to use drop_ﬁrst=True during dummy variable creation?

 Dummy variable creation is a technique in statistical modelling and machine

 Is_Blue (1 if Blue, 0 otherwise)

 This ensures a proper dummy variable representation while avoiding

 Process: Examine the residuals (differences between observed and predicted

Homoscedasticity is crucial for maintaining the reliability of confidence intervals and

Ensuring linearity is essential for an accurate linear regression model, as non-linearity

Independence of residuals is crucial to avoid biased standard errors and misleading

 Process: Detect high correlations among independent variables.

Multicollinearity can distort regression coefficients, making interpretation difficult.

 Process: Assess model performance on different data subsets.

Cross-validation helps generalize the model, reducing overfitting risks.

7. Check for Overfitting

Preventing overfitting ensures better real-world performance and avoids misleading

From the equation of the best fit line:

cnt = 4491.30 + 1001 x yr + 1117.21 x temp - 426.72 x hum - 358.38 x windspeed +

 Temperature (temp): A higher temperature is associated with an increase in

Linear regression is a statistical method used to model the relationship between a

Steps in Linear Regression Algorithm

 Simple Linear Regression (One Independent Variable)

The equation is:

o y = dependent variable (target variable)

The equation extends to:

2. Objective Function (Cost Function)

 n = number of data points

3. Optimization (Minimization of Error)

 Gradient Descent: Iteratively updates the coefficients in the direction of

where α is the learning rate.

 Normal Equation: Directly calculates the optimal values of coefficients without

4. Training the Model

The model’s performance is measured using:

 R^2 Score (Coefficient of Determination): Measures how well the model

7. Assumptions of Linear Regression

1. Linearity: The relationship between independent and dependent variables is

2. Explain the Anscombe’s quartet in detail. (3 marks)

Student Study Hours (x) Exam Score (y)

Anscombe’s Quartet highlights the limitations of relying only on summary statistics

3. What is Pearson’s R? (3 marks)

Pearson’s correlation coefficient (Pearson’s R) is a statistical measure that

Consider the relationship between study hours and exam scores:

Student Study Hours (X) Exam Score (Y)

Pearson’s R is a powerful tool for measuring linear relationships between variables,

4. What is scaling? Why is scaling performed? What is the di erence between

Scaling in data pre-processing refers to the process of transforming the values of

Differences between Normalized Scaling and Standardized Scaling:

1. Normalized Scaling (Min-Max Scaling):

2. Standardized Scaling (Z-score Normalization):

 Advantages: Less sensitive to outliers; preserves the shape of the distribution.

Consider a dataset with house prices (in $1000s):

House Price ($1000s)

 Normalized Scaling (Min-Max):

 Normalization is preferred for bounded data.

Variance Inflation Factor (VIF) is a measure used to detect multicollinearity in a

 Perfect Multicollinearity: If a predictor is an exact linear combination of other

Consider a dataset with three independent variables:

 Here, X1 and X3 are identical (X3=X1X3 = X1), leading to perfect

Solution to Avoid Infinite VIF:

1. Remove Highly Correlated Features – Drop one of the redundant variables.

A Quantile-Quantile (Q-Q) plot is a graphical tool used to assess whether a dataset

Use and Importance of Q-Q Plot in Linear Regression

Interpretation of Q-Q Plots

You might also like