0% found this document useful (0 votes)
6 views7 pages

ML Asssignment Subjective Questions Answers

The document discusses various aspects of data analysis and linear regression, including the impact of categorical variables on bike rentals, the importance of using drop_first=True in dummy variable creation, and the validation of linear regression assumptions. It also explains linear regression, Anscombe's quartet, Pearson's R, scaling methods, VIF, and the significance of Q-Q plots in assessing data distribution. Key findings include the top features affecting bike demand and the necessity of data visualization for accurate modeling.

Uploaded by

ravaligan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views7 pages

ML Asssignment Subjective Questions Answers

The document discusses various aspects of data analysis and linear regression, including the impact of categorical variables on bike rentals, the importance of using drop_first=True in dummy variable creation, and the validation of linear regression assumptions. It also explains linear regression, Anscombe's quartet, Pearson's R, scaling methods, VIF, and the significance of Q-Q plots in assessing data distribution. Key findings include the top features affecting bike demand and the necessity of data visualization for accurate modeling.

Uploaded by

ravaligan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Assignment based subjective questions:

1. From your analysis of the categorical variables from the dataset, what could you infer
about their effect on the dependent variable?

Season  Fall season had highest number of bike rentals whereas spring had least number of rentals

Weathersit  Bike rentals are high in clear and partly cloudy weather condition and low in light
snow condition

Weekday  Count is almost same throughout the week

Mnth  September has highest rentals while Jan has least rentals as expected according to weather
conditions.

2. Why is it important to use drop_first=True during dummy variable creation?

Dummy variables are always created according the number of distinct values of that feature.
Ex: Let’s say for color column we have 3 values as black, white and beige. If we execute
dummy variables command, they get created for all 3 colors and it’s redundant.
Drop_first removes the first color column that gets created and hence redundancy is
eliminated. We can improve performance significantly when the number of columns in the
dataset is more.
Ideally we should have 4 columns for season as we have 4 distinct values but we have defined
drop_first = True and hence we don’t have season_fall column and if we have ‘000’ for all those 3
columns then we can identify that as season_fall.

3. Looking at the pair-plot among the numerical variables, which one has the highest
correlation with the target variable? (1 mark)
From the above plot, we can clearly see that temp and atemp features are clearly correlated with the
target variable cnt

4.How did you validate the assumptions of Linear Regression after building the model on the
training set? (3 marks)

1. Error terms should follow normal distribution as per the assumption of linear regression model
and we clearly see that the below plot of the residuals follow normal distribution.
2. Linear regression model assumes that the variables are independent and relationship between
dependant and independent variables is linear. We visualised this using a pairplot of numeric
variables.

3. Multicollinearity shouldn’t be there among the variables as per the assumption of linear
regression model. We calculated VIF for each model and eliminated the variables that are highly
correlated and have got the idea of how much collinearity each variables has and our final model has
variables that are not correlated with each other

5. Based on the final model, which are the top 3 features contributing significantly towards
explaining the demand of the shared bikes? (2 marks)

The top 3 features contributing significantly are

 Yr coefficient 0.2343


 Temp  coefficient  0.4352
 Weathersit _light_rain and thunderstorm  coefficient -0.2961

General Subjective questions:

1.Explain the linear regression algorithm in detail.

Linear Regression is an algorithm that describes relationship between two variables (dependent and
independent ) to predict the outcome of future events. It’s a supervised learning algorithm that gives
relationship between variables and makes predictions.

Ex: Predicting sales based on different factors such as product prices, marketing spends ..

Linear Regression model is of two types: Simple linear regression and Multiple Linear Regression

Simple Linear Regression:


It involves one independent variable and one dependant variable and is the simplest form of linear
regression

𝑦 = 𝛽0 + 𝛽1𝑋

𝛽0 = 𝐼𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡
𝛽1 = 𝑆𝑙𝑜𝑝𝑒
y= dependent variable
X= independent variable

Multiple Linear Regression :


It involves more than one independent variable and dependent variable and it’s equation is given by
𝑦 = 𝛽0 + 𝛽1𝑋1 + 𝛽2𝑋2 + 𝛽3𝑋3 + ⋯ . + 𝛽𝑛𝑋𝑛
𝛽0 = 𝐼𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡
𝛽1, 𝛽2 … 𝛽𝑛 𝑎𝑟𝑒 𝑡h𝑒 𝑆𝑙𝑜𝑝𝑒𝑠
y= dependent variable
X1.,X2 ,.. Xn are the independent variable
The goal of linear regression algorithm is to find the best fit line that predicts the outcome based on
independent variables .

2. Explain the Anscombe’s quartet in detail.

Anscombes quartet was constructed by statistician Franscis Anscombe in 1973 and consists of four
datasets having similar properties such as mean, variance, R2 but when plotted , they represent
different visualisation trends. It is designed to highlight the importance of Exploratory data analysis
and visualising data despite having similar mean, variance and other statistical properties as similar
statistical properties alone can be misleading while deriving insights .

The four datasets consists of 11 x-y pairs of data, when we plot them , they seem to have unique
connection between x and y , in terms of variability patterns but they all have same summary
statistics such as correlation coefficient , linear regression line and so on .

It helps to understand the importance of data visualisation and how easy it is to fool a regression
algorithm. Before trying to implement any machine learning algorithm ,we need to first visualise the
data set in order to build a well fit model.

3. What is Pearson’s R?

Pearson’s R summarizes the characteristics of a dataset and describes the strength and direction of
linear relationship between two variables. It lies between -1 and 1.

Between 0 and 1 means positive correlation. Zero indicates no correlation. Between 0 and -1
indicates negative correlation and it’s most widely used correlation coefficient and it’s good choice
to use it when the below conditions are true

 when both variables are quantitative


 when the variables are normally distributed
 when the data have no outliers
 when the linear relationship is linear.

It is also an inferential statistic ,meaning that it can be used to test statistical hypothesis . Specifically,
we can test whether there is significant relationship between two variables.

4. What is scaling? Why is scaling performed? What is the difference between normalized scaling
and standardized scaling?

Scaling is a pre-processing step that is performed on independent variables to normalize the data
within a particular range as it helps in performing calculations in algorithm.

It is performed to convert the data from one magnitude/range to the expected unit as per the data
set we have as most of the times, features in dataset have different magnitude/range. If scaling is not
done, then the coefficients will not be accurate since algorithm takes only magnitude into account
and not units and hence incorrect modelling. To solve this issue, scaling needs to be done so that all
features will be in same magnitude/range.

There are 2 types of scaling.

1. Normalised /Min-Max Scaling


2. Standardization scaling

Min-Max Scaling :

It lies in the range of 0 and 1.

Min-Max scaling x = x-min(x)/max(x)-min(x)

Standardization scaling :

It brings all of the data into a standard normal distribution which has zero mean and one standard
deviation.

Standardization x = x-mean(x)/sd(x)

Normalisation loses some data about outliers over standardization method and hence it’s the
disadvantage of normalisation.

5. You might have observed that sometimes the value of VIF is infinite. Why does this happen?

VIF is calculated using below formula

VIF = 1/1-R2

If R2 is equal to 0 then VIF will be infinite and it means infinite VIF. It denotes perfect correlation in
variables. Large value of VIF indicates that there is a correlation between the variables and the range
to consider VIF while modelling the data is given below :

1 No multicollinearity
4-5 moderate
10 or greater Severe

If VIF is large , we need to take some action before we go for multiple regression. Drop the features
with large VIF and , calculate VIF for the other features and hopefully VIF changes for them.

6.What is a Q-Q plot? Explain the use and importance of a Q-Q plot in linear regression.

Q-Q plots known as Quantile-Quantile plots and they plot the quantiles of a sample distribution
against quantiles of a theoretical distribution. Q-Q plot is a graphical tool to help us assess if a set of
data come from some theoretical distribution such as normal or uniform distribution. It helps to
determine if two datasets come from population with a common distribution.
It helps in a scenario when training and test data set received separately and then we confirm using
Q-Q plot that both the data sets are from populations with same distribution.

You might also like