0% found this document useful (0 votes)

397 views3 pages

Assignment-Based Subjective Questions/Answers

The document contains questions and answers related to linear regression analysis. It includes questions about interpreting categorical variables, creating dummy variables, and identifying important features in a linear regression model. The answers explain the effects of categorical variables, why drop_first=True is important in dummy variable creation, and that correlation, p-value, and VIF are the top three contributing features to the shared bike demand model.

Uploaded by

rahul

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

397 views3 pages

Assignment-Based Subjective Questions/Answers

Uploaded by

rahul

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

Assignment-based Subjective Questions/Answers

Q1. From your analysis of thecategorical variables fromthe dataset,what could youinfer about their effect on the
dependent variable?
A1. The categorical variables have very effect on the target variable.
The Equation of our best fitted line is:
cnt = 0.2464yr - 0.0820holiday - 0.2353windspeed + 0.0459summer + 0.0823winter + 0.1026Aug - 0.1653Dec -
0.2074Feb - 0.2759Jan - 0.1111Nov + 0.1212Sep - 0.3090 * Light Snow - 0.0922 Mist

Q2. Why is it important to use drop_first=True during dummy variable creation?

A2. 1.drop_first=True is important to use, as it helps in reducing the extra column created during dummy variable
creation. Hence it reduces the correlations created among dummy variables.

2.Let’s say we have 3 types of values in Categorical column and we want to create dummy variable for that column. If
one variable is not furnished and semi_furnished, then It is obvious unfurnished. So we do not need 3rd variable to
identify the unfurnished.

Hence if we have categorical variable with n-levels, then we need to use n-1 columns to represent the dummy variables.

Q5. Based on the final model, which are the top 3 features contributing significantly towards explaining the
demand of the shared bikes?

A5. The top 3 features contributing significantly towards explaining the demand of the shared bikes are :

1. Correlation
2. P-value
3. VIF

General Subjective Questions

Q1. Explain the linear regression algorithm in detail.
A1. Linear Regression is a machine learning algorithm based on supervised learning. It performs a regression task.
Regression models a target prediction value based on independent variables. It is mostly used for finding out the relationship
between variables and forecasting. Different regression models differ based on – the kind of relationship between dependent
and independent variables, they are considering and the number of independent variables being used.
Linear regression performs the task to predict a dependent variable value (y) based on a given independent variable (x). So,
this regression technique finds out a linear relationship between x (input) and y(output). Hence, the name is Linear
Regression. The equation is given by : Y = b1 + b2*X
Q2. Explain the Anscombe’s quartet in detail.
A2. Anscombe’s Quartet can be defined as a group of four data sets which are nearly identical in simple descriptive statistics,
but there are some peculiarities in the dataset that fools the regression model if built. They have very different distributions
and appear differently when plotted on scatter plots. The four data set plots which have nearly same statistical observations,
which provides same statistical information that involves variance, and mean of all x,y points in all four datasets. This tells
us about the importance of visualising the data before applying various algorithms out there to build models out of them
which suggests that the data features must be plotted in order to see the distribution of the samples that can help you identify
the various anomalies present in the data like outliers, diversity of the data, linear separability of the data, etc. Also, the
Linear Regression can be only be considered a fit for the data with linear relationships and is incapable of handling any
other kind of datasets.
Q3. What is Pearson’s R?
A3. A Pearson's correlation is used when you want to find a linear relationship between two variables. It can be used in a
causal as well as a associativeresearch hypothesis but it can't be used with a attributive RH because it is univariate. Pearson's
correlation should be used only when there is a linear relationship between variables. It can be a positive or negative
relationship, as long as it is significant. Correlation is used for testing in Within Groups studies. A possible research
hypothesis for this statistical model would be that there is a positive linear relationship between variables. Another possible
research hypothesis would be that there is a negative linear relationship. If there is no linear relationship between the
variables, then we would retain the null hypothesis. Pearson's correlation should be used when there is a significant effect.
(p > .05) When there is a relationship between two variables. There can be a positive or negative correlation. It cannot be
used when we retain the null hypothesis because then there is no relationship. It can be used if the null is rejected. A
Pearson's correlation is used when two quantitative variables are being tested in the RH. This cannot test attributive RH, but
can associative and causal. The associative hypothesis can be tested when ever we want with a correlation. The causal RH
can only be used with a correlation when a well-ran true experiment is being ran.
Q4. What is scaling? Why is scaling performed? What is the difference between normalized scaling and standardized
scaling?
A4. Scaling is a step of data Pre-Processing which is applied to independent variables to normalize the data within a
particular range. It also helps in speeding up the calculations in an algorithm.

Most of the times, collected data set contains features highly varying in magnitudes, units and range. If scaling is not done
then algorithm only takes magnitude in account and not units hence incorrect modelling. To solve this issue, we have to
do scaling to bring all the variables to the same level of magnitude.It is important to note that scaling just affects the
coefficients and none of the other parameters like t-statistic, F-statistic, p-values, R-squared, etc

Normalization/Min-Max Scaling: It brings all of the data in the range of 0 and 1. sklearn.preprocessing.MinMaxScaler
helps to implement normalization in python.

MinMax Scaling : x = {x - min(x)} / { max(x) – min(x)}

Standardization Scaling: Standardization replaces the values by their Z scores. It brings all of the data into a standard
normal distribution which has mean (μ) zero and standard deviation one (σ).

Standardization : x = { x – mean(x)} / S.D(x)

Q.5. You might have observed that sometimes the value of VIF is infinite. Why does this happen?

A5. If there is perfect correlation, then VIF = infinity. This shows a perfect correlation between two independent
variables. In the case of perfect correlation, we get R2 =1, which lead to 1/(1-R2) infinity. To solve this problem we need
to drop one of the variables from the dataset which is causing this perfect multicollinearity. An infinite VIF value
indicates that the corresponding variable may be expressed exactly by a linear combination of other variables (which show
an infinite VIF as well).

Q.6. What is a Q-Q plot? Explain the use and importance of a Q-Q plot in linear regression.

A.6. Quantile-Quantile (Q-Q) plot, is a graphical tool to help us assess if a set of data plausibly came from some
theoretical distribution such as a Normal, exponential or Uniform distribution. Also, it helps to determine if two data sets
come from populations with a common distribution.

This helps in a scenario of linear regression when we have training and test data set received separately and then we can
confirm using Q-Q plot that both the data sets are from populations with same distributions.
Few advantages:

a) It can be used with sample sizes also

b) Many distributional aspects like shifts in location, shifts in scale, changes in symmetry, and the presence of outliers can
all be detected from this plot.

It is used to check following scenarios:

If two data sets —

i. come from populations with a common distribution

ii. have common location and scale

iii. have similar distributional shapes

iv. have similar tail behavior

Malayalam Calendar Y 2010
67% (3)
Malayalam Calendar Y 2010
12 pages
Pgdds - Assessment and Learning Experience Manual: Iiitb and Upgrad Post - Graduate Diploma in Data Science
100% (1)
Pgdds - Assessment and Learning Experience Manual: Iiitb and Upgrad Post - Graduate Diploma in Data Science
12 pages
Lead Score Case Study - Presentation
33% (3)
Lead Score Case Study - Presentation
17 pages
Hilux Ac
80% (5)
Hilux Ac
5 pages
Citronix Operation Guide
100% (1)
Citronix Operation Guide
27 pages
Electrical BOQ Office Building
No ratings yet
Electrical BOQ Office Building
4 pages
Architectural Permit
No ratings yet
Architectural Permit
3 pages
Credit EDA Case Study
100% (3)
Credit EDA Case Study
22 pages
Test Bank For Business Analytics 3rd Edition by Evans
No ratings yet
Test Bank For Business Analytics 3rd Edition by Evans
28 pages
Lesson1 Introduction of Production and Operation Management
No ratings yet
Lesson1 Introduction of Production and Operation Management
25 pages
Sajjad DS
100% (2)
Sajjad DS
97 pages
Predictive Modelling
100% (1)
Predictive Modelling
58 pages
Regression Analysis
100% (2)
Regression Analysis
9 pages
IBM SPSS Modeler-Neural Networks
100% (1)
IBM SPSS Modeler-Neural Networks
18 pages
Problem 2 - Survey: Importing Nessceary Libraries
No ratings yet
Problem 2 - Survey: Importing Nessceary Libraries
10 pages
Project 5 PDF
100% (1)
Project 5 PDF
48 pages
SMDM - Week 1 Checklist
100% (1)
SMDM - Week 1 Checklist
3 pages
Business Report Project - Sheetal - SMDM
100% (1)
Business Report Project - Sheetal - SMDM
20 pages
Estimation and Hypothesis
100% (2)
Estimation and Hypothesis
32 pages
Tutorial 2 - Clustering
100% (2)
Tutorial 2 - Clustering
6 pages
Lead Scoring Case Study - Summary
80% (5)
Lead Scoring Case Study - Summary
2 pages
Predictive Modeling Project Report
100% (2)
Predictive Modeling Project Report
31 pages
Name: Siti Mursyida Abdul Karim (Data Science Program) Topic: Assignment - EDA
100% (1)
Name: Siti Mursyida Abdul Karim (Data Science Program) Topic: Assignment - EDA
13 pages
SMDM Project Report
100% (1)
SMDM Project Report
9 pages
NIrupam Agarwal Business Report-ML
100% (1)
NIrupam Agarwal Business Report-ML
23 pages
Sec D CH 12 Regression Part 2
100% (1)
Sec D CH 12 Regression Part 2
66 pages
Machine Learning: Lecture 13: Model Validation Techniques, Overfitting, Underfitting
100% (2)
Machine Learning: Lecture 13: Model Validation Techniques, Overfitting, Underfitting
26 pages
Data Mining Assignment: Sudhanva Saralaya
100% (1)
Data Mining Assignment: Sudhanva Saralaya
16 pages
THINK L2 Unit 4 Vocabulary
No ratings yet
THINK L2 Unit 4 Vocabulary
2 pages
1
100% (1)
1
385 pages
Statistical Methods For Decision Making
100% (1)
Statistical Methods For Decision Making
15 pages
Wholesale Custumer
100% (1)
Wholesale Custumer
32 pages
Exploratory Factor Analysis
100% (1)
Exploratory Factor Analysis
33 pages
Assignment Subjective Questions
67% (3)
Assignment Subjective Questions
1 page
2
0% (1)
2
36 pages
Week 1 Quiz
100% (1)
Week 1 Quiz
28 pages
HCIP-Routing & Switching-IEEP V2.5 Exam Outline
No ratings yet
HCIP-Routing & Switching-IEEP V2.5 Exam Outline
3 pages
Linear Regression Subjective Questions
No ratings yet
Linear Regression Subjective Questions
14 pages
Czekala, 2023
No ratings yet
Czekala, 2023
14 pages
AS Notebook - PCA - Wine Data-4
100% (1)
AS Notebook - PCA - Wine Data-4
1 page
SPSS Multiple Linear Regression
No ratings yet
SPSS Multiple Linear Regression
55 pages
Presenter:: Prof. Richard Chinomona
100% (1)
Presenter:: Prof. Richard Chinomona
55 pages
Clustering Analysis: Prepared by Muralidharan N
100% (1)
Clustering Analysis: Prepared by Muralidharan N
16 pages
Classification Metrics in Machine Learning
No ratings yet
Classification Metrics in Machine Learning
6 pages
Data Mining in Telecommunication Industr
No ratings yet
Data Mining in Telecommunication Industr
3 pages
Set 3
No ratings yet
Set 3
4 pages
VARUNSAINI - 13 Nov 2022
No ratings yet
VARUNSAINI - 13 Nov 2022
14 pages
Multivariate Statistical Method
No ratings yet
Multivariate Statistical Method
85 pages
Advanced Statistics ANOVA PCA EDA Project Report 3 Great Lakes
No ratings yet
Advanced Statistics ANOVA PCA EDA Project Report 3 Great Lakes
28 pages
Amit Khilare Used Device Data PM Project
No ratings yet
Amit Khilare Used Device Data PM Project
25 pages
Chapter 8 B - Trendlines and Regression Analysis
No ratings yet
Chapter 8 B - Trendlines and Regression Analysis
73 pages
LDA 01 Linear Discriminant Analysis
No ratings yet
LDA 01 Linear Discriminant Analysis
65 pages
Assignment 02
No ratings yet
Assignment 02
9 pages
Sample Size for Analytical Surveys, Using a Pretest-Posttest-Comparison-Group Design
From Everand
Sample Size for Analytical Surveys, Using a Pretest-Posttest-Comparison-Group Design
Joseph George Caldwell
No ratings yet
MBA Free Ebooks
No ratings yet
MBA Free Ebooks
56 pages
Assignment #3 Hypothesis Testing
No ratings yet
Assignment #3 Hypothesis Testing
10 pages
Topic03 Correlation Regression
No ratings yet
Topic03 Correlation Regression
81 pages
Problem 1
No ratings yet
Problem 1
12 pages
Elasticity and Its Application
No ratings yet
Elasticity and Its Application
33 pages
BBS - Substation Boxcut Facilities + Transformer 1500 KVA
No ratings yet
BBS - Substation Boxcut Facilities + Transformer 1500 KVA
21 pages
Clustering & PCA Assignment Questions
No ratings yet
Clustering & PCA Assignment Questions
4 pages
Computer Science and Engineering (Curclm)
No ratings yet
Computer Science and Engineering (Curclm)
25 pages
Simple Regression Quiz
No ratings yet
Simple Regression Quiz
6 pages
Advanced Statistics Project
No ratings yet
Advanced Statistics Project
12 pages
Google Bigtable: Describe The Data Model of Bigtable
100% (1)
Google Bigtable: Describe The Data Model of Bigtable
6 pages
Data Management Policy
No ratings yet
Data Management Policy
5 pages
Machine Learning Guided Project
No ratings yet
Machine Learning Guided Project
23 pages
Lead Scoring Subjective Questions
No ratings yet
Lead Scoring Subjective Questions
3 pages
Assignment-Based Subjective Questions/Answers
No ratings yet
Assignment-Based Subjective Questions/Answers
3 pages
Design For Testability CIA 2
No ratings yet
Design For Testability CIA 2
2 pages
Extract Essbase Outline To SQL Database
No ratings yet
Extract Essbase Outline To SQL Database
21 pages
Routine List
No ratings yet
Routine List
26 pages
asset-v1-IIMBx QM901x 3T2015 Type@asset Block@w02 - C03
No ratings yet
asset-v1-IIMBx QM901x 3T2015 Type@asset Block@w02 - C03
6 pages
Vijayalakshmi
No ratings yet
Vijayalakshmi
17 pages
8cs4-22-stv Lab
No ratings yet
8cs4-22-stv Lab
95 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Diagrama Proceso Biodiesel de Cafe
No ratings yet
Diagrama Proceso Biodiesel de Cafe
1 page
SQL - Basics
No ratings yet
SQL - Basics
25 pages
Easy Synchronization: User'S Manual
No ratings yet
Easy Synchronization: User'S Manual
45 pages
Question Bank 2023 Final All Questions
No ratings yet
Question Bank 2023 Final All Questions
78 pages
DMS Ass 4
No ratings yet
DMS Ass 4
7 pages
Sagar Abaper
No ratings yet
Sagar Abaper
11 pages
Sample - Customer Churn Prediction Python Documentation
No ratings yet
Sample - Customer Churn Prediction Python Documentation
33 pages
Calculating Total Scale Scores and Reliability SPSS - D.boduszek
No ratings yet
Calculating Total Scale Scores and Reliability SPSS - D.boduszek
16 pages
Homework 4
No ratings yet
Homework 4
4 pages
Simple Linear Regression - Assign3
No ratings yet
Simple Linear Regression - Assign3
8 pages
Use Powerpay Webon Mobile
No ratings yet
Use Powerpay Webon Mobile
8 pages
Software Requirement Specification E-SEVA
No ratings yet
Software Requirement Specification E-SEVA
17 pages
Leads Data Dictionary
No ratings yet
Leads Data Dictionary
2 pages
AC MELCloud User Manual UK 3MB
No ratings yet
AC MELCloud User Manual UK 3MB
42 pages
Practice Midterm2 Fall2011
No ratings yet
Practice Midterm2 Fall2011
9 pages
BDAR Test Cases
No ratings yet
BDAR Test Cases
8 pages
Training Design Adobe Photoshop2
No ratings yet
Training Design Adobe Photoshop2
6 pages
Frequently Asked Questions
No ratings yet
Frequently Asked Questions
3 pages
12 Dec 2024
No ratings yet
12 Dec 2024
1 page

Assignment-Based Subjective Questions/Answers

Uploaded by

Assignment-Based Subjective Questions/Answers

Uploaded by

Assignment-based Subjective Questions/Answers

Q2. Why is it important to use drop_first=True during dummy variable creation?

General Subjective Questions

MinMax Scaling : x = {x - min(x)} / { max(x) – min(x)}

Standardization : x = { x – mean(x)} / S.D(x)

a) It can be used with sample sizes also

It is used to check following scenarios:

If two data sets —

i. come from populations with a common distribution

ii. have common location and scale

iii. have similar distributional shapes

iv. have similar tail behavior

You might also like