0% found this document useful (0 votes)

15 views

Linear_Regression_datascience_basit.pdf

Uploaded by

AshwiniSapali-Kudure

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views

Linear_Regression_datascience_basit.pdf

Uploaded by

AshwiniSapali-Kudure

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

PROJECT: Predictive Modelling.

Firm Level Data.

Linear Regression.
Name: Basit Ali.
Data Science Program.
Problem 1: Linear Regression

Problem Statement: You are a part of an investing firm and your work is to do research
about these 759 firms. You are provided with the dataset containing the sales and other
attributes of these 759 firms. Predict the sales of these firms on the bases of the details
given in the dataset so as to help your company in investing consciously. Also, provide
them with 5 attributes that are most important.

1.1 Read the data and do exploratory data analysis. Describe the data briefly. (Check
the null values, data types, shape, EDA). Perform Univariate and Bivariate Analysis.

Importing the libraries

Data Description: Exploratory Data Analysis: Top 5 entries in the dataset.

• The first column is an index ("Unnamed: 0") as these are only

serial numbers, we can remove it.

REPORT TITLE PAGE 2

REPORT TITLE PAGE 3
Insights

• Data consists of both categorical and numerical values

• Null values exist in the data
• There are total 759 rows representing voters and 9 columns with 9 variables. Out of 9, 1
column are of object type and 8 columns are of integer type.
• The first column is an index ("Unnamed: 0") as these are only serial numbers, we can remove
it.
• There are no duplicated records in the data set.
• There are 21 missing data points in tobinq.

Getting counts of object variable: There is only one object variable and below are the
corresponding values for each value .

UNIVARIATE ANALYSIS PLOT:

REPORT TITLE PAGE 4

REPORT TITLE PAGE 5
REPORT TITLE PAGE 6
•
• Univariate analysis involves the examination of a single variable in isolation. It
aims to understand the distribution, central tendency, and dispersion of that
variable.
• The distribution of data of the variables present in the dataset are pretty much
positively distributed apart from the institutions variable which is shown here
as negatively skewed.
• Outliers are present in the variable except in Institution variable.

Sales are more for the companies are present in SP500 index than the firms that are not
present in the Sp500 index.

REPORT TITLE PAGE 7

Observations: KDE Plot described as Kernel Density Estimate is used for visualizing the Probability
Density of a continuous variable. It depicts the probability density at different values in a continuous
variable Dependent variable sales is heavily right skewed. sales is highly correlated with capital, randd,
employment. Sales has correlation with tobinq. Probably this variable has a very negative impact on
sales. This gives us the initial insights as to what the variables which will contribute to the highest sales
and the variable contributing to the least of sales.

REPORT TITLE PAGE 8

Inference from the above bivariate analysis is that Sales are highly correlated with
Capital, employment and value. From the above scatter plot there is also high
correlation that exist between randd and patents.

Correlation Heatmap:

Observations: There is a very high degree of correlation between sales and employment,
capital, randd with 0.91, 0.87 and 0.85 respectively sales.
Above heatmap Sales has the least correlation with TOBINQ at 0.11 .It is also to note
that there is moderate correlation between patents and capital. Randd has moderate
correlation with capital and high correlation with patents

REPORT TITLE PAGE 9

TREATMENT OF OUTLIERS:

• Outliers can affect the distribution of data, making it skewed. Treating outliers can
help in normalizing the distribution
• Outliers can distort visualizations such as histograms, box plots, and scatter plots.
Removing or transforming outliers can improve the clarity and interpretability of
these visualizations.

NULL VALUE IMPUTE:

Missing value are present in the data

REPORT TITLE PAGE 10

Imputing the values that are missing with Median.

After imputing the is no missing values are present in the data.

SCALING :

•Scaling can be useful to reduce or check the multi collinearity in the data, so if scaling is not
applied I find the VIF – variance inflation factor values very high. Which indicates presence of
multicollinearity
• These values are calculated after building the model of linear regression. To understand the
multi collinearity in the model
• The scaling had no impact in model score or coefficients of attributes nor the intercept.
• Based on the given data set, as we have attributes that do not have well-defined meanings so
therefore we should scale our data in this case. Accordingly, we have scaled the dataset after
treating the outliers and converting the categorical data into continuous in the dataset.
StandardScaler normalizes the data using the formula (x -mean)/standard deviation

REPORT TITLE PAGE 11

1.2 Encode the data (having string values) for Modelling. Data Split:
Split the data into test and train (70:30). Apply Linear regression.
Performance Metrics: Check the performance of Predictions on
Train and Test sets using R-square, RMSE.

Data encoding is the conversion of data into digital signals i.e. zeros and ones.
There are three common approaches for converting ordinal and categorical variables to numerical
values. They are: • Ordinal Encoding • One-Hot Encoding • Dummy Variable Encoding

Linear regression model does not take categorical values so that we have encoded categorical values to
integer for better results

Here we will use, Dummy Variable Encoding to convert each category into a separate column
containing only 0and 1, where 1 indicates presence and 0 indicates absence.
In this case:

Here, we have used Drop First as True to ensure that levels of categorical variables are not included as
multiple columns in dataset might result in multicollinearity which in turn land into a dummy trap.

Since we need the forecast of sales, we are taking sales as the dependent variable.

We will divide the data into Training and Testing data set, with 70:30 proportion with the fixed
random state as 1 to ensure uniformity across multiple systems.

REPORT TITLE PAGE 12

Linear Regression

Linear Regression is a machine learning algorithm based on supervised learning. It performs a

regression task. Regression models a target prediction value based on independent variables. It is
mostly used for finding out the relationship between variables and forecasting.

Linear regression performs the task to predict a dependent variable value (y) based on a given
independent variable (x). So, this regression technique finds out a linear relationship between x
(input) and y(output).

Fitting the Linear regression model to the data

The coefficient value represents the mean change of the dependent variable given a one-unit shift
in an independent variable. It can be used as the absolute sizes of the coefficient to identify the most
important variable.

REPORT TITLE PAGE 13

The intercept (often labeled as constant) is the point where the function crosses the y-axis. the
intercept (often labeled the constant) is the expected mean value of Y when all X=0.

R-squared is a statistical measure of how close the data are to the fitted regression line. It is also
known as the coefficient of determination, or the coefficient of multiple determination for multiple
regression.

The definition of R-squared is fairly straight-forward; it is the percentage of the response variable
variation that is explained by a linear model. Or:

R-squared = Explained variation / Total variation

R-squared is always between 0 and 100%:

• 0% indicates that the model explains none of the variability of the response data around its
mean.
• 100% indicates that the model explains all the variability of the response data around its mean.
In general, the higher the R-squared, the better the model fits your data.

In above the case test and train data reveals a R-square of 94% which indicates a better fit model

Root Mean Square Error (RMSE) is the standard deviation of the residuals (prediction errors).
Residuals are a measure of how far from the regression line data points are; RMSE is a measure of
how spread out these residuals are. In other words, it tells you how concentrated the data is around
the line of best fit.

REPORT TITLE PAGE 14

The lower the RMSE, the better a given model is able to “fit” a dataset.

Multicollinearity exists whenever an independent variable is highly correlated with one or more of the
other independent variables in a multiple regression equation. Multicollinearity is a problem
because it undermines the statistical significance of an independent variable.

A variance inflation factor (VIF) detects multicollinearity in regression analysis. Multicollinearity

is when there’s correlation between predictors (i.e. independent variables) in a model; it’s presence
can adversely affect your regression results.

Variance inflation factors range from 1 upwards. The numerical value for VIF tells you (in
decimal form) what percentage the variance (i.e. the standard error squared) is inflated for each
coefficient. For example, a VIF of 1.9 tells you that the variance of a particular coefficient is 90%
bigger than what you would expect if there was no multicollinearity — if there was no correlation with
other predictors.

REPORT TITLE PAGE 15

A rule of thumb for interpreting the variance inflation factor:
• 1 = not correlated.
• Between 1 and 5 = moderately correlated.
• Greater than 5 = highly correlated.

Capital, patents, R&D, employment, value here shows high VIF factors.

Ordinary Least Squares regression (OLS) is a common technique for estimating coefficients of
linear regression equations which describe the relationship between one or more independent
quantitative variables and a dependent variable (simple or multiple linear regression). Least
squares stand for the minimum squares error (SSE).

REPORT TITLE PAGE 16

If we plot the dependent variable again the predicted values of the dependent variable, if
the resultant scatter plot is not having cone shaped distribution(resultant of the noise as
described above) then the model is good and the figure depicts such as distribution

REPORT TITLE PAGE 17

1.4 Inference: Based on these predictions, what are the business insights and
recommendations.

Below is the equation which we obtain on the basis of Stats model:

The equation,

Sales = (0.0) * Intercept + (0.29) * capital + (-0.03) * patents + (

0.08) * randd + (0.41) * employment + (-0.02) * tobinq + (0.21) *
value + (0.0) * institutions + (0.01) * sp500_yes +

• The R-squared value (0.934) represents the proportion of the variance in the dependent variable
(sales) that is explained by the independent variables in the model. In this case, approximately
93.4% of the variability in sales is explained by the model. The adjusted R-squared (0.933) takes
into account the number of predictors and provides a more accurate measure in the presence of
multiple variables.

REPORT TITLE PAGE 18

• The coefficients represent the estimated change in the dependent variable for a one-unit change
in the corresponding independent variable, assuming all other variables are held constant.

Business Insights and Recommendations:

• Variables with significant positive coefficients (e.g., 'capital,' 'employment,' 'value') suggest
positive associations with 'sales.' Companies may consider investing in these areas to potentially
increase sales.
• Variables with significant negative coefficients (e.g., 'tobinq') suggest negative associations with
'sales.' Companies may want to investigate these factors to understand and address potential
issues.
• 'institutions' and 'sp500_yes' do not appear to have a statistically significant impact on 'sales' in
this model. Businesses may choose to reassess the inclusion of these variables in future analyses
or seek additional data.
• Results show that the cooperation between sales and R&D and between sales and marketing has a
significant, positive effect in generating revenue for the firm.

• We would advice this firm to invest in companies where in the employment turnover is very high.
Also we can do further classification and encourage firms having lower employment to hire more
candidates who are qualified thus increasing the turn around.

• Inversely since tobinq which is the ratio between a physical asset's market value and its
replacement value is negatively impacting sales, we will be advising investment firm not to look
into Firms which have high tobinq ratio as it negatively impacts sales

REPORT TITLE PAGE 19

Problem 1: Linear Regression
54% (13)
Problem 1: Linear Regression
14 pages
Predictive Modelling Project Report Final
45% (11)
Predictive Modelling Project Report Final
49 pages
Exam Final
100% (1)
Exam Final
21 pages
PM - ExtendedProject - Business Report
100% (4)
PM - ExtendedProject - Business Report
35 pages
FRA Milestone1 - Maminulislam
100% (4)
FRA Milestone1 - Maminulislam
23 pages
Girish Chadha - 29th December 2022
100% (3)
Girish Chadha - 29th December 2022
35 pages
Arun_27072021_Predictive_Modeling.pdf
No ratings yet
Arun_27072021_Predictive_Modeling.pdf
33 pages
Devidutta_Predictive_Modeling.pdf
No ratings yet
Devidutta_Predictive_Modeling.pdf
25 pages
Predictive Modeling Business Report Seetharaman Final Changes PDF
100% (1)
Predictive Modeling Business Report Seetharaman Final Changes PDF
28 pages
Machine Learning and Linear Regression
100% (1)
Machine Learning and Linear Regression
55 pages
Predictive Modelling Project
No ratings yet
Predictive Modelling Project
28 pages
Project Predictive Modeling
No ratings yet
Project Predictive Modeling
43 pages
FRA Milestone 1
No ratings yet
FRA Milestone 1
33 pages
Unit-III (Data Analytics)
100% (1)
Unit-III (Data Analytics)
15 pages
FRA Milestone 1
No ratings yet
FRA Milestone 1
33 pages
Bussiness Report PM
No ratings yet
Bussiness Report PM
44 pages
DA-3rd unit
No ratings yet
DA-3rd unit
16 pages
Anshul Dyundi Predictive Modelling Alternate Project July 2022
No ratings yet
Anshul Dyundi Predictive Modelling Alternate Project July 2022
11 pages
Monika Sree 11-07-2024
No ratings yet
Monika Sree 11-07-2024
36 pages
Predictive Modelling Project 2
100% (4)
Predictive Modelling Project 2
32 pages
Predictive_Modelling_Alternate_Project_Business_Case.docx
No ratings yet
Predictive_Modelling_Alternate_Project_Business_Case.docx
47 pages
Data Analytics Unit III
No ratings yet
Data Analytics Unit III
15 pages
STATISTIC%20AND%20DATA%20SCIENCE%20II.pdf
No ratings yet
STATISTIC%20AND%20DATA%20SCIENCE%20II.pdf
37 pages
TOD 212 - PPT 1 For Students - Monsoon 2023
No ratings yet
TOD 212 - PPT 1 For Students - Monsoon 2023
26 pages
Business Report: Predictive Modelling
100% (2)
Business Report: Predictive Modelling
37 pages
Linear Regression
No ratings yet
Linear Regression
38 pages
meWeek 3
No ratings yet
meWeek 3
57 pages
Predictive Model: Submitted by
100% (3)
Predictive Model: Submitted by
27 pages
3 Da
No ratings yet
3 Da
16 pages
Practical - Regression
No ratings yet
Practical - Regression
114 pages
Data Science Interview Preparation
100% (1)
Data Science Interview Preparation
113 pages
Excel Project - Investment Firm
No ratings yet
Excel Project - Investment Firm
3 pages
Predictive Modeling
100% (1)
Predictive Modeling
22 pages
CIA Understanding
No ratings yet
CIA Understanding
5 pages
Sukanya Linear LogisticRegression Report
100% (1)
Sukanya Linear LogisticRegression Report
23 pages
Chapter 02 Overview (R)
No ratings yet
Chapter 02 Overview (R)
43 pages
2023 Statistics Fin 10
No ratings yet
2023 Statistics Fin 10
14 pages
Report Group 8 Final
No ratings yet
Report Group 8 Final
13 pages
Evans Analytics2e PPT 08
No ratings yet
Evans Analytics2e PPT 08
65 pages
Nanduri Naga Sowri Pgp-Dsba - Octa - G2 Great Learning
No ratings yet
Nanduri Naga Sowri Pgp-Dsba - Octa - G2 Great Learning
40 pages
Arpita - Sarkar - Business - Report - 17th December, 2023
No ratings yet
Arpita - Sarkar - Business - Report - 17th December, 2023
23 pages
FINAL - CC01 - Group7
No ratings yet
FINAL - CC01 - Group7
23 pages
Chapter 06-Regression Analysis
No ratings yet
Chapter 06-Regression Analysis
41 pages
Final Cc01 Group7
No ratings yet
Final Cc01 Group7
23 pages
Introudction To Regression Analysis and Measuring With Stat Model 1702371825910
No ratings yet
Introudction To Regression Analysis and Measuring With Stat Model 1702371825910
16 pages
Linear Regression
67% (3)
Linear Regression
15 pages
Linear Regression Subjective Questions
No ratings yet
Linear Regression Subjective Questions
14 pages
ML PR-2
No ratings yet
ML PR-2
11 pages
Mod3
No ratings yet
Mod3
50 pages
Regression_Questionnaire
No ratings yet
Regression_Questionnaire
10 pages
Session 1: Simple Linear Regression: Figure 1 - Supervised and Unsupervised Learning Methods
No ratings yet
Session 1: Simple Linear Regression: Figure 1 - Supervised and Unsupervised Learning Methods
16 pages
Session 5 Marked B PDF
No ratings yet
Session 5 Marked B PDF
36 pages
Linear Regression
No ratings yet
Linear Regression
46 pages
Lecture Notes - Linear Regression
No ratings yet
Lecture Notes - Linear Regression
26 pages
Unit v -Update
No ratings yet
Unit v -Update
53 pages
Unit 3
No ratings yet
Unit 3
24 pages
Linear Regression Assignment Questions and Answer
No ratings yet
Linear Regression Assignment Questions and Answer
7 pages
Cap8 Predicting Continuous Target Variables with Regression Analysis - Thakur Ankita 2016 - Python Real World Data Science
No ratings yet
Cap8 Predicting Continuous Target Variables with Regression Analysis - Thakur Ankita 2016 - Python Real World Data Science
36 pages
Chapter 8 B - Trendlines and Regression Analysis
No ratings yet
Chapter 8 B - Trendlines and Regression Analysis
73 pages
Topic - 9 PDF
No ratings yet
Topic - 9 PDF
12 pages
Process Performance Models: Statistical, Probabilistic & Simulation
From Everand
Process Performance Models: Statistical, Probabilistic & Simulation
Vishnuvarthanan Moorthy
No ratings yet
(Vatansever Dan Hepsen, 2013) .
No ratings yet
(Vatansever Dan Hepsen, 2013) .
12 pages
Econometrics Assignment
No ratings yet
Econometrics Assignment
5 pages
Kumpulan Jurnal PDF
100% (1)
Kumpulan Jurnal PDF
1,632 pages
Data Science Using R
No ratings yet
Data Science Using R
11 pages
Acca f2 Syllabus
No ratings yet
Acca f2 Syllabus
12 pages
Econometrics: Chapter 6: Multiple Regression Model
No ratings yet
Econometrics: Chapter 6: Multiple Regression Model
23 pages
Basic Business Statistics Australian 4Th Edition Berenson Test Bank Full Chapter PDF
100% (21)
Basic Business Statistics Australian 4Th Edition Berenson Test Bank Full Chapter PDF
68 pages
Board Characteristics, State Ownership and Firm
No ratings yet
Board Characteristics, State Ownership and Firm
20 pages
Group 10 - Curve Fitting
No ratings yet
Group 10 - Curve Fitting
81 pages
4158-Article Text-7967-1-10-20201227
No ratings yet
4158-Article Text-7967-1-10-20201227
16 pages
Relationship Between Learning Strategies
No ratings yet
Relationship Between Learning Strategies
12 pages
DALab Part-B BCU&BU
No ratings yet
DALab Part-B BCU&BU
12 pages
Prediction of Medical Costs Using Regression Algorithms: A. Lakshmanarao, Chandra Sekhar Koppireddy, G.Vijay Kumar
0% (1)
Prediction of Medical Costs Using Regression Algorithms: A. Lakshmanarao, Chandra Sekhar Koppireddy, G.Vijay Kumar
7 pages
Censoring & Truncation
No ratings yet
Censoring & Truncation
14 pages
Chapter 12
No ratings yet
Chapter 12
17 pages
Eviews 2
No ratings yet
Eviews 2
15 pages
Statistics Practice Workbook
No ratings yet
Statistics Practice Workbook
87 pages
ESG - Indian Market
No ratings yet
ESG - Indian Market
6 pages
Q.N. 3 A. Draw The Research Model and Write The Research Hypothesis of The Model
No ratings yet
Q.N. 3 A. Draw The Research Model and Write The Research Hypothesis of The Model
3 pages
Finals Elementary Statistics
No ratings yet
Finals Elementary Statistics
2 pages
The Company Characteristics and Environmental Performance: Farlinno A., Bernawati Y
No ratings yet
The Company Characteristics and Environmental Performance: Farlinno A., Bernawati Y
16 pages
The Importance of Job Satisfaction and Organizational Commitment in Shaping Turnover Intent A Test of A Causal Model
No ratings yet
The Importance of Job Satisfaction and Organizational Commitment in Shaping Turnover Intent A Test of A Causal Model
24 pages
The Role of Intellectual Capital On Public Universities Performance in Indonesia
No ratings yet
The Role of Intellectual Capital On Public Universities Performance in Indonesia
20 pages
Effectof Road Traffic Congestionon Stressat Work
No ratings yet
Effectof Road Traffic Congestionon Stressat Work
21 pages
1 PB
No ratings yet
1 PB
12 pages
Final Financial Model
No ratings yet
Final Financial Model
19 pages
Saputri - 2015 - FACTORS INFLUENCING CONSUMER BEHAVIOR TOWARDS MAGNUM ICE CREAM
No ratings yet
Saputri - 2015 - FACTORS INFLUENCING CONSUMER BEHAVIOR TOWARDS MAGNUM ICE CREAM
86 pages
QCM
No ratings yet
QCM
24 pages
A Sustainable Procurement Approach For Selection of Construction Consultants in Property and Facilities Management
No ratings yet
A Sustainable Procurement Approach For Selection of Construction Consultants in Property and Facilities Management
16 pages