100% found this document useful (1 vote)

914 views28 pages

Predictive Modeling Business Report Seetharaman Final Changes PDF

Here are the steps to encode the categorical variables, split the data into train and test sets, and apply logistic regression and LDA: 1. Encode the categorical variables 'Survived', 'airbag', 'seatbelt', 'frontal', 'sex', 'abcat', 'occRole' using One-Hot Encoding. This converts them into numeric dummy variables. 2. Split the data into train and test sets with a 70:30 split ratio using sample.split(). 3. Apply logistic regression on the train set using glm() with family = 'binomial'. 4. Predict probabilities on the test set using predict() and compare to actual values using confusionMatrix(). 5. Apply LDA using lda()

Uploaded by

Ankita Mishra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

914 views28 pages

Predictive Modeling Business Report Seetharaman Final Changes PDF

Uploaded by

Ankita Mishra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

Business Report

PREDICTIVE MODELING

Seetharaman Ramakrishnan 4/2/21 Predictive Modeling

Table of Contents

Problem 1 ......................................................................................................................................... 1
Problem 1.1 ................................................................................................................................... 3
Problem 1.2 ................................................................................................................................... 8

Problem 1.3 ................................................................................................................................. 11

Problem 1.4 ................................................................................................................................. 14

Project 2 ......................................................................................................................................... 15
Problem 2.1 ................................................................................................................................. 15
Problem 2.2 ................................................................................................................................. 20

Problem 2.3 ................................................................................................................................ 23

Problem 2.4 ................................................................................................................................ 27

1
Problem 1
Problem Statement:
Problem 1: Linear Regression

You are a part of an investment firm and your work is to do research about these 759 firms.
You are provided with the dataset containing the sales and other attributes of these 759
firms. Predict the sales of these firms on the bases of the details given in the dataset so as
to help your company in investing consciously. Also, provide them with 5 attributes that are
most important.

Data Dictionary:
1. sales: Sales (in millions of dollars).
2. capital: Net stock of property, plant, and equipment.
3. patents: Granted patents.
4. randd: R&D stock (in millions of dollars).
5. employment: Employment (in 1000s).
6. sp500: Membership of firms in the S&P 500 index. S&P is a stock market index
that measures the stock performance of 500 large companies listed on stock
exchanges in the United States
7. tobinq: Tobin's q (also known as q ratio and Kaldor's v) is the ratio between a
physical asset's market value and its replacement value.
8. value: Stock market value.
9. institutions: Proportion of stock owned by institutions.

2
Project 1
Problem 1.1:
1.1) Read the data and do exploratory data analysis. Describe the data briefly. (Check the
null values, data types, shape, EDA). Perform Univariate and Bivariate Analysis.

NULL Values:

Info:

3
 The first step to know our data understand it, get familiar with it. What are the answers
we’re trying to get with that data? What variables are we using, and what do they
mean? How does it look from a statistical perspective? Is data formatted correctly? Do
we have missing values? And duplicated? What about outliers? So all these answers are
can be found out step by step as below:

Step1: Import: a) all the necessary libraries and b) The Data.

Step2: Describing the Data after loading it. Checking for datatypes, number of columns
and rows, checking for missing number of values, describing its min, and max, mean
values. Depending upon requirement dropping off missing values or replacing it.

Step3: Reviewing new dataset and Inferences

Price, depth and have many outliers, which can be inferred. They have three kinds of
datatypes, int, float, and object.

Data with Outliers:

4
Data without Outliers after the Treatment:

Correlation analyses using pair plot and heat map:

5
6
7
Problem 1.2:
1.2) Impute null values if present? Do you think scaling is necessary in this case?

8
SCALING:

The scaling data shows the ranges of the data is between -2 to +3 and most of the
variables are ordinal variables so there is no need of scaling

9
Observation:

As can be seen from point 1.1 info Total Data has 759 rows with data while Indepe
ndent Variable- tobinq showing 738 entries meaning 21 null values , for which Me
dian value of variable tobinq is replaced.

10
1.3) Encode the data (having string values) for Modelling. Data Split: Split the dat
a into test and train (70:30). Apply linear regression. Performance Metrics: Check
the performance of Predictions on Train and Test sets using R-square, RMSE.

Encode the SALES with Train and Test Data predicators and SP500 as a
Categorical as Y/N data.

MSE:

11
Stats Model - Apply Linear Regression:

R-Square Applied Data:

12
Scatter plot Pre- and Post -Scaling with Z-Score

Pre – Scaling:

13
POST –SCALING:

Co-efficient – of Scaled Data Set:

1.4) Inference: Based on these predictions, what are the business insights and
recommendations.

The investment criteria for any new investor is mainly based on the capital invested in the c
ompany by the promoters and investors are vying on the firms where the capital investment
is good as also reflecting in the scatter plot.

To generate capital the company should have the combination of the following attributes su
ch as value, employment, sales and patents.

The highest contributing attribute is employment followed by patents.

When the Employment increase by 1 Unit the Sales increase by 80.33 units, by keeping all t
he predictors constant,

When the Capital increase by 1 Unit the Sales increase by 0.42 units by keeping all the pre
dictors constant.

14
Project 2:
Problem 2: Logistic Regression and Linear Discriminant Analysis
You are hired by the Government to do an analysis of car crashes. You are provided
details of car crashes, among which some people survived and some didn't. You have
to help the government in predicting whether a person will survive or not on the basis
of the information given in the data set so as to provide insights that will help the
government to make stronger laws for car manufacturers to ensure safety measures.
Also, find out the important factors on the basis of which you made your predictions.

Data Dictionary :

1. dvcat: factor with levels (estimated impact speeds) 1-9km/h, 10-24, 25-39, 40-54, 55+
2. weight: Observation weights, albeit of uncertain accuracy, designed to account for
varying sampling probabilities. (The inverse probability weighting estimator can be used
to demonstrate causality when the researcher cannot conduct a controlled experiment
but has observed data to model)
3. Survived: factor with levels Survived or not_survived
4. airbag: a factor with levels none or airbag
5. seatbelt: a factor with levels none or belted
6. frontal: a numeric vector; 0 = non-frontal, 1=frontal impact
7. sex: a factor with levels f: Female or m: Male
8. ageOFocc: age of occupant in years
9. yearacc: year of accident
10. yearVeh: Year of model of vehicle; a numeric vector
11. abcat: Did one or more (driver or passenger) airbag(s) deploy? This factor has levels
deploy, nodeploy and unavail
12. occRole: a factor with levels driver or pass: passenger
13. deploy: a numeric vector: 0 if an airbag was unavailable or did not deploy; 1 if one or
more bags deployed.
14. injSeverity: a numeric vector; 0: none, 1: possible injury, 2: no incapacity, 3: incapacity, 4:
killed; 5: unknown, 6: prior death
15. caseid: character, created by pasting together the populations sampling unit, the case
number, and the vehicle number. Within each year, use this to uniquely identify the
vehicle.

2.1) Data Ingestion: Read the dataset. Do the descriptive statistics and do null value
condition check, write an inference on it. Perform Univariate and Bivariate Analysis.
Do exploratory data analysis.

15
16
List of Categorical Columns:

Apply Median for the Categorical Data:

Data without Outliers treatment:

Data after Outlier Treatment:

17
Univariate and Bivariate Analysis:

18
19
Correlation Chart of the Data:

2.2) Encode the data (having string values) for Modelling. Data Split: Split the data
into train and test (70:30). Apply Logistic Regression and LDA (linear discriminant
analysis).
Before Encoding :

20
After Encoding:

21
Data Split: Split the data into train and test (70:30)

Train-Test Split

22
Value Counts of the Test and Train Daya of Y Values of Survive Column:

Predicting on Training and Test dataset LDA(Linear Discriminant Analysis)

Accuracy of the Test Data:

Predicting on Training and Test dataset LOGISTIC REGRESSION:

Accuracy Score is 0.9792038027332145

Confusion Matrix:

2.3) Performance Metrics: Check the performance of Predictions on Train and

Test sets using Accuracy, Confusion Matrix, Plot ROC curve and get
ROC_AUC score for each model. Compare both the models and write
inferences, which model is best/optimized.

23
Plotting confusion matrix for the different models for the Training Data:

Metrics Classification:

LDA Model Classification:

Model Classification:

24
AUC – Training Data:

AUC-ROC of the Test and Training Data:

Plotting the AUC-ROC Data:

25
Confusion Matrix of the Training Data:

Confusion Matrix of the Testing Data:

AUC-ROC of the Logistic Regression Model:

26
2.4) Inference: Based on these predictions, what are the insights and
recommendations.
Inference:
Score of Both Train and Test Data are coming near-by.
Linear Discriminant Analysis Model Giving Better Recall and Precision in comparison
to Logistic Regression.
Hence, LDA Model cab be considered further upgrading the same using SMOTE
model, whereby its predictive ability get further enhanced.

Conclusion:
The model accuracy of logistic regression on both training data as well as testing
data is almost same i.e 97%.
Similarly, AUC in logistic regression for training data and testing data is also similar.
The other parameters of confusion matrix in logistic regression is also similar,
therefore we can presume in this that our model is over fitted.
We have therefore applied Grid Search CV to hyper tune our model and as per which
F1 score in both training and test data was 97%.
In case of LDA, the AUC for testing and training data is also same and it was 97%,
besides this the other parameters of confusion matrix of LDA model was also similar
and it clearly shows that model is over fitted here too.
Overall we can conclude that logistic regression model is best suited for this data set
given the level of accuracy in spite of the Linear Discriminant Analysis that the model
is over fitted.

Predictive Modelling Project 2
100% (4)
Predictive Modelling Project 2
32 pages
Predictive Model: Submitted by
100% (3)
Predictive Model: Submitted by
27 pages
Linear Regression Datascience Basit PDF
No ratings yet
Linear Regression Datascience Basit PDF
19 pages
DATA MINING PROJECT PAVITHRAA GOVINDARAJAN 24 OCT 2021 Jupyter Notebook PDF
100% (3)
DATA MINING PROJECT PAVITHRAA GOVINDARAJAN 24 OCT 2021 Jupyter Notebook PDF
49 pages
As Quiz 3 PCA Solution PDF
100% (1)
As Quiz 3 PCA Solution PDF
1 page
MRA - Project - Puvya - Ravi
100% (3)
MRA - Project - Puvya - Ravi
46 pages
Advance Statistics-Project Report
50% (2)
Advance Statistics-Project Report
17 pages
Business Report: Predictive Modelling
100% (2)
Business Report: Predictive Modelling
37 pages
Shoe Sales
100% (3)
Shoe Sales
105 pages
ML Project Report
100% (2)
ML Project Report
35 pages
ML ProjectReport-Sonali Joshi
100% (2)
ML ProjectReport-Sonali Joshi
38 pages
Girish Chadha - 29th December 2022
100% (3)
Girish Chadha - 29th December 2022
35 pages
VaibhavKumar Extendedproject PDF
100% (2)
VaibhavKumar Extendedproject PDF
10 pages
Gabors Maté Teachings CI Mod1
100% (10)
Gabors Maté Teachings CI Mod1
16 pages
SMDM Project Business Report - Ketan Sawalkar: (Document Title)
100% (2)
SMDM Project Business Report - Ketan Sawalkar: (Document Title)
17 pages
Machine Learning Project: Raghul Harish
100% (2)
Machine Learning Project: Raghul Harish
46 pages
Advanced Statistics Project - Jayant Chandra
No ratings yet
Advanced Statistics Project - Jayant Chandra
20 pages
Predictive Modelling ALOK KUMAR
100% (1)
Predictive Modelling ALOK KUMAR
25 pages
PM - ExtendedProject - Business Report
100% (4)
PM - ExtendedProject - Business Report
35 pages
Anshul Dyundi Predictive Modelling Alternate Project July 2022
No ratings yet
Anshul Dyundi Predictive Modelling Alternate Project July 2022
11 pages
Machine Learning Business Report - Compress (AutoRecovered)
100% (3)
Machine Learning Business Report - Compress (AutoRecovered)
69 pages
DataMining Aug2021
100% (2)
DataMining Aug2021
49 pages
Predictive Modeling
100% (1)
Predictive Modeling
22 pages
Pranjal - Singh - 27.11.2022 AS Project
No ratings yet
Pranjal - Singh - 27.11.2022 AS Project
9 pages
Data Mining Project Report
100% (1)
Data Mining Project Report
98 pages
Predictive Modelling Project 1 PDF
50% (2)
Predictive Modelling Project 1 PDF
38 pages
Project Time Series Forecasting
100% (1)
Project Time Series Forecasting
53 pages
SMDM Project Gopala Satish Kumar Jupyter Notebook G8 DSBA
100% (1)
SMDM Project Gopala Satish Kumar Jupyter Notebook G8 DSBA
14 pages
DM Gopala Satish Kumar Business Report G8 DSBA
100% (2)
DM Gopala Satish Kumar Business Report G8 DSBA
26 pages
Advanced Statistics Project Report
100% (1)
Advanced Statistics Project Report
34 pages
Detail Project Report SMDM
100% (1)
Detail Project Report SMDM
25 pages
FRA Report
100% (1)
FRA Report
30 pages
Project Report
100% (3)
Project Report
36 pages
Parts of A Bar.
100% (2)
Parts of A Bar.
3 pages
Mini Project - Factor Hair Analysis: Sravanthi.M
100% (2)
Mini Project - Factor Hair Analysis: Sravanthi.M
24 pages
Cart-Rf-ANN: Prepared by Muralidharan N
0% (1)
Cart-Rf-ANN: Prepared by Muralidharan N
16 pages
Anamit Deb Gupta Mra - Project Milestone - 1
100% (1)
Anamit Deb Gupta Mra - Project Milestone - 1
30 pages
Final Document of SQL Project With Questions
0% (2)
Final Document of SQL Project With Questions
5 pages
RACHIT MITTAL Capstone Project. Notes 2 PDF
No ratings yet
RACHIT MITTAL Capstone Project. Notes 2 PDF
39 pages
Data Mining Clustering PDF
No ratings yet
Data Mining Clustering PDF
15 pages
SMDM Business-Report Arvind Soni-2
0% (1)
SMDM Business-Report Arvind Soni-2
15 pages
Capstone Proect Notes 2
100% (2)
Capstone Proect Notes 2
16 pages
SMDM - Project Report - Lakshmi
No ratings yet
SMDM - Project Report - Lakshmi
26 pages
Data Mining Project PCA Report
100% (1)
Data Mining Project PCA Report
27 pages
ML Ts Proj
100% (9)
ML Ts Proj
58 pages
Business Report
No ratings yet
Business Report
12 pages
Predective Modellig Project
100% (1)
Predective Modellig Project
18 pages
Business Report DSBA Data Mining Project - Part 2 Segmentation Using K-Means Clustering
No ratings yet
Business Report DSBA Data Mining Project - Part 2 Segmentation Using K-Means Clustering
28 pages
Business Report Problem 2
No ratings yet
Business Report Problem 2
10 pages
Advanced Statistics: Business Report Ranvijay Sharma
No ratings yet
Advanced Statistics: Business Report Ranvijay Sharma
16 pages
PW 50
No ratings yet
PW 50
168 pages
Pranjal - Singh - 30.10.2022 SMDM PROJECT REPORT
No ratings yet
Pranjal - Singh - 30.10.2022 SMDM PROJECT REPORT
9 pages
Time Series Project
50% (4)
Time Series Project
2 pages
Lifi
100% (1)
Lifi
16 pages
This Study Resource Was: Quiz 3
100% (1)
This Study Resource Was: Quiz 3
5 pages
Graded Project AS
No ratings yet
Graded Project AS
14 pages
Problem Statement1
No ratings yet
Problem Statement1
1 page
AS Project - 3 Business Report
0% (1)
AS Project - 3 Business Report
10 pages
Capstone Project
100% (1)
Capstone Project
7 pages
Advance Statistics Business Report
No ratings yet
Advance Statistics Business Report
15 pages
Assignment Report - Data Mining
No ratings yet
Assignment Report - Data Mining
24 pages
Happy New Month of June 2023 Prayers, Wishes, Mes
No ratings yet
Happy New Month of June 2023 Prayers, Wishes, Mes
2 pages
State Wise Health Income Clustering 18th December 2021 PDF
100% (2)
State Wise Health Income Clustering 18th December 2021 PDF
29 pages
Problem 1:: Readingcsv PD Read - Excel (Readingcsv) Readingcsv Head
No ratings yet
Problem 1:: Readingcsv PD Read - Excel (Readingcsv) Readingcsv Head
18 pages
Propeller Owner's Manual: and Logbook
No ratings yet
Propeller Owner's Manual: and Logbook
278 pages
ML Quiz-2
No ratings yet
ML Quiz-2
5 pages
FRA Assignment - India Credit Model
No ratings yet
FRA Assignment - India Credit Model
14 pages
Wholesale Custumer
100% (1)
Wholesale Custumer
32 pages
Rexroth Oil Cleanliness Booklet 2017 EN
No ratings yet
Rexroth Oil Cleanliness Booklet 2017 EN
44 pages
Problem 1 - (Download Data) : Importing Nessceary Libraries
No ratings yet
Problem 1 - (Download Data) : Importing Nessceary Libraries
16 pages
Avo, Inversion AND Seismic Attributes: Poisson's Ratio Volumetrics
No ratings yet
Avo, Inversion AND Seismic Attributes: Poisson's Ratio Volumetrics
2 pages
Despicable Me
No ratings yet
Despicable Me
95 pages
Burning The Curtains of Ignorance
No ratings yet
Burning The Curtains of Ignorance
5 pages
Predictive Modelling Alternative Firm Level PDF
100% (4)
Predictive Modelling Alternative Firm Level PDF
26 pages
Reformer Furnace 02
No ratings yet
Reformer Furnace 02
8 pages
Decision Making: Submitted By-Ankita Mishra
No ratings yet
Decision Making: Submitted By-Ankita Mishra
20 pages
A Study On Integrating Water Element in Preschool To Enhance Learning Environment For Children
No ratings yet
A Study On Integrating Water Element in Preschool To Enhance Learning Environment For Children
7 pages
Factor Hair Revised Project Report PDF
No ratings yet
Factor Hair Revised Project Report PDF
23 pages
380 Dia Clutch - Oyster
No ratings yet
380 Dia Clutch - Oyster
29 pages
KP Kundali 1741902807 67d353d7174a6
No ratings yet
KP Kundali 1741902807 67d353d7174a6
77 pages
7 Types of Twins
No ratings yet
7 Types of Twins
3 pages
6.6 Hormones, Homeostasis & Reproduction
No ratings yet
6.6 Hormones, Homeostasis & Reproduction
33 pages
Middle Term Problem Sets
No ratings yet
Middle Term Problem Sets
27 pages
Punjab
No ratings yet
Punjab
13 pages
Appendicular Skeletal System
No ratings yet
Appendicular Skeletal System
4 pages
Volume 3 Industrial
No ratings yet
Volume 3 Industrial
280 pages
Ank SMDM PDF
No ratings yet
Ank SMDM PDF
39 pages
CHM 124 Stereochemistry Defined
No ratings yet
CHM 124 Stereochemistry Defined
2 pages
IL Presentation
No ratings yet
IL Presentation
13 pages
Sipomer B CEA TDS
No ratings yet
Sipomer B CEA TDS
1 page
Ce302 Design of Hydraulic Structures, July 2021
No ratings yet
Ce302 Design of Hydraulic Structures, July 2021
3 pages
Attachment 3513 60bca44e29030 60bca44bb7199 Ultimate-Husband-2109-2120
No ratings yet
Attachment 3513 60bca44e29030 60bca44bb7199 Ultimate-Husband-2109-2120
48 pages
Galvanized Steel Wire Strands (GSW) JIS G 3537: 1994
No ratings yet
Galvanized Steel Wire Strands (GSW) JIS G 3537: 1994
1 page
III III: Unit Unit
No ratings yet
III III: Unit Unit
25 pages
Faces of God
No ratings yet
Faces of God
11 pages
Analysis of Car Insurance Data
No ratings yet
Analysis of Car Insurance Data
4 pages
Stats 1
No ratings yet
Stats 1
3 pages
No Risk Score
No ratings yet
No Risk Score
4 pages
CP 7
No ratings yet
CP 7
2 pages
Rahul Mishra: Certified Scrummaster® (CSM®)
No ratings yet
Rahul Mishra: Certified Scrummaster® (CSM®)
2 pages
CV New Format
No ratings yet
CV New Format
2 pages
Pico Explorer Base: Features
No ratings yet
Pico Explorer Base: Features
2 pages

Predictive Modeling Business Report Seetharaman Final Changes PDF

Uploaded by

Predictive Modeling Business Report Seetharaman Final Changes PDF

Uploaded by

Business Report

Seetharaman Ramakrishnan 4/2/21 Predictive Modeling

Problem 1.3 ................................................................................................................................. 11

Problem 1.4 ................................................................................................................................. 14

Problem 2.3 ................................................................................................................................ 23

Problem 2.4 ................................................................................................................................ 27

Step1: Import: a) all the necessary libraries and b) The Data.

Step3: Reviewing new dataset and Inferences

Data with Outliers:

Correlation analyses using pair plot and heat map:

R-Square Applied Data:

Co-efficient – of Scaled Data Set:

The highest contributing attribute is employment followed by patents.

Apply Median for the Categorical Data:

Data without Outliers treatment:

Data after Outlier Treatment:

Predicting on Training and Test dataset LDA(Linear Discriminant Analysis)

Accuracy of the Test Data:

Predicting on Training and Test dataset LOGISTIC REGRESSION:

Accuracy Score is 0.9792038027332145

2.3) Performance Metrics: Check the performance of Predictions on Train and

LDA Model Classification:

AUC-ROC of the Test and Training Data:

Plotting the AUC-ROC Data:

Confusion Matrix of the Testing Data:

AUC-ROC of the Logistic Regression Model:

You might also like