0% found this document useful (0 votes)

72 views43 pages

Project Predictive Modeling

Uploaded by

Purva Soni

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

72 views43 pages

Project Predictive Modeling

Uploaded by

Purva Soni

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 43

Project – Predictive Modeling

Contents
Problem 1: Linear Regression.....................................................................................................
You are a part of an investment firm and your work is to do research about these 759 firms. You
are provided with the dataset containing the sales and other attributes of these 759 firms.
Predict the sales of these firms on the bases of the details given in the dataset so as to help
your company in investing consciously. Also, provide them with 5 attributes that are most
important......................................................................................................................................................
Q 1.1) Read the data and do exploratory data analysis. Describe the data briefly.
(Check the null values, data types, shape, EDA). Perform Univariate and
Bivariate Analysis...........................................................................................................................
Q 1.2 Impute null values if present? Do you think scaling is necessary in this
case?...................................................................................................................................................
Q 1.3 Encode the data (having string values) for Modelling. Data Split: Split the
data into test and train (70:30). Apply Linear regression. Performance Metrics:
Check the performance of Predictions on Train and Test sets using R-square,
RMSE...................................................................................................................................................
Q 1.4 Inference: Based on these predictions, what are the business insights and
recommendations...........................................................................................................................
Problem 2: Logistic Regression and Linear Discriminant Analysis.........................................
Q 2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null
value condition check, write an inference on it. Perform Univariate and Bivariate
Analysis. Do exploratory data analysis...................................................................................
Q 2.2 Encode the data (having string values) for Modelling. Data Split: Split the
data into train and test (70:30). Apply Logistic Regression and LDA (linear
discriminant analysis)...................................................................................................................
Q 2.3 Performance Metrics: Check the performance of Predictions on Train and
Test sets using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC
score for each model. Compare both the models and write inferences, which
model is best/optimized...............................................................................................................
Q 2.4 Inference: Based on these predictions, what are the insights and
recommendations...........................................................................................................................

Problem 1: Linear Regression

You are a part of an investment firm and your work is to do research about these 759 firms. You are
provided with the dataset containing the sales and other attributes of these 759 firms. Predict the
sales of these firms on the bases of the details given in the dataset so as to help your company in
investing consciously. Also, provide them with 5 attributes that are most important.

Data Dictionary for Firm_level_data:

1. sales: Sales (in millions of dollars).
2. capital: Net stock of property, plant, and equipment.
3. patents: Granted patents.
4. randd: R&D stock (in millions of dollars).
5. employment: Employment (in 1000s).
6. sp500: Membership of firms in the S&P 500 index. S&P is a stock market index that measures the
stock performance of 500 large companies listed on stock exchanges in the United States
7. tobinq: Tobin's q (also known as q ratio and Kaldor's v) is the ratio between a physical asset's
market value and its replacement value.
8. value: Stock market value.
9. institutions: Proportion of stock owned by institutions.

Q 1.1) Read the data and do exploratory data analysis. Describe the data
briefly. (Check the null values, data types, shape, EDA). Perform Univariate
and Bivariate Analysis.

Read and check head and tail of the data

Checking shape, information and summary statistics

There are 759 observations for 10 variables
We can remove the column 'Unnamaed: 0'
This data set continuous, discrete and categorical data

'sp500' has categorical data whereas 'patents' has discrete data and all other variables
have continuous data

For most of the variables except 'institutions' , the mean is far greater than median
indicating positive skewness in the data

Standard deviation is also high for all numerical variables

Check for Nulls values:

There are null values for the variable 'tobinq', the ratio between a physical
asset's market value and its replacement value.
In the variable 'tabinq', we have 2.77% of null values

Check for duplicates:

There are no duplicates in the data set

Checking unique values for categorical variable 'sp500' :

Conduct Univariate Analysis

As per the analysis, we see that all the continuous variables -

sales,capital,patents,randd,employment,tobinq,value except institutions have skewed
data and we can
Pair plot and correlation matrix shows that there is correlation between capital and
randd and capital and employment.

It also shows strong correlation between dependent variable sales with capital, randd ,
employment and value

Q 1.2 Impute null values if present? Do you think scaling is necessary in this
case?

We have Null values in tobinq, since tobinq being continuous variable and a
ratio null values are imputed using median
In [487]:
After imputing

There are outliers , the boxplots with outliers

After Outlier removal

The different variables are in different scale or magnitude, patents are whole numbers ,
robinq is a ratio converted in decimal numbers and similarly all other variables are in
different scale, hence scaling will be required

Before scaling
After scaling

Q 1.3 Encode the data (having string values) for Modelling. Data Split: Split
the data into test and train (70:30). Apply Linear regression. Performance
Metrics: Check the performance of Predictions on Train and Test sets using R-
square, RMSE.

One feature 'sp500' with string value is encoded to create dummies

Data set is split into train and test data in 70:30 ratio
In [501]:

Linear Regression model is applied

The intercept for our model is -0.02834809299281312

R square on the train data -0.9359702538559448

93% of the variation in the sales is explained by the predictors in the model
for train set
In [506]:
R square on the test data - 0.9240311293641786
In [507]:
RMSE train - 0.2581275829531501
In [508]:
RMSE test - 0.2618357790172932

Linear Regression using statsmodels

In [509]:
Inferential Statistics

Sum of Squared error Train - 0.06663

Sum of Squared error test - 0.068558

mean_sq_error train - 0.2581275829531501

mean-sq-error test - 0.2618357790172932

Model score train - 0.9359702538559448

Model score test - 0.9240311293641786

Prediction Scatter plot

Regression equation –

(-0.03) * Intercept + (0.25) * capital + (-0.03) * patents + (0.05) * randd +

(0.42) * employment + (-0.05) * tobinq + (0.28) * value + (0.0) * institutions
+ (0.11) * sp500_yes +

Q 1.4 Inference: Based on these predictions, what are the business insights
and recommendations.

As per the model output below features are significant

Capital has postive impact on the sales

Number of patents and tobinq have slight negative impact on the sales

whereas randd has positive impact on the sales increase and employment has the
highhest impact on the sales performance

With low beta coefficient the current the above attribute should be able to predict the
sales performance significantly in the future.
Problem 2: Logistic Regression and
Linear Discriminant Analysis
You are hired by the Government to do an analysis of car crashes. You are provided details of car
crashes, among which some people survived and some didn't. You have to help the government in
predicting whether a person will survive or not on the basis of the information given in the data set so
as to provide insights that will help the government to make stronger laws for car manufacturers to
ensure safety measures. Also, find out the important factors on the basis of which you made your
predictions.

Data Dictionary for Car_Crash

1. dvcat: factor with levels (estimated impact speeds) 1-9km/h, 10-24, 25-39, 40-54, 55+
2. weight: Observation weights, albeit of uncertain accuracy, designed to account for varying
sampling probabilities. (The inverse probability weighting estimator can be used to demonstrate
causality when the researcher cannot conduct a controlled experiment but has observed data to
model)
3. Survived: factor with levels Survived or not_survived
4. airbag: a factor with levels none or airbag
5. seatbelt: a factor with levels none or belted
6. frontal: a numeric vector; 0 = non-frontal, 1=frontal impact
7. sex: a factor with levels f: Female or m: Male
8. ageOFocc: age of occupant in years
9. yearacc: year of accident
10. yearVeh: Year of model of vehicle; a numeric vector
11. abcat: Did one or more (driver or passenger) airbag(s) deploy? This factor has levels deploy,
nodeploy and unavail
12. occRole: a factor with levels driver or pass: passenger
13. deploy: a numeric vector: 0 if an airbag was unavailable or did not deploy; 1 if one or more bags
deployed.
14. injSeverity: a numeric vector; 0: none, 1: possible injury, 2: no incapacity, 3: incapacity, 4: killed;
5: unknown, 6: prior death
15. caseid: character, created by pasting together the populations sampling unit, the case number,
and the vehicle number. Within each year, use this to uniquely identify the vehicle.

Q 2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do
null value condition check, write an inference on it. Perform Univariate and
Bivariate Analysis. Do exploratory data analysis.

Read the data and check head and tail of the data.
Checking shape, information and stastitics of the data set.

There are 15 columns and 11217 rows in the dataset

In [527]:
There are numerical and categorical variables.
dvcat,Survived,airbag,seatbelt,sex,abcat,occRole,caseid are of object
datatype and categorical in nature frontal,deploy are of integer datatype
however they are categorical in naure with injSeverity also being a
categortical variable. weight, ageOFocc,yearacc,yearVeh are the clear
numerical variables.
dvcat has a unique factor '24-Oct' which is reflecting as a date, however it
should reflect in the string data type for the impact speed level 10-24'. The
values should be replaced with correct srting
In [529]:

There are null values in the column 'injSeverity'

In [532]:
The null values are less than 1% for injSeverity and this is a categorical
variable, we can drop the null values

No duplicate rows in the data

Checking unique value for categorical variables

Getting unique count of nominal categorical variables

With the above analysis of categorical variables, it seems

 most of the vehicles were driven in the speed 10-24 km/h
 Almost 4000+ (~ 40%) of the vehicles did not have airbags
 Similarly 3000+ (~ 30%) passengers or drivers in vehicles were not belted
 More than 60% of vehicles wither did not deploy the airbags or the aribags were
not available
 Most car crashes involved drivers
 More than 7000 (~ 65%) crashes involved frontal impact
 In terms of injury severity 10% of the accidnets were fatal

Getting the percentage values of target variable 'Survived'

Univariate and Bivariate Analysis

Bivariate Analysis and multivariate analysis
Pair plot

Correlation check
There does not seem to have multicollinearity in the data.

From the Biovariate analysis and pair plot with hue as dependent variable 'Survived'
below are observations:

As the impact speed increases there are less chances of survial in case of a crash with
highest not survived %age for speed more than 55 km/h follwed by 40-54 km/h There
are lesser chances of survival incase of no availability of airbag or airbag not deployed
Frontal impact is also an important factor however the survial %age higher in the frontal
impact than non-frontal

InjServerity 4 is directly linked to non survival

Q 2.2 Encode the data (having string values) for Modelling. Data Split: Split
the data into train and test (70:30). Apply Logistic Regression and LDA (linear
discriminant analysis).

The column caseid does not seem to add any value other than specific
vehicel identification, we can drop the column
Will encode the data for variables having string value for variables
'Survived','airbag','seatbelt','sex','abcat','occRole'.

Columns in the encoded data

Splitting the Train/Test data

Applied Logistics Regression

Important features identified using RFE are below

Frontal
injSeverity
dvcat
seatbelt
abcat

training data

Confusion matrix

Accuracy - 0.9796101564503719
test data

Confusion Matrix

Accuracy - 0.9820466786355476
Applying LDA (linear discriminant analysis).

Train data

Model score - 0.9563990766863298

Confusion matrix

Classification report
Test data

Model Score - 0.9554159186116098

Confusion matrix

Classification report

Changing the cut-off values for maximum accuracy?

But 0.4 cut-off gives us the best 'f1-score'. Here, we will take the cut-off as
0.4 to get the optimum 'f1' score.
In [601]:
Q 2.3 Performance Metrics: Check the performance of Predictions on Train
and Test sets using Accuracy, Confusion Matrix, Plot ROC curve and get
ROC_AUC score for each model. Compare both the models and write
inferences, which model is best/optimized.

As it is important to predict survival accurately, both accuracy and recall are important
and hence both model look good with very little difference. however, Logistics
Regression slightly scores higher with regards to Recall, hence Logistics Regression
model is preferred. As all performance paramaters are quite high, we can use this data
and features for recommendations based on the model performance
Q 2.4 Inference: Based on these predictions, what are the insights and
recommendations.

As mentioned earlier,Frontal impact , Severity of the injury ,speed impact level ,

seatbelt, airbag deployment are important features.

Weight, Year of the car or age of the car is not impacting the survival greatly during a
crash Probability of survival with low speed at accident prone locations is high

Crashes where the passengers have deployment of airbags and seatbelts have higher
survival

It is recommended to make stronger laws for vehicle manufacturers to ensure safety

measures. Below are few recommendations

It should be mandatory to have airbags for both front and back seats to ensure safety
during frontal or non=frontal impact accidents.

Seatbelt should be available for all the seats and manufacturers.

Manufacturers should deploy warning signals in case of non-deployment of airbags or

seatbelts.

Manufacturers can deploy intelligent system to advise speed reduction at accident

prone locations or crowded places

Government can implement reward vs penalty clauses for manufacturers based on the
record of safety measures deployed in the vehicles.

Predictive Modeling Business Report
100% (3)
Predictive Modeling Business Report
69 pages
Linear Regression - Jupyter Notebook
100% (3)
Linear Regression - Jupyter Notebook
56 pages
Predictive Modelling Project 2
100% (4)
Predictive Modelling Project 2
32 pages
Lecture Notes - Linear Regression
No ratings yet
Lecture Notes - Linear Regression
26 pages
Module 2 Modified
No ratings yet
Module 2 Modified
67 pages
Predictive Modelling Alternate Project Business Case
No ratings yet
Predictive Modelling Alternate Project Business Case
47 pages
Unit II - Diagnotis and Multiple Linear
No ratings yet
Unit II - Diagnotis and Multiple Linear
8 pages
Mohammed Tayab Khan 24 Dec 2021
No ratings yet
Mohammed Tayab Khan 24 Dec 2021
16 pages
Advanced - Linear Regression
No ratings yet
Advanced - Linear Regression
57 pages
Business Report: Predictive Modelling
100% (2)
Business Report: Predictive Modelling
37 pages
Monika Sree 11-07-2024
No ratings yet
Monika Sree 11-07-2024
36 pages
Predictive Modelling Project
No ratings yet
Predictive Modelling Project
28 pages
PM - ExtendedProject - Business Report
100% (4)
PM - ExtendedProject - Business Report
35 pages
Linear Regression
No ratings yet
Linear Regression
38 pages
Predictive Model: Submitted by
100% (3)
Predictive Model: Submitted by
27 pages
Arun 27072021 Predictive Modeling PDF
No ratings yet
Arun 27072021 Predictive Modeling PDF
33 pages
PM Week1 MLSDeck0.2
No ratings yet
PM Week1 MLSDeck0.2
15 pages
Devidutta Predictive Modeling PDF
No ratings yet
Devidutta Predictive Modeling PDF
25 pages
Linear Regression
No ratings yet
Linear Regression
20 pages
Sukanya Linear LogisticRegression Report
100% (1)
Sukanya Linear LogisticRegression Report
23 pages
Pooja Kabadi - Predictive Modelling Project
No ratings yet
Pooja Kabadi - Predictive Modelling Project
70 pages
Finance & Risk Analytics QSTN 1 - Credit Risk
No ratings yet
Finance & Risk Analytics QSTN 1 - Credit Risk
24 pages
Bussiness Report PM
No ratings yet
Bussiness Report PM
44 pages
Problem 1: Linear Regression
54% (13)
Problem 1: Linear Regression
14 pages
Business Report - Predictive Modelling
No ratings yet
Business Report - Predictive Modelling
19 pages
Amit Sir - Assignment
No ratings yet
Amit Sir - Assignment
19 pages
Business Report
No ratings yet
Business Report
20 pages
2.3 ML (Implementation of Polynomial Regression Using Python)
No ratings yet
2.3 ML (Implementation of Polynomial Regression Using Python)
9 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
19 pages
VaibhavKumar Extendedproject PDF
100% (2)
VaibhavKumar Extendedproject PDF
10 pages
Linear Regression Datascience Basit PDF
No ratings yet
Linear Regression Datascience Basit PDF
19 pages
Girish Chadha - 29th December 2022
100% (3)
Girish Chadha - 29th December 2022
35 pages
BSD 3101-Lab Exercise 1
No ratings yet
BSD 3101-Lab Exercise 1
12 pages
Final AIP Spring 24 (Sloution)
No ratings yet
Final AIP Spring 24 (Sloution)
16 pages
Unit-2 Ak
No ratings yet
Unit-2 Ak
106 pages
4TH Year Cat 1
No ratings yet
4TH Year Cat 1
12 pages
2.3 Assumptions of Linear Regression
No ratings yet
2.3 Assumptions of Linear Regression
16 pages
Lab Mannual of ML
No ratings yet
Lab Mannual of ML
43 pages
FRA Milestone 1
No ratings yet
FRA Milestone 1
33 pages
LR Assumptions - 05
No ratings yet
LR Assumptions - 05
12 pages
Machine Learning 2
No ratings yet
Machine Learning 2
45 pages
Assignment Report - Predictive Modelling - Rahul Dubey
No ratings yet
Assignment Report - Predictive Modelling - Rahul Dubey
18 pages
Practical # 10
No ratings yet
Practical # 10
5 pages
Excel Project - Investment Firm
No ratings yet
Excel Project - Investment Firm
3 pages
Linear Regression
No ratings yet
Linear Regression
5 pages
BA Unit 2 Notes
No ratings yet
BA Unit 2 Notes
5 pages
PM Alternate Project
No ratings yet
PM Alternate Project
2 pages
FRA Report
100% (1)
FRA Report
30 pages
Group 1 Practical
No ratings yet
Group 1 Practical
16 pages
Predictive Modelling Project - Nandini
No ratings yet
Predictive Modelling Project - Nandini
31 pages
Machine Learning and Linear Regression
100% (1)
Machine Learning and Linear Regression
55 pages
SMDM Predictive Modeling Business Report 05.02.2022 PDF
No ratings yet
SMDM Predictive Modeling Business Report 05.02.2022 PDF
38 pages
Session 1: Simple Linear Regression: Figure 1 - Supervised and Unsupervised Learning Methods
No ratings yet
Session 1: Simple Linear Regression: Figure 1 - Supervised and Unsupervised Learning Methods
16 pages
Unit - Iv
No ratings yet
Unit - Iv
11 pages
Anshul Dyundi Predictive Modelling Alternate Project July 2022
No ratings yet
Anshul Dyundi Predictive Modelling Alternate Project July 2022
11 pages
Questions
No ratings yet
Questions
3 pages
Linear Regression Firm Basit PDF
No ratings yet
Linear Regression Firm Basit PDF
21 pages
PA Answers
No ratings yet
PA Answers
4 pages
Assumptions in Linear Regression
No ratings yet
Assumptions in Linear Regression
3 pages