Predictive Modeling Business Report Seetharaman Final Changes PDF
Predictive Modeling Business Report Seetharaman Final Changes PDF
PREDICTIVE MODELING
Problem 1 ......................................................................................................................................... 1
Problem 1.1 ................................................................................................................................... 3
Problem 1.2 ................................................................................................................................... 8
Project 2 ......................................................................................................................................... 15
Problem 2.1 ................................................................................................................................. 15
Problem 2.2 ................................................................................................................................. 20
1
Problem 1
Problem Statement:
Problem 1: Linear Regression
You are a part of an investment firm and your work is to do research about these 759 firms.
You are provided with the dataset containing the sales and other attributes of these 759
firms. Predict the sales of these firms on the bases of the details given in the dataset so as
to help your company in investing consciously. Also, provide them with 5 attributes that are
most important.
Data Dictionary:
1. sales: Sales (in millions of dollars).
2. capital: Net stock of property, plant, and equipment.
3. patents: Granted patents.
4. randd: R&D stock (in millions of dollars).
5. employment: Employment (in 1000s).
6. sp500: Membership of firms in the S&P 500 index. S&P is a stock market index
that measures the stock performance of 500 large companies listed on stock
exchanges in the United States
7. tobinq: Tobin's q (also known as q ratio and Kaldor's v) is the ratio between a
physical asset's market value and its replacement value.
8. value: Stock market value.
9. institutions: Proportion of stock owned by institutions.
2
Project 1
Problem 1.1:
1.1) Read the data and do exploratory data analysis. Describe the data briefly. (Check the
null values, data types, shape, EDA). Perform Univariate and Bivariate Analysis.
NULL Values:
Info:
3
The first step to know our data understand it, get familiar with it. What are the answers
we’re trying to get with that data? What variables are we using, and what do they
mean? How does it look from a statistical perspective? Is data formatted correctly? Do
we have missing values? And duplicated? What about outliers? So all these answers are
can be found out step by step as below:
Step2: Describing the Data after loading it. Checking for datatypes, number of columns
and rows, checking for missing number of values, describing its min, and max, mean
values. Depending upon requirement dropping off missing values or replacing it.
4
Data without Outliers after the Treatment:
5
6
7
Problem 1.2:
1.2) Impute null values if present? Do you think scaling is necessary in this case?
8
SCALING:
The scaling data shows the ranges of the data is between -2 to +3 and most of the
variables are ordinal variables so there is no need of scaling
9
Observation:
As can be seen from point 1.1 info Total Data has 759 rows with data while Indepe
ndent Variable- tobinq showing 738 entries meaning 21 null values , for which Me
dian value of variable tobinq is replaced.
10
1.3) Encode the data (having string values) for Modelling. Data Split: Split the dat
a into test and train (70:30). Apply linear regression. Performance Metrics: Check
the performance of Predictions on Train and Test sets using R-square, RMSE.
Encode the SALES with Train and Test Data predicators and SP500 as a
Categorical as Y/N data.
MSE:
11
Stats Model - Apply Linear Regression:
12
Scatter plot Pre- and Post -Scaling with Z-Score
Pre – Scaling:
13
POST –SCALING:
1.4) Inference: Based on these predictions, what are the business insights and
recommendations.
The investment criteria for any new investor is mainly based on the capital invested in the c
ompany by the promoters and investors are vying on the firms where the capital investment
is good as also reflecting in the scatter plot.
To generate capital the company should have the combination of the following attributes su
ch as value, employment, sales and patents.
When the Employment increase by 1 Unit the Sales increase by 80.33 units, by keeping all t
he predictors constant,
When the Capital increase by 1 Unit the Sales increase by 0.42 units by keeping all the pre
dictors constant.
14
Project 2:
Problem 2: Logistic Regression and Linear Discriminant Analysis
You are hired by the Government to do an analysis of car crashes. You are provided
details of car crashes, among which some people survived and some didn't. You have
to help the government in predicting whether a person will survive or not on the basis
of the information given in the data set so as to provide insights that will help the
government to make stronger laws for car manufacturers to ensure safety measures.
Also, find out the important factors on the basis of which you made your predictions.
Data Dictionary :
1. dvcat: factor with levels (estimated impact speeds) 1-9km/h, 10-24, 25-39, 40-54, 55+
2. weight: Observation weights, albeit of uncertain accuracy, designed to account for
varying sampling probabilities. (The inverse probability weighting estimator can be used
to demonstrate causality when the researcher cannot conduct a controlled experiment
but has observed data to model)
3. Survived: factor with levels Survived or not_survived
4. airbag: a factor with levels none or airbag
5. seatbelt: a factor with levels none or belted
6. frontal: a numeric vector; 0 = non-frontal, 1=frontal impact
7. sex: a factor with levels f: Female or m: Male
8. ageOFocc: age of occupant in years
9. yearacc: year of accident
10. yearVeh: Year of model of vehicle; a numeric vector
11. abcat: Did one or more (driver or passenger) airbag(s) deploy? This factor has levels
deploy, nodeploy and unavail
12. occRole: a factor with levels driver or pass: passenger
13. deploy: a numeric vector: 0 if an airbag was unavailable or did not deploy; 1 if one or
more bags deployed.
14. injSeverity: a numeric vector; 0: none, 1: possible injury, 2: no incapacity, 3: incapacity, 4:
killed; 5: unknown, 6: prior death
15. caseid: character, created by pasting together the populations sampling unit, the case
number, and the vehicle number. Within each year, use this to uniquely identify the
vehicle.
2.1) Data Ingestion: Read the dataset. Do the descriptive statistics and do null value
condition check, write an inference on it. Perform Univariate and Bivariate Analysis.
Do exploratory data analysis.
15
16
List of Categorical Columns:
17
Univariate and Bivariate Analysis:
18
19
Correlation Chart of the Data:
2.2) Encode the data (having string values) for Modelling. Data Split: Split the data
into train and test (70:30). Apply Logistic Regression and LDA (linear discriminant
analysis).
Before Encoding :
20
After Encoding:
21
Data Split: Split the data into train and test (70:30)
Train-Test Split
22
Value Counts of the Test and Train Daya of Y Values of Survive Column:
Confusion Matrix:
23
Plotting confusion matrix for the different models for the Training Data:
Metrics Classification:
Model Classification:
24
AUC – Training Data:
25
Confusion Matrix of the Training Data:
26
2.4) Inference: Based on these predictions, what are the insights and
recommendations.
Inference:
Score of Both Train and Test Data are coming near-by.
Linear Discriminant Analysis Model Giving Better Recall and Precision in comparison
to Logistic Regression.
Hence, LDA Model cab be considered further upgrading the same using SMOTE
model, whereby its predictive ability get further enhanced.
Conclusion:
The model accuracy of logistic regression on both training data as well as testing
data is almost same i.e 97%.
Similarly, AUC in logistic regression for training data and testing data is also similar.
The other parameters of confusion matrix in logistic regression is also similar,
therefore we can presume in this that our model is over fitted.
We have therefore applied Grid Search CV to hyper tune our model and as per which
F1 score in both training and test data was 97%.
In case of LDA, the AUC for testing and training data is also same and it was 97%,
besides this the other parameters of confusion matrix of LDA model was also similar
and it clearly shows that model is over fitted here too.
Overall we can conclude that logistic regression model is best suited for this data set
given the level of accuracy in spite of the Linear Discriminant Analysis that the model
is over fitted.
27