Project Predictive Modeling
Project Predictive Modeling
Contents
Problem 1: Linear Regression.....................................................................................................
You are a part of an investment firm and your work is to do research about these 759 firms. You
are provided with the dataset containing the sales and other attributes of these 759 firms.
Predict the sales of these firms on the bases of the details given in the dataset so as to help
your company in investing consciously. Also, provide them with 5 attributes that are most
important......................................................................................................................................................
Q 1.1) Read the data and do exploratory data analysis. Describe the data briefly.
(Check the null values, data types, shape, EDA). Perform Univariate and
Bivariate Analysis...........................................................................................................................
Q 1.2 Impute null values if present? Do you think scaling is necessary in this
case?...................................................................................................................................................
Q 1.3 Encode the data (having string values) for Modelling. Data Split: Split the
data into test and train (70:30). Apply Linear regression. Performance Metrics:
Check the performance of Predictions on Train and Test sets using R-square,
RMSE...................................................................................................................................................
Q 1.4 Inference: Based on these predictions, what are the business insights and
recommendations...........................................................................................................................
Problem 2: Logistic Regression and Linear Discriminant Analysis.........................................
Q 2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null
value condition check, write an inference on it. Perform Univariate and Bivariate
Analysis. Do exploratory data analysis...................................................................................
Q 2.2 Encode the data (having string values) for Modelling. Data Split: Split the
data into train and test (70:30). Apply Logistic Regression and LDA (linear
discriminant analysis)...................................................................................................................
Q 2.3 Performance Metrics: Check the performance of Predictions on Train and
Test sets using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC
score for each model. Compare both the models and write inferences, which
model is best/optimized...............................................................................................................
Q 2.4 Inference: Based on these predictions, what are the insights and
recommendations...........................................................................................................................
Q 1.1) Read the data and do exploratory data analysis. Describe the data
briefly. (Check the null values, data types, shape, EDA). Perform Univariate
and Bivariate Analysis.
'sp500' has categorical data whereas 'patents' has discrete data and all other variables
have continuous data
For most of the variables except 'institutions' , the mean is far greater than median
indicating positive skewness in the data
It also shows strong correlation between dependent variable sales with capital, randd ,
employment and value
Q 1.2 Impute null values if present? Do you think scaling is necessary in this
case?
We have Null values in tobinq, since tobinq being continuous variable and a
ratio null values are imputed using median
In [487]:
After imputing
Before scaling
After scaling
Q 1.3 Encode the data (having string values) for Modelling. Data Split: Split
the data into test and train (70:30). Apply Linear regression. Performance
Metrics: Check the performance of Predictions on Train and Test sets using R-
square, RMSE.
93% of the variation in the sales is explained by the predictors in the model
for train set
In [506]:
R square on the test data - 0.9240311293641786
In [507]:
RMSE train - 0.2581275829531501
In [508]:
RMSE test - 0.2618357790172932
Regression equation –
Q 1.4 Inference: Based on these predictions, what are the business insights
and recommendations.
Number of patents and tobinq have slight negative impact on the sales
whereas randd has positive impact on the sales increase and employment has the
highhest impact on the sales performance
With low beta coefficient the current the above attribute should be able to predict the
sales performance significantly in the future.
Problem 2: Logistic Regression and
Linear Discriminant Analysis
You are hired by the Government to do an analysis of car crashes. You are provided details of car
crashes, among which some people survived and some didn't. You have to help the government in
predicting whether a person will survive or not on the basis of the information given in the data set so
as to provide insights that will help the government to make stronger laws for car manufacturers to
ensure safety measures. Also, find out the important factors on the basis of which you made your
predictions.
1. dvcat: factor with levels (estimated impact speeds) 1-9km/h, 10-24, 25-39, 40-54, 55+
2. weight: Observation weights, albeit of uncertain accuracy, designed to account for varying
sampling probabilities. (The inverse probability weighting estimator can be used to demonstrate
causality when the researcher cannot conduct a controlled experiment but has observed data to
model)
3. Survived: factor with levels Survived or not_survived
4. airbag: a factor with levels none or airbag
5. seatbelt: a factor with levels none or belted
6. frontal: a numeric vector; 0 = non-frontal, 1=frontal impact
7. sex: a factor with levels f: Female or m: Male
8. ageOFocc: age of occupant in years
9. yearacc: year of accident
10. yearVeh: Year of model of vehicle; a numeric vector
11. abcat: Did one or more (driver or passenger) airbag(s) deploy? This factor has levels deploy,
nodeploy and unavail
12. occRole: a factor with levels driver or pass: passenger
13. deploy: a numeric vector: 0 if an airbag was unavailable or did not deploy; 1 if one or more bags
deployed.
14. injSeverity: a numeric vector; 0: none, 1: possible injury, 2: no incapacity, 3: incapacity, 4: killed;
5: unknown, 6: prior death
15. caseid: character, created by pasting together the populations sampling unit, the case number,
and the vehicle number. Within each year, use this to uniquely identify the vehicle.
Q 2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do
null value condition check, write an inference on it. Perform Univariate and
Bivariate Analysis. Do exploratory data analysis.
Read the data and check head and tail of the data.
Checking shape, information and stastitics of the data set.
Correlation check
There does not seem to have multicollinearity in the data.
From the Biovariate analysis and pair plot with hue as dependent variable 'Survived'
below are observations:
As the impact speed increases there are less chances of survial in case of a crash with
highest not survived %age for speed more than 55 km/h follwed by 40-54 km/h There
are lesser chances of survival incase of no availability of airbag or airbag not deployed
Frontal impact is also an important factor however the survial %age higher in the frontal
impact than non-frontal
Q 2.2 Encode the data (having string values) for Modelling. Data Split: Split
the data into train and test (70:30). Apply Logistic Regression and LDA (linear
discriminant analysis).
The column caseid does not seem to add any value other than specific
vehicel identification, we can drop the column
Will encode the data for variables having string value for variables
'Survived','airbag','seatbelt','sex','abcat','occRole'.
Frontal
injSeverity
dvcat
seatbelt
abcat
training data
Confusion matrix
Accuracy - 0.9796101564503719
test data
Confusion Matrix
Accuracy - 0.9820466786355476
Applying LDA (linear discriminant analysis).
Train data
Confusion matrix
Classification report
Test data
Confusion matrix
Classification report
As it is important to predict survival accurately, both accuracy and recall are important
and hence both model look good with very little difference. however, Logistics
Regression slightly scores higher with regards to Recall, hence Logistics Regression
model is preferred. As all performance paramaters are quite high, we can use this data
and features for recommendations based on the model performance
Q 2.4 Inference: Based on these predictions, what are the insights and
recommendations.
Weight, Year of the car or age of the car is not impacting the survival greatly during a
crash Probability of survival with low speed at accident prone locations is high
Crashes where the passengers have deployment of airbags and seatbelts have higher
survival
It should be mandatory to have airbags for both front and back seats to ensure safety
during frontal or non=frontal impact accidents.
Government can implement reward vs penalty clauses for manufacturers based on the
record of safety measures deployed in the vehicles.