PREDICTIVE MODELLING
BUSSINESS REPORT
THANUSRI A
14-01-2024
Problem 1
Define the problem and perform exploratory Data Analysis
Problem definition - Check shape, Data types, statistical summary - Univariate analysis -
Multivariate analysis - Use appropriate visualizations to identify the patterns and insights -
Key meaningful observations on individual variables and the relationship between variables
Data Pre-processing
Prepare the data for modelling: - Missing Value Treatment (if needed) - Outlier Detection
(treat, if needed) - Feature Engineering - Encode the data - Train-test split
Model Building - Linear regression
Apply linear Regression using Sklearn - Using Statsmodels Perform checks for significant
variables using the appropriate method - Create multiple models and check the performance
of Predictions on Train and Test sets using Rsquare, RMSE & Adj Rsquare.
Business Insights & Recommendations
Comment on the Linear Regression equation from the final model and impact of relevant
variables (atleast 2) as per the equation - Conclude with the key takeaways (actionable
insights and recommendations) for the business
The top 5 rows of the dataset:
Note that there are many zeroes in few of the features
The last 5 rows of the dataset:
The number of rows and columns in the dataset:
The shape of the data is (8192,22)
Dataset summary:
Basic info about the dataset:
There are a total of 8192 rows and 22 columns in the dataset. Out of 22, 13 are float
8 are integer type and 1 object type variable.
Dataset null value check:
There are missing values present in ‘rchar’ , ‘wchar’.
Let us treat them using median value
There are no duplicate rows present:
There are many 0 values present in few of the variables
Let us check the number of zeroes in each feature and remove accordingly if more
than 50 percent of the values are 0’s
The following features have more than 50 percent of 0’s in the variables
'pgout','ppgout','pgfree','pgscan','atch'
So we drop these 5 columns from the dataset.
For rest of the features, let us treat all the zeroes with median values
The dataset after removing the required features and computing median values for the rest
of the features looks like:
Now we create a dataframe that contains only the integer and float type variables
and try printing the boxplots for these features.
There are outliers present in the data, these needs to be treated
There are many methods in which outliers can be treated
We choose IQR method to treat them
So, we treat them using IQR method. In this method, any observation that is less
than Q1- 1.5 IQR or more than Q3 + 1.5 IQR is considered an outlier.
After outlier treatment :
Outliers have been successfully treated from the dataset now.
UNIVARIATE ANALYSIS:
we plot the histograms for the different feature present in the dataset.
Bivariate analysis:
The scatteplots between the dependent variable and different independent variable is as
follows :
.
A bat plot between cpu run in usr mode vs freeswap
A bat plot between cpu run in usr mode vs freemem
Multivariate analysis:
Scatterplot ‘lread’ and ‘lwrite’ seperated by ‘runqsz’
Scatterplot ‘sread’ and ‘swrite’ seperated by ‘runqsz’
Scatterplot between ‘exec’ and ‘fork’ seperated by ‘runqsz’
Scatterplot between ‘vflt’ and ‘pflt’ seperated by ‘runqsz’
Checking for correlation:
Correlation between variables”:
Pairplot is as follows:
Pair plot shows the relationship between the variables in the form of scatterplot and
the distribution of the variable in the form of histogram .
As the given data set contains number of columns the pair plot is looking a little
messy.
In some plots, we will be able to see possitive correlation, some having negative
correlation and some having no correlation
Now we convert the categorical variable ‘runqsz’ into numerical by encoding using
dummy variable:
The first 5 rows of the dataset now :
Now we make a copy of the dataset present at this moment so that it is feasible for
various things.
Now we separate the dataset given into X independent variables and Y dependent
variable.
SPLITTING THE DATASET INTO TRAINING AND TESTING DATASET
Let us create the x and y variable data with respect to ‘usr’ column as the target
variable. Now x having every data except the target variable and y having only the
targetvariable .
We split the independent variables X into two parts, one for training X_train and one
for testing X_test
we split the dependent variable Y into two parts, one for training Y_train and one for
the testing Y_test
Using stats model api as SM to intercept the X variable.
Using sklearn to split the data into x_train and y_train
X data frame :
Y data frame :
The coefficients are:
The intercept of the model is
R square on training data:
R square on testing data:
RMSE on training data:
RMSE on testing data:
LINEAR REGRESSION USING STATS MODELS :
As the Train and the test data split up we can process with creating the linearmodel. Now for
creating the OLS model, we can use the .ols from stats model api package.And Fit the data
with x_train and y_train.
The model summary is :
The R-square value tells that the model can explain 76.6 % of the variance in the
training set
Adjusted R-square also nearly to the R-square,76.6%.
RMSE on train data:
RMSE on test data:
Scatterplot between the actual y value and predicted y value:
The following table shows the comparision between the actual and predicted values and
their difference i.e residual
Graph between predicted and residual values
The residual density graph:
The final linear model equation of the data is :
From the above linear equation it can be predcited that ,
There are many negative coefficients present in the linear equation.
Except ‘fork’, ‘freemem’ all coefficients are decrease when implies.
When ‘fork - Number of system fork calls per second’ is increased by a unit then the
‘usr’ value increases by 33 % and also ‘Number of system exec calls per second’ is
increased by a unit then the ‘usr’ gets decresed by 38.9 %
Problem 2
Define the problem and perform exploratory Data Analysis
Problem definition - Check shape, Data types, statistical summary - Univariate analysis -
Multivariate analysis - Use appropriate visualizations to identify the patterns and insights -
Key meaningful observations on individual variables and the relationship between variables
Data Pre-processing
Prepare the data for modelling: - Missing value Treatment (if needed) - Outlier
Detection(treat, if needed) - Feature Engineering (if needed) - Encode the data - Train-test
split
Model Building and Compare the Performance of the Models
Build a Logistic Regression model - Build a Linear Discriminant Analysis model - Build a
CART model - Prune the CART model by finding the best hyperparameters using
GridSearch - Check the performance of the models across train and test set using different
metrics - Compare the performance of all the models built and choose the best one with
proper rationale
Business Insights & Recommendations
Comment on the importance of features based on the best model - Conclude with the key
takeaways (actionable insights and recommendations) for the business.
Top five rows of the dataset:
Last five rows of the dataset:
Shape of the data:
Dataset summary:
Basic info about the dataset:
There are 2 features with float datatype , 1 feature with integer datatype , 7 features
with object datatype.
Null value check:
There are null values present in the ‘Wife_age’ feature and ‘No_of_children_born’
feature.
Lets treat the null values by imputing null calue with median value of that particular
feature.
There are 80 duplicate rows present in the dataset.
Lets drop all the duplicate rows present in the dataset.
Now the shape of the data is
UNIVARIATE ANALYSIS:
BIVARIATE ANALYSIS:
The above plot shows the relation between different age group of women and contraceptive
method used.
From the below plot we may note that, tertiary educated women use the most contraceptive
methods
The below plot shows that wives with highest husband’s education have used the most
contraceptive methods.
The above plot shows that the non working women use the most contraceptive methods.
From the above plot, Women with very high standard living index use the most contraceptive
measures.
Correlation between variables:
We have noticed that there are only three features in integer form which can be plotted,
Now we convert all the objects to categorical codes
Encoded the categorical variables Wife_ education, Husband_education,
Wife_religion,Standard_of_living_index, Media_exposure and Contraceptive_method_used
in the ascending orderfrom worst to best since LDA does not take string variables as
parameters into model building.
Below isthe encoding for ordinal values:
Wife_ education: Uneducated = 1, Primary = 2, Secondary = 3, Tertiary = 4.
Husband_education: Uneducated = 1, Primary = 2, Secondary = 3, Tertiary = 4.
Wife_religion: Scientology = 1 and non-Scientology = 2.
Wife_Working: Yes = 1 and No = 2.
Standard_of_living _index: Very Low = 1, Low = 2, High = 3, Very High = 4.
Media_exposure: Exposed = 1 and Not-Exposed = 2.
Contraceptive_method_used: Yes = 1 and No = 0
The first rows of the dataset:
Info :
Outliers are as follows:
There are outliers present in the data
We treat them using IQR method
Pairplot:
Correlation between variables :
LOGISTIC REGRESSION
Train and Test Split:Let us create the x and y variable data with respect
to ‘'Contraceptive_method_used'’column as the target variable. Now x having every data
except the target variable and yhaving only the target variable .
Before we proceed the process , we need to import the required libraries or checking
it. In this encoding for ‘'Contraceptive_method_used'’ 1 as yes and 0 as No.
We use Label Encoder form sklearn library to encode the data if we havent encoded
the data previously.
The encoding is for creating the dummy variables .
Now the Train set and the test set has been spitted up by using the
sklearnmodel.Using logistic regression model method to fit the data and creating a
logistic model.
The proportion of 1s and 0s i.e.(Customers using Contraceptive_method_used
Yes /No) as follows,
Now we need to fit the Logistic regression model by using newton cg as solver, 1000
as maximum iteration, then we get the predicted data frame model as,
In the above data frame, we will be able to see that 1 is having the highest accuracy 69.43 %
The model accuracy is 67.3 %
AUC and ROC curve
Now we can plot plot the AUC and ROC curve of the model and get the separatecurve and
auc score of Train dataset and test dataset.
AUC curve for the train data:
In this curve, if the plot occurs below the do ed lines, then it accept as worst modelever,
Eventhough the curve is not perfect but the curve is OK ,the AUC(Area underthe curve) of
the train data model is 71.8%.
AUC curve for the test data:
This curve is similar to the train data curve AUC but slightly vary in the initial
locations.
the curve is ok as plotted above the dotted lines.
The Area under curve is same as the train data as 71.8%.
For the comparision the train data AUC with the test data AUC, mostly both curve
issimilar with some variation only as the AUC of both is same as 71.8.10%. Lets
move to theconfusion matrix,
Confusion matrix for train data:
This plot shows the realtionship between the true label and predicted label as 0’s and 1’s
Classification report is as follows
For Contraceptive_method_used (Label 0 ):
Precision (66%) – 66% of married women predicted are actually not using
Contraceptive method out of all married women predicted to not using Contraceptive
method.
Recall (53%) – Out of all the married women not using Contraceptive method , 53%of
married women have been predicted correctly .
For Contraceptive_method_used (Label 1 ):
Precision (68%) – 68% of married women predicted are actually using Contraceptive
method out of all married women predicted to be using Contraceptive method .
Recall (79%) – Out of all the married women actually using contraceptive method
79% of married women have been predicted correctly .
And the Accuracy is 67% which is more than 50%, so the model is Good.
Confusion matrix for test data:
This plot shows the relationship between the true labels and predicted labels as 0’s and 1’s
And the classification report is as follows
For Contraceptive_method_used (Label 0 ):
Precision (64%) – 64% of married women predicted are actually not using
Contraceptive method out of all married women predicted to not using Contraceptive
method.
Recall (46%) – Out of all the married women not using Contraceptive method , 46%of
married women have been predicted correctly .
For Contraceptive_method_used (Label 1 ):
Precision (65%) – 65% of married women predicted are actually using Contraceptive
method out of all married women predicted to be using Contraceptive method .
Recall (79%) – Out of all the married women actually using contraceptive method
,79% of married women have been predicted correctly .
And the Accuracy is 65% which is more than 50%, so the model is also Good
asTraining Data.
Grid search :
By using the grid search CV from sklearn model to get predict the best model. Theprocess is
same as the above and we get,
For train data
As the above method, here also we get similar values and accuracy is still 67 %
For test data
As the above method, here also we get similar values and accuracy is still 65 %
Overall accuracy of the model – 67 % of total predictions are correct
Accuracy, AUC, Precision and Recall for test data is almost inline with training
data.This proves no overfitting or underfitting has happened, and overall the model is
agood model for classification
LINEAR DISCRIMINANT ANALYSIS
Train and Test Split:
The procedure is same as the above Logistics regression for splitting the Train and test data.
Need to import the LDA(Linear Discriminant analysis) from the sklearn library and the results
is as follows,
There is some slight difference with the Training and the test data reports, but its ok as the
Accuracy of train data is as 67% and the accuracy for the test data is as 65%
CART
In CART we can use the dataset with outliers as its not sensitive with outliers.
Train and Test Split:
The Same procedure as the above Logis c regression and the LDA, Train and testdata
need to be splitted, and before that the necessary libraries need to be imported.
In cart , the decision tree is the most important,
Decision tree:
Fit the train and test data into decision tree. We need to create in new worddocument
and saved in Project folder.
Now we can copy and paste the code in http://webgraphviz.com/. For checking
thedecision tree we can delete the existing codes and paste it there.
The tree will be little messy as the data contains vast information or
classifications,sowe will reduce the max.leaf , max.depth of the tree and the min. sample size.
Here “GINI” ,a decision tree classifier plays the important role. And creating
a newword document with reduced branches as 30, leaf is 10 and depth is 7
and saved thedocument in project folder.
Now decision tree is looking better than before
Now Let us check the feature Importance, where Feature importance refers totechniques
that assign a score to input features based on how useful they are at predic nga target
variable.
As we see ,depend upon the ‘wife_age’ having more importance, we can slightly predict that
the contraceptive method can be used depend upon the age factors of women.
AUC plot:
As we see the AUC curve bending high , the model will be good and its
AUCvalue for train data is 82.4%
Here the plot is not quite smooth , but over the area its keeping up the bend formation and
its AUC value for test data is 70.0%
Confusion matrix for train data:
By checking up the confusion matrix of the train data, we can get the value of True Positive
as 260 and the True Negative as 474.
For Contraceptive_method_used (Label 0 ):
Precision (77%) – 77% of married women predicted are actually not using
Contraceptive method out of all married women predicted to not using Contraceptive
method.
Recall (62%) – Out of all the married women not using Contraceptive method , 62%of
married women have been predicted correctly .
For Contraceptive_method_used (Label 1 ):
Precision (75%) – 75% of married women predicted are actually using Contraceptive
method out of all married women predicted to be using Contraceptive method .
Recall (86%) – Out of all the married women actually using contraceptive method
,86% of married women have been predicted correctly .
And the Accuracy is 75% which is more than 50%, so the model is also Good as
Training Data.
Confusion matrix for test data:
By checking up the confusion matrix of the train data, we can get the value of True Positive
as 91 and the True Negative as 182.
For Contraceptive_method_used (Label 0 ):
Precision (67%) – 67% of married women predicted are actually not using
Contraceptive method out of all married women predicted to not using Contraceptive
method.
Recall (47%) – Out of all the married women not using Contraceptive method , 47%of
married women have been predicted correctly .
For Contraceptive_method_used (Label 1 ):
Precision (64%) – 64% of married women predicted are actually using Contraceptive
method out of all married women predicted to be using Contraceptive method .
Recall (81%) – Out of all the married women actually using contraceptive method
,81% of married women have been predicted correctly .
And the Accuracy is 65% which is more than 50%, so the model is also Good as
Training Data.
CONCLUSION
From these above models , in Every models the Encoded label ‘1’(conceptive
method used) predicted as high and the Accuracy and the F1 score of the models
also favourfor the label ’1’.
But we can’t conclude that the contraceptive method used or not , but we canpredict
that the married women used the Contraceptive method as prediction and the final
prediction also showing the same things only.