0% found this document useful (0 votes)
17 views44 pages

Predictive Modelling

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views44 pages

Predictive Modelling

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

PM PROJECT REPORT

Rohit Nagarahalli

1
Problem 1................................................................................................................................... 5
Executive Summary: ............................................................................................................... 5
Introduction: .......................................................................................................................... 5
Data shape: ........................................................................................................................ 5
Data types: ......................................................................................................................... 5
Statistical Summary:........................................................................................................... 6
Univariate Analysis: ............................................................................................................ 6
Multivariate Analysis: ....................................................................................................... 10
Key Meaningful Observations: ............................................................................................. 12
Data Preprocessing: ............................................................................................................. 12
Model Building ..................................................................................................................... 14
Appraoch-1....................................................................................................................... 14
Assumptions:.................................................................................................................... 18
Linear Regression Equation: ............................................................................................. 20
Observations for Approach-1 ............................................................................................... 21
Approach-2 Regularization Methods ............................................................................... 22
Approach-3....................................................................................................................... 25
Linear Regression Equation and comment on variables: ................................................. 28
Key Observations ................................................................................................................. 30
Problem-2................................................................................................................................. 31
Executive Summary: ............................................................................................................. 31
Introduction: ........................................................................................................................ 31
Data shape: ...................................................................................................................... 31
Univariate Analysis ........................................................................................................... 32
Multivariate Analysis ........................................................................................................ 35
Data Preprocessing .............................................................................................................. 36
Missing Values .................................................................................................................. 36
Logistic Regression: ......................................................................................................... 37
Observations .................................................................................................................... 38
Linear Discriminant Analysis ........................................................................................... 39
Observations .................................................................................................................... 40
CART Model...................................................................................................................... 42

2
Business Insights and Recommendations ............................................................................ 43
Importance of feature based on best model ................................................................... 43
Actionable Insights and recommendations ..................................................................... 44

Figure 1 lread ---------------------------------------------------------------------------------------------------- 6


Figure 2 lwrite --------------------------------------------------------------------------------------------------- 7
Figure 3 scall ----------------------------------------------------------------------------------------------------- 7
Figure 4 sread ---------------------------------------------------------------------------------------------------- 7
Figure 5 swrite --------------------------------------------------------------------------------------------------- 7
Figure 6 fork ------------------------------------------------------------------------------------------------------ 7
Figure 7 exec ----------------------------------------------------------------------------------------------------- 7
Figure 8 rchar ---------------------------------------------------------------------------------------------------- 7
Figure 9 wchar --------------------------------------------------------------------------------------------------- 7
Figure 10 atch ---------------------------------------------------------------------------------------------------- 8
Figure 11 pgin ---------------------------------------------------------------------------------------------------- 8
Figure 12 ppgin -------------------------------------------------------------------------------------------------- 8
Figure 13 pfit ----------------------------------------------------------------------------------------------------- 8
Figure 14 pgout-------------------------------------------------------------------------------------------------- 8
Figure 15 ppgout ------------------------------------------------------------------------------------------------ 8
Figure 16 pgfree ------------------------------------------------------------------------------------------------- 8
Figure 17 pgscan ------------------------------------------------------------------------------------------------ 8
Figure 18 vflt ----------------------------------------------------------------------------------------------------- 9
Figure 19 freemem --------------------------------------------------------------------------------------------- 9
Figure 20 freeswap --------------------------------------------------------------------------------------------- 9
Figure 21 usr ----------------------------------------------------------------------------------------------------- 9
Figure 22 runqsz count --------------------------------------------------------------------------------------- 10
Figure 23 runqsz distribution w.r.t target ----------------------------------------------------------------- 10
Figure 24 Raw Data Correlations --------------------------------------------------------------------------- 11
Figure 25 Count of Zeros ------------------------------------------------------------------------------------- 13
Figure 26 Correlations post data cleaning ---------------------------------------------------------------- 15
Figure 27 Linearity and Independence test -------------------------------------------------------------- 18
Figure 28 Normality of residuals --------------------------------------------------------------------------- 19
Figure 29 QQ Plot for normality ---------------------------------------------------------------------------- 19
Figure 30 Actual vs Predicted ------------------------------------------------------------------------------- 21
Figure 31 Ridge Actual vs Predicted ----------------------------------------------------------------------- 23
Figure 32 Lasso Actual vs Predicted------------------------------------------------------------------------ 24
Figure 33 Normality Test ------------------------------------------------------------------------------------- 27
Figure 34 QQ Plot (Approach-3) ---------------------------------------------------------------------------- 27
Figure 35 Actual vs Predicted Approach-3 --------------------------------------------------------------- 29
Figure 36 Age of Wife ----------------------------------------------------------------------------------------- 32
Figure 37 Wife and Husband Education------------------------------------------------------------------- 32
Figure 38 No. of Children------------------------------------------------------------------------------------- 33

3
Figure 39 Wife's Religion and Working status ----------------------------------------------------------- 33
Figure 40 Husband Occupation and Standard of living ------------------------------------------------ 34
Figure 41 Media Exposure and Contraceptive used ---------------------------------------------------- 34
Figure 42 Distribution of children vs Wife's Education ------------------------------------------------ 35
Figure 43 Distribution of children with Wife's Education --------------------------------------------- 35
Figure 44 Children per Husband's Education ------------------------------------------------------------ 36
Figure 45 Outlier Detection ---------------------------------------------------------------------------------- 37
Figure 46 Confusion Matrix ---------------------------------------------------------------------------------- 38
Figure 47 ROC AUC -------------------------------------------------------------------------------------------- 39
Figure 48 Training and Testing Confusion Matrix ------------------------------------------------------- 39
Figure 49 AUC Curve LDA Model --------------------------------------------------------------------------- 41
Figure 50 AUC ROC CART (Pruned)------------------------------------------------------------------------- 43

Table 1 Data Types ..................................................................................................................... 5


Table 2 Statistical Summary ....................................................................................................... 6
Table 3 Percentage of 0 ............................................................................................................ 14
Table 4 Initial OLS model .......................................................................................................... 15
Table 5 OLS model post VIF treatment .................................................................................... 16
Table 6 final OLS model ............................................................................................................ 17
Table 7 Coefficient .................................................................................................................... 22
Table 8 Ridge model Coefficient............................................................................................... 23
Table 9 Approach-3 Initial OLS Model ...................................................................................... 25
Table 10 VIF values ................................................................................................................... 26
Table 11 Final OLS Approach-3................................................................................................. 26
Table 12 Shape ......................................................................................................................... 31
Table 13 Summary.................................................................................................................... 31
Table 14 Missing Values ........................................................................................................... 36
Table 15 Classification report ................................................................................................... 37
Table 16 LDA Classification report............................................................................................ 40
Table 17 Important CART features ........................................................................................... 42
Table 18 CART Classification matrix ......................................................................................... 42

Equation 1 Linear Regression ................................................................................................... 20


Equation 2 Linear Regression Equation (Approach-3) ............................................................. 28
Equation 3 Linear Discriminant Function ................................................................................. 41

4
Problem 1

Executive Summary:
The comp-activ database comprises activity measures of computer systems. Data was
gathered from a Sun Sparcstation 20/712 with 128 Mbytes of memory, operating in a multi-
user university department. Users engaged in diverse tasks, such as internet access, file
editing, and CPU-intensive programs.

Introduction:
The aim is to establish a linear equation for predicting 'usr' (the percentage of time CPUs
operate in user mode). Also, to analyse various system attributes to understand their
influence on the system's 'usr' mode.

Data shape:
The data shape by default is said to have 8192 records and 22 variables. However, variables
such as ‘rchar’ and ‘wchar’ have 8088 and 8017 records respectively considering the rest as
missing values/records which shall be addressed later.
Data types:
# Column Non-Null Count Dtype
# Column Non-Null Count Dtype --- ------ -------------- -----
--- ------ -------------- ----- 11 pgfree 8192 non-null float64
0 lread 8192 non-null int64 12 pgscan 8192 non-null float64
1 lwrite 8192 non-null int64 13 atch 8192 non-null float64
2 scall 8192 non-null int64 14 pgin 8192 non-null float64
3 sread 8192 non-null int64 15 ppgin 8192 non-null float64
4 swrite 8192 non-null int64 16 pflt 8192 non-null float64
5 fork 8192 non-null float64 17 vflt 8192 non-null float64
6 exec 8192 non-null float64 18 runqsz 8192 non-null object
7 rchar 8088 non-null float64 19 freemem 8192 non-null int64
8 wchar 8177 non-null float64 20 freeswap 8192 non-null int64
9 pgout 8192 non-null float64 21 usr 8192 non-null int64
10 ppgout 8192 non-null float64
Table 1 Data Types

From the above table 1 it can be noted that there are a total of 21 variables that belong to
numerical data type (int64 or float64). The variable ‘runqsz’ is an object which is actually a
category of CPU_Bound and Not_CPU_Bound. However, the runqsz variable will later be
converted to a numeciral category (0 or 1).

5
Statistical Summary:

Table 2 Statistical Summary

It can be inferred from the statistical that most of the variable’s statistic is abnormal and this
is because of the presence of zeros in most of the columns. The importance of zero and
whether or not they need to imputed will be evaluated approach wise. Since, there has been
no domain information on zero model needs to be built considering both the scenarios.
Univariate Analysis:
There are a total of 21 columns which are numerical and continuous. A violin plot would do
justice to the analysis since it not only describes the distribution but also the outlying points,
skewness and also the normality.

Figure 1 lread

6
Figure 2 lwrite
Figure 6 fork

Figure 3 scall
Figure 7 exec

Figure 4 sread Figure 8 rchar

Figure 9 wchar
Figure 5 swrite

7
Figure 10 atch Figure 14 pgout

Figure 11 pgin Figure 15 ppgout

Figure 12 ppgin Figure 16 pgfree

Figure 13 pfit Figure 17 pgscan

8
Figure 18 vflt

Figure 19 freemem

Figure 20 freeswap

Figure 21 usr

9
The above plots from Figure 1 to Figure 21 represents the univariate analysis of Continuous
variables. It can be noted that from figure 1 to figure 19 all the variables follow a pattern.
They all are skewed towards their right. As for figure 20 representing freeswap appears to
have a trimodal distribution as there are 3 peaks in the plot. The target variable
which is “usr” is more of a left skewed and a very small peak could also be observed to its
left where the smaller values belong. For all the right skewed a log transformation approach
could be an ideal solution. However, a model with and without transformation will be built
ahead to differentiate the performances of the same.

Figure 22 runqsz count

The above plot for the variables "runqsz" indicate there is a balance between the the two
values (CPU_Bond and Not_CPU_Bound). Further indicating absence of Imbalanced data for
this particular variable
Multivariate Analysis:

Figure 23 runqsz distribution w.r.t target

10
Figure 22 represents the distribution of the variable runqsz over target. It can be noted that
though there is a slight difference in the mean the distribution does not vary hugely. CPU
Bound has few outliers on the very lower end. This also implies that the only categorical
variable present in the data does not provide cautious insight with at least the target.

Figure 24 Raw Data Correlations

It can be inferred from the figure 23 that there are ample number of independent variables
that are correlated with each other. For example, if we could set a modulus of correlation of
a variable and a threshold of at least 0.5, a total of 15 combinations of variables can be
noted to have correlations above 0.5(50%). Another interesting thing can be noted with
respect to the target variable is that all the variables except the runqsz(categorical),
freemem and freeswap are negatively correlated and freeswap is said to have a strong
positive correlation with the target(usr).

11
Key Meaningful Observations:

• From the raw data univariate analysis, it can be noted that except for the variable
freeswap and usr rest all are skewed towards the right. Also, freeswap is observed to
have trimodal distribution.

• The statistical summary came out to be uneven, this observation is a direct impact of
presence of large number of zeros in the variables. This also indicated presence of
outliers.

• The target although appears to have a bimodal distribution at the left end of the tail
is however a left tailed distribution. The correlation of independent variables
alongside the target indicates all the variables are negatively correlated to the target
except freemem and freeswap. Of the two positive correlation with the target
freeswap has a strong positive correlation with the target.

• It has been noted since there are a large number of zeros in the data, zero can turn
out be a problem especially in linear regression. Since, there has not been any
mentions about whether or not presence of zeros is legitimate by the domain we will
consider building multiple model one without and another with zeros.

Data Preprocessing:

It was noted that there were 119 missing values(nan) in the dataset. Of which 104 were in
the rchar and 15 in the wchar. These missing values were handled using KNN Imputer
whenever a model was built. Since the approach of building the model were bifurcated with
zeros and non-zero models, the missing values were treated along with zeros if the model
was built considering zeros are of bluff values adding least importance. If not, only the initial
119 missing values were handled considering zeros as the legitimate values.

Similar to the presence of zeros, there were also outliers in the data. Though it seemed like
outliers, investigating the data came out to be legitimate and building the model with
legitimate outliers did not bring any harm to the model.
Though outliers can bring in error slightly more than what the non-outlier model would
bring in, it is still best suited to go ahead with outliers if the outlying values turn out to be
legitimate.

12
Figure 25 Count of Zeros

The above plot represents the count of zeros in the raw data. We need to keep in mind the
data is hugely affected by the presence of large number of zeroes, zero inflated datasets ruin
the regression model. Also, it is evidently visible that the data is skewed as seen in the
univariate analysis. Multiple approaches for building the model could possibly provide a
solution since a single ad-hoc approach might not be the right solution.
It was observed that the variable "pgscan" consists of 79% of zero's it should cause no harm
if it is dropped from the dataset. The presence of such large number of zero would not be
suitable to consider a variable for model building.
Multiple approach for building the model has been implemented like building the model
with zeros, without zeros, with log transformation in the presence of zeros, penalising the
variables coefficient using Regularization methods.
"usr" variable which is the dependent represents the percentage of time CPUs operate in
user mode. If the values are zero in this variable it suggests that either the CPU is not
operating or might be miscalculated. Since this could affect our model building and also the
quantity is small, we will drop the records and associated with it wherever there are zero,
one and two since even they are very less in count and sound inappropriate from the
problem statement.

13
Model Building

Appraoch-1
In this approach we will drop the variables that have more than 50% of zeros in their
column, and convert the other zeroes in to NaN and impute it using KNN Imputer.

lread 8.53 ppgout 60.32


lwrite 33.59 pgfree 60.21
scall 0.00 atch 56.65
sread 0.00 pgin 15.38
swrite 0.00 ppgin 15.38
fork 0.25 pflt 0.04
exec 0.25 vflt 0.00
rchar 0.00 runqsz 0.00
wchar 0.00 freemem 0.00
pgout 60.32 freeswap 0.00
usr 0.00

Table 3 Percentage of 0

From the table 3, we will be be dropping the variables pgout, ppgout. Pgfree and atch from
the dataset and proceed further for model building.
NOTE: The variables that are being dropped here because of the presence of zero will be
considered as it is in the other approaches. Since, there has been no domain information on
the importance of the feature and zeros associated to it we will also build the model
considering that the features that are dropped here might contain crucial information and
are not suitable to drop.
Further, the feature having less than 50% of zeros are converted to nan and are imputed
using KNN Imputation.
Why KNN Imputation?

• KNN imputation considers relationships between variables. It takes into account the
similarity between data points and imputes missing values based on the values of the
nearest neighbours.
• KNN imputation is not limited by assumptions of linearity. It can capture non-linear
relationships between variables, making it suitable for datasets where the
relationships are more complex and cannot be adequately represented by a linear
model.
• In datasets where variables are interrelated, KNN imputation can be more effective.
It considers the overall patterns in the data and can provide better estimates for
missing values when variables interact with each other.

14
Figure 26 Correlations post data cleaning

Table 4 Initial OLS model

The data was split into X and y for Independent and dependent variables respectively, it was
further distributed into respective train and test with a test size of 30% and a random state
as 7 to obtain the model shown in the table 4. The random state 7 will be used for all the
models.

15
From the table 4, the initial model (assumptions unchecked) infers few key observations. The
R-squared and adjusted R-squared can be noted as 0.912 and 0.91 respective. This can be
termed as a good observation as the model explains 91% of variance. However, the model is
yet to be evaluated for its assumptions as we can see the presence of multicollinearity.
Since there is a presence of strong multicollinearity, the approach to solve the same was
employed by VIF (Variance Inflation Factor). The Vif was calculated for the initial model and
the variables having VIF larger than 10 was noted down and eliminated one by one and
check for R-squared and adj. R-squared was conducted to make note of any changes in
them.
Noting down the R-squared and the Adjusted R-squared dropping the variable "fork" which
was contributing towards multicollinearity would bring less or no decrease in the model
efficiency. The R-squared and adj. R-Squared came out to be 0.911 and 0.911 respectively
post eliminating “fork”
Since there was presence of strong multicollinearity post eliminating “fork”, further VIF was
checked and variables were dropped one by one so as to eliminate the multicollinearity.
Columns such as sread, pflt and pgin were dropped from the model one by one and the
model with less/no multicollinearity was built.

Table 5 OLS model post VIF treatment

The concept of multicollinearity was resolved by VIF. The threshold has been set 5 and now
all the variables have variability below 5. The table 5 represents the model post treatment of
multicollinearity with R-squared and adj. R-squared of 0.906 respectively. However, this can
not be considered as the final model as we can see the variable lwrite with p-value > 0.05

16
P-value is the test for importance of feature and whether or not the features importance is
zero.
From the OLS table above it can be noted that there is a 14.2% chance for variable "lwrite"
to be least important among the group since the hypothesis test assumes that the important
variable be less than or equal to the threshold of 0.05.
Another and the final model with less/no multicollinearity and all variable importance was
built as shown below.

Table 6 final OLS model

The R-squared and adj. R-squared remain un-altered post dropping lwrite as it is clearly
visible in the Table 6 consisting of final OLS model. In this OLS model the assumption of
multicollinearity was eliminated and the feature importance was taken into consideration.
The final R-squared and Adj. R-squared remain 0.96, explaining 96% of the variance. There
are few more assumptions that needs to be satisfied so as to conclude the satisfaction of the
result the model will predict.

17
Assumptions:

Linearity and Independence test:

• Linearity describes a straight-line relationship between two variables, predictor


variables must have a linear relation with the dependent variable.

• A plot of fitted values vs residuals, if they don't follow any pattern (the curve is a
straight line), then we say the model is linear otherwise model is showing signs of
non-linearity.

Figure 27 Linearity and Independence test

The above plot (Figure 27) of fitted values vs residuals don't follow any pattern (the curve is
a straight line), then we say the model is almost linear. Achieving this can be a challenging
task as perfect linear sometimes can be impossible. However, a near perfect linear like the
above plot is achievable as they represent randomly distributed.

Test for Normality:

• Error terms/residuals should be normally distributed.

• If the error terms are not normally distributed, confidence intervals may become too
wide or narrow. Once confidence interval becomes unstable, it leads to difficulty in
estimating coefficients based on minimization of least squares.

18
Figure 28 Normality of residuals

The above visual representation tells that the errors are normally distributed. It also
suggests to some extent it is skewed towards its left. Since the model is built without outlier
treatment considering the outliers are a legitimate value, the slight skewness could be a
result of it. However, we will try solving the data with other approaches.

The QQ plot of residuals can be used to visually check the normality assumption. The normal
probability plot of residuals should approximately follow a straight line.

Figure 29 QQ Plot for normality

Most of the points are lying on the straight line in QQ plot. There can be few exceptions as
suggested earlier getting a full perfect model can be highly challenging especially without
domain intervention. However, the above QQ plot could satisfy the need.

19
The Shapiro-Wilk test can also be used for checking the normality. The null and alternate
hypotheses of the test are as follows:

• Null hypothesis - Data is normally distributed.


• Alternate hypothesis - Data is not normally distributed.

• Since p-value < 0.05, the residuals are not normal as per Shapiro test.
• Strictly speaking - the residuals are not normal. However, as an approximation, we
might be willing to accept this distribution as close to being normal

Test for Homoscedasticity:


The null and alternate hypotheses of the goldfeldquandt test are as follows:

• Null hypothesis: Residuals are homoscedastic


• Alternate hypothesis: Residuals have heteroscedasticity.

• Since p-value > 0.05 we can say that the residuals are homoscedastic.

Linear Regression Equation:


usr = 99.67456476140191 + -0.011698042141953 * ( lread ) + -0.00141281
67946733641 * ( scall ) + -0.004938179830042236 * ( swrite ) + -0.268
6742094564625 * ( exec ) + -1.4794325809918855e-06 * ( rchar ) + -5.1
34623999262638e-06 * ( wchar ) + -0.04895505808532618 * ( ppgin ) + -
0.024796525604816134 * ( vflt ) + 0.0002103487045647368 * ( freemem )
+ -1.335542105397542e-06 * ( freeswap ) + -0.3178231678893986 * ( run
qsz_Not_CPU_Bound )

Equation 1 Linear Regression

RMSE Score:
Training Score: 2.7203264821013207
Test Score: 2.780337238168565

• We can see that RMSE on the train and test sets are comparable. So, our model is
not suffering from overfitting.
• Hence, we can conclude the model "OLS" is good for prediction as well as inference
purposes.

20
Figure 30 Actual vs Predicted

The plot represents the Actual vs Predicted graph of a randomly chosen 100 records. The
blue represents the Actual records and the red represents the Predicted records. We can
now see a very minimal deviation when compared each other representing the model is
doing a perfect job in predicting the values.

Observations for Approach-1


• R-squared of the model is 0.906 and adjusted R-squared is 0.906, which shows that
the model is able to explain ~90% variance in the data. This is quite phenomenal.

• A unit increase in the exec will result in a -0.2687 unit decrease in the usr, all other
variables remaining constant.

• A unit increase in the independent variable will result in a const. unit


increase/decrease in the usr.

• The usr of a Not CPU Bound in runqsz will be -0.3178 units lesser than a runqsz of
CPU Bound, all other variables remaining constant.

• As we move forward, we will build few more model with data considering zeros as a
legitimate value, with Linear Regression and Regularization.

21
Linear Regression using sklearn.

Table 7 Coefficient

The intercept for our model is 99.67456476141315


R-squared:
Training: 0.9063271600466553
Testing: 0.9030559276940191
90% of the variation in the usr is explained by the predictors in the model for train and test
set.
RMSE:
Training: 2.7203264821013207
Testing: 2.780337238168478

Approach-2 Regularization Methods

Ridge and Lasso regularization involve adding a penalty term to the linear regression cost
function. These penalty terms are based on the magnitude of the coefficients, and they can
be sensitive to the scale of the input features. Therefore, scaling is recommended to ensure
that all features contribute equally to the regularization term.
In this approach let us consider zero to be valid in our data. Since regularization method
does a feature selection part within it, may be the variables with large zero will be punished
in the presence of lambda (alpha) function.
All the features except pgscan shall be considered for this approach. Hence, 21 features is
what we will be proceeding with for model building.

22
Ridge Regularization:

Best alpha found: 10


Method employed: Grid Search CV (To Identify best alpha).

Table 8 Ridge model Coefficient

Ridge Model Score


Training Score: 0.9152828363177401
Testing Score: 0.9093091434469945

The Root Mean Squared Error on Train Set: 2.587020614213069


The Root Mean Squared Error on Test Set: 2.6891721118726006

Figure 31 Ridge Actual vs Predicted

The above plot represents the Actual vs Predicted graph of a randomly chosen 100 records
for the ridge model. The green represents the Actual records and the magenta represents
the Predicted records. We can ne see a very minimal deviation when compared each other
representing the model is doing a perfect job in predicting the values.

23
Lasso Regularization:
Best alpha found: 0.01
Method employed: Grid Search CV (To Identify best alpha).

Lasso Model Score


Training Score: 0.9152708482960568
Testing Score: 0.9093835593687756

The Root Mean Squared Error on Train Set: 2.587203647737826


The Root Mean Squared Error on Test Set: 2.6880685921944107

Figure 32 Lasso Actual vs Predicted

The above plot represents the Actual vs Predicted graph of a randomly chosen 100 records
for the lasso model. The green represents the Actual records and the magenta represents
the Predicted records. We can ne see a very minimal deviation when compared each other
representing the model is doing a perfect job in predicting the values.

24
Approach-3

EDA (Univariate Analysis) evidently displays that there is a skewness in the data that too
almost all variables having their values skewed towards right. Another approach of building
the model would be by log transformation also not excluding the columns with significant
number of zeros except "pgscan"
The reason for including log transformation, most of the variables were right (positive)
skewed.
Initial Model

Table 9 Approach-3 Initial OLS Model

The R-squared and Adj. R-squared comes out to be 0.955 respectively. Since there is a
presence of strong multicollinearity, we would look to solve it using VIF.

25
Table 10 VIF values

From the table 10, it can be noted that many columns have high VIF value. The elimination
of high VIF features was done in a phased manner. The high VIF feature was dropped and the
R-squared and Adj. R-squared was noted. Drastic change in Adj. R-squared would indicate
the drop in feature was unnecessary to drop. This process continues until we eliminated the
VIF features also being aware of importance of feature and dip in the Variance of the model.

Table 11 Final OLS Approach-3

NOTE: The final model obtained (Table-11) for Approach-3 did go through a series of
eliminating high VIF feature one by one, checking R-squared and further dropping features
that had p-values > 0.05. Only the screenshot of final model was put up in the report so as to
reduce the complexity of the report.

26
Test for Normality (Approach-3):

Figure 33 Normality Test

The above visual representation tells that the errors are normally distributed. It also
suggests to some extent it is skewed towards its left. Since the model is built without outlier
treatment considering the outliers are a legitimate value, the slight skewness could be a
result of it.
The QQ plot of residuals can be used to visually check the normality assumption. The normal
probability plot of residuals should approximately follow a straight line.

Figure 34 QQ Plot (Approach-3)

From the figure 34, it can be inferred that points do lie on the straight line and also equal
number of points that lie outside the line and few being closer and few little farther.
However, the QQ plot in approach-1 seemed better than the approach-3 QQ-plot.

27
The Shapiro-Wilk test can also be used for checking the normality. The null and alternate
hypotheses of the test are as follows:

• Null hypothesis - Data is normally distributed.


• Alternate hypothesis - Data is not normally distributed.

• Since p-value < 0.05, the residuals are not normal as per Shapiro test.
• Strictly speaking - the residuals are not normal. However, as an approximation, we
might be willing to accept this distribution as close to being normal

Test for Homoscedasticity:


The null and alternate hypotheses of the goldfeldquandt test are as follows:

• Null hypothesis: Residuals are homoscedastic


• Alternate hypothesis: Residuals have heteroscedasticity.

• Since p-value > 0.05 we can say that the residuals are homoscedastic.

Linear Regression Equation and comment on variables:

Equation 2 Linear Regression Equation (Approach-3)

Comments:

• From the Equation-2, a unit increase in the exec will result in a -1.1214 unit decrease
in the usr, all other variables remaining constant.

• a unit increase in the freemem will result in a 7.4904 unit increase in the usr, all other
variables remaining constant.

• The usr of a Not CPU Bound in runqsz will be -0.3178 units lesser than a runqsz of
CPU Bound, all other variables remaining constant.

28
RMSE Score
Training: 3.965124845143364
Testing: 3.9354229764644773
MAE
Training: 3.0475993791394114
Testing: 3.0323486562705573

• We can see that RMSE on the train and test sets are comparable. So, our model is not
suffering from overfitting
• MAE indicates that our current model is able to predict usr within a mean error of
2.0 units on the test data.

Figure 35 Actual vs Predicted Approach-3

The above plot represents the Actual vs Predicted graph of a randomly chosen 100
records. The blue represents the Actual records and the red represents the Predicted
records. We can ne see a very minimal deviation when compared each other
representing the model is doing a perfect job in predicting the values.

29
Key Observations

• In the Univariate Analysis of raw data, the observations of all continuous


variables except for the target and freeswap, all others displayed a pattern which
exhibits right (positive) skew. It was also observed the variable freeswap was a
trimodal distribution as three peaks were observed. A detail data check by the
domain could possibly explain such patterns and the target “usr” came out to be
left skewed indicating higher values forming a peak and extreme lower value
being lesser and dragging towards its left.

• As an exercise of going through multivariate analysis, it was observed few


independent variables had a very high positive correlation among them. This also
suggests that presence of such high correlation could lead us to multicollinearity
and it was later handled while building the model.

• With respect to the target, all the variables except freeswap and freemem had a
negative correlation. However, freeswap exhibited a correlation value of 0.27 and
freemem exhibited a correlation value of 0.68 with the target.

• Since the data was observed to have zeros which in turn dominated few
variables, the model building was done considering both (with and without)
zeros. This method was employed since there was no domain information on the
same. However, for all the approach the variable “pgscan” was dropped (not
considered) for the presence of around 79% of zero in that column alone.

• The first approach involved in building model had variables that had percentage
of zeros less than 50%. The NaN value was imputed using KNN Imputer. Initial
model was built observed to have multicollinearity, less important features which
were eliminated step by step and the final model was estimated to have
explained around 90% of variance. Most of the assumptions were satisfied.

• Regularization techniques such as Ridge and Lasso model were built so as to


check if it could further better the model. The performance and prediction
remained less changed indicating the first approach did the right model building.

• The last approach was to consider the presence of zero as legit and build model
on it. All the variables except pgscan were considered. The same methods such as
initial model building, multicollinearity check and p-value were employed. This in
turn explained 95% variance, better than the first approach. However, this has its
own drawback when it comes to handling zeros in the production and the
normality for the approach-1 looked better. Both these approaches did provide
best results and considering the approach depends on the domain.

30
Problem-2

Executive Summary:
Republic of Indonesia Ministry of Health, has entrusted us with a dataset containing
information from a Contraceptive Prevalence Survey. This dataset encompasses data from
1473 married females who were either not pregnant or were uncertain of their pregnancy
status during the survey.

Introduction:
Expectation involves predicting whether these women opt for a contraceptive method of
choice. This prediction will be based on a comprehensive analysis of their demographic and
socio-economic attributes.

Data shape:
The data shape by default is said to have 1473 records and 10variables. There are a total of
4 categorical variables, 2 numerical, 3 binary and the target being binary class.

Table 12 Shape

Statistical Summary:

Table 13 Summary

31
Univariate Analysis

Figure 36 Age of Wife

The box plot represents presence of no outliers in the distribution and the violin plot
describes a slightly right skewed distribution but does not require any transformation as the
appearance is not abnormal.

Figure 37 Wife and Husband Education

The Education status of the Husband and Wife, Both the genders tend to have higher count
of Tertiary education level. Least is the Uneducated section. The Wife and Husband
Education tend to have similar pattern as the count of that particular level appears more or
less the same.

32
Figure 38 No. of Children

Though the number of children has been categorized as a numerical variable, it can also be
visualized as a category. The left plot of Figure 38 is a categorical visual of the variable where
Number of children greater than 5 has comparatively lesser count and the right plot
representing the violin plot indicates presence of outliers as the plot is skewed towards
right.

Figure 39 Wife's Religion and Working status

Figure 39 represents the pie chart of two variables wife’s religion and their working status.
The wife’s religion when it is Scientology dominated the dataset as 85% of the data consists
the same whereas the non-scientology is only around 15% of records in the data. Also,
around 80% of the Wife’s are not working despite the education of the same dominating
over uneducated.

33
Figure 40 Husband Occupation and Standard of living

The left side plot of Figure 40 represents the Occupation. The Occupation category 3 which
represents high is the highest among the category whereas the Occupation 4 is the least
among the group. Right side plot of Figure 40 represents that 46.5% of the data represents
the family tends to have a Very high standard of living and only 8.8% tend to have very low.

Figure 41 Media Exposure and Contraceptive used

Figure 41 to it left has the details of Media Exposure and the right side has the details of
Contraceptive used. It can be noted that Exposed exposure rate is very high and dominating
toan extent we can say the variable is oversampled towards Exposed. The variable
contraceptive used being the target (dependent) variable eliminates the concept of
oversampling as the class target is sampled if not perfectly but near perfect samples.

34
Multivariate Analysis

Figure 42 Distribution of children vs Wife's Education

The wife’s education category appears to have same median across all category except
Uneducated indicating higher among the category. The tertiary education has outliers at
extreme ends, such outliers can be miscalculation or could be legit. A domain consultation
could help identify such cases if not it can be considered as miscalculated outlier and then
be capped.

Figure 43 Distribution of children with Wife's Education

The mean of media not-exposed category is slightly higher than that of Exposed despite the
Media exposed category having extreme outliers. The Not-Exposed category also indicates
the distribution not having any outliers.

35
Figure 44 Children per Husband's Education

From the figure 44 it can be noted that despite the secondary and tertiary category having
outliers the median of uneducated is higher compared to others in the same category.
Uneducated husbands tend to have a greater number of children and a proper education on
the contraceptives could reduce this significantly.

Data Preprocessing
Missing Values

Table 14 Missing Values

It was noted that there were missing values in the Wife_age and No_of_children_born
variables with 71 and 21 respectively.
The variables Wife age follows a normal distribution. The mean and median appear to be
same. This indicates we can either use mean or median for imputing nan values.
The number of children born was imputed with the mode, the most appearing values in that
particular column.

36
Figure 45 Outlier Detection

There are Outliers in the Number of children variable. Logistic regression can be sensitive to
outliers, and their presence can influence the estimation of coefficients and impact the
performance of the model. However, a two-model approach was considered one with and
another without outliers. The model without outliers proved to have better accuracy,
precision and recall compared to others.
The outliers were treated with the capping value. Any value above 5 number of children
were categorized to fall under 5. The model was then split into train, test and split with a
random state of 7 the test sample was taken to be 30%.
Logistic Regression:

Model Score: 72%

Table 15 Classification report

37
Figure 46 Confusion Matrix

Observations

Precision is the ratio of true positive predictions to the total number of positive predictions
made by the model.
For class 0: Precision = 0.73
For class 1: Precision = 0.72
Interpretation: Out of all instances predicted as positive, 73% (class 0) and 72% (class 1)
were actually positive.
Recall is the ratio of true positive predictions to the total number of actual positive instances
in the dataset.
For class 0: Recall = 0.56
For class 1: Recall = 0.85
Interpretation: Out of all actual positive instances, the model captured 56% (class 0) and
85% (class 1).
F1-score is the harmonic mean of precision and recall, providing a balance between the two
metrics.
For class 0: F1-Score = 0.63
For class 1: F1-Score = 0.78
Interpretation: The F1-score considers both precision and recall, providing a single metric
that balances the trade-off between false positives and false negatives.

38
Figure 47 ROC AUC

The AUC score for the Logistic regression is 76% which appears pretty good since we have
only 1473 records which is low. A data set with more records will eventually reflect in the
model accuracy and predictions.
Linear Discriminant Analysis

Figure 48 Training and Testing Confusion Matrix

Figure 48 represents Training and Testing prediction confusion matrix. The prediction of 1
(yes) which is a variable of interest is predicted by the model in the same proportion. This
could possibly indicate the stability of the model. However, the model accuracy might not be
satisfying considering the number of records we are dealing with.

39
Table 16 LDA Classification report

Observations

Precision is the ratio of true positive predictions to the total number of positive predictions
made by the model.
For class 0: Precision = 0.72
For class 1: Precision = 0.70
Interpretation: Out of all instances predicted as positive, 72% (class 0) and 70% (class 1)
were actually positive.
Recall is the ratio of true positive predictions to the total number of actual positive instances
in the dataset.
For class 0: Recall = 0.52
For class 1: Recall = 0.85
Interpretation: Out of all actual positive instances, the model captured 52% (class 0) and
85% (class 1).
F1-score is the harmonic mean of precision and recall, providing a balance between the two
metrics.
For class 0: F1-Score = 0.59
For class 1: F1-Score = 0.75
Interpretation: The F1-score considers both precision and recall, providing a single metric
that balances the trade-off between false positives and false negatives.
The Model Accuracy comes out to be 69%

40
Figure 49 AUC Curve LDA Model

The Curve for the Training and Testing looks similar. The model is quite stable so far but with
a decrease in accuracy compared the logistic Regression. Emphasise on the data could bring
even mode stability higher AUC, ROC curve and model accuracy.

𝐿𝐷𝐹 = 0.36725944 + 𝑋1 ∗ −0.81 + 𝑋2 ∗ (0.56) + 𝑋3 ∗ (−0.01) + 𝑋4 ∗ (0.99) + 𝑋5


∗ (−0.1) + 𝑋6 ∗ (−0.02) + 𝑋7 ∗ 0.02 + 𝑋8 ∗ 0.29 + 𝑋9 ∗ 0.13
Equation 3 Linear Discriminant Function

By the above equation and the coefficients, it is clear that

• Predictor 'Number of children born' has the largest magnitude thus this helps in
classifying the best
• Predictor 'Wife age' has the smallest magnitude thus this helps in classifying the least

41
CART Model

Independent and Dependent variables were split into X and y respectively. Initially the
model was fit with Decision Tree Classifier with the gini as a criterion to split the root node.
Important features were extracted as shown below.

Table 17 Important CART features

For the basic model that was fit the important features came out to be Wife age and Media
exposure to be the least important feature.
Decision tree model scores:
Training: 98%
Testing: 62%
By looking at the training and testing score we can say that the model is overfitting, a result
of high variance. To overcome this pruning method was employed by hyper tuning the
parameters using Grid Search CV.

These were used as the best estimators and again the model was built.
Decision tree model scores (Hyper tuning the parameters):
Training: 80%
Testing: 65%
Though the training score is reduced the testing score is increased just by 2%. However, the
model is still overfitting. This shows decision trees are prone to overfitting.

Table 18 CART Classification matrix

42
The classification matrix for the testing data of the CART model post pruning the data yet
appears dissatisfactory as the results are not up to the mark. A better approach with a
different algorithm such as Random Forest etc. could be a better alternate.

Figure 50 AUC ROC CART (Pruned)

The AUC for the training is better than the AUC for the testing data yet again indicating
overfitting. The AUC score is 86% for training and 70% for testing. Emphasising more on the
quantity of the data and advanced algorithms such as random forest, XG Boost etc. could
make the model better.

Business Insights and Recommendations


Importance of feature based on best model
• It was observed for Linear Discriminant Analysis that the Variable Number of children
born has the largest magnitude further helping in classifying.

• Linear discriminant Analysis focuses on the magnitude and coefficients are


considered to do so. The higher the coefficients larger the magnitude and important
the feature. Lower the coefficients lower the magnitude and least important the
feature.

• In LDA, Wife age had the lesser magnitude and further specifying less importance I
the model building.

• The CART model provides information on feature importance where Wife age is given
the upper most importance and surprisingly the opposite of what LDA model
suggested

• The least important came out to be Media exposure. To note the way Decision tree
splits it would be better if there was absence of under sampling or else they have to
carry crucial information.

43
Actionable Insights and recommendations

• The Univariate analysis was performed to analyse the pattern displayed by them. It
can be noted the age of Wife has almost normal distribution with absence of outliers.
Count of Education for both husband and wife have similar pattern where tertiary
has the highest count and uneducated being the least.

• Multivariate Analysis was performed to check the distribution of continuous


alongside the category. The missing value were identified and treated with proper
imputation techniques.

• A two-fold approach was considered. The first one was to considered the outliers in
Number of children born were legit and another was to consider them as
exaggerated values where they will be capped to the nearest legit value. However,
when the model was built for both the approach the second model provided better
results and was considered.

• Logistic Regression was the first algorithm considered. The results provided by the
model was pretty satisfying considering the number of records (less). The model
accuracy, precision, recall and f1-score was calculated.

• Similarly, Linear Discriminant Analysis and CART (Decision tree) was built to check for
their approach (feature importance) and accuracy. The LDA could divide the target as
it is meant to and the accuracy was considerable.

• The decision tree model was built initially without tuning any parameters. The result
was overfitting and variance was the cause of such results. Another model was built
by pruning the tree and hyper tuning the parameters. Grid search CV was considered
as the multi fold approach. The results did reduce the overfitting slightly but not
significantly further proving that decision tree might not be the right option for the
data provided.

• The Logistic Regression and LDA model provided better results compared to that of
Decision tree. Considering either of Logistic regression or LDA could provide the
desirable results. It is also important to note that the results can be astonishing if the
quantity of data is more.

44

You might also like