Predictive Modelling
Predictive Modelling
Rohit Nagarahalli
1
Problem 1................................................................................................................................... 5
Executive Summary: ............................................................................................................... 5
Introduction: .......................................................................................................................... 5
Data shape: ........................................................................................................................ 5
Data types: ......................................................................................................................... 5
Statistical Summary:........................................................................................................... 6
Univariate Analysis: ............................................................................................................ 6
Multivariate Analysis: ....................................................................................................... 10
Key Meaningful Observations: ............................................................................................. 12
Data Preprocessing: ............................................................................................................. 12
Model Building ..................................................................................................................... 14
Appraoch-1....................................................................................................................... 14
Assumptions:.................................................................................................................... 18
Linear Regression Equation: ............................................................................................. 20
Observations for Approach-1 ............................................................................................... 21
Approach-2 Regularization Methods ............................................................................... 22
Approach-3....................................................................................................................... 25
Linear Regression Equation and comment on variables: ................................................. 28
Key Observations ................................................................................................................. 30
Problem-2................................................................................................................................. 31
Executive Summary: ............................................................................................................. 31
Introduction: ........................................................................................................................ 31
Data shape: ...................................................................................................................... 31
Univariate Analysis ........................................................................................................... 32
Multivariate Analysis ........................................................................................................ 35
Data Preprocessing .............................................................................................................. 36
Missing Values .................................................................................................................. 36
Logistic Regression: ......................................................................................................... 37
Observations .................................................................................................................... 38
Linear Discriminant Analysis ........................................................................................... 39
Observations .................................................................................................................... 40
CART Model...................................................................................................................... 42
2
Business Insights and Recommendations ............................................................................ 43
Importance of feature based on best model ................................................................... 43
Actionable Insights and recommendations ..................................................................... 44
3
Figure 39 Wife's Religion and Working status ----------------------------------------------------------- 33
Figure 40 Husband Occupation and Standard of living ------------------------------------------------ 34
Figure 41 Media Exposure and Contraceptive used ---------------------------------------------------- 34
Figure 42 Distribution of children vs Wife's Education ------------------------------------------------ 35
Figure 43 Distribution of children with Wife's Education --------------------------------------------- 35
Figure 44 Children per Husband's Education ------------------------------------------------------------ 36
Figure 45 Outlier Detection ---------------------------------------------------------------------------------- 37
Figure 46 Confusion Matrix ---------------------------------------------------------------------------------- 38
Figure 47 ROC AUC -------------------------------------------------------------------------------------------- 39
Figure 48 Training and Testing Confusion Matrix ------------------------------------------------------- 39
Figure 49 AUC Curve LDA Model --------------------------------------------------------------------------- 41
Figure 50 AUC ROC CART (Pruned)------------------------------------------------------------------------- 43
4
Problem 1
Executive Summary:
The comp-activ database comprises activity measures of computer systems. Data was
gathered from a Sun Sparcstation 20/712 with 128 Mbytes of memory, operating in a multi-
user university department. Users engaged in diverse tasks, such as internet access, file
editing, and CPU-intensive programs.
Introduction:
The aim is to establish a linear equation for predicting 'usr' (the percentage of time CPUs
operate in user mode). Also, to analyse various system attributes to understand their
influence on the system's 'usr' mode.
Data shape:
The data shape by default is said to have 8192 records and 22 variables. However, variables
such as ‘rchar’ and ‘wchar’ have 8088 and 8017 records respectively considering the rest as
missing values/records which shall be addressed later.
Data types:
# Column Non-Null Count Dtype
# Column Non-Null Count Dtype --- ------ -------------- -----
--- ------ -------------- ----- 11 pgfree 8192 non-null float64
0 lread 8192 non-null int64 12 pgscan 8192 non-null float64
1 lwrite 8192 non-null int64 13 atch 8192 non-null float64
2 scall 8192 non-null int64 14 pgin 8192 non-null float64
3 sread 8192 non-null int64 15 ppgin 8192 non-null float64
4 swrite 8192 non-null int64 16 pflt 8192 non-null float64
5 fork 8192 non-null float64 17 vflt 8192 non-null float64
6 exec 8192 non-null float64 18 runqsz 8192 non-null object
7 rchar 8088 non-null float64 19 freemem 8192 non-null int64
8 wchar 8177 non-null float64 20 freeswap 8192 non-null int64
9 pgout 8192 non-null float64 21 usr 8192 non-null int64
10 ppgout 8192 non-null float64
Table 1 Data Types
From the above table 1 it can be noted that there are a total of 21 variables that belong to
numerical data type (int64 or float64). The variable ‘runqsz’ is an object which is actually a
category of CPU_Bound and Not_CPU_Bound. However, the runqsz variable will later be
converted to a numeciral category (0 or 1).
5
Statistical Summary:
It can be inferred from the statistical that most of the variable’s statistic is abnormal and this
is because of the presence of zeros in most of the columns. The importance of zero and
whether or not they need to imputed will be evaluated approach wise. Since, there has been
no domain information on zero model needs to be built considering both the scenarios.
Univariate Analysis:
There are a total of 21 columns which are numerical and continuous. A violin plot would do
justice to the analysis since it not only describes the distribution but also the outlying points,
skewness and also the normality.
Figure 1 lread
6
Figure 2 lwrite
Figure 6 fork
Figure 3 scall
Figure 7 exec
Figure 9 wchar
Figure 5 swrite
7
Figure 10 atch Figure 14 pgout
8
Figure 18 vflt
Figure 19 freemem
Figure 20 freeswap
Figure 21 usr
9
The above plots from Figure 1 to Figure 21 represents the univariate analysis of Continuous
variables. It can be noted that from figure 1 to figure 19 all the variables follow a pattern.
They all are skewed towards their right. As for figure 20 representing freeswap appears to
have a trimodal distribution as there are 3 peaks in the plot. The target variable
which is “usr” is more of a left skewed and a very small peak could also be observed to its
left where the smaller values belong. For all the right skewed a log transformation approach
could be an ideal solution. However, a model with and without transformation will be built
ahead to differentiate the performances of the same.
The above plot for the variables "runqsz" indicate there is a balance between the the two
values (CPU_Bond and Not_CPU_Bound). Further indicating absence of Imbalanced data for
this particular variable
Multivariate Analysis:
10
Figure 22 represents the distribution of the variable runqsz over target. It can be noted that
though there is a slight difference in the mean the distribution does not vary hugely. CPU
Bound has few outliers on the very lower end. This also implies that the only categorical
variable present in the data does not provide cautious insight with at least the target.
It can be inferred from the figure 23 that there are ample number of independent variables
that are correlated with each other. For example, if we could set a modulus of correlation of
a variable and a threshold of at least 0.5, a total of 15 combinations of variables can be
noted to have correlations above 0.5(50%). Another interesting thing can be noted with
respect to the target variable is that all the variables except the runqsz(categorical),
freemem and freeswap are negatively correlated and freeswap is said to have a strong
positive correlation with the target(usr).
11
Key Meaningful Observations:
• From the raw data univariate analysis, it can be noted that except for the variable
freeswap and usr rest all are skewed towards the right. Also, freeswap is observed to
have trimodal distribution.
• The statistical summary came out to be uneven, this observation is a direct impact of
presence of large number of zeros in the variables. This also indicated presence of
outliers.
• The target although appears to have a bimodal distribution at the left end of the tail
is however a left tailed distribution. The correlation of independent variables
alongside the target indicates all the variables are negatively correlated to the target
except freemem and freeswap. Of the two positive correlation with the target
freeswap has a strong positive correlation with the target.
• It has been noted since there are a large number of zeros in the data, zero can turn
out be a problem especially in linear regression. Since, there has not been any
mentions about whether or not presence of zeros is legitimate by the domain we will
consider building multiple model one without and another with zeros.
Data Preprocessing:
It was noted that there were 119 missing values(nan) in the dataset. Of which 104 were in
the rchar and 15 in the wchar. These missing values were handled using KNN Imputer
whenever a model was built. Since the approach of building the model were bifurcated with
zeros and non-zero models, the missing values were treated along with zeros if the model
was built considering zeros are of bluff values adding least importance. If not, only the initial
119 missing values were handled considering zeros as the legitimate values.
Similar to the presence of zeros, there were also outliers in the data. Though it seemed like
outliers, investigating the data came out to be legitimate and building the model with
legitimate outliers did not bring any harm to the model.
Though outliers can bring in error slightly more than what the non-outlier model would
bring in, it is still best suited to go ahead with outliers if the outlying values turn out to be
legitimate.
12
Figure 25 Count of Zeros
The above plot represents the count of zeros in the raw data. We need to keep in mind the
data is hugely affected by the presence of large number of zeroes, zero inflated datasets ruin
the regression model. Also, it is evidently visible that the data is skewed as seen in the
univariate analysis. Multiple approaches for building the model could possibly provide a
solution since a single ad-hoc approach might not be the right solution.
It was observed that the variable "pgscan" consists of 79% of zero's it should cause no harm
if it is dropped from the dataset. The presence of such large number of zero would not be
suitable to consider a variable for model building.
Multiple approach for building the model has been implemented like building the model
with zeros, without zeros, with log transformation in the presence of zeros, penalising the
variables coefficient using Regularization methods.
"usr" variable which is the dependent represents the percentage of time CPUs operate in
user mode. If the values are zero in this variable it suggests that either the CPU is not
operating or might be miscalculated. Since this could affect our model building and also the
quantity is small, we will drop the records and associated with it wherever there are zero,
one and two since even they are very less in count and sound inappropriate from the
problem statement.
13
Model Building
Appraoch-1
In this approach we will drop the variables that have more than 50% of zeros in their
column, and convert the other zeroes in to NaN and impute it using KNN Imputer.
Table 3 Percentage of 0
From the table 3, we will be be dropping the variables pgout, ppgout. Pgfree and atch from
the dataset and proceed further for model building.
NOTE: The variables that are being dropped here because of the presence of zero will be
considered as it is in the other approaches. Since, there has been no domain information on
the importance of the feature and zeros associated to it we will also build the model
considering that the features that are dropped here might contain crucial information and
are not suitable to drop.
Further, the feature having less than 50% of zeros are converted to nan and are imputed
using KNN Imputation.
Why KNN Imputation?
• KNN imputation considers relationships between variables. It takes into account the
similarity between data points and imputes missing values based on the values of the
nearest neighbours.
• KNN imputation is not limited by assumptions of linearity. It can capture non-linear
relationships between variables, making it suitable for datasets where the
relationships are more complex and cannot be adequately represented by a linear
model.
• In datasets where variables are interrelated, KNN imputation can be more effective.
It considers the overall patterns in the data and can provide better estimates for
missing values when variables interact with each other.
14
Figure 26 Correlations post data cleaning
The data was split into X and y for Independent and dependent variables respectively, it was
further distributed into respective train and test with a test size of 30% and a random state
as 7 to obtain the model shown in the table 4. The random state 7 will be used for all the
models.
15
From the table 4, the initial model (assumptions unchecked) infers few key observations. The
R-squared and adjusted R-squared can be noted as 0.912 and 0.91 respective. This can be
termed as a good observation as the model explains 91% of variance. However, the model is
yet to be evaluated for its assumptions as we can see the presence of multicollinearity.
Since there is a presence of strong multicollinearity, the approach to solve the same was
employed by VIF (Variance Inflation Factor). The Vif was calculated for the initial model and
the variables having VIF larger than 10 was noted down and eliminated one by one and
check for R-squared and adj. R-squared was conducted to make note of any changes in
them.
Noting down the R-squared and the Adjusted R-squared dropping the variable "fork" which
was contributing towards multicollinearity would bring less or no decrease in the model
efficiency. The R-squared and adj. R-Squared came out to be 0.911 and 0.911 respectively
post eliminating “fork”
Since there was presence of strong multicollinearity post eliminating “fork”, further VIF was
checked and variables were dropped one by one so as to eliminate the multicollinearity.
Columns such as sread, pflt and pgin were dropped from the model one by one and the
model with less/no multicollinearity was built.
The concept of multicollinearity was resolved by VIF. The threshold has been set 5 and now
all the variables have variability below 5. The table 5 represents the model post treatment of
multicollinearity with R-squared and adj. R-squared of 0.906 respectively. However, this can
not be considered as the final model as we can see the variable lwrite with p-value > 0.05
16
P-value is the test for importance of feature and whether or not the features importance is
zero.
From the OLS table above it can be noted that there is a 14.2% chance for variable "lwrite"
to be least important among the group since the hypothesis test assumes that the important
variable be less than or equal to the threshold of 0.05.
Another and the final model with less/no multicollinearity and all variable importance was
built as shown below.
The R-squared and adj. R-squared remain un-altered post dropping lwrite as it is clearly
visible in the Table 6 consisting of final OLS model. In this OLS model the assumption of
multicollinearity was eliminated and the feature importance was taken into consideration.
The final R-squared and Adj. R-squared remain 0.96, explaining 96% of the variance. There
are few more assumptions that needs to be satisfied so as to conclude the satisfaction of the
result the model will predict.
17
Assumptions:
• A plot of fitted values vs residuals, if they don't follow any pattern (the curve is a
straight line), then we say the model is linear otherwise model is showing signs of
non-linearity.
The above plot (Figure 27) of fitted values vs residuals don't follow any pattern (the curve is
a straight line), then we say the model is almost linear. Achieving this can be a challenging
task as perfect linear sometimes can be impossible. However, a near perfect linear like the
above plot is achievable as they represent randomly distributed.
• If the error terms are not normally distributed, confidence intervals may become too
wide or narrow. Once confidence interval becomes unstable, it leads to difficulty in
estimating coefficients based on minimization of least squares.
18
Figure 28 Normality of residuals
The above visual representation tells that the errors are normally distributed. It also
suggests to some extent it is skewed towards its left. Since the model is built without outlier
treatment considering the outliers are a legitimate value, the slight skewness could be a
result of it. However, we will try solving the data with other approaches.
The QQ plot of residuals can be used to visually check the normality assumption. The normal
probability plot of residuals should approximately follow a straight line.
Most of the points are lying on the straight line in QQ plot. There can be few exceptions as
suggested earlier getting a full perfect model can be highly challenging especially without
domain intervention. However, the above QQ plot could satisfy the need.
19
The Shapiro-Wilk test can also be used for checking the normality. The null and alternate
hypotheses of the test are as follows:
• Since p-value < 0.05, the residuals are not normal as per Shapiro test.
• Strictly speaking - the residuals are not normal. However, as an approximation, we
might be willing to accept this distribution as close to being normal
• Since p-value > 0.05 we can say that the residuals are homoscedastic.
RMSE Score:
Training Score: 2.7203264821013207
Test Score: 2.780337238168565
• We can see that RMSE on the train and test sets are comparable. So, our model is
not suffering from overfitting.
• Hence, we can conclude the model "OLS" is good for prediction as well as inference
purposes.
20
Figure 30 Actual vs Predicted
The plot represents the Actual vs Predicted graph of a randomly chosen 100 records. The
blue represents the Actual records and the red represents the Predicted records. We can
now see a very minimal deviation when compared each other representing the model is
doing a perfect job in predicting the values.
• A unit increase in the exec will result in a -0.2687 unit decrease in the usr, all other
variables remaining constant.
• The usr of a Not CPU Bound in runqsz will be -0.3178 units lesser than a runqsz of
CPU Bound, all other variables remaining constant.
• As we move forward, we will build few more model with data considering zeros as a
legitimate value, with Linear Regression and Regularization.
21
Linear Regression using sklearn.
Table 7 Coefficient
Ridge and Lasso regularization involve adding a penalty term to the linear regression cost
function. These penalty terms are based on the magnitude of the coefficients, and they can
be sensitive to the scale of the input features. Therefore, scaling is recommended to ensure
that all features contribute equally to the regularization term.
In this approach let us consider zero to be valid in our data. Since regularization method
does a feature selection part within it, may be the variables with large zero will be punished
in the presence of lambda (alpha) function.
All the features except pgscan shall be considered for this approach. Hence, 21 features is
what we will be proceeding with for model building.
22
Ridge Regularization:
The above plot represents the Actual vs Predicted graph of a randomly chosen 100 records
for the ridge model. The green represents the Actual records and the magenta represents
the Predicted records. We can ne see a very minimal deviation when compared each other
representing the model is doing a perfect job in predicting the values.
23
Lasso Regularization:
Best alpha found: 0.01
Method employed: Grid Search CV (To Identify best alpha).
The above plot represents the Actual vs Predicted graph of a randomly chosen 100 records
for the lasso model. The green represents the Actual records and the magenta represents
the Predicted records. We can ne see a very minimal deviation when compared each other
representing the model is doing a perfect job in predicting the values.
24
Approach-3
EDA (Univariate Analysis) evidently displays that there is a skewness in the data that too
almost all variables having their values skewed towards right. Another approach of building
the model would be by log transformation also not excluding the columns with significant
number of zeros except "pgscan"
The reason for including log transformation, most of the variables were right (positive)
skewed.
Initial Model
The R-squared and Adj. R-squared comes out to be 0.955 respectively. Since there is a
presence of strong multicollinearity, we would look to solve it using VIF.
25
Table 10 VIF values
From the table 10, it can be noted that many columns have high VIF value. The elimination
of high VIF features was done in a phased manner. The high VIF feature was dropped and the
R-squared and Adj. R-squared was noted. Drastic change in Adj. R-squared would indicate
the drop in feature was unnecessary to drop. This process continues until we eliminated the
VIF features also being aware of importance of feature and dip in the Variance of the model.
NOTE: The final model obtained (Table-11) for Approach-3 did go through a series of
eliminating high VIF feature one by one, checking R-squared and further dropping features
that had p-values > 0.05. Only the screenshot of final model was put up in the report so as to
reduce the complexity of the report.
26
Test for Normality (Approach-3):
The above visual representation tells that the errors are normally distributed. It also
suggests to some extent it is skewed towards its left. Since the model is built without outlier
treatment considering the outliers are a legitimate value, the slight skewness could be a
result of it.
The QQ plot of residuals can be used to visually check the normality assumption. The normal
probability plot of residuals should approximately follow a straight line.
From the figure 34, it can be inferred that points do lie on the straight line and also equal
number of points that lie outside the line and few being closer and few little farther.
However, the QQ plot in approach-1 seemed better than the approach-3 QQ-plot.
27
The Shapiro-Wilk test can also be used for checking the normality. The null and alternate
hypotheses of the test are as follows:
• Since p-value < 0.05, the residuals are not normal as per Shapiro test.
• Strictly speaking - the residuals are not normal. However, as an approximation, we
might be willing to accept this distribution as close to being normal
• Since p-value > 0.05 we can say that the residuals are homoscedastic.
Comments:
• From the Equation-2, a unit increase in the exec will result in a -1.1214 unit decrease
in the usr, all other variables remaining constant.
• a unit increase in the freemem will result in a 7.4904 unit increase in the usr, all other
variables remaining constant.
• The usr of a Not CPU Bound in runqsz will be -0.3178 units lesser than a runqsz of
CPU Bound, all other variables remaining constant.
28
RMSE Score
Training: 3.965124845143364
Testing: 3.9354229764644773
MAE
Training: 3.0475993791394114
Testing: 3.0323486562705573
• We can see that RMSE on the train and test sets are comparable. So, our model is not
suffering from overfitting
• MAE indicates that our current model is able to predict usr within a mean error of
2.0 units on the test data.
The above plot represents the Actual vs Predicted graph of a randomly chosen 100
records. The blue represents the Actual records and the red represents the Predicted
records. We can ne see a very minimal deviation when compared each other
representing the model is doing a perfect job in predicting the values.
29
Key Observations
• With respect to the target, all the variables except freeswap and freemem had a
negative correlation. However, freeswap exhibited a correlation value of 0.27 and
freemem exhibited a correlation value of 0.68 with the target.
• Since the data was observed to have zeros which in turn dominated few
variables, the model building was done considering both (with and without)
zeros. This method was employed since there was no domain information on the
same. However, for all the approach the variable “pgscan” was dropped (not
considered) for the presence of around 79% of zero in that column alone.
• The first approach involved in building model had variables that had percentage
of zeros less than 50%. The NaN value was imputed using KNN Imputer. Initial
model was built observed to have multicollinearity, less important features which
were eliminated step by step and the final model was estimated to have
explained around 90% of variance. Most of the assumptions were satisfied.
• The last approach was to consider the presence of zero as legit and build model
on it. All the variables except pgscan were considered. The same methods such as
initial model building, multicollinearity check and p-value were employed. This in
turn explained 95% variance, better than the first approach. However, this has its
own drawback when it comes to handling zeros in the production and the
normality for the approach-1 looked better. Both these approaches did provide
best results and considering the approach depends on the domain.
30
Problem-2
Executive Summary:
Republic of Indonesia Ministry of Health, has entrusted us with a dataset containing
information from a Contraceptive Prevalence Survey. This dataset encompasses data from
1473 married females who were either not pregnant or were uncertain of their pregnancy
status during the survey.
Introduction:
Expectation involves predicting whether these women opt for a contraceptive method of
choice. This prediction will be based on a comprehensive analysis of their demographic and
socio-economic attributes.
Data shape:
The data shape by default is said to have 1473 records and 10variables. There are a total of
4 categorical variables, 2 numerical, 3 binary and the target being binary class.
Table 12 Shape
Statistical Summary:
Table 13 Summary
31
Univariate Analysis
The box plot represents presence of no outliers in the distribution and the violin plot
describes a slightly right skewed distribution but does not require any transformation as the
appearance is not abnormal.
The Education status of the Husband and Wife, Both the genders tend to have higher count
of Tertiary education level. Least is the Uneducated section. The Wife and Husband
Education tend to have similar pattern as the count of that particular level appears more or
less the same.
32
Figure 38 No. of Children
Though the number of children has been categorized as a numerical variable, it can also be
visualized as a category. The left plot of Figure 38 is a categorical visual of the variable where
Number of children greater than 5 has comparatively lesser count and the right plot
representing the violin plot indicates presence of outliers as the plot is skewed towards
right.
Figure 39 represents the pie chart of two variables wife’s religion and their working status.
The wife’s religion when it is Scientology dominated the dataset as 85% of the data consists
the same whereas the non-scientology is only around 15% of records in the data. Also,
around 80% of the Wife’s are not working despite the education of the same dominating
over uneducated.
33
Figure 40 Husband Occupation and Standard of living
The left side plot of Figure 40 represents the Occupation. The Occupation category 3 which
represents high is the highest among the category whereas the Occupation 4 is the least
among the group. Right side plot of Figure 40 represents that 46.5% of the data represents
the family tends to have a Very high standard of living and only 8.8% tend to have very low.
Figure 41 to it left has the details of Media Exposure and the right side has the details of
Contraceptive used. It can be noted that Exposed exposure rate is very high and dominating
toan extent we can say the variable is oversampled towards Exposed. The variable
contraceptive used being the target (dependent) variable eliminates the concept of
oversampling as the class target is sampled if not perfectly but near perfect samples.
34
Multivariate Analysis
The wife’s education category appears to have same median across all category except
Uneducated indicating higher among the category. The tertiary education has outliers at
extreme ends, such outliers can be miscalculation or could be legit. A domain consultation
could help identify such cases if not it can be considered as miscalculated outlier and then
be capped.
The mean of media not-exposed category is slightly higher than that of Exposed despite the
Media exposed category having extreme outliers. The Not-Exposed category also indicates
the distribution not having any outliers.
35
Figure 44 Children per Husband's Education
From the figure 44 it can be noted that despite the secondary and tertiary category having
outliers the median of uneducated is higher compared to others in the same category.
Uneducated husbands tend to have a greater number of children and a proper education on
the contraceptives could reduce this significantly.
Data Preprocessing
Missing Values
It was noted that there were missing values in the Wife_age and No_of_children_born
variables with 71 and 21 respectively.
The variables Wife age follows a normal distribution. The mean and median appear to be
same. This indicates we can either use mean or median for imputing nan values.
The number of children born was imputed with the mode, the most appearing values in that
particular column.
36
Figure 45 Outlier Detection
There are Outliers in the Number of children variable. Logistic regression can be sensitive to
outliers, and their presence can influence the estimation of coefficients and impact the
performance of the model. However, a two-model approach was considered one with and
another without outliers. The model without outliers proved to have better accuracy,
precision and recall compared to others.
The outliers were treated with the capping value. Any value above 5 number of children
were categorized to fall under 5. The model was then split into train, test and split with a
random state of 7 the test sample was taken to be 30%.
Logistic Regression:
37
Figure 46 Confusion Matrix
Observations
Precision is the ratio of true positive predictions to the total number of positive predictions
made by the model.
For class 0: Precision = 0.73
For class 1: Precision = 0.72
Interpretation: Out of all instances predicted as positive, 73% (class 0) and 72% (class 1)
were actually positive.
Recall is the ratio of true positive predictions to the total number of actual positive instances
in the dataset.
For class 0: Recall = 0.56
For class 1: Recall = 0.85
Interpretation: Out of all actual positive instances, the model captured 56% (class 0) and
85% (class 1).
F1-score is the harmonic mean of precision and recall, providing a balance between the two
metrics.
For class 0: F1-Score = 0.63
For class 1: F1-Score = 0.78
Interpretation: The F1-score considers both precision and recall, providing a single metric
that balances the trade-off between false positives and false negatives.
38
Figure 47 ROC AUC
The AUC score for the Logistic regression is 76% which appears pretty good since we have
only 1473 records which is low. A data set with more records will eventually reflect in the
model accuracy and predictions.
Linear Discriminant Analysis
Figure 48 represents Training and Testing prediction confusion matrix. The prediction of 1
(yes) which is a variable of interest is predicted by the model in the same proportion. This
could possibly indicate the stability of the model. However, the model accuracy might not be
satisfying considering the number of records we are dealing with.
39
Table 16 LDA Classification report
Observations
Precision is the ratio of true positive predictions to the total number of positive predictions
made by the model.
For class 0: Precision = 0.72
For class 1: Precision = 0.70
Interpretation: Out of all instances predicted as positive, 72% (class 0) and 70% (class 1)
were actually positive.
Recall is the ratio of true positive predictions to the total number of actual positive instances
in the dataset.
For class 0: Recall = 0.52
For class 1: Recall = 0.85
Interpretation: Out of all actual positive instances, the model captured 52% (class 0) and
85% (class 1).
F1-score is the harmonic mean of precision and recall, providing a balance between the two
metrics.
For class 0: F1-Score = 0.59
For class 1: F1-Score = 0.75
Interpretation: The F1-score considers both precision and recall, providing a single metric
that balances the trade-off between false positives and false negatives.
The Model Accuracy comes out to be 69%
40
Figure 49 AUC Curve LDA Model
The Curve for the Training and Testing looks similar. The model is quite stable so far but with
a decrease in accuracy compared the logistic Regression. Emphasise on the data could bring
even mode stability higher AUC, ROC curve and model accuracy.
• Predictor 'Number of children born' has the largest magnitude thus this helps in
classifying the best
• Predictor 'Wife age' has the smallest magnitude thus this helps in classifying the least
41
CART Model
Independent and Dependent variables were split into X and y respectively. Initially the
model was fit with Decision Tree Classifier with the gini as a criterion to split the root node.
Important features were extracted as shown below.
For the basic model that was fit the important features came out to be Wife age and Media
exposure to be the least important feature.
Decision tree model scores:
Training: 98%
Testing: 62%
By looking at the training and testing score we can say that the model is overfitting, a result
of high variance. To overcome this pruning method was employed by hyper tuning the
parameters using Grid Search CV.
These were used as the best estimators and again the model was built.
Decision tree model scores (Hyper tuning the parameters):
Training: 80%
Testing: 65%
Though the training score is reduced the testing score is increased just by 2%. However, the
model is still overfitting. This shows decision trees are prone to overfitting.
42
The classification matrix for the testing data of the CART model post pruning the data yet
appears dissatisfactory as the results are not up to the mark. A better approach with a
different algorithm such as Random Forest etc. could be a better alternate.
The AUC for the training is better than the AUC for the testing data yet again indicating
overfitting. The AUC score is 86% for training and 70% for testing. Emphasising more on the
quantity of the data and advanced algorithms such as random forest, XG Boost etc. could
make the model better.
• In LDA, Wife age had the lesser magnitude and further specifying less importance I
the model building.
• The CART model provides information on feature importance where Wife age is given
the upper most importance and surprisingly the opposite of what LDA model
suggested
• The least important came out to be Media exposure. To note the way Decision tree
splits it would be better if there was absence of under sampling or else they have to
carry crucial information.
43
Actionable Insights and recommendations
• The Univariate analysis was performed to analyse the pattern displayed by them. It
can be noted the age of Wife has almost normal distribution with absence of outliers.
Count of Education for both husband and wife have similar pattern where tertiary
has the highest count and uneducated being the least.
• A two-fold approach was considered. The first one was to considered the outliers in
Number of children born were legit and another was to consider them as
exaggerated values where they will be capped to the nearest legit value. However,
when the model was built for both the approach the second model provided better
results and was considered.
• Logistic Regression was the first algorithm considered. The results provided by the
model was pretty satisfying considering the number of records (less). The model
accuracy, precision, recall and f1-score was calculated.
• Similarly, Linear Discriminant Analysis and CART (Decision tree) was built to check for
their approach (feature importance) and accuracy. The LDA could divide the target as
it is meant to and the accuracy was considerable.
• The decision tree model was built initially without tuning any parameters. The result
was overfitting and variance was the cause of such results. Another model was built
by pruning the tree and hyper tuning the parameters. Grid search CV was considered
as the multi fold approach. The results did reduce the overfitting slightly but not
significantly further proving that decision tree might not be the right option for the
data provided.
• The Logistic Regression and LDA model provided better results compared to that of
Decision tree. Considering either of Logistic regression or LDA could provide the
desirable results. It is also important to note that the results can be astonishing if the
quantity of data is more.
44