Predictive Modelling
Predictive Modelling
- RAHUL SHARMA
1
A Problem1: 2-14
B Problem2: 15-27
1
2
Problem definition:-
Check shape:- 8192 rows x 22columns
Data types:-
2
3
Statistical Summary:-
Uni-variate analysis:-
3
4
The transfers per second for both reading and writing are brisk, with the majority occurring at
a rapid pace.
Most transactions are swiftly processed by the system, with a read-write rate that is generally
quick, typically under 5%.
The current situation suggests a relative absence of ongoing activities.
CPU able to run in user mode b/w 80- 99% times & its ideal.
Multivariate analysis:-
A correlation can be observed between 'vflt,' 'pflt,' and 'fork,' suggesting that an increase in
fork calls is associated with a rise in page faults.
Likewise, there is a strong correlation between the number of page out requests per second
and the number of pages paged out per second.
4
5
Use appropriate visualizations to identify the patterns and insights:-
The read system call is the most frequently used call, with an average of 53 calls per
second. This is likely because it is used to read data from files and devices.
The write system call is the second most frequently used call, with an average of 39 calls per
second. This is likely because it is used to write data to files and devices.
5
6
The fork system call is the third most frequently used call, with an average of 24 calls per
second. This is likely because it is used to create new processes.
The sread system call is the fourth most frequently used call, with an average of 21 calls per
second. This is likely because it is used to read data from sockets.
The swrite system call is the fifth most frequently used call, with an average of 15 calls per
second. This is likely because it is used to write data to sockets.
Memory Metrics Tango: The amount of available memory (freemem) and its companions are
closely connected. When the system needs to use the swap space (a backup memory area),
it's like a dance, but a bit more structured.
I/O, the Lone Wolf: Input and output operations (I/O), represented by sread and swrite,
follow their own rhythm. They're less connected to the overall system, moving to their unique
beat.
PFIT Playing Ping-Pong: The page fitting process (pfit) plays a game of ping-pong. It makes
fewer mistakes on its own, allowing other processes more freedom to move and operate
smoothly.
CPU, the Independent Actor: The Central Processing Unit (CPU) acts independently. When it
executes (exec) or forks, it does so on its own stage, less dependent on other parts of the
system.
System, a Grand Ensemble: The entire system is like a grand ensemble. Many intricate
connections exist, and when one metric makes a move (twirls), it affects the entire dance.
Everything is interconnected, and each part influences the whole performance.
AFTER TREATMENT:-
7
8
Feature Engineering:-
New features - no. of page rate & page requests rate have been added/created with the
variables pgin, pgout, ppgin & ppgout.
Although, these new features has not given any significant output, as the majority of the
values are in form of 0 or inf.
After the encoded the data, the data-set has split-ted into training and testing in the 70:30
ratio.
X_TRAIN 1st 5 rows:-
a) Standard errors assume that the convenience metrics of the errors is correctly specified.
b) The condition number is large, 6.9 e +06. This might indicate that there are strong
multicollinearity or other numerical problems.
8
9
Interpretation of R-squared
R-squared value can shows 60.1% of the variance in the training set.
By dropping multicollinear columns one by one, we observe that some almost remain same And
there is quite only 0 .001 and 0.002 Downwards difference.
SO ON…..
9
10
There is no effect on adj. R-squared after dropping the 'ppgout' column, and it has highest number
in value of variance influence factor, so we remove it from the training set.
Since there is ALSO no effect on adj. R-squared after dropping the 'pgin' column, and it has highest
number in value of variance influence factor, so we remove it from the training set.
As we see, There is little bit effect on adj. R-squared after dropping the 'fork' column.
As we see, There is also little bit effect on adj. R-squared after dropping the 'vflt' column.
10
11
There is no effect on adj. R-squared after dropping the 'sread', ‘lread’,’pgfree’ column
As we see, There is little bit effect on adj. R-squared after dropping the 'pflt' column.
After dropping the features causing strong multicollinearity and the statistically insignificant ones, our
model performance hasn't dropped sharply. This shows that these variables did not have much
predictive power.
11
12
We observe that the pattern has slightly decreased and that Data points seems to be randomly
distributed.
12
13
The QQ plot of residuals can be used to visually check the normally assumptions.
The normally probability plot of residual should approximately follow a straight line.
13
14
1.4 Business Insights & Recommendations
Comment on the Linear Regression equation from the final model and impact of relevant variables
(atleast 2) as per the equation - Conclude with the key takeaways (actionable insights and
recommendations) for the business
RMSE on the train and test sets are comparable. So, our model may not suffer from over-
fitting.
MAE indicates that our current model able to predict mpg within a mean error of the test
data.
Therefore, we can assume the model "fitres-42" is good for prediction as well as inference
purposes.
14
15
Problem definition:-
Check shape:- 1473 rows x 10columns
Data types:-
Statistical Summary:-
15
16
Uni-variate analysis:-
The age of the wives B\W 17 - 49 years, where mostly they are in 28’s and mid 20s - early 50s.
Majority of the people have 1 or 2 children but a few people have more than 15 children as
well.
16
17
Wives who have done their secondary and Tertiary education have used contraceptive
methods more as compared to the others.
Wives who are not educated or only completed Primary education are not to use any
contraceptive methods.
Commonly same thing find on the Husband’s education.
Fewer Husbands are uneducated as compared to the wives.
Mostly people are belonging the areas where the standard of living is Very High and High.
Nearly less than 250 people are belonging with Low and Very low standard of living index.
17
18
As we already knew that, the mostly wives have used a contraceptive method, however there
is a good proportional as well who have not used any.
Multivariate analysis:-
18
19
This plot does not identify any major trend/correlation between the variables.
Very Few of the variables are available in the pair-plot, they don’t have the classes of well
separated. They will not be a good predictors.
19
20
Strong positive correlation shows b/w wife's age and husband's occupation.
Strong negative correlation shows b/w number of children born and wife's age.
Based on the above heat-map, it shows that couples where the wife was younger tended to
have more children than couples where the wife was older. There are also a few with have
much higher number of children born.
20
21
AFTER TREATMENT:
Data has string & categorical variables, these variables must be encoded so that the Machine
Learning model understands the data.
In the targeted variable, "No" is switched to 0 and "Yes" is switched to 1.
Likewise, other no.’s are given to the values in variables Wife_ education, Husband_education
& Standard_of_living_index.
After this, dummy encoding used to encode the data for the rest of the columns.
After the encoded the data, the data-set has split-ted into training and testing in the 70:30
ratio.
Accuracy = 0.7152
21
22
2.3 Model Building and Compare the Performance of the Models:-
Build a Logistic Regression model - Build a Linear Discriminant Analysis model - Build a CART model -
Prune the CART model by finding the best hyper parameters using Grid Search - Check the performance
of the models across train and test set using different metrics - Compare the performance of all the
models built and choose the best one with proper rationale
22
23
Build a Linear Discriminant Analysis model:-
23
24
Build a CART model:-
Prune the CART model by finding the best hyper parameters using Grid Search:-
24
25
Check the performance of the models across train and test set using different metrics:-
Compare the performance of all the models built and choose the best one with proper rationale:-
Accuracy score of all the models are above 65% for both test and train data.
Accuracy: Logistic Regression and Linear Discriminant Analysis have similar test accuracy, but
Logistic Regression has a slightly higher accuracy.
Precision and Recall: Linear Discriminant Analysis has a higher test recall, indicating its ability
to correctly identify positive cases. However, Logistic Regression also performs well.
F1 Score: Linear Discriminant Analysis has a higher F1 score on the test set.
25
26
AUC-ROC: Logistic Regression and Linear Discriminant Analysis have the same AUC-ROC on
the test set.
Considering the overall performance across these metrics, Linear Discriminant Analysis seems to
be a good choice. It strikes a balance between precision and recall, making it suitable for cases
where both false positives and false negatives are important.
Performance Superiority of CART Model: The text suggests that the CART model has outperformed all other
models considered in the evaluation. The evaluation criterion used is accuracy, where the CART model achieves
an accuracy value of 68%, indicating its effectiveness in predicting both classes of interest.
Accuracy and Recall Metrics: The CART model not only achieves a high accuracy value but also demonstrates
strong performance in terms of recall. Recall, measuring the ability to correctly identify true positives, is
highlighted as a key metric. The CART model and the LDA model both show high recall values, but the slightly
higher accuracy of the CART model favors its consideration for prediction.
Area Under the Curve (AUC) Analysis:The AUC, a common metric used in evaluating the performance of
classification models, is mentioned. While the AUC values of 82% for the train data and 72% for the test data are
acknowledged as not being the best, they still surpass the performance of other models considered. This
indicates that the CART model exhibits good discriminative ability.
Recommendation for Prediction: The text concludes that, based on the observed performance metrics, the
CART model is suitable for making predictions on unseen data. The combination of high accuracy, recall, and
competitive AUC values supports the recommendation to use the CART model in practical predictions.
Consideration for Unseen Data: The statement emphasizes the robustness of the CART model by suggesting that
it can be confidently used for making predictions on any unseen data fed to the model. This is a crucial aspect,
indicating the generalization capability of the model beyond the training and evaluation data.
Wife's Education and Number of Children Born: Both the Logistic Regression and CART
models highlight the importance of the wife's education and the number of children born as
key features. These features are identified as significant factors in determining whether
women will use contraceptive methods. The emphasis on these variables suggests that they
play a crucial role in influencing the decision-making process.
Husband's Education: The text mentions that both models indicate the importance of the
husband's education. The suggestion is that, in real-life scenarios, the husband's education
26
27
level can have an impact on the wife's decision to use contraceptive methods. This implies a
social or contextual influence where the husband's education is considered a relevant factor
in the decision-making process.
Importance of Features: The repeated emphasis on the importance of specific features, such
as the education levels of both the wife and husband, as well as the number of children born,
underscores their significance in predicting contraceptive usage. These features are likely
strong predictors in the models, contributing significantly to their predictive performance.
Real-World Relevance: The mention that the importance of husband's education "makes
sense" implies a real-world applicability and relevance of the identified features. It suggests
that the models are aligning with common societal expectations or patterns where education
levels, both of the wife and husband, can influence decisions related to family planning and
contraceptive use.
Standard of Living Influence: The statement suggests that women from areas with high and
very high standards of living are more likely to use contraceptive methods. This could be
indicative of socio-economic factors playing a role in family planning decisions.
Age and Education Level: Women between the ages of 25 to 35 with a good education level
are identified as more likely to use contraceptives. This aligns with the understanding that
education and age can impact family planning decisions.
Husband's Education: The education level of the husband is highlighted as a significant factor
influencing whether the wife will use contraceptive methods. This reinforces the notion that
spousal education levels can be interconnected with family planning decisions.
Understanding Non-Parental Contraceptive Users: Expressing the need to understand the
viewpoint of women who do not have any children but are still using contraceptives is an
important consideration. It suggests the importance of exploring the motivations and
circumstances surrounding this demographic.
Role of Media Exposure: The statement recognizes the key role of media exposure in family
planning decisions. This underscores the influence of media in shaping perceptions and
awareness regarding contraceptive methods.
Health Ministry Outreach: Suggesting that the Republic of Indonesia Ministry of Health can
reach out to women who do not use contraceptives for education and awareness indicates a
proactive approach to address potential gaps in knowledge or accessibility.
Analysis of Education Levels 8, 10, 11, & 12: Noting that wives with education levels 8, 10, 11,
and 12 do not use contraceptives raises a specific area of interest. Further investigation into
the reasons behind this pattern could provide valuable insights into cultural, social, or
individual factors influencing contraceptive decisions.
27