0% found this document useful (0 votes)

21 views28 pages

Predictive Modelling

Uploaded by

anandagarwalscribe

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views28 pages

Predictive Modelling

Uploaded by

anandagarwalscribe

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

1

- RAHUL SHARMA
1

Sr. CONTENT Page

No. no.

A Problem1: 2-14

1.1 Define the problem and perform exploratory 2-6

Data Analysis

1.2 Data Pre-processing 6-8

1.3 Model Building - Linear regression 8-13

1.4 Business Insights & Recommendations 14

B Problem2: 15-27

2.1 Define the problem and perform exploratory 15-20

Data Analysis

2.2 Data Pre-processing 20-21

2.3 Model Building and Compare the 22-26

Performance of the Models

2.4 Business Insights & Recommendations 26-27

1
2

A Problem1: Comp-active database

The comp-activ database comprises activity measures of computer systems. Data was gathered
from a Sun Sparcstation 20/712 with 128 Mbytes of memory, operating in a multi-user university
department. Users engaged in diverse tasks, such as internet access, file editing, and CPU-
intensive programs.
Being an aspiring data scientist, our aim to establish a linear equation for predicting 'usr' (the
percentage of time CPUs operate in user mode). Your goal is to analyze various system attributes
to understand their influence on the system's 'usr' mode.

1.1 - Define the problem and perform exploratory Data Analysis

Problem definition - Check shape, Data types, statistical summary - Univariate analysis - Multivariate
analysis - Use appropriate visualizations to identify the patterns and insights - Key meaningful
observations on individual variables and the relationship between variables.

Problem definition:-
Check shape:- 8192 rows x 22columns

Data types:-

2
3

Statistical Summary:-

Uni-variate analysis:-

3
4

 The transfers per second for both reading and writing are brisk, with the majority occurring at
a rapid pace.
 Most transactions are swiftly processed by the system, with a read-write rate that is generally
quick, typically under 5%.
 The current situation suggests a relative absence of ongoing activities.

CPU able to run in user mode b/w 80- 99% times & its ideal.

Multivariate analysis:-

 A correlation can be observed between 'vflt,' 'pflt,' and 'fork,' suggesting that an increase in
fork calls is associated with a rise in page faults.
 Likewise, there is a strong correlation between the number of page out requests per second
and the number of pages paged out per second.

4
5
Use appropriate visualizations to identify the patterns and insights:-

 The read system call is the most frequently used call, with an average of 53 calls per
second. This is likely because it is used to read data from files and devices.
 The write system call is the second most frequently used call, with an average of 39 calls per
second. This is likely because it is used to write data to files and devices.
5
6
 The fork system call is the third most frequently used call, with an average of 24 calls per
second. This is likely because it is used to create new processes.
 The sread system call is the fourth most frequently used call, with an average of 21 calls per
second. This is likely because it is used to read data from sockets.
 The swrite system call is the fifth most frequently used call, with an average of 15 calls per
second. This is likely because it is used to write data to sockets.

Key meaningful observations on individual variables and the relationship between

variables:-

 Memory Metrics Tango: The amount of available memory (freemem) and its companions are
closely connected. When the system needs to use the swap space (a backup memory area),
it's like a dance, but a bit more structured.
 I/O, the Lone Wolf: Input and output operations (I/O), represented by sread and swrite,
follow their own rhythm. They're less connected to the overall system, moving to their unique
beat.
 PFIT Playing Ping-Pong: The page fitting process (pfit) plays a game of ping-pong. It makes
fewer mistakes on its own, allowing other processes more freedom to move and operate
smoothly.
 CPU, the Independent Actor: The Central Processing Unit (CPU) acts independently. When it
executes (exec) or forks, it does so on its own stage, less dependent on other parts of the
system.
 System, a Grand Ensemble: The entire system is like a grand ensemble. Many intricate
connections exist, and when one metric makes a move (twirls), it affects the entire dance.
Everything is interconnected, and each part influences the whole performance.

1.2 Data Pre-processing

Prepare the data for modelling: - Missing Value Treatment (if needed) - Outlier Detection (treat, if
needed) - Feature Engineering - Encode the data - Train-test split

Missing Value Treatment (if needed)

There are 104 missing values present at rchar & 15 at wchar

6
7

AFTER TREATMENT:-

Outlier Detection (treat, if needed):-

 There are total 31775 outliers present

 All the outliers are treated by adjusting them to the lower and upper bound values calculated by the
IQR value.

7
8
Feature Engineering:-

 New features - no. of page rate & page requests rate have been added/created with the
variables pgin, pgout, ppgin & ppgout.
 Although, these new features has not given any significant output, as the majority of the
values are in form of 0 or inf.

Encode the data - Train-test split:-

 After the encoded the data, the data-set has split-ted into training and testing in the 70:30
ratio.
 X_TRAIN 1st 5 rows:-

 X_TEST 1ST 5 rows:-

1.3 Model Building - Linear Regression

Apply linear Regression using Sklearn - Using Statsmodels Perform checks for significant variables using
the appropriate method - Create multiple models and check the performance of Predictions on Train and
Test sets using Rsquare, RMSE & Adj Rsquare.

a) Standard errors assume that the convenience metrics of the errors is correctly specified.
b) The condition number is large, 6.9 e +06. This might indicate that there are strong
multicollinearity or other numerical problems.

8
9

 Interpretation of R-squared
 R-squared value can shows 60.1% of the variance in the training set.

By dropping multicollinear columns one by one, we observe that some almost remain same And
there is quite only 0 .001 and 0.002 Downwards difference.

SO ON…..

9
10
 There is no effect on adj. R-squared after dropping the 'ppgout' column, and it has highest number
in value of variance influence factor, so we remove it from the training set.
 Since there is ALSO no effect on adj. R-squared after dropping the 'pgin' column, and it has highest
number in value of variance influence factor, so we remove it from the training set.

 As we see, There is little bit effect on adj. R-squared after dropping the 'fork' column.

 As we see, There is also little bit effect on adj. R-squared after dropping the 'vflt' column.
10
11

 There is no effect on adj. R-squared after dropping the 'sread', ‘lread’,’pgfree’ column

 As we see, There is little bit effect on adj. R-squared after dropping the 'pflt' column.

After dropping the features causing strong multicollinearity and the statistically insignificant ones, our
model performance hasn't dropped sharply. This shows that these variables did not have much
predictive power.
11
12

 VIF for all features is <3

 VIF method can be used for identifying important variables & eliminating/removing the ones
that may not significant and have high multicollinearity.

 We observe that the pattern has slightly decreased and that Data points seems to be randomly
distributed.

12
13

 The QQ plot of residuals can be used to visually check the normally assumptions.
 The normally probability plot of residual should approximately follow a straight line.

 Partially, the points are laying on the straight line in QQ plot.

13
14
1.4 Business Insights & Recommendations
Comment on the Linear Regression equation from the final model and impact of relevant variables
(atleast 2) as per the equation - Conclude with the key takeaways (actionable insights and
recommendations) for the business

 RMSE on the train data = 11.5289

 MAE on the train data = 8.1244

 RMSE on the train and test sets are comparable. So, our model may not suffer from over-
fitting.
 MAE indicates that our current model able to predict mpg within a mean error of the test
data.
 Therefore, we can assume the model "fitres-42" is good for prediction as well as inference
purposes.

Key Influence of Process Run Queue Size:

The CPU run-time in user mode shows a significant dependency on the Process run queue size.
Understanding and managing the size of the queue for running processes are crucial for
optimizing CPU performance.

Sensitivity to CPU Bound Queue Size:

A noteworthy finding is that increasing the CPU bound queue size by just 1 unit leads to a
substantial 33.5 times increase in the percentage of time the CPU runs in user mode. This suggests
that proper management of CPU-bound tasks in the queue is vital for improving user mode run-
time.

Impact of Non-CPU Bound Queue Size:

Similarly, the non-CPU bound queue size has a significant impact, with a 32.7 times increase in
CPU run-time in user mode for every 1-unit increase. Balancing and optimizing I/O-bound tasks in
the queue are important considerations for overall system performance.

Cumulative Effect of Process Run Queue Size:

When considering both CPU and non-CPU bound queues, the overall impact on the percentage of
time the CPU runs in user mode is substantial, approximately 132 times, including the Intercept.
This underscores the holistic influence of the process run queue size on CPU behavior.

Constant Impact of Other Features:

The analysis suggests that, while the process run queue size has a substantial impact, the other
features considered in the model do not significantly affect CPU run-time. This could guide
resource allocation efforts, focusing primarily on optimizing the process run queue size.

14
15

B Problem2: Contraceptive Method Data-set

In your role as a statistician at the Republic of Indonesia Ministry of Health, you have been
entrusted with a dataset containing information from a Contraceptive Prevalence Survey. This
dataset encompasses data from 1473 married females who were either not pregnant or were
uncertain of their pregnancy status during the survey.
NOW, we predicting whether these women opt for a contraceptive method of choice. This
prediction will be based on a comprehensive analysis of their demographic and socio-economic
attributes.

2.1 Define the problem and perform exploratory Data Analysis

Problem definition:-
Check shape:- 1473 rows x 10columns

Data types:-

Statistical Summary:-

15
16
 Uni-variate analysis:-

 The age of the wives B\W 17 - 49 years, where mostly they are in 28’s and mid 20s - early 50s.

 Majority of the people have 1 or 2 children but a few people have more than 15 children as
well.

16
17
 Wives who have done their secondary and Tertiary education have used contraceptive
methods more as compared to the others.
 Wives who are not educated or only completed Primary education are not to use any
contraceptive methods.
 Commonly same thing find on the Husband’s education.
 Fewer Husbands are uneducated as compared to the wives.

 Scientology is playing wider role in wife region.

 Mostly Wives are not in working professional.

 Mostly people are belonging the areas where the standard of living is Very High and High.
 Nearly less than 250 people are belonging with Low and Very low standard of living index.
17
18

 Distribution of media exposure is quite better, its more than 1000.

 As we already knew that, the mostly wives have used a contraceptive method, however there
is a good proportional as well who have not used any.

Multivariate analysis:-

18
19
 This plot does not identify any major trend/correlation between the variables.
 Very Few of the variables are available in the pair-plot, they don’t have the classes of well
separated. They will not be a good predictors.

Use appropriate visualizations to identify the patterns and insights:-

19
20

 Strong positive correlation shows b/w wife's age and husband's occupation.
 Strong negative correlation shows b/w number of children born and wife's age.
 Based on the above heat-map, it shows that couples where the wife was younger tended to
have more children than couples where the wife was older. There are also a few with have
much higher number of children born.

2.2 Data Pre-processing

Prepare the data for modelling: - Missing value Treatment (if needed) - Outlier Detection(treat, if needed)
- Feature Engineering (if needed) - Encode the data - Train-test split

Prepare the data for modelling: -

Missing value Treatment (if needed)

 There are 71 missing values are present in "wife_age" and 21 in "no_of_children_born". So

now we treat the missing values.

20
21

AFTER TREATMENT:

Outlier Detection(treat, if needed)

 97 Outliers are present. So, it has to treat the outliers.

 Now‘Husband_Occupation’ has been also changed to Object data type as it is a categorical
variable.
 There are 85 duplicate which can be dropped from the dataset.

Encode the data - Train-test split

 Data has string & categorical variables, these variables must be encoded so that the Machine
Learning model understands the data.
 In the targeted variable, "No" is switched to 0 and "Yes" is switched to 1.
 Likewise, other no.’s are given to the values in variables Wife_ education, Husband_education
& Standard_of_living_index.
 After this, dummy encoding used to encode the data for the rest of the columns.
 After the encoded the data, the data-set has split-ted into training and testing in the 70:30
ratio.

Accuracy = 0.7152

21
22
2.3 Model Building and Compare the Performance of the Models:-
Build a Logistic Regression model - Build a Linear Discriminant Analysis model - Build a CART model -
Prune the CART model by finding the best hyper parameters using Grid Search - Check the performance
of the models across train and test set using different metrics - Compare the performance of all the
models built and choose the best one with proper rationale

Build a Logistic Regression model:-

22
23
Build a Linear Discriminant Analysis model:-

23
24
Build a CART model:-

Prune the CART model by finding the best hyper parameters using Grid Search:-

24
25
Check the performance of the models across train and test set using different metrics:-

Compare the performance of all the models built and choose the best one with proper rationale:-

Accuracy score of all the models are above 65% for both test and train data.

 Accuracy: Logistic Regression and Linear Discriminant Analysis have similar test accuracy, but
Logistic Regression has a slightly higher accuracy.
 Precision and Recall: Linear Discriminant Analysis has a higher test recall, indicating its ability
to correctly identify positive cases. However, Logistic Regression also performs well.
 F1 Score: Linear Discriminant Analysis has a higher F1 score on the test set.
25
26
 AUC-ROC: Logistic Regression and Linear Discriminant Analysis have the same AUC-ROC on
the test set.

Considering the overall performance across these metrics, Linear Discriminant Analysis seems to
be a good choice. It strikes a balance between precision and recall, making it suitable for cases
where both false positives and false negatives are important.

 Performance Superiority of CART Model: The text suggests that the CART model has outperformed all other
models considered in the evaluation. The evaluation criterion used is accuracy, where the CART model achieves
an accuracy value of 68%, indicating its effectiveness in predicting both classes of interest.
 Accuracy and Recall Metrics: The CART model not only achieves a high accuracy value but also demonstrates
strong performance in terms of recall. Recall, measuring the ability to correctly identify true positives, is
highlighted as a key metric. The CART model and the LDA model both show high recall values, but the slightly
higher accuracy of the CART model favors its consideration for prediction.
 Area Under the Curve (AUC) Analysis:The AUC, a common metric used in evaluating the performance of
classification models, is mentioned. While the AUC values of 82% for the train data and 72% for the test data are
acknowledged as not being the best, they still surpass the performance of other models considered. This
indicates that the CART model exhibits good discriminative ability.
 Recommendation for Prediction: The text concludes that, based on the observed performance metrics, the
CART model is suitable for making predictions on unseen data. The combination of high accuracy, recall, and
competitive AUC values supports the recommendation to use the CART model in practical predictions.
 Consideration for Unseen Data: The statement emphasizes the robustness of the CART model by suggesting that
it can be confidently used for making predictions on any unseen data fed to the model. This is a crucial aspect,
indicating the generalization capability of the model beyond the training and evaluation data.

2.4 Business Insights & Recommendations:-

 Wife's Education and Number of Children Born: Both the Logistic Regression and CART
models highlight the importance of the wife's education and the number of children born as
key features. These features are identified as significant factors in determining whether
women will use contraceptive methods. The emphasis on these variables suggests that they
play a crucial role in influencing the decision-making process.
 Husband's Education: The text mentions that both models indicate the importance of the
husband's education. The suggestion is that, in real-life scenarios, the husband's education

26
27
level can have an impact on the wife's decision to use contraceptive methods. This implies a
social or contextual influence where the husband's education is considered a relevant factor
in the decision-making process.
 Importance of Features: The repeated emphasis on the importance of specific features, such
as the education levels of both the wife and husband, as well as the number of children born,
underscores their significance in predicting contraceptive usage. These features are likely
strong predictors in the models, contributing significantly to their predictive performance.
 Real-World Relevance: The mention that the importance of husband's education "makes
sense" implies a real-world applicability and relevance of the identified features. It suggests
that the models are aligning with common societal expectations or patterns where education
levels, both of the wife and husband, can influence decisions related to family planning and
contraceptive use.
 Standard of Living Influence: The statement suggests that women from areas with high and
very high standards of living are more likely to use contraceptive methods. This could be
indicative of socio-economic factors playing a role in family planning decisions.
 Age and Education Level: Women between the ages of 25 to 35 with a good education level
are identified as more likely to use contraceptives. This aligns with the understanding that
education and age can impact family planning decisions.
 Husband's Education: The education level of the husband is highlighted as a significant factor
influencing whether the wife will use contraceptive methods. This reinforces the notion that
spousal education levels can be interconnected with family planning decisions.
 Understanding Non-Parental Contraceptive Users: Expressing the need to understand the
viewpoint of women who do not have any children but are still using contraceptives is an
important consideration. It suggests the importance of exploring the motivations and
circumstances surrounding this demographic.
 Role of Media Exposure: The statement recognizes the key role of media exposure in family
planning decisions. This underscores the influence of media in shaping perceptions and
awareness regarding contraceptive methods.
 Health Ministry Outreach: Suggesting that the Republic of Indonesia Ministry of Health can
reach out to women who do not use contraceptives for education and awareness indicates a
proactive approach to address potential gaps in knowledge or accessibility.
 Analysis of Education Levels 8, 10, 11, & 12: Noting that wives with education levels 8, 10, 11,
and 12 do not use contraceptives raises a specific area of interest. Further investigation into
the reasons behind this pattern could provide valuable insights into cultural, social, or
individual factors influencing contraceptive decisions.

CCA Selected Answers
83% (6)
CCA Selected Answers
84 pages
Kailash BusinessReport
No ratings yet
Kailash BusinessReport
31 pages
Texas Public School Construction Costs
100% (1)
Texas Public School Construction Costs
20 pages
Predictive Modelling ALOK KUMAR
100% (1)
Predictive Modelling ALOK KUMAR
25 pages
Assignment2 Final Report Random Search Optimization and Meta Learning
No ratings yet
Assignment2 Final Report Random Search Optimization and Meta Learning
18 pages
Predictive Modeling Projectt
No ratings yet
Predictive Modeling Projectt
109 pages
Business+Report Linear
No ratings yet
Business+Report Linear
20 pages
Predictive Modelling - BR - Priyanka Sharma
No ratings yet
Predictive Modelling - BR - Priyanka Sharma
36 pages
Business Report
No ratings yet
Business Report
30 pages
Predictive Modelling Project
No ratings yet
Predictive Modelling Project
94 pages
Notebook - Measures of Computer Systems
No ratings yet
Notebook - Measures of Computer Systems
81 pages
Predective Modelling Business Report Set
No ratings yet
Predective Modelling Business Report Set
10 pages
Business Report PM Suchita Bhovar March 10 2024
No ratings yet
Business Report PM Suchita Bhovar March 10 2024
27 pages
PM Coded Project Sample Business Report
No ratings yet
PM Coded Project Sample Business Report
41 pages
Predictive Modelling Project
No ratings yet
Predictive Modelling Project
29 pages
PM Projec2 - SOBAC
No ratings yet
PM Projec2 - SOBAC
38 pages
FINAL - CC01 - Group7
No ratings yet
FINAL - CC01 - Group7
23 pages
Final Cc01 Group7
No ratings yet
Final Cc01 Group7
23 pages
01 Apply Data Preprocessing On Heart Dataset and Evaluate Performance Using Confusion Matrix
No ratings yet
01 Apply Data Preprocessing On Heart Dataset and Evaluate Performance Using Confusion Matrix
19 pages
TE ML LAB Mannual
No ratings yet
TE ML LAB Mannual
21 pages
Prob Ass
No ratings yet
Prob Ass
33 pages
Em Semester Project
No ratings yet
Em Semester Project
21 pages
Bussiness Report PM
No ratings yet
Bussiness Report PM
44 pages
2 DataPreProcessing Code
No ratings yet
2 DataPreProcessing Code
46 pages
Designing Machine Learning Workflows in Python Chapter3
No ratings yet
Designing Machine Learning Workflows in Python Chapter3
42 pages
Group 14 Xac Xuat Thong Ke 3
No ratings yet
Group 14 Xac Xuat Thong Ke 3
39 pages
S 11
No ratings yet
S 11
7 pages
Ch1a Slides
No ratings yet
Ch1a Slides
33 pages
ML Lab
No ratings yet
ML Lab
23 pages
Answer
No ratings yet
Answer
5 pages
Arpita - Sarkar - Business - Report - 17th December, 2023
No ratings yet
Arpita - Sarkar - Business - Report - 17th December, 2023
23 pages
Active Sample Selection For Matrix Compl
No ratings yet
Active Sample Selection For Matrix Compl
89 pages
Flight Price Prediction Report
No ratings yet
Flight Price Prediction Report
18 pages
Predictive Modelling
No ratings yet
Predictive Modelling
44 pages
XSTK 1
No ratings yet
XSTK 1
37 pages
XSTK
No ratings yet
XSTK
36 pages
Subject - Machine Learning Group - E27-24 Name
No ratings yet
Subject - Machine Learning Group - E27-24 Name
18 pages
Data Science Checklist
No ratings yet
Data Science Checklist
22 pages
Machine Learning
100% (1)
Machine Learning
33 pages
Final 1
No ratings yet
Final 1
6 pages
Assignment
No ratings yet
Assignment
5 pages
Predictive Maintenance
No ratings yet
Predictive Maintenance
66 pages
Documenting The Solution To Develop A Behaviour Score
No ratings yet
Documenting The Solution To Develop A Behaviour Score
9 pages
Data Analytics QP May 25
No ratings yet
Data Analytics QP May 25
4 pages
Predictive Modelling Project 2
100% (4)
Predictive Modelling Project 2
32 pages
Axa Challenge Rapport
No ratings yet
Axa Challenge Rapport
2 pages
Exercise Sheet 10 Solution
No ratings yet
Exercise Sheet 10 Solution
3 pages
20dit073 Jay Prajapati ML
No ratings yet
20dit073 Jay Prajapati ML
68 pages
Capstone 2 Corizo
No ratings yet
Capstone 2 Corizo
2 pages
Bhagya Report Final
No ratings yet
Bhagya Report Final
73 pages
MSC Academic Internship Config Manual IDS Improvement Using MIGBM Feature Selection
No ratings yet
MSC Academic Internship Config Manual IDS Improvement Using MIGBM Feature Selection
19 pages
Case Study 219302405
No ratings yet
Case Study 219302405
14 pages
CC02 Group6 Report
No ratings yet
CC02 Group6 Report
36 pages
Victor BernhardPresentation
No ratings yet
Victor BernhardPresentation
24 pages
Important Questions
No ratings yet
Important Questions
4 pages
Chapter 9 BTC PRICE PRED
No ratings yet
Chapter 9 BTC PRICE PRED
12 pages
Final Report
No ratings yet
Final Report
17 pages
Turover Prediction
No ratings yet
Turover Prediction
52 pages
Report 4
No ratings yet
Report 4
50 pages
C2W3 Lab 01 Model Evaluation and Selection
No ratings yet
C2W3 Lab 01 Model Evaluation and Selection
21 pages
C2W3 Lab 01 Model Evaluation and Selection
No ratings yet
C2W3 Lab 01 Model Evaluation and Selection
21 pages
Técnicas Estadísticas para la Ciencia de Datos a través de R. Aprendizaje Supervisado: Análisis Discriminante, Árboles de Decisión, Redes Neuronales y Modelos Lineales Generalizados
From Everand
Técnicas Estadísticas para la Ciencia de Datos a través de R. Aprendizaje Supervisado: Análisis Discriminante, Árboles de Decisión, Redes Neuronales y Modelos Lineales Generalizados
César Pérez López
No ratings yet
Bounded and Unbounded Sequence: (A) Definition
No ratings yet
Bounded and Unbounded Sequence: (A) Definition
2 pages
Gap Filling Activities With Without Clues Tag Questions Pu Siam Ibn Bashar Al Saud
No ratings yet
Gap Filling Activities With Without Clues Tag Questions Pu Siam Ibn Bashar Al Saud
65 pages
6.0 Power Series Related Question
No ratings yet
6.0 Power Series Related Question
9 pages
09 MSDS Wax Dispersant
No ratings yet
09 MSDS Wax Dispersant
8 pages
Pi Stacked Polymers Molecules
No ratings yet
Pi Stacked Polymers Molecules
280 pages
29 Aug 2019 143139927XKES7XQNDPR
No ratings yet
29 Aug 2019 143139927XKES7XQNDPR
246 pages
FitTrack Gold Manual 2019
No ratings yet
FitTrack Gold Manual 2019
178 pages
Structure & Bonding Poster
No ratings yet
Structure & Bonding Poster
1 page
Green City Planning
67% (3)
Green City Planning
16 pages
Motorcycle Accidents Thesis
100% (2)
Motorcycle Accidents Thesis
7 pages
Fundamentals of Vector Quantization
No ratings yet
Fundamentals of Vector Quantization
87 pages
FDA Confirms Graphene Oxide Is in The mRNA COVID-19 Vaccines
No ratings yet
FDA Confirms Graphene Oxide Is in The mRNA COVID-19 Vaccines
8 pages
Air Operator Certification Manual CAP 3300: Scheduled Commuter & NSOP Airplanes
No ratings yet
Air Operator Certification Manual CAP 3300: Scheduled Commuter & NSOP Airplanes
5 pages
Urban Studies Case Study-Townships: Location
No ratings yet
Urban Studies Case Study-Townships: Location
10 pages
Technique and The Laws of Leverage: Basics
No ratings yet
Technique and The Laws of Leverage: Basics
2 pages
The Hexagon For Trigonometric Identities
No ratings yet
The Hexagon For Trigonometric Identities
11 pages
VDRIVE Manual
No ratings yet
VDRIVE Manual
24 pages
Map Work - Geography (Locating and Labelling)
No ratings yet
Map Work - Geography (Locating and Labelling)
17 pages
UNILORIN 2022-23 UTME CUT-OFF (TripleHay)
100% (1)
UNILORIN 2022-23 UTME CUT-OFF (TripleHay)
3 pages
Lectures On Digital Design Principles (2023, River Publishers, Routledge) - Libgen - Li
No ratings yet
Lectures On Digital Design Principles (2023, River Publishers, Routledge) - Libgen - Li
280 pages
12th 3
No ratings yet
12th 3
35 pages
Program Confereng2024
No ratings yet
Program Confereng2024
29 pages
2024 Programme
No ratings yet
2024 Programme
28 pages
Experiment 4 LIPID
100% (8)
Experiment 4 LIPID
16 pages
7ICEGE Programme LR PDF
No ratings yet
7ICEGE Programme LR PDF
116 pages
Rotational Mechanics
No ratings yet
Rotational Mechanics
17 pages
DNA As Code of Life - Rau's IAS
No ratings yet
DNA As Code of Life - Rau's IAS
5 pages
AASHTO T134 Relaciones Humedad-Densidad de Mezclas de Suelo-Cemento
No ratings yet
AASHTO T134 Relaciones Humedad-Densidad de Mezclas de Suelo-Cemento
7 pages

Predictive Modelling

Uploaded by

Predictive Modelling

Uploaded by

1

Sr. CONTENT Page

1.1 Define the problem and perform exploratory 2-6

1.2 Data Pre-processing 6-8

1.3 Model Building - Linear regression 8-13

1.4 Business Insights & Recommendations 14

2.1 Define the problem and perform exploratory 15-20

2.2 Data Pre-processing 20-21

2.3 Model Building and Compare the 22-26

2.4 Business Insights & Recommendations 26-27

A Problem1: Comp-active database

1.1 - Define the problem and perform exploratory Data Analysis

Key meaningful observations on individual variables and the relationship between

1.2 Data Pre-processing

Missing Value Treatment (if needed)

There are 104 missing values present at rchar & 15 at wchar

Outlier Detection (treat, if needed):-

 There are total 31775 outliers present

Encode the data - Train-test split:-

 X_TEST 1ST 5 rows:-

1.3 Model Building - Linear Regression

 VIF for all features is <3

 Partially, the points are laying on the straight line in QQ plot.

 RMSE on the train data = 11.5289

Key Influence of Process Run Queue Size:

Sensitivity to CPU Bound Queue Size:

Impact of Non-CPU Bound Queue Size:

Cumulative Effect of Process Run Queue Size:

Constant Impact of Other Features:

B Problem2: Contraceptive Method Data-set

2.1 Define the problem and perform exploratory Data Analysis

 Scientology is playing wider role in wife region.

 Mostly Wives are not in working professional.

 Distribution of media exposure is quite better, its more than 1000.

Use appropriate visualizations to identify the patterns and insights:-

2.2 Data Pre-processing

Prepare the data for modelling: -

 There are 71 missing values are present in "wife_age" and 21 in "no_of_children_born". So

Outlier Detection(treat, if needed)

 97 Outliers are present. So, it has to treat the outliers.

Encode the data - Train-test split

Build a Logistic Regression model:-

2.4 Business Insights & Recommendations:-

You might also like