0% found this document useful (0 votes)

16 views10 pages

DATA 51000 ClassificationAssignment

The report details a classification assignment using the Maternal Health Risk dataset from the UCI Machine Learning Repository, focusing on predicting maternal mortality risk levels based on various health attributes. The analysis involved preprocessing steps, outlier detection, feature selection, and the application of multiple classification models, with Random Forest and Gradient Boosting yielding the highest accuracy rates. Key features identified include blood sugar, systolic blood pressure, and body temperature, which significantly contribute to predicting risk levels in pregnant women.

Uploaded by

arunreddy2406

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views10 pages

DATA 51000 ClassificationAssignment

Uploaded by

arunreddy2406

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 10

DATA-51000 Data Mining and Analytics

Classification Assignment – Report

INFO
Name Vorusu Pavan Kalyan

Email [email protected]

Software/Programming Orange Data Mining Tool

Languages and Libraries
Used

DATA INFORMATION
Source(s) Source- UCI Machine Learning Repository
Dataset Citation- Ahmed,Marzia. (2023). Maternal Health Risk. UCI Machine
Learning Repository. https://doi.org/10.24432/C5DP5D.
Source citation- M. Ahmed, M. A. Kashem, M. Rahman, and S. Khatun, “Review
and Analysis of Risk Factor of Maternal Health in Remote Area Using the Internet
of Things (IoT),” 2020. https://www.semanticscholar.org/paper/Review-and-
Analysis-of-Risk-Factor-of-Maternal-in-Ahmed-Kashem/
f175092a3b2217c9abca5bf5d91bab3c245c6b10
Date 08/14/2023
Description - how was it Above mentioned dataset was downloaded from UCI Machine Learning
collected and what it contains Repository. Data has been collected from different hospitals, community clinics,
maternal health cares from the rural areas of Bangladesh through the IoT based
risk monitoring system. As part of a 2020 study by Marzia Ahmed, M. A.
Kashem, Mostafijur Rahman, and S. Khatun titled "Review and Analysis of Risk
Factors of Maternal Health in Remote Areas Using the Internet of Things (IoT),"
this dataset was acquired.
Original usage In order to improve healthcare decisions, especially in rural areas, this information
was used to assist determine the risk levels associated with maternal health
problems. Most likely, the dataset's main objective was to create a prediction
model that classifies maternal health scenarios according to risk factors. In the
long run, this might help reduce the incidence of maternal deaths in rural regions
by facilitating early diagnosis and care.
Size (number of instances and number of instances – 1014
features) number of features - 7
Target variable (which Target variable – RiskLevel.
variable will be predicted by In this dataset target variable “RiskLevel” describes the levels of risk of getting
the model?) maternal mortality in pregnant women. In this RiskLevel feature target classes are
labeled as “low risk”, “mid risk” and “high risk”.

DATA ATTRIBUTES (list all – start with the target column)

Column Name Data Type Example Value
RiskLevel Categorical(target) low risk
Age Numeric 25
SystolicBP Numeric 130
DiastolicBP Numeric 80
BS Numeric 15
BodyTemp Numeric 98
HeartRate Numeric 86

EXPLORATORY ANALYSIS
Missing values (for which
features and how many?what
should be done about this?)

No Missing Data.
Outliers (what was used to For detecting outliers we used outliers[1] widget in orange data mining tool. For
detect them? how many are detecting outliers in Maternal Health risk dataset we used Local Outlier Factor
there? what will you do about method and obtained outlier report shown in below figure.
them and why?)
-provide examples

Fig. 1 Outlier detection report.

From above figure we observed that 79 outliers detected from above method. We
dropped outliers and considered inliers data for classification , because outliers
may affect the model results.
Distributions (are any
skewed?)
-provide examples (figures)
Fig. 2 Frequency distribution of feature Age.
From above figure we observed that the feature Age is Right Skewed.

Fig. 3 Frequency distribution of feature DiastolicBP.

From above figure we observed that the feature DiastolicBP is Left Skewed.

Fig. 4 Frequency distribution of feature BS(Blood Sugar).

From above figure we observed that the feature BS(Blood Sugar) is Right Skewed.
Correlations between features
(any multicollinearity?)
-show correlations for highly
correlated feature pairs
-are there any “leakage”
features? (i.e., values that
would not normally be known
and are giving away the target
value)

Fig. 5 Correlation Values between the features.

From above figure we observed that the features SystolicBP and DiastolicBP has
highest correlation values i.e +0.787, Which indicates the Multicollinearity
problem between the features . To Avoid multicollinearity we dropped one feature
i.e DiastolicBP and considered remaining features for applying classification
Models.
Important features For determining the important features we applied Random Forest Classifier on
-state the method for the dataset and top important features and percentage of importance is listed
determining this below.
-list these features BS: 35.65%
SystolicBP: 17.86%
Age: 16.10%
DiastolicBP: 13.63%
HeartRate: 10.37%
BodyTemp: 6.39%
Additional notes The below figure shows the distribution of target feature in the Maternal Health
-comments Risk dataset
-visualizations

Fig. 6 Distribution of target feature Risk Level.

From above figure shows that the majority of the dataset contains Low risk
patients.
The below box plot shows the distribution of age based on risk level .

Fig. 7 Box plot showing Distribution of Age split by Risk Level.

From above figure we observed that persons whose age is greater than 36 years
are likely to get high risk of Maternal Mortality.

CLASSIFICATION METHODOLOGY
Preprocessing steps (describe, e.g. missing value There is no missing values present in the data.
imputation, outlier removal, variable transformations, I removed the outliers and considered the inliers data
feature selection, normalization, etc.) for modeling.
Feature selection:
I selected the features Age, SystolicBP, BS, BodyTemp,
HeartRate and ignored DiastolicBP to avoid
Multicollinearity. The selected and ignored features are
shown in below figure.

Fig. 8 selected features using select column widget.

Normalization:
I Normalized the instances in intervals of [0,1] with the
help of preprocess widget[2], because it standardized
features so that they have a consistent and comparable
range for modeling.
Dataset splitting:
I divided the dataset in the ratios of 80:20 with the help
of data sampler widget[3] and labeled them as training
and testing datasets as shown in below figure

Fig. 9 Dataset splitting using data sampler widget.

Classification methods used(at least 3) along with Applied five classification models on the dataset.
hyperparameters that were adjusted Applied models were
a. Decision Tree
b. Random Forest
c. Logistic regression
d. Naïve bayes
e. Gradient Boosting
f. kNN
Adjusted parameters in each model were shown in
below table.
Model Parameters Adjusted
Decision Tree Minimum Number of instances in leaves
Do not split subsets smaller than
Limit the Maximal tree depth to
Random Forest Number of trees
Number of attributes considered at each split
Logistic Regulation type
Regression Strength(C)
Naïve Bayes Default parameters
Gradient Method
Boosting Number of trees
Learning rate
kNN Number of neighbors
Metric
Weight
Classification evaluation Accuracy scores
(how will you know that the classifierhas done well?) High accuracy scores are considered be the best
performed model. Also other metrics such as precision,
recall and F1-scores for evaluating classification models.
Confusion matrix
Values of true positives, true negatives, false positives
and false negatives helps in model performances.
RESULTS
(for each classification method, include a table with 10-fold cross-validation results for all combinations of
hyperparameter values)
(also include confusion matrices for each method with its best hyperparameter settings)

[discuss the results]

Method Hyperparameter settings Accuracy (using 10-fold c.v.)
Random Forest Number of trees=100, Number of attributes considered 94.2%
at each split=5
Random Forest Number of trees=20, Number of attributes considered 93.4%
at each split=10
Random Forest Number of trees=50, Number of attributes considered 94.1%
at each split=5
Random Forest Number of trees=100, Replicable training= yes 94.3%
kNN Number of neighbors=5, metric =manhattan, metric=by 92.1%
weights
kNN Number of neighbors=5, metric =Euclidean, 85.6%
metric=uniform
kNN Number of neighbors=10, metric =Euclidean, 85.5%
metric=uniform
kNN Number of neighbors=10, metric = manhattan, metric= 93.4%
by weights
Tree Minimum Number of instances in leaves=20 83.9%
Do not split subsets smaller than=5
Limit the maximal tree depth to= 5
Include binary tree= yes
Tree Minimum Number of instances in leaves=10 85.3%
Do not split subsets smaller than=5
Limit the maximal tree depth to= 5
Include binary tree= no
Tree Minimum Number of instances in leaves=5 86.0%
Do not split subsets smaller than=5
Limit the maximal tree depth to= 5
Include binary tree= yes
Tree Minimum Number of instances in leaves=2 85.9%
Do not split subsets smaller than=5
Limit the maximal tree depth to= 5
Include binary tree= yes
Gradient Boosting Number of trees=50 92.8%
Learning rate=0.300
Replicable training= yes
Gradient Boosting Number of trees=50 89.8%
Learning rate=0.100
Replicable training= yes
Gradient Boosting Number of trees=100 93.8%
Learning rate=0.500
Replicable training= yes
Gradient Boosting Number of trees=50 92.7%
Learning rate=0.500
Replicable training= yes
Logistic Regression Regulation type=Lasso, c=10 80.2%
Logistic Regression Regulation type=Ridge, c=10 80.1%
Logistic Regression Regulation type=None, c=10 79.9%
Logistic Regression Regulation type=Lasso, c=1 80.4%
Logistic Regression Regulation type=Lasso, c=0.001 50.0%
Naïve Bayes Default Parameters 81.6%

In above table the rows with highlighted colors are the parameters at which their accuracy levels are higher.
Below figure shows the 10- fold cross validation results of different models

Fig. 10 Stratified 10-fold cross validation results of different models.

From above figure random forest model is the outperformed model with accuracy of 94.3% followed by Gradient
boosting model with accuracy of 93.8%. Below table shows the confusion matrix and important features obtained in
each model.
TABLE I. CONFUSION MATRICES AND TOP3 IMPORTANT FEATURES AT 10-FOLD CROSS-VALIDATION RESULTS
Model Confusion matrix Important features
Random
Forest
kNN

Decision Tree

Gradient
Boosting

Logistic
Regression

Naïve Bayes

From above table, we observed that the highest number of high risk patients were predicted by Random forest and
Gradient boosting models, which means these models excels in classifying high risk level patients correctly, which
helps in preventing maternal mortality and taking preventive measures to avoid maternal mortality in pregnant
women at a early stage. Gradient boosting also excels in classifying mid risk level , maternal mortality patients.
Also, from above table we can say that the features BS(blood sugar), SystolicBP and BodyTemp plays a major role
in determining the risk levels of patients correctly.
These trained models were tested on test data and performance of different models were shown below.
Fig. 10 Model performances on test data.
From above figure we observed that , gradient boosting and random forest models are the outperformers on test data
with accuracies of 94.1% and 93.9% respectively.
Obtained Confusion matrices of each model after applying on test data were shown below.

TABLE II. CONFUSION MATRICES OF DIFFERENT MODELS AFTER APPLYING ON TEST DATA.

Random Forest Model kNN Model Decision Tree

Logistic Regression Model Gradient Boosting Model Naïve Bayes model

From above table, we observed that random forest and gradient boosting models were the best performers in
classifying the risk levels of maternal mortality in pregnant women, because these models contains highest number
of true positives.

Conclusion:
Based on above results we can conclude that Gradient boosting and random forest models helps in predicting the
risk levels of maternal mortality in pregnant women, and also we found that BS(blood sugar), SystolicBP and
BodyTemp features plays major role in predicting the risk levels.

References:
[1] “Outliers — Orange Visual Programming 3 documentation.” https://orange3.readthedocs.io/projects/orange-
visual-programming/en/latest/widgets/data/outliers.html
[2] “Orange Data Mining - undefined,” Orange Data Mining.
https://orangedatamining.com/widget-catalog/transform/preprocess/
[3] https://orange3.readthedocs.io/projects/orange-visual-programming/en/master/widgets/data/datasampler.html

Data Mining A Tutorial-Based Primer, Second Edition PDF
100% (1)
Data Mining A Tutorial-Based Primer, Second Edition PDF
530 pages
Heart Disease Prediction Final
67% (3)
Heart Disease Prediction Final
45 pages
Heart Disease Detection Using Machine Learning
No ratings yet
Heart Disease Detection Using Machine Learning
12 pages
Ensemble Techniques Project
100% (2)
Ensemble Techniques Project
28 pages
APA Chapter3 T20
No ratings yet
APA Chapter3 T20
24 pages
Machine Learning Model
No ratings yet
Machine Learning Model
9 pages
Synopsis (Heart Disease Prediction)
No ratings yet
Synopsis (Heart Disease Prediction)
7 pages
Comparative Study For Classification
No ratings yet
Comparative Study For Classification
6 pages
SUMMARY
No ratings yet
SUMMARY
16 pages
A Computational Study On Classification of Malignant
No ratings yet
A Computational Study On Classification of Malignant
63 pages
Predicting The Presence of Heart Diseases Using Comparative Data Mining and Machine Learning Algorithms
No ratings yet
Predicting The Presence of Heart Diseases Using Comparative Data Mining and Machine Learning Algorithms
5 pages
Edited Version of Cardiovascular Diseases Risk Prediction Dataset Report
No ratings yet
Edited Version of Cardiovascular Diseases Risk Prediction Dataset Report
25 pages
Heart Disease Predictor - ML - Report
No ratings yet
Heart Disease Predictor - ML - Report
15 pages
Comparison of Various Data Mining Methods For Early Diagnosis of Human Cardiology
No ratings yet
Comparison of Various Data Mining Methods For Early Diagnosis of Human Cardiology
9 pages
Heart Disease Prediction Using Machine Learning IJERTV9IS040614
No ratings yet
Heart Disease Prediction Using Machine Learning IJERTV9IS040614
4 pages
Heart Disease Prediction Using Machine Learning Techniques: Abstract
No ratings yet
Heart Disease Prediction Using Machine Learning Techniques: Abstract
5 pages
Machine Learning
No ratings yet
Machine Learning
9 pages
J Imu 2019 100203
No ratings yet
J Imu 2019 100203
18 pages
Adaboost 2
No ratings yet
Adaboost 2
9 pages
Heart Attack Prediction System: Sushmita Manikandan
No ratings yet
Heart Attack Prediction System: Sushmita Manikandan
4 pages
BSAN Case 3
No ratings yet
BSAN Case 3
9 pages
Decision Support
No ratings yet
Decision Support
21 pages
DWDM Lab 3
No ratings yet
DWDM Lab 3
10 pages
Group 19
No ratings yet
Group 19
21 pages
20MIS7043 (LAB 7) .Ipynb Colaboratory
No ratings yet
20MIS7043 (LAB 7) .Ipynb Colaboratory
4 pages
20MIS7095 (LAB 7) .Ipynb Colaboratory
No ratings yet
20MIS7095 (LAB 7) .Ipynb Colaboratory
4 pages
Experiment 2
No ratings yet
Experiment 2
17 pages
Python and Machine Learning
No ratings yet
Python and Machine Learning
14 pages
Camera Ready
No ratings yet
Camera Ready
5 pages
IEEE Conference Team ATOM
No ratings yet
IEEE Conference Team ATOM
5 pages
Thesis Presentation
No ratings yet
Thesis Presentation
22 pages
A Hybrid Feature Selection Model For Early Heart Attack Prediction Using IoMT Devices
No ratings yet
A Hybrid Feature Selection Model For Early Heart Attack Prediction Using IoMT Devices
10 pages
Juliet Isioma Goddey - 2304171 Data Preparation
No ratings yet
Juliet Isioma Goddey - 2304171 Data Preparation
10 pages
PeerEval Classification
No ratings yet
PeerEval Classification
5 pages
ML Report2
No ratings yet
ML Report2
21 pages
Multi-Disease Prediction With Machine Learning
No ratings yet
Multi-Disease Prediction With Machine Learning
7 pages
Paper - Heart Disease Prediction
No ratings yet
Paper - Heart Disease Prediction
5 pages
DATA-51000-ClusteringAssignmentTemplateNew Maternal Health Risk
No ratings yet
DATA-51000-ClusteringAssignmentTemplateNew Maternal Health Risk
12 pages
(INTI
No ratings yet
(INTI
9 pages
Heart Disease
No ratings yet
Heart Disease
13 pages
HussainBadshah SafwanSheikh
No ratings yet
HussainBadshah SafwanSheikh
12 pages
Review
No ratings yet
Review
5 pages
Heart Disease Prediction Model: Dissertation
No ratings yet
Heart Disease Prediction Model: Dissertation
4 pages
Heart Disease Prediction - Medical Image Analysis - Robust Healthcare Forecasting
No ratings yet
Heart Disease Prediction - Medical Image Analysis - Robust Healthcare Forecasting
5 pages
Article Eda
No ratings yet
Article Eda
7 pages
Second Progres Report
No ratings yet
Second Progres Report
10 pages
ML Cep FAisal
No ratings yet
ML Cep FAisal
18 pages
Research Paper
No ratings yet
Research Paper
7 pages
Disease Prediction Based On Symptoms
No ratings yet
Disease Prediction Based On Symptoms
16 pages
Heart Disease Prediction
No ratings yet
Heart Disease Prediction
17 pages
Heart Disease Prediction With Machine Learning Approaches
No ratings yet
Heart Disease Prediction With Machine Learning Approaches
6 pages
Project Report
No ratings yet
Project Report
18 pages
HEART
No ratings yet
HEART
15 pages
Heart Disease Prediction
No ratings yet
Heart Disease Prediction
9 pages
FP Report - Group 2
No ratings yet
FP Report - Group 2
4 pages
Term Paper
No ratings yet
Term Paper
10 pages
Cse437 4
No ratings yet
Cse437 4
14 pages
Final Research Paper
No ratings yet
Final Research Paper
3 pages
Term Paper
No ratings yet
Term Paper
8 pages
Conference PPT Anas2
No ratings yet
Conference PPT Anas2
14 pages
IJRASET Signature Recognition
No ratings yet
IJRASET Signature Recognition
4 pages
Digital Assignment-1 Literature Review On Twitter Sentiment Analysis Name: G.Tirumala Reg No: 16BCE0202 1)
No ratings yet
Digital Assignment-1 Literature Review On Twitter Sentiment Analysis Name: G.Tirumala Reg No: 16BCE0202 1)
9 pages
AD3461 - ML Lab Manual
No ratings yet
AD3461 - ML Lab Manual
54 pages
Malicious Application Detection Using Machine Learning
No ratings yet
Malicious Application Detection Using Machine Learning
59 pages
Employee Attrition Prediction
No ratings yet
Employee Attrition Prediction
66 pages
For Email
No ratings yet
For Email
8 pages
DWM Exp5 C49
No ratings yet
DWM Exp5 C49
12 pages
Machine Learning
No ratings yet
Machine Learning
17 pages
Abstract Book-ICMSc 2021
No ratings yet
Abstract Book-ICMSc 2021
127 pages
Data Mining For Business Analytics Concepts Techniques and Applications in Python Ebook Download Instantly
No ratings yet
Data Mining For Business Analytics Concepts Techniques and Applications in Python Ebook Download Instantly
314 pages
Disaster Response Classification Using NLP: Under Supervision of - Mrs. Sonali Mathur
No ratings yet
Disaster Response Classification Using NLP: Under Supervision of - Mrs. Sonali Mathur
14 pages
Shubham Thesis PDF
No ratings yet
Shubham Thesis PDF
63 pages
Heart Disease rp3
No ratings yet
Heart Disease rp3
20 pages
KNN and Naive Bayes
No ratings yet
KNN and Naive Bayes
61 pages
Himanshu Sharma - Detection of Financial Statement Fraud Using Decision Tree Classifiers - 2013
No ratings yet
Himanshu Sharma - Detection of Financial Statement Fraud Using Decision Tree Classifiers - 2013
23 pages
Naive Bayes
No ratings yet
Naive Bayes
24 pages
Spelling Noisy Channel
No ratings yet
Spelling Noisy Channel
5 pages
Report On Naive Bayes
No ratings yet
Report On Naive Bayes
5 pages
Big Data Assignments Answer
No ratings yet
Big Data Assignments Answer
15 pages
Probabilistic Graphical Models: EEE 485/585 Statistical Learning and Data Analytics
No ratings yet
Probabilistic Graphical Models: EEE 485/585 Statistical Learning and Data Analytics
29 pages
A Smart System For Fake News Detection Using Machine Learning
No ratings yet
A Smart System For Fake News Detection Using Machine Learning
7 pages
JOCC - Volume 2 - Issue 1 - Pages 50-65
No ratings yet
JOCC - Volume 2 - Issue 1 - Pages 50-65
16 pages
AI Research Paper
No ratings yet
AI Research Paper
8 pages
Reasoning With Uncertainty - Probabilistic Reasoning: Version 2 CSE IIT, Kharagpur
No ratings yet
Reasoning With Uncertainty - Probabilistic Reasoning: Version 2 CSE IIT, Kharagpur
5 pages
Class Result Prediction Using Machine Learning
No ratings yet
Class Result Prediction Using Machine Learning
6 pages
Online - Reviews Sentiment - Analysis
No ratings yet
Online - Reviews Sentiment - Analysis
5 pages
ML Assignment 4
No ratings yet
ML Assignment 4
10 pages
A Machine Learning Approach For Stylometric Analysis of Bangla Literature
No ratings yet
A Machine Learning Approach For Stylometric Analysis of Bangla Literature
5 pages
Introduction To Business Statistics Through R Software: Software
From Everand
Introduction To Business Statistics Through R Software: Software
Editor IJSMI
No ratings yet
Introduction To Non Parametric Methods Through R Software
From Everand
Introduction To Non Parametric Methods Through R Software
Editor IJSMI
No ratings yet