DATA 51000 ClassificationAssignment
DATA 51000 ClassificationAssignment
INFO
Name Vorusu Pavan Kalyan
Email [email protected]
DATA INFORMATION
Source(s) Source- UCI Machine Learning Repository
Dataset Citation- Ahmed,Marzia. (2023). Maternal Health Risk. UCI Machine
Learning Repository. https://doi.org/10.24432/C5DP5D.
Source citation- M. Ahmed, M. A. Kashem, M. Rahman, and S. Khatun, “Review
and Analysis of Risk Factor of Maternal Health in Remote Area Using the Internet
of Things (IoT),” 2020. https://www.semanticscholar.org/paper/Review-and-
Analysis-of-Risk-Factor-of-Maternal-in-Ahmed-Kashem/
f175092a3b2217c9abca5bf5d91bab3c245c6b10
Date 08/14/2023
Description - how was it Above mentioned dataset was downloaded from UCI Machine Learning
collected and what it contains Repository. Data has been collected from different hospitals, community clinics,
maternal health cares from the rural areas of Bangladesh through the IoT based
risk monitoring system. As part of a 2020 study by Marzia Ahmed, M. A.
Kashem, Mostafijur Rahman, and S. Khatun titled "Review and Analysis of Risk
Factors of Maternal Health in Remote Areas Using the Internet of Things (IoT),"
this dataset was acquired.
Original usage In order to improve healthcare decisions, especially in rural areas, this information
was used to assist determine the risk levels associated with maternal health
problems. Most likely, the dataset's main objective was to create a prediction
model that classifies maternal health scenarios according to risk factors. In the
long run, this might help reduce the incidence of maternal deaths in rural regions
by facilitating early diagnosis and care.
Size (number of instances and number of instances – 1014
features) number of features - 7
Target variable (which Target variable – RiskLevel.
variable will be predicted by In this dataset target variable “RiskLevel” describes the levels of risk of getting
the model?) maternal mortality in pregnant women. In this RiskLevel feature target classes are
labeled as “low risk”, “mid risk” and “high risk”.
EXPLORATORY ANALYSIS
Missing values (for which
features and how many?what
should be done about this?)
No Missing Data.
Outliers (what was used to For detecting outliers we used outliers[1] widget in orange data mining tool. For
detect them? how many are detecting outliers in Maternal Health risk dataset we used Local Outlier Factor
there? what will you do about method and obtained outlier report shown in below figure.
them and why?)
-provide examples
CLASSIFICATION METHODOLOGY
Preprocessing steps (describe, e.g. missing value There is no missing values present in the data.
imputation, outlier removal, variable transformations, I removed the outliers and considered the inliers data
feature selection, normalization, etc.) for modeling.
Feature selection:
I selected the features Age, SystolicBP, BS, BodyTemp,
HeartRate and ignored DiastolicBP to avoid
Multicollinearity. The selected and ignored features are
shown in below figure.
In above table the rows with highlighted colors are the parameters at which their accuracy levels are higher.
Below figure shows the 10- fold cross validation results of different models
Decision Tree
Gradient
Boosting
Logistic
Regression
Naïve Bayes
From above table, we observed that the highest number of high risk patients were predicted by Random forest and
Gradient boosting models, which means these models excels in classifying high risk level patients correctly, which
helps in preventing maternal mortality and taking preventive measures to avoid maternal mortality in pregnant
women at a early stage. Gradient boosting also excels in classifying mid risk level , maternal mortality patients.
Also, from above table we can say that the features BS(blood sugar), SystolicBP and BodyTemp plays a major role
in determining the risk levels of patients correctly.
These trained models were tested on test data and performance of different models were shown below.
Fig. 10 Model performances on test data.
From above figure we observed that , gradient boosting and random forest models are the outperformers on test data
with accuracies of 94.1% and 93.9% respectively.
Obtained Confusion matrices of each model after applying on test data were shown below.
TABLE II. CONFUSION MATRICES OF DIFFERENT MODELS AFTER APPLYING ON TEST DATA.
From above table, we observed that random forest and gradient boosting models were the best performers in
classifying the risk levels of maternal mortality in pregnant women, because these models contains highest number
of true positives.
Conclusion:
Based on above results we can conclude that Gradient boosting and random forest models helps in predicting the
risk levels of maternal mortality in pregnant women, and also we found that BS(blood sugar), SystolicBP and
BodyTemp features plays major role in predicting the risk levels.
References:
[1] “Outliers — Orange Visual Programming 3 documentation.” https://orange3.readthedocs.io/projects/orange-
visual-programming/en/latest/widgets/data/outliers.html
[2] “Orange Data Mining - undefined,” Orange Data Mining.
https://orangedatamining.com/widget-catalog/transform/preprocess/
[3] https://orange3.readthedocs.io/projects/orange-visual-programming/en/master/widgets/data/datasampler.html