0% found this document useful (0 votes)
16 views10 pages

DATA 51000 ClassificationAssignment

The report details a classification assignment using the Maternal Health Risk dataset from the UCI Machine Learning Repository, focusing on predicting maternal mortality risk levels based on various health attributes. The analysis involved preprocessing steps, outlier detection, feature selection, and the application of multiple classification models, with Random Forest and Gradient Boosting yielding the highest accuracy rates. Key features identified include blood sugar, systolic blood pressure, and body temperature, which significantly contribute to predicting risk levels in pregnant women.

Uploaded by

arunreddy2406
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views10 pages

DATA 51000 ClassificationAssignment

The report details a classification assignment using the Maternal Health Risk dataset from the UCI Machine Learning Repository, focusing on predicting maternal mortality risk levels based on various health attributes. The analysis involved preprocessing steps, outlier detection, feature selection, and the application of multiple classification models, with Random Forest and Gradient Boosting yielding the highest accuracy rates. Key features identified include blood sugar, systolic blood pressure, and body temperature, which significantly contribute to predicting risk levels in pregnant women.

Uploaded by

arunreddy2406
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

DATA-51000 Data Mining and Analytics

Classification Assignment – Report

INFO
Name Vorusu Pavan Kalyan

Email [email protected]

Software/Programming Orange Data Mining Tool


Languages and Libraries
Used

DATA INFORMATION
Source(s) Source- UCI Machine Learning Repository
Dataset Citation- Ahmed,Marzia. (2023). Maternal Health Risk. UCI Machine
Learning Repository. https://doi.org/10.24432/C5DP5D.
Source citation- M. Ahmed, M. A. Kashem, M. Rahman, and S. Khatun, “Review
and Analysis of Risk Factor of Maternal Health in Remote Area Using the Internet
of Things (IoT),” 2020. https://www.semanticscholar.org/paper/Review-and-
Analysis-of-Risk-Factor-of-Maternal-in-Ahmed-Kashem/
f175092a3b2217c9abca5bf5d91bab3c245c6b10
Date 08/14/2023
Description - how was it Above mentioned dataset was downloaded from UCI Machine Learning
collected and what it contains Repository. Data has been collected from different hospitals, community clinics,
maternal health cares from the rural areas of Bangladesh through the IoT based
risk monitoring system. As part of a 2020 study by Marzia Ahmed, M. A.
Kashem, Mostafijur Rahman, and S. Khatun titled "Review and Analysis of Risk
Factors of Maternal Health in Remote Areas Using the Internet of Things (IoT),"
this dataset was acquired.
Original usage In order to improve healthcare decisions, especially in rural areas, this information
was used to assist determine the risk levels associated with maternal health
problems. Most likely, the dataset's main objective was to create a prediction
model that classifies maternal health scenarios according to risk factors. In the
long run, this might help reduce the incidence of maternal deaths in rural regions
by facilitating early diagnosis and care.
Size (number of instances and number of instances – 1014
features) number of features - 7
Target variable (which Target variable – RiskLevel.
variable will be predicted by In this dataset target variable “RiskLevel” describes the levels of risk of getting
the model?) maternal mortality in pregnant women. In this RiskLevel feature target classes are
labeled as “low risk”, “mid risk” and “high risk”.

DATA ATTRIBUTES (list all – start with the target column)


Column Name Data Type Example Value
RiskLevel Categorical(target) low risk
Age Numeric 25
SystolicBP Numeric 130
DiastolicBP Numeric 80
BS Numeric 15
BodyTemp Numeric 98
HeartRate Numeric 86

EXPLORATORY ANALYSIS
Missing values (for which
features and how many?what
should be done about this?)

No Missing Data.
Outliers (what was used to For detecting outliers we used outliers[1] widget in orange data mining tool. For
detect them? how many are detecting outliers in Maternal Health risk dataset we used Local Outlier Factor
there? what will you do about method and obtained outlier report shown in below figure.
them and why?)
-provide examples

Fig. 1 Outlier detection report.


From above figure we observed that 79 outliers detected from above method. We
dropped outliers and considered inliers data for classification , because outliers
may affect the model results.
Distributions (are any
skewed?)
-provide examples (figures)
Fig. 2 Frequency distribution of feature Age.
From above figure we observed that the feature Age is Right Skewed.

Fig. 3 Frequency distribution of feature DiastolicBP.


From above figure we observed that the feature DiastolicBP is Left Skewed.

Fig. 4 Frequency distribution of feature BS(Blood Sugar).


From above figure we observed that the feature BS(Blood Sugar) is Right Skewed.
Correlations between features
(any multicollinearity?)
-show correlations for highly
correlated feature pairs
-are there any “leakage”
features? (i.e., values that
would not normally be known
and are giving away the target
value)

Fig. 5 Correlation Values between the features.


From above figure we observed that the features SystolicBP and DiastolicBP has
highest correlation values i.e +0.787, Which indicates the Multicollinearity
problem between the features . To Avoid multicollinearity we dropped one feature
i.e DiastolicBP and considered remaining features for applying classification
Models.
Important features For determining the important features we applied Random Forest Classifier on
-state the method for the dataset and top important features and percentage of importance is listed
determining this below.
-list these features BS: 35.65%
SystolicBP: 17.86%
Age: 16.10%
DiastolicBP: 13.63%
HeartRate: 10.37%
BodyTemp: 6.39%
Additional notes The below figure shows the distribution of target feature in the Maternal Health
-comments Risk dataset
-visualizations

Fig. 6 Distribution of target feature Risk Level.


From above figure shows that the majority of the dataset contains Low risk
patients.
The below box plot shows the distribution of age based on risk level .

Fig. 7 Box plot showing Distribution of Age split by Risk Level.


From above figure we observed that persons whose age is greater than 36 years
are likely to get high risk of Maternal Mortality.

CLASSIFICATION METHODOLOGY
Preprocessing steps (describe, e.g. missing value There is no missing values present in the data.
imputation, outlier removal, variable transformations, I removed the outliers and considered the inliers data
feature selection, normalization, etc.) for modeling.
Feature selection:
I selected the features Age, SystolicBP, BS, BodyTemp,
HeartRate and ignored DiastolicBP to avoid
Multicollinearity. The selected and ignored features are
shown in below figure.

Fig. 8 selected features using select column widget.


Normalization:
I Normalized the instances in intervals of [0,1] with the
help of preprocess widget[2], because it standardized
features so that they have a consistent and comparable
range for modeling.
Dataset splitting:
I divided the dataset in the ratios of 80:20 with the help
of data sampler widget[3] and labeled them as training
and testing datasets as shown in below figure

Fig. 9 Dataset splitting using data sampler widget.


Classification methods used(at least 3) along with Applied five classification models on the dataset.
hyperparameters that were adjusted Applied models were
a. Decision Tree
b. Random Forest
c. Logistic regression
d. Naïve bayes
e. Gradient Boosting
f. kNN
Adjusted parameters in each model were shown in
below table.
Model Parameters Adjusted
Decision Tree Minimum Number of instances in leaves
Do not split subsets smaller than
Limit the Maximal tree depth to
Random Forest Number of trees
Number of attributes considered at each split
Logistic Regulation type
Regression Strength(C)
Naïve Bayes Default parameters
Gradient Method
Boosting Number of trees
Learning rate
kNN Number of neighbors
Metric
Weight
Classification evaluation Accuracy scores
(how will you know that the classifierhas done well?) High accuracy scores are considered be the best
performed model. Also other metrics such as precision,
recall and F1-scores for evaluating classification models.
Confusion matrix
Values of true positives, true negatives, false positives
and false negatives helps in model performances.
RESULTS
(for each classification method, include a table with 10-fold cross-validation results for all combinations of
hyperparameter values)
(also include confusion matrices for each method with its best hyperparameter settings)

[discuss the results]


Method Hyperparameter settings Accuracy (using 10-fold c.v.)
Random Forest Number of trees=100, Number of attributes considered 94.2%
at each split=5
Random Forest Number of trees=20, Number of attributes considered 93.4%
at each split=10
Random Forest Number of trees=50, Number of attributes considered 94.1%
at each split=5
Random Forest Number of trees=100, Replicable training= yes 94.3%
kNN Number of neighbors=5, metric =manhattan, metric=by 92.1%
weights
kNN Number of neighbors=5, metric =Euclidean, 85.6%
metric=uniform
kNN Number of neighbors=10, metric =Euclidean, 85.5%
metric=uniform
kNN Number of neighbors=10, metric = manhattan, metric= 93.4%
by weights
Tree Minimum Number of instances in leaves=20 83.9%
Do not split subsets smaller than=5
Limit the maximal tree depth to= 5
Include binary tree= yes
Tree Minimum Number of instances in leaves=10 85.3%
Do not split subsets smaller than=5
Limit the maximal tree depth to= 5
Include binary tree= no
Tree Minimum Number of instances in leaves=5 86.0%
Do not split subsets smaller than=5
Limit the maximal tree depth to= 5
Include binary tree= yes
Tree Minimum Number of instances in leaves=2 85.9%
Do not split subsets smaller than=5
Limit the maximal tree depth to= 5
Include binary tree= yes
Gradient Boosting Number of trees=50 92.8%
Learning rate=0.300
Replicable training= yes
Gradient Boosting Number of trees=50 89.8%
Learning rate=0.100
Replicable training= yes
Gradient Boosting Number of trees=100 93.8%
Learning rate=0.500
Replicable training= yes
Gradient Boosting Number of trees=50 92.7%
Learning rate=0.500
Replicable training= yes
Logistic Regression Regulation type=Lasso, c=10 80.2%
Logistic Regression Regulation type=Ridge, c=10 80.1%
Logistic Regression Regulation type=None, c=10 79.9%
Logistic Regression Regulation type=Lasso, c=1 80.4%
Logistic Regression Regulation type=Lasso, c=0.001 50.0%
Naïve Bayes Default Parameters 81.6%

In above table the rows with highlighted colors are the parameters at which their accuracy levels are higher.
Below figure shows the 10- fold cross validation results of different models

Fig. 10 Stratified 10-fold cross validation results of different models.


From above figure random forest model is the outperformed model with accuracy of 94.3% followed by Gradient
boosting model with accuracy of 93.8%. Below table shows the confusion matrix and important features obtained in
each model.
TABLE I. CONFUSION MATRICES AND TOP3 IMPORTANT FEATURES AT 10-FOLD CROSS-VALIDATION RESULTS
Model Confusion matrix Important features
Random
Forest
kNN

Decision Tree

Gradient
Boosting

Logistic
Regression

Naïve Bayes

From above table, we observed that the highest number of high risk patients were predicted by Random forest and
Gradient boosting models, which means these models excels in classifying high risk level patients correctly, which
helps in preventing maternal mortality and taking preventive measures to avoid maternal mortality in pregnant
women at a early stage. Gradient boosting also excels in classifying mid risk level , maternal mortality patients.
Also, from above table we can say that the features BS(blood sugar), SystolicBP and BodyTemp plays a major role
in determining the risk levels of patients correctly.
These trained models were tested on test data and performance of different models were shown below.
Fig. 10 Model performances on test data.
From above figure we observed that , gradient boosting and random forest models are the outperformers on test data
with accuracies of 94.1% and 93.9% respectively.
Obtained Confusion matrices of each model after applying on test data were shown below.

TABLE II. CONFUSION MATRICES OF DIFFERENT MODELS AFTER APPLYING ON TEST DATA.

Random Forest Model kNN Model Decision Tree

Logistic Regression Model Gradient Boosting Model Naïve Bayes model

From above table, we observed that random forest and gradient boosting models were the best performers in
classifying the risk levels of maternal mortality in pregnant women, because these models contains highest number
of true positives.

Conclusion:
Based on above results we can conclude that Gradient boosting and random forest models helps in predicting the
risk levels of maternal mortality in pregnant women, and also we found that BS(blood sugar), SystolicBP and
BodyTemp features plays major role in predicting the risk levels.

References:
[1] “Outliers — Orange Visual Programming 3 documentation.” https://orange3.readthedocs.io/projects/orange-
visual-programming/en/latest/widgets/data/outliers.html
[2] “Orange Data Mining - undefined,” Orange Data Mining.
https://orangedatamining.com/widget-catalog/transform/preprocess/
[3] https://orange3.readthedocs.io/projects/orange-visual-programming/en/master/widgets/data/datasampler.html

You might also like