Slides
Slides
Machine Learning
Samin Poudel*
*ComputationalData Science and Engineering, North Carolina A&T State University,
Greensboro, NC 27409
1
Presentation follows as below:
➢ Introduction
➢ Data, Algorithms and Methods
➢ Result and Discussion
➢ Conclusion and Future Work
2
Introduction
*https://data-flair.training/blogs/machine-learning-in-healthcare/
Introduction
• The analysis of the clinical data can lead to the timely diagnosis of the
disease which will help to start cure for the patient in time as well
• Traditional approach of diagnosing disease is generally costly and time
consuming
• ML techniques have not only been able to diagnose the common
diseases but are also equally capable of diagnosing the rare diseases
• In general, a dataset table used to build a ML model for diagnosing a
disease have columns for different attributes and a column variable for
the class variable
4
Introduction
Problem Statement:
• Accuracy of the ML in diagnosing the diseases is still a concern
• Improvement in the performance of ML to diagnose disease is a hot topic in
healthcare domain
• Different ML approach perform differently for different healthcare dataset
• Need to find the way to apply many state of art algorithms to same dataset
in reasonable time with minimal lines of codes, so that the search of best
ML method can be pursued efficiently to diagnose a particular disease
Probable Solution:
• The use of libraries like AutoGluon can help to find the best performing ML
approach out of many ML approaches in diagnosing the disease for a given
dataset with optimal lines of codes.
5
Data, Algorithms and Methods
Data:
• Dataset Used: Pima Indian Diabetes
• This data set has 8 attributes and one class variable named Outcome.
• Outcome variable has value of 0 or 1, 1 means tested positive for diabetes
• The dataset has 768 instances, 268 instances are tested positive for diabetes
Table 1. Statistical description of Data based on Attributes
Diabetes
Pregnancies Glucose Blood Pressure Skin Thickness Insulin BMI Pedigree Age
Function
min 0 0 0 0 0 0 0.078 21
Library Number of ML
ML Algorithm
approaches
Scikit-Learn Random Forest Classifier, Decision Tree Classifier, Naïve Bayes Classifier, Perceptron, 6
Multilayer Perceptron, Voting Classifier
8
Data, Algorithms and Methods
• Overview of Methodology:
• Data Loaded to Amazon SageMaker’s Jupyter Instance
• Data Spitted to Training and Test set
• Machine Learning Algorithms trained and tested using scikit-learn and
AutoGluon Library
• Training and Test set for each of the ML algorithm should be same for
reasonable comparable among them. It was achieved by defining
random seed while splitting data into training and test sets
• Evaluation of ML algorithms to diagnose diabetes are performed using
classification metrics Accuracy, Precision, Recall and F1-score
• Detailed Implementation of the ML algorithms is in authors’ GitHub page
9
Result and Discussion
Evaluation of ML Algorithms:
• Although being a classical ML algorithm, Naive Bayes performed better among
the ML algorithms, based on combined analysis of all the evaluation metrics
Figure 3. (a) Evaluation of AutoGluon ML algorithms when trained with accuracy as validation metric 11
(b) Evaluation of AutoGluon ML algorithms when trained with F1-score as validation metric
Conclusion and Future Work
Conclusion:
• Libraries like AutoGluon help comparing performances of many ML approaches in diagnosing a
disease for a given dataset with optimal lines of code.
• This helps in finding the best performing ML algorithm for a particular dataset or a particular
type of disease as well. And it decreases the probability of inaccurate diagnosis, which is a
significantly important consideration while dealing with the health of the people.
• Performance of 20 ML approaches in diagnosing diabetes based on the Pima Indian Diabetes
Dataset tested
• For the data set considered, Naïve Bayes algorithm performed better among the other
algorithms. This shows that using the complex and computationally costly algorithms not
necessarily improve the accuracy of diagnosing a disease.
Future Work:
• The possibility of the improvement in the performance of ML models in future can be started by
finding the correlation among each attribute and dropping the highly correlated attributes.
Because the highly correlated attributes can confuse a model in the learning phase. 12
Thank you
THANK YOU
13