Irjet V5i4896 PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 05 Issue: 04 | Apr-2018 www.irjet.net p-ISSN: 2395-0072

Diagnosis of Liver Disease Using Machine Learning Techniques


Joel Jacob1, Joseph Chakkalakal Mathew2, Johns Mathew3, Elizabeth Issac4
1,2,3 Dept. of Computer Science and Engineering, MACE, Kerala, India
4 Assistant Professor, Dept. of Computer Science and Engineering, MACE, Kerala, India
---------------------------------------------------------------------***---------------------------------------------------------------------
Abstract - Diagnosis of liver disease at a preliminary The main objective of this research is to use classification
stage is important for better treatment. It is a very algorithms to identify the liver patients from healthy
challenging task for medical researchers to predict the individuals. In this study, FOUR classification algorithms
disease in the early stages owing to subtle symptoms. Often Logistic Regression, Support Vector Machines (SVM), K
the symptoms become apparent when it is too late. To Nearest Neighbor (KNN) and artificial neural networks
overcome this issue, this project aims to improve liver (ANN) have been considered for comparing their
disease diagnosis using machine learning approaches. The performance based on the liver patient data. Further, the
main objective of this research is to use classification model with the highest accuracy is implemented as a user
algorithms to identify the liver patients from healthy friendly Graphical User Interface (GUI) using Tkinter
individuals. This project also aims to compare the package in python. The GUI can be readily utilized by
classification algorithms based on their performance doctors and medical practitioners as a screening tool for
factors. To serve the medicinal community for the diagnosis liver disease.
of liver disease among patients, a graphical user interface
will be developed using python. The GUI can be readily The dataset used is The Indian Liver Patient Dataset
utilized by doctors and medical practitioners as a screening (ILPD) which was selected from UCI Machine learning
tool for the liver disease. repository for this study. It is a sample of the entire Indian
population collected from Andhra Pradesh region and
Key Words: Machine Learning, Liver Patients, comprises of 585 patient data.
Classification algorithms
2. RELATED WORKS
1. INTRODUCTION
In recent research works, several neural network models
Problems with liver patients are not easily discovered in have been developed to aid in diagnosis of liver diseases in
an early stage as it will be functioning normally even when the medical field by the physicians such as diagnosis
it is partially damaged. An early diagnosis of liver support system [3], expert system, intelligent diagnosis
problems will increase patient’s survival rate. Liver system, and hybrid intelligent system. In addition,
failures are at high rate of risk among Indians. It is Christopher N. [4] proposed a system to diagnose medical
expected that by 2025 India may become the World diseases considering 6 benchmarks which are liver
Capital for Liver Diseases. The widespread occurrence of disorder, heart diseases, diabetes, breast cancer, hepatitis
liver infection in India is contributed due to deskbound and lymph. The authors developed two systems based on
lifestyle, increased alcohol consumption and smoking. WSO and C4.5, an accuracy of 64.60% with 19 rules of
There are about 100 types of liver infections. Therefore, liver disorder dataset and 62.89% with 43rules which was
developing a machine that will enhance in the diagnosis of obtained from the WSO and C4.5respectively. Ramana [5]
the disease will be of a great advantage in the medical also made acritical study on liver diseases diagnosis by
field. These systems will help the physicians in making evaluating some selected classification algorithms such as
accurate decisions on patients and also with the help of naïve Bayes classifier, C4.5, backpropagation neural
Automatic classification tools for liver diseases (probably network, K-NN and support vector. The authors obtained
mobile enabled or web enabled), one can reduce the 51.59% accuracy on Naïve Bayes classifier, 55.94% on
patient queue at the liver experts such as endocrinologists. C4.5 algorithm, 66.66% on BPNN, 62.6% on KNN and
62.6% accuracy on support vector machine.
Classification techniques are much popular in medical
diagnosis and predicting diseases. Michael J Sorich [1] The poor performance in the training and testing of the
reported that SVM classifier produces best predictive liver disorder dataset as resulted from an insufficient in
performance for the chemical datasets. Lung-Cheng Huang the dataset. Therefore, Sug [6], suggested a method based
reported that Naïve Bayesian classifier produces high on oversampling in minor classes in order to compensate
performance than SVM and C 4.5 for the CDC Chronic for the insufficiency of data effectively. The author
fatigue syndrome dataset. Paul R Harper [2] reported that considered two algorithms of decision tree for the
there is not necessary a single best classification tool but research work. These algorithms are C4.5 and CART [7]
instead the best performing algorithm will depend on the and the dataset of BUPA liver disorder was also
features of the dataset to be analyzed. considered for the experiments.

© 2018, IRJET | Impact Factor value: 6.171 | ISO 9001:2008 Certified Journal | Page 4011
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 05 Issue: 04 | Apr-2018 www.irjet.net p-ISSN: 2395-0072

These previously designed systems have been adequate patient records and 167 non-liver patient records. In the
but more works has to be done on their recognition rate description of this dataset, it is observed that some values
for better accuracy in the diagnosis of the liver disease. In are Null for the Albumin and Globulin Ratio column. The
this case, this will make the diagnoses of the liver diseases columns which contain null values are replaced with mean
to be more effective and efficient by preventing values of the column.
misdiagnosis of the liver disorder. Developing a system
with better performance than the previous works will help iii. CLASSIFICATION TECHNIQUES
in preventing misdiagnosis of the disease and help in
providing the best and required medication for the a) SVM
patient.
SVM aims to find an optimal hyperplane that separates the
3. IMPLEMENTATION data into different classes. The scikit-learn package in
python is used for implementing SVM. The pre-processed
i. DATASET data is split into test data and training set which is of 25%
and 75% of the total dataset respectively. A support vector
The Indian Liver Patient Dataset comprised of 10 different machine constructs a hyper plane or set of hyper planes in
attributes of 583 patients. The patients were described as a high- or infinite-dimensional space. A good separation is
either 1 or 2 on the basis of liver disease. The detailed achieved by the hyper plane that has the largest distance
description of the dataset is shown in Table. The table to the nearest training data point of any class (so-called
provide details about the attribute and attribute type. As functional margin), since in general the larger the margin
clearly visible from the table, all the features except sex the lower the generalization error of the classifier.
are real valued integers. The feature Sex is converted to
numeric value (0 and 1) in the data pre-processing step. b) LOGISTIC REGRESSION

Table-1 Dataset Description Logistic regression is one of the simpler classification


models. Because of its parametric nature it can to some
No. ATTRIBUTES ATTRIBUTE TYPE extent be interpreted by looking at the parameters making
it useful when experimenters want to look at relationships
1. Age Numeric between variables. A parametric model can be described
2. Sex Nominal entirely by a vector of parameters = (0, 1... p). An example
of a parametric model would be a straight-line y = kx + m
3. Total Bilirubin Numeric where the parameters are k and m. With known
4. Direct Bilirubin Numeric parameters the entire model can be recreated. Logistic
regression is a parametric model where the parameters
5. Alkaline Phosphatase Numeric
are coefficients to the predictor variables written as 0 +1
6. Alamine Phosphatase Numeric +X1 + ...PXp Where 0 is called the intercept. For
convenience we instead write the above sum of the
7. Total Proteins Numeric
parameterized predictor variables in vector form as X. The
8. Albumin Numeric name logistic regression is a bit unfortunate since a
9. Albumin and Globulin Ratio Numeric
regression model is usually used to find a continuous
response variable, whereas in classification the response
10. Result Numeric (1,2) variable is discrete. The term can be motivated by the fact
that we in logistic regression found the probability of the
response variable belonging to a certain class, and this
ii. DATA-PREPROCESSING probability is continuous.
Data pre-processing is an important step of solving every c) K-NN
machine learning problem. Most of the datasets used with
Machine Learning problems need to be processed / This section describes the implementation details of KNN
cleaned / transformed so that a Machine Learning algorithm. The model for KNN is the entire training
algorithm can be trained on it. Most commonly used pre- dataset. When a prediction is required for a unseen data
processing techniques are very few like missing value instance, the KNN algorithm will search through the
imputation, encoding categorical variables, scaling, etc. training dataset for the k-most similar instances. The
These techniques are easy to understand. But when we prediction attribute of the most similar instances is
actually deal with the data, things often get clunky. Every summarized and returned as the prediction for the unseen
dataset is different and poses unique challenges. All instance.
features, except Gender are real valued integers. The last
column, Disease, is the label (with ’1’ representing The similarity measure is dependent on the type of data.
presence of disease and ’2’ representing absence of For real-valued data, the Euclidean distance can be used.
disease). Total number of data points is 583, with 416 liver Other types of data such as categorical or binary data,

© 2018, IRJET | Impact Factor value: 6.171 | ISO 9001:2008 Certified Journal | Page 4012
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 05 Issue: 04 | Apr-2018 www.irjet.net p-ISSN: 2395-0072

Hamming distance can be used. The KNN algorithm is No. of Output 1


belongs to the family of instance-based, competitive
learning and lazy learning algorithms. Instance-based Learning Rate 0.26
algorithms are those algorithms that model the problem Epoch 100
using data in-stances (or rows) in order to make
predictive decisions. The KNN algorithm is an extreme
form of instance-based methods because all training 4. RESULTS AND EVALUATION
observations are retained as part of the model. It is a
competitive learning algorithm, because it internally uses Our main goal going into this project was to predict liver
competition between model elements (data instances) in disease using various machine learning techniques. We
order to make a predictive decision. The objective predicted using Support Vector Machine (SVM), Logistic
similarity measure between data instances causes each Regression, K-Nearest Neighbor (K-NN) and Neural
data instance to compete to win or be most similar to a Network. All of them predicted with better results. With
given unseen data instance and contribute to a prediction. Each algorithm, we have observed Accuracy, Precision,
Sensitivity and Specificity which can be defined as follows:
d) Artificial Neural Network
Accuracy: The accuracy of a classifier is the percentage of
A back propagation neural network was designed. In this the test set tuples that are correctly classified by the
network, 10 input neurons were present at the input layer. classifier.
The number of inputs represents the total number of
attributes in the dataset. The input layer uses Rectified
Linear Unit activation function. The output layer contains
a single layer which uses the sigmoid activation function.
Sensitivity: Sensitivity is also referred as True positive
In order to obtain a required recognition rate that is rate i.e. the proportion of positive tuples that are correctly
capable enough to diagnose the liver disorder in a patient. identified.
There is a need for varying certain parameters in the
neural network models to produce the required optimum
result. These parameters are the learning rate, momentum
rate and the hidden neurons. All these parameters present
in the backpropagation neural networks. The learning rate Precision: precision is defined as the proportion of the
is the learning power of the system, the momentum rate true positives against all the positive results (both true
determines the learning speed of the system. The number positives and false positives)
of hidden neurons in the network has to be varied to
produce the optimal result.

The numbers of neurons needed at the hidden layer are


experimenting in order to deter-mine the best neurons Specificity: Specificity is the True negative rate that is the
that can represent the features present in the input proportion of negative tuples that are correctly identified
dataset accurately to produce the required optimum
result. The numbers of neurons required in the hidden
layer were experimenting by varying the neurons. The
sigmoid function was used in the output layer because of
its soft switching ability and simplicity in derivatives. The results of each of the classification algorithm is
summarized in the table shown below.
The neural network was implemented using keras package
which runs using the tensor-flow backend in python. TABLE-3 Results of classification algorithms

Description of the back propagation neural network is Classification Accuracy Precision Sensitivity Specificity
given in the table below Algorithm
Logistic 73.23 78.57 88 30.62
Table-2 ANN Description Regression
K-NN 72.05 80.98 83.78 44.04
No. of Inputs 10
SVM 75.04 77.09 79 71.11
No. of hidden Layers 2
ANN 92.8 93.78 97.23 83
No. of neurons in 1st hidden layer 400

No. of neurons in 2nd hidden layer 400 As clearly summarized in the table, Artificial Neural
Networks gave the best results.

© 2018, IRJET | Impact Factor value: 6.171 | ISO 9001:2008 Certified Journal | Page 4013
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 05 Issue: 04 | Apr-2018 www.irjet.net p-ISSN: 2395-0072

5. DEVELOPMENT OF GUI [5] Schiff's Diseases of the Liver, 10th Edition


Copyright ©2007 Ramana, Eugene R.; Sorrell,
The model that gave the maximum accuracy for the test Michael Maddrey, Willis C.
data was the artificial neural network. So, Artificial neural
network is used for creating the GUI. The GUI is created [6] P. Sug, On the optimality of the simple Bayesian
using Tkinter package in python. Two GUIs are created, classifier under zero-one loss, Machine Learning
one for predicting and the other for training new data. The 29 (2–3) (1997) 103–130.
GUI contains input fields for all attributes in the dataset.
The system will predict whether the patient has liver [7] 16th Edition HARRISON’S PRINCIPLES of Internal
disease or not based on the trained model. The GUI will be Medicine
a useful tool for medical staff in the early diagnosis of liver
disease in patients. A picture of the developed GUI is
shown below.

Fig-1 GUI developed using python

6. CONCLUSION

In this project, we have proposed methods for diagnosing


liver disease in patients using machine learning
techniques. The four machine learning techniques that
were used include SVM, Logistic Regression, KNN and
Artificial Neural Network. The system was implemented
using all the models and their performance was evaluated.
Performance evaluation was based on certain
performance metrics. ANN was the model that resulted in
the highest accuracy with an accuracy of 98%. Comparing
this work with the previous research works, it was
discovered that ANN proved highly efficient. A GUI, which
can be used as a medical tool by hospitals and medical
staff was implemented using ANN.

REFERENCES
[1] Michael J Sorich. An intelligent model for liver
disease diagnosis. Artificial Intelligence in
Medicine 2009;47:53—62.

[2] Paul R. Harper, A review and comparison of


classification algorithms for medical decision
making.

[3] BUPA Liver Disorder Dataset. UCI repository


machine learning databases.

[4] Prof Christopher N. New Automatic Diagnosis of


Liver Status Using Bayesian Classification.

© 2018, IRJET | Impact Factor value: 6.171 | ISO 9001:2008 Certified Journal | Page 4014

You might also like