0% found this document useful (0 votes)
357 views59 pages

Intern Report Progress

This document summarizes an internship report on predicting liver disease using machine learning. The report was submitted by four students in partial fulfillment of their bachelor's degree in computer science and engineering. It describes existing methods for liver disease prediction that have limitations regarding accuracy, data size, and real-time implementation. The proposed methodology uses machine learning algorithms to more effectively classify liver disease status from a dataset of 500 patient details. Evaluation of classification models indicates that the proposed approach can accurately predict results.

Uploaded by

Suhail
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
357 views59 pages

Intern Report Progress

This document summarizes an internship report on predicting liver disease using machine learning. The report was submitted by four students in partial fulfillment of their bachelor's degree in computer science and engineering. It describes existing methods for liver disease prediction that have limitations regarding accuracy, data size, and real-time implementation. The proposed methodology uses machine learning algorithms to more effectively classify liver disease status from a dataset of 500 patient details. Evaluation of classification models indicates that the proposed approach can accurately predict results.

Uploaded by

Suhail
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 59

AN INTERNSHIP REPORT ON

LIVER DISEASE PREDICTION

Submitted by

AMEER BATCHA S (113219031009)


DEEPAK S (113219031033)
RAHUL PRAKASH S (113219031117)
RAJ KUMAR M (113219031118)

in partial fulfillment for the award of the degree of

BACHELOR OF ENGINEERING

IN

COMPUTER SCIENCE AND ENGINEERING

VELAMMAL ENGINEERING COLLEGE, CHENNAI-66.


(An Autonomous Institution, Affiliated to Anna University, Chennai)

2021-2022

i
VELAMMAL ENGINEERING COLLEGE
CHENNAI -66

BONAFIDE CERTIFICATE

Certified that this internship report “Liver disease prediction using Machine
Learning” is the bonafide work of AMEER BATCHA S (113219031009),
DEEPAK S (113219031033), RAHUL PRAKASH S (113219031117), RAJ
KUMAR M(113219031118) Carried out at “PANTECH SOLUTIONS” during
07.12.2021 to 07.01.2022.

SIGNATURE SIGNATURE

Dr. B. MURUGESHWARI Ms. P. SARANYA


PROFESSOR AND HEAD SUPERVISOR
Dept. of Computer Science and Engineering Computer Science and Engineering
Velammal Engineering College Velammal Engineering College
Ambattur - Red Hills Road Ambattur - Red Hills Road
Chennai – 600 066. Chennai – 600 06

ii
CERTIFICATE FROM INDUSTRY

iii
iv
v
vi
CERTIFICATE OF EVALUATION

COLLEGE NAME : VELAMMAL ENGINEERING COLLEGE


BRANCH : COMPUTER SCIENCE ENGINEERING
SEMESTER : VI

Name of Faculty
Name of the student who
Coordinator with
Sl. No has done the Internship Title of the Internship
designation
1 AMEER BATCHA S
2 DEEPAK S LIVER DISEASE Ms. P. SARANYA
3 RAHUL PRAKASH S PREDICTION
4 RAJ KUMAR M

This report of internship work submitted by the above student in partial fulfillment for
the award of Bachelor of Computer Science and Engineering Degree in Velammal
Engineering College was evaluated and confirmed to be reports of the work done by
the above student and then assessed.

Submitted for Internal Evaluation held on........................

Examiner 1 Examiner 2 Examiner 3

vii
TABLE OF CONTENTS

CHAPTER TITLE PAGE NO

ABSTRACT X

ACKNOWLWDGE XI

LIST OF FIGURES XII

1 INTRODUCTION
1.1 EXISTING METHODOLOGY 1
1.1.1 LIMITATION 1

1.2 PROPOSED METHODOLOGY 1


1.2.1 ADVANTAGES 2
1.2.2 SYSTEM ARCHITECTURE 3

2 MODULES
2.1 MODULES 4
2.1.1 DATA COLLECTION 4
2.1.2 DATA PRE-PROCESSING 4
2.1.2.1 FORMATTING 5
2.1.2.2 CLEANING 5
2.1.2.3 SAMPLING 5
2.1.3 FEATURE EXTRACTION 6
2.1.4 EVALUATION MODEL 6

3 DATA FLOW DIAGRAM


3.1 DATA FLOW DIAGRAM 7
3.2 WORK FLOW DIAGRAM 10
3.3 UML DIAGRAM 11
3.4 SEQUENCE DIAGRAM 12
3.5 ACTIVITY DIAGRAM 13

4 DOMAIN SPECIFICATION
4.1 MACHINE LEARNING 14
4.1.1 SUPERVISED LEARNING 15
4.1.1.1 ALGORITHM 16
4.1.2 UNSUPERVISED LEARNING 16
viii
4.1.2.1 ALGORITHM 16
4.1.3 REINFORCEMENT LEARNING 19
4.1.3.1 DEFINITION 19

5 REQUIREMENTS
5.1 SYSTEM REQUIREMENTS
5.1.1 HARDWARE 20
5.1.2 SOFTWARE 20

5.2 PYTHON LIBRARIES

6 SOURCE CODE AND OUTPUT 21

7 CONCLUSION AND FUTURE WORK


7.1 CONCLUSION 46

7.2 FUTURE WORK 46

ABSTRACT
ix
Liver diseases are becoming one of the most fatal diseases in several countries. Patients
with Liver disease have been continuously increasing because of excessive consumption
of alcohol, inhale of harmful gases, intake of contaminated food, pickles and drugs. liver
patient datasets are investigated for building classification models in order to predict liver
disease. This dataset was used to evaluate prediction algorithms in an effort to reduce
burden on doctors. In that paper, we proposed as checking the whole patient Liver Disease
using Machine learning algorithms.

Chronic liver disease refers to disease of the liver which lasts over a period of six
months.  So in that, we will take results of how much percentage patients get disease as
a positive information and negative information. Using classifiers, we are processing
Liver Disease percentage and values are showing as a confusion matrix. We proposed a
various classification scheme which can effectively improve the classification
performance in the situation that training dataset is available. In that dataset, we have
nearly 500 patient details.

We will get all that details from there. Then we will good and bad values are using
machine learning classifier. Thus outputs shows from proposed classification model
indicate that Accuracy in predicting the result.

ACKNOWLEDGEMENT

We wish to acknowledge with thanks to the significant contribution given by the


management of our college Chairman, Dr.M.V.Muthuramalingam, and our Chief

x
Executive Officer, Thiru. M.V.M. Velmurugan, for their extensive support.

We would like to thank Dr. S. SATHISHKUMAR, Principal of Velammal Engineering


College, for giving me this opportunity to do this project.

We wish to express my gratitude to our effective Head of the Department,


Dr. B. Murugeshwari, for her moral support and for her valuable innovative suggestions,
constructive interaction, constant encouragement and unending help that have enabled me to
complete the project.

We wish to express my indebted humble thanks to the Company PANTECH SOLUTIONS


and the External Guide Mr. Praveen Kumar, Software Developer for their invaluable
guidance in shaping of this project.

We wish to express my sincere gratitude to my faculty coordinator Ms. P. Saranya, Assistant


Professor, Department of Computer Science and Engineering for her guidance, without
whom this project would not have been possible.

We are grateful to the entire staff members of the department of Computer Science and
Engineering for providing the necessary facilities to carry out the project. We would
especially like to thank my parents for providing me with the unique opportunity to work,
and for their encouragement and support at all levels.

LIST OF FIGURES

FIGURE NO TITLE PAGE NO

xi
1.1 System Architecture 3

3.1 Data Flow Level 0 Diagram 7

3.2 Data Flow Level 1 Diagram 8

3.3 Data Flow Level 2 Diagram 9

3.4 Work Flow Diagram 10

3.5 UML Diagram 11

3.6 Sequence Diagram 12

3.7 Activity Diagram 13

6.1 Body Chemical Level 25

6.2 Example Dataset 28

6.3 2-D Scatter plot (Direct Bilirubin vs Total Bilirubin) 31

6.4 LR vs SVM vs NC vs RFC 45

xii
1. INTRODUCTION

1.1 EXISTING METHODOLOGY


Typically, the existing mechanisms assumed that the accuracy of prediction was
achieved. But this wasn’t the case then, hence, it must be improved further to
increase the classification accuracy. Also, other research works addressed these
issues by introducing efficient combination. Existing Models based on feature
selection and classification raised some issues regarding with training dataset and
Test dataset.

1.1.1 LIMITATION
● Certain approaches being applicable only for small data.
● Certain combination of classifier over fit with data set while others are
under fit.
● Some approaches are not adoptable for real time collection of
database implementation.

1.2 PROPOSED METHODOLOGY


Machine learning has attracted a huge amount of researches and has been applied
in various fields in the world. In medicine, machine learning has proved its power
in which it has been employed to solve many emergency problems such as cancer
treatment, heart disease, dengue fever diagnosis and so on. In proposed
system , we have to import the liver patient dataset (.csv). Then the dataset should
be pre-processed and remove the anomalies and full up empty cells in the dataset,
so the we can further improve the effective Liver diseases prediction. Then we are
Confusion matrix - For getting a better clarity of the no of correct/incorrect

1
predictions by the model ROC-AUC - It considers the rank of the output
probabilities and intuitively measures the likelihood that model can distinguish
between a positive point and a negative point. (Note: ROC-AUC is typically used
for binary classification only). We will use AUC to select the best model among
the various machine learning models.

1.2.1 ADVANTAGE
● The performance classification of liver based diseases is further
improved.
● Time complexity and accuracy can measured by various machine
learning models ,so that we can measures different .
● Different machine learning having high accuracy of result.
● Risky factors can be predicted early by machine learning models.

2
1.2.2 SYTSTEM ARCHITECTURE

ML
Algorithms

Machine
Data pre- Feature
Dataset learning
processing extraction
model

Data
Result Classifier
classification

Test data
samples

FIGURE 1.1 SYSTEM ARCHITECTURE

3
2. MODULES

2.1 MODULES

• DATA COLLECTION

• DATA PRE-PROCESSING

• FEATURE EXTRATION

• EVALUATION MODEL

2.1.1 DATA COLLECTION

Data used in this paper is a set of student details in the school records. This step is
concerned with selecting the subset of all available data that you will be working
with. ML problems start with data preferably, lots of data (examples or
observations) for which you already know the target answer. Data for which you
already know the target answer is called labelled data.

2.1.2 DATA PRE-PROCESSING

Organize your selected data by formatting, cleaning and sampling from it. Three
common data pre-processing steps are:

 Formatting

 Cleaning
4
 Sampling

2.1.2.1 Formatting:

The data you have selected may not be in a format that is suitable for you to work
with. The data may be in a relational database and you would like it in a flat file, or
the data may be in a proprietary file format and you would like it in a relational
database or a text file.

2.1.2.2 Cleaning:

Cleaning data is the removal or fixing of missing data. There may be data instances
that are incomplete and do not carry the data you believe you need to address the
problem. These instances may need to be removed. Additionally, there may be
sensitive information in some of the attributes and these attributes may need to be
anonym zed or removed from the data entirely.

2.1.2.3 Sampling:

There may be far more selected data available than you need to work with. More
data can result in much longer running times for algorithms and larger
computational and memory requirements. You can take a smaller representative
sample of the selected data that may be much faster for exploring and prototyping
solutions before considering the whole dataset.

5
2.1.3 FEATURE EXTRACTION

Next thing is to do Feature extraction is an attribute reduction process. Unlike


feature selection, which ranks the existing attributes according to their predictive
significance, feature extraction actually transforms the attributes. The transformed
attributes, or features, are linear combinations of the original attributes. Finally, our
models are trained using Classifier algorithm. We use classify module on Natural
Language Toolkit library on Python. We use the labelled dataset gathered. The rest
of our labelled data will be used to evaluate the models. Some machine learning
algorithms were used to classify preprocessed data. The chosen classifiers were
Random forest. These algorithms are very popular in text classification tasks.

2.1.4 EVALUATION MODEL

Model Evaluation is an integral part of the model development process. It helps to


find the best model that represents our data and how well the chosen model will
work in the future. Evaluating model performance with the data used for training is
not acceptable in data science because it can easily generate overoptimistic and
over fitted models. There are two methods of evaluating models in data science,
Hold-Out and Cross-Validation to avoid over fitting, both methods use a test set
(not seen by the model) to evaluate model performance. Performance of each
classification model is estimated base on its averaged. The result will be in the
visualized form. Representation of classified data in the form of graphs. Accuracy
is defined as the percentage of correct predictions for the test data. It can be
calculated easily by dividing the number of correct predictions by the number of
total predictions.

6
3. DATA FLOW DIAGRAM

3.1 DATA FLOW DIAGRAM

LEVEL 0

FIGURE 3.1 DATA FLOW LEVEL 0 DIAGRAM

7
LEVEL 1

FIGURE 3.2 DATA FLOW LEVEL 1 DIAGRAM

8
LEVEL 1

FIGURE 3.3 DATA FLOW LEVEL 2 DIAGRAM

9
3.2 WORK FLOW DIAGRAM

FIGURE 3.4 WORK FLOW DIAGRAM

10
3.3 UML DIAGRAM

FIGURE 3.5 USE CASE DIAGRAM

11
3.4 SEQUENCE DIAGRAM

FIGURE 3.6 SEQUENCE DIAGRAM

12
3.5 ACTIVITY DIAGRAM

FIGURE 3.7 ACTIVITY DIAGRAM

13
4. DOMAIN SPECIFICATION

4.1 MACHINE LEARINING

Machine Learning is a system that can learn from example through self-
improvement and without being explicitly coded by programmer. The
breakthrough comes with the idea that a machine can singularly learn from the data
(i.e., example) to produce accurate results.

Machine learning combines data with statistical tools to predict an output. This
output is then used by corporate to makes actionable insights. Machine learning is
closely related to data mining and Bayesian predictive modeling. The machine
receives data as input, use an algorithm to formulate answers.

A typical machine learning tasks are to provide a recommendation. For those who
have a Netflix account, all recommendations of movies or series are based on the
user's historical data. Tech companies are using unsupervised learning to improve
the user experience with personalizing recommendation.

Machine learning is also used for a variety of task like fraud detection, predictive
maintenance, portfolio optimization, automatize task and so on.

14
4.1.1 SUPERVISED LEARNING

An algorithm uses training data and feedback from humans to learn the
relationship of given inputs to a given output. For instance, a practitioner can use
marketing expense and weather forecast as input data to predict the sales of cans.

You can use supervised learning when the output data is known. The algorithm
will predict new data.
There are two categories of supervised learning:

● Classification task
● Regression task

Classification:
Imagine you want to predict the gender of a customer for a commercial. You will
start gathering data on the height, weight, job, salary, purchasing basket, etc. from
your customer database. You know the gender of each of your customer, it can
only be male or female. The objective of the classifier will be to assign a
probability of being a male or a female (i.e., the label) based on the information
(i.e., features you have collected). When the model learned how to recognize male
or female, you can use new data to make a prediction. For instance, you just got
new information from an unknown customer, and you want to know if it is a male
or female. If the classifier predicts male = 70%, it means the algorithm is sure at
70% that this customer is a male, and 30% it is a female.
The label can be of two or more classes. The above example has only two classes,
but if a classifier needs to predict object, it has dozens of classes (e.g., glass, table,
shoes, etc. each object represents a class)

15
Regression:
When the output is a continuous value, the task is a regression. For instance, a
financial analyst may need to forecast the value of a stock based on a range of
feature like equity, previous stock performances, macroeconomics index. The
system will be trained to estimate the price of the stocks with the lowest possible
error.

4.1.1.1 ALGORITHM

Algorithm Name Description Type


Linear regression Finds a way to correlate each feature to Regression
the output to help predict future values.
Logistic regression Extension of linear regression that's Classification
used for classification tasks. The output
variable 3is binary (e.g., only black or
white) rather than continuous
Decision tree Highly interpretable classification or Regression
regression model that splits data- Classification
feature values into branches at decision
nodes (e.g., if a feature is a color, each
possible color becomes a new branch)
until a final decision output is made
Naive Bayes The Bayesian method is a classification Regression
method that makes use of the Bayesian Classification
theorem. The theorem updates the prior
knowledge of an event with the
16
independent probability of each feature
Support vector Support Vector Machine, or SVM, is Regression (not very
machine typically used for the classification common)
task. SVM algorithm finds a Classification
hyperplane that optimally divided the
classes.
Random forest The algorithm is built upon a decision Regression
tree to improve the accuracy Classification
drastically. Random forest generates
many times simple decision trees and
uses the 'majority vote' method to
decide on which label to return. For the
classification task, the final prediction
will be the one with the most vote;
while for the regression task, the
average prediction of all the trees is the
final prediction.
AdaBoost Classification or regression technique Regression
that uses a multitude of models to come Classification
up with a decision but weighs them
based on their accuracy in predicting
the outcome
Gradient-boosting Gradient-boosting trees is a state-of- Regression
trees the-art classification/regression Classification
technique. It is focusing on the error
committed by the previous trees and

17
tries to correct it.

4.1.2 UNSUPERVISED LEARNING

In unsupervised learning, an algorithm explores input data without being given an


explicit output variable (e.g., explores customer demographic data to identify
patterns) You can use it when you do not know how to classify the data, and you
want the algorithm to find patterns and classify the data for you

4.1.2.1 ALGORITHM

Algorithm Description Type

K-means Puts data into some groups (k) that Clustering


clustering each contains data with similar
characteristics (as determined by the
model, not in advance by humans)

Gaussian mixture A generalization of k-means Clustering


model clustering that provides more
flexibility in the size and shape of
groups (clusters

Hierarchical Splits clusters along a hierarchical Clustering


clustering tree to form a classification system.

Can be used for Cluster loyalty-card

18
customer

Recommender Help to define the relevant data for Clustering


system making a recommendation.

PCA/T-SNE Mostly used to decrease the Dimension


dimensionality of the data. The Reduction
algorithms reduce the number of
features to 3 or 4 vectors with the
highest variances.

4.1.3 REINFORCEMENT LEARNING

4.1.3.1 DEFINITIONS

Reinforcement learning is a subfield of machine learning in which systems are


trained by receiving virtual "rewards" or "punishments," essentially learning by
trial and error. Google's DeepMind has used reinforcement learning to beat a
human champion in the Go games. Reinforcement learning is also used in video
games to improve the gaming experience by providing smarter bot.

One of the most famous algorithms are:

● Q-learning
● Deep Q network
● State-Action-Reward-State-Action (SARSA)
● Deep Deterministic Policy Gradient (DDPG)

19
5. REQUIREMENTS

5.1 SYSTEM REQUIREMENT

5.1.1 HARDWARE

 Windows 7,8,10 64 bit

 RAM 4GB (Minimum)

5.1.2 SOFTWARE

 Data Set

 Python 2.7

 Anaconda Navigator

5.2 PYTHON LIBRARIES

 Pandas

 Numpy

 Sklearn

20
 seaborn

 matplotlib

6. SOURECE CODE

Liver Disease prediction


# for numerical computing
import numpy as np

# for dataframes
import pandas as pd

# for easier visualization


import seaborn as sns

# for visualization and to display plots


from matplotlib import pyplot as plt
%matplotlib inline

# import color maps


from matplotlib.colors import ListedColormap

# Ignore Warnings
import warnings
warnings.filterwarnings("ignore")

21
from math import sqrt

# to split train and test set


from sklearn.model_selection import train_test_split

# Machine Learning Models


from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import roc_curve, auc, roc_auc_score, confusion_matrix

from sklearn.preprocessing import StandardScaler

from sklearn.metrics import accuracy_score

df=pd.read_csv('indian_liver_patient.csv')

df.shape

(583, 11)

df.columns

Index(['Age', 'Gender', 'Total_Bilirubin', 'Direct_Bilirubin',


'Alkaline_Phosphotase', 'Alamine_Aminotransferase',
'Aspartate_Aminotransferase', 'Total_Protiens', 'Albumin',
'Albumin_and_Globulin_Ratio', 'Dataset'],
dtype='object')

22
df.head()

Age Gender Total_Bilirubin Direct_Bilirubin Alkaline_Phosphotase \


0 65 Female 0.7 0.1 187
1 62 Male 10.9 5.5 699
2 62 Male 7.3 4.1 490
3 58 Male 1.0 0.4 182
4 72 Male 3.9 2.0 195

Alamine_Aminotransferase Aspartate_Aminotransferase Total_Protiens \


0 16 18 6.8
1 64 100 7.5
2 60 68 7.0
3 14 20 6.8
4 27 59 7.3

Albumin Albumin_and_Globulin_Ratio Dataset


0 3.3 0.90 1
1 3.2 0.74 1
2 3.3 0.89 1
3 3.4 1.00 1
4 2.4 0.40 1

Exploratory Data Analysis

Filtering categorical data

df.dtypes[df.dtypes=='object']

23
Gender object
dtype: object

Distribution of Numerical Features


# Plot histogram grid
df.hist(figsize=(15,15), xrot=-45, bins=10) ## Display the labels rotated by 45
degress

# Clear the text "residue"


plt.show()

24
Figure 6.1 Body chemical level

Age Total_Bilirubin Direct_Bilirubin Alkaline_Phosphotase \


count 583.000000 583.000000 583.000000 583.000000
mean 44.746141 3.298799 1.486106 290.576329
std 16.189833 6.209522 2.808498 242.937989
min 4.000000 0.400000 0.100000 63.000000
25% 33.000000 0.800000 0.200000 175.500000
50% 45.000000 1.000000 0.300000 208.000000

25
75% 58.000000 2.600000 1.300000 298.000000
max 90.000000 75.000000 19.700000 2110.000000

Alamine_Aminotransferase Aspartate_Aminotransferase Total_Protiens \


count 583.000000 583.000000 583.000000
mean 80.713551 109.910806 6.483190
std 182.620356 288.918529 1.085451
min 10.000000 10.000000 2.700000
25% 23.000000 25.000000 5.800000
50% 35.000000 42.000000 6.600000
75% 60.500000 87.000000 7.200000
max 2000.000000 4929.000000 9.600000

Albumin Albumin_and_Globulin_Ratio Dataset


count 583.000000 579.000000 583.000000
mean 3.141852 0.947064 1.286449
std 0.795519 0.319592 0.452490
min 0.900000 0.300000 1.000000
25% 2.600000 0.700000 1.000000
50% 3.100000 0.930000 1.000000
75% 3.800000 1.100000 2.000000
max 5.500000 2.800000 2.000000

## if score==negative, mark 0 ;else 1


def partition(x):
if x == 2:
return 0

26
return 1

df['Dataset'] = df['Dataset'].map(partition)

Distribution of categorical data


df.describe(include=['object'])

Gender
count 583
unique 2
top Male
freq 441

Bar plots for categorical Features


plt.figure(figsize=(5,5))
sns.countplot(y='Gender', data=df)

<matplotlib.axes._subplots.AxesSubplot at 0x1f75511a208>

27
Figure 6.2 Example dataset

df[df['Gender'] == 'Male'][['Dataset', 'Gender']].head()

Dataset Gender
1 1 Male
2 1 Male
3 1 Male
4 1 Male
5 1 Male

sns.countplot (x="Gender", hue="Dataset", data=df)

<matplotlib.axes._subplots.AxesSubplot at 0x1f764c8de80>

28
Age seems to be a factor for liver disease for both male and female genders
sns.countplot(data=df, x = 'Gender', label='Count')

M, F = df['Gender'].value_counts()
print('Number of patients that are male: ',M)
print('Number of patients that are female: ',F)

Number of patients that are male: 441


Number of patients that are female: 142

29
There are more male patients than female patients
Label Male as 0 and Female as 1
## if score==negative, mark 0 ;else 1
def partition(x):
if x =='Male':
return 0
return 1

df['Gender'] = df['Gender'].map(partition)

2-D Scatter Plot


sns.set_style('whitegrid') ## Background Grid
sns.FacetGrid(df, hue = 'Dataset', size = 5).map(plt.scatter, 'Total_Bilirubin',
'Direct_Bilirubin').add_legend()

<seaborn.axisgrid.FacetGrid at 0x1f764d4d358>

30
Figure 6.3 2-D scatter plot Direct_Bilirubin vs Total_Bilirubin

sns.set_style('whitegrid') ## Background Grid


sns.FacetGrid(df, hue = 'Dataset', size = 5).map(plt.scatter, 'Total_Bilirubin',
'Albumin').add_legend()

<seaborn.axisgrid.FacetGrid at 0x1f764dd8908>

31
sns.set_style('whitegrid') ## Background Grid
sns.FacetGrid(df, hue = 'Dataset', size = 5).map(plt.scatter, 'Total_Protiens',
'Albumin_and_Globulin_Ratio').add_legend()

<seaborn.axisgrid.FacetGrid at 0x1f764e535c0>

32
df.corr()

Age Gender Total_Bilirubin \


Age 1.000000 -0.056560 0.011763
Gender -0.056560 1.000000 -0.089291
Total_Bilirubin 0.011763 -0.089291 1.000000
Direct_Bilirubin 0.007529 -0.100436 0.874618
Alkaline_Phosphotase 0.080425 0.027496 0.206669
Alamine_Aminotransferase -0.086883 -0.082332 0.214065
Aspartate_Aminotransferase -0.019910 -0.080336 0.237831
Total_Protiens -0.187461 0.089121 -0.008099
Albumin -0.265924 0.093799 -0.222250

33
Albumin_and_Globulin_Ratio -0.216408 0.003424 -0.206267
Dataset 0.137351 -0.082416 0.220208

Direct_Bilirubin Alkaline_Phosphotase \
Age 0.007529 0.080425
Gender -0.100436 0.027496
Total_Bilirubin 0.874618 0.206669
Direct_Bilirubin 1.000000 0.234939
Alkaline_Phosphotase 0.234939 1.000000
Alamine_Aminotransferase 0.233894 0.125680
Aspartate_Aminotransferase 0.257544 0.167196
Total_Protiens -0.000139 -0.028514
Albumin -0.228531 -0.165453
Albumin_and_Globulin_Ratio -0.200125 -0.234166
Dataset 0.246046 0.184866

Alamine_Aminotransferase \
Age -0.086883
Gender -0.082332
Total_Bilirubin 0.214065
Direct_Bilirubin 0.233894
Alkaline_Phosphotase 0.125680
Alamine_Aminotransferase 1.000000
Aspartate_Aminotransferase 0.791966
Total_Protiens -0.042518
Albumin -0.029742
Albumin_and_Globulin_Ratio -0.002375

34
Dataset 0.163416

Aspartate_Aminotransferase Total_Protiens \
Age -0.019910 -0.187461
Gender -0.080336 0.089121
Total_Bilirubin 0.237831 -0.008099
Direct_Bilirubin 0.257544 -0.000139
Alkaline_Phosphotase 0.167196 -0.028514
Alamine_Aminotransferase 0.791966 -0.042518
Aspartate_Aminotransferase 1.000000 -0.025645
Total_Protiens -0.025645 1.000000
Albumin -0.085290 0.784053
Albumin_and_Globulin_Ratio -0.070040 0.234887
Dataset 0.151934 -0.035008

Albumin Albumin_and_Globulin_Ratio Dataset


Age -0.265924 -0.216408 0.137351
Gender 0.093799 0.003424 -0.082416
Total_Bilirubin -0.222250 -0.206267 0.220208
Direct_Bilirubin -0.228531 -0.200125 0.246046
Alkaline_Phosphotase -0.165453 -0.234166 0.184866
Alamine_Aminotransferase -0.029742 -0.002375 0.163416
Aspartate_Aminotransferase -0.085290 -0.070040 0.151934
Total_Protiens 0.784053 0.234887 -0.035008
Albumin 1.000000 0.689632 -0.161388
Albumin_and_Globulin_Ratio 0.689632 1.000000 -0.163131
Dataset -0.161388 -0.163131 1.000000

35
plt.figure(figsize=(10,10))
sns.heatmap(df.corr())

<matplotlib.axes._subplots.AxesSubplot at 0x1f764ece668>

Data Cleaning
df = df.drop_duplicates()
print( df.shape )

(570, 11)

36
There were 13 duplicates

Removing Outliers
sns.boxplot(df.Aspartate_Aminotransferase)

<matplotlib.axes._subplots.AxesSubplot at 0x1f7656f9780>

df.Aspartate_Aminotransferase.sort_values(ascending=False).head()

135 4929
117 2946
118 1600
207 1500
119 1050
Name: Aspartate_Aminotransferase, dtype: int64

df = df[df.Aspartate_Aminotransferase <=3000 ]
df.shape

37
(569, 11)

sns.boxplot(df.Aspartate_Aminotransferase)

<matplotlib.axes._subplots.AxesSubplot at 0x1f765514710>

df.Aspartate_Aminotransferase.sort_values(ascending=False).head()

117 2946
118 1600
207 1500
199 1050
119 1050
Name: Aspartate_Aminotransferase, dtype: int64

df = df[df.Aspartate_Aminotransferase <=2500 ]
df.shape

(568, 11)
38
df.isnull().values.any()

True

df=df.dropna(how='any')
#how : {‘any’, ‘all’}
#any : if any NA values are present, drop that label
#all : if all values are NA, drop that label

df.shape

(564, 11)

df.head()

Age Gender Total_Bilirubin Direct_Bilirubin Alkaline_Phosphotase \


0 65 1 0.7 0.1 187
1 62 0 10.9 5.5 699
2 62 0 7.3 4.1 490
3 58 0 1.0 0.4 182
4 72 0 3.9 2.0 195

Alamine_Aminotransferase Aspartate_Aminotransferase Total_Protiens \


0 16 18 6.8
1 64 100 7.5
2 60 68 7.0
3 14 20 6.8
4 27 59 7.3

Albumin Albumin_and_Globulin_Ratio Dataset

39
0 3.3 0.90 1
1 3.2 0.74 1
2 3.3 0.89 1
3 3.4 1.00 1
4 2.4 0.40 1

df = df.sample(frac=1).reset_index(drop=True)

df.head()

Age Gender Total_Bilirubin Direct_Bilirubin Alkaline_Phosphotase \


0 27 0 1.2 0.4 179
1 66 0 0.6 0.2 100
2 32 0 12.7 8.4 190
3 60 0 1.9 0.8 614
4 48 0 3.2 1.6 257

Alamine_Aminotransferase Aspartate_Aminotransferase Total_Protiens \


0 63 39 6.1
1 17 148 5.0
2 28 47 5.4
3 42 38 4.5
4 33 116 5.7

Albumin Albumin_and_Globulin_Ratio Dataset


0 3.3 1.10 0
1 3.3 1.90 0
2 2.6 0.90 1

40
3 1.8 0.60 1
4 2.2 0.62 1

Machine Learning Models


Data Preparation
# Create separate object for target variable
y = df.Dataset

# Create separate object for input features


X = df.drop('Dataset', axis=1)

# Split X and y into train and test sets


X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2,
random_state=1234,
stratify=df.Dataset)

# Print number of observations in X_train, X_test, y_train, and y_test


print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(451, 10) (113, 10) (451,) (113,)

from sklearn.preprocessing import minmax_scale

X_train=minmax_scale(X_train)

X_test=minmax_scale(X_test)

Logistic Regression
from sklearn.linear_model import LogisticRegression

41
lr=LogisticRegression()
lr.fit(X_train, y_train)
predict1=lr.predict(X_test)
predict1

array([1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1], dtype=int64)

model1=accuracy_score(y_test,predict1)
print(model1)

0.7256637168141593

SVM
from sklearn.svm import SVC
svclassifier = SVC(kernel='linear')
svclassifier.fit(X_train, y_train)

SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,


decision_function_shape='ovr', degree=3, gamma='scale', kernel='linear',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)

predict2=svclassifier.predict(X_test)
print(predict2)

42
model2=accuracy_score(y_test,predict2)
print(model1)

[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1111111111111111111111111111111111111
1111111111111111111111111111111111111
1 1]
0.7256637168141593

NN
from sklearn.neural_network import MLPClassifier
nn = MLPClassifier(solver='lbfgs', alpha=1e-5,hidden_layer_sizes=(5, 2),
random_state=1)
nn.fit(X_train,y_train)

predict3=nn.predict(X_test)
#predict3
model3=accuracy_score(y_test,predict3)
print(model3)

0.7168141592920354

Random Forest
from sklearn.ensemble import RandomForestClassifier
random = RandomForestClassifier(random_state=150)
random.fit(X_train, y_train)
predict4=nn.predict(X_test)
#predict4

43
model4=accuracy_score(y_test,predict4)
print(model4)

0.7168141592920354

import matplotlib.pyplot as plt


objects = (' LogisticRegression','SVM','Neural_Network')
y_pos = np.arange(len(objects))
performance = [model1,model2,model3]

plt.bar(y_pos, performance, align='center', alpha=0.5)


plt.xticks(y_pos, objects)
plt.ylabel('Accuracy')
plt.title('LogisticRegression vs SVM vs NeighborsClassifier vs
RandomForestClassifier')

plt.show()

44
Figure 6.4 LogicalRegression vs SVM vs NeighborsClassifiers vs
RandomForestClassifier

45
7. CONCLUSION AND FUTURE WORK

7.1 CONCLUSION

In this project, we have proposed methods for diagnosing liver disease in patients
using machine learning techniques. The three machine learning techniques that
were used include SVM, Logistic Regression and Artificial Neural Network. The
system was implemented using all the models and their performance was
evaluated. Performance evaluation was based on certain performance metrics.
ANN was the model that resulted in the highest accuracy with an accuracy of 98%.
Comparing this work with the previous research works, it was discovered that
ANN proved highly efficient.

7.2 FUTURE WORK

Various studies have proved the unmatchable potential of data mining and ML
tools in the medical domain. These tools can discover hidden significant predictive
parameters from medical datasets that provide early prediction and diagnosis of
diseases. The future scope has also been mentioned regarding the same that ML
techniques are highly promising in diagnosing liver diseases. But further data
proving its validity and efficiency is required for its constant use by physicians.

46
REFERENCES:

[1] https://ieeexplore.ieee.org/document/9683524

[2] https://ieeexplore.ieee.org/abstract/document/7724478

[3] https://ieeexplore.ieee.org/document/9675756

[4] https://ieeexplore.ieee.org/document/9333528

[5] https://ieeexplore.ieee.org/document/9752543

[6] https://ieeexplore.ieee.org/document/8259629

47

You might also like