0% found this document useful (0 votes)

357 views59 pages

Intern Report Progress

This document summarizes an internship report on predicting liver disease using machine learning. The report was submitted by four students in partial fulfillment of their bachelor's degree in computer science and engineering. It describes existing methods for liver disease prediction that have limitations regarding accuracy, data size, and real-time implementation. The proposed methodology uses machine learning algorithms to more effectively classify liver disease status from a dataset of 500 patient details. Evaluation of classification models indicates that the proposed approach can accurately predict results.

Uploaded by

Suhail

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

357 views59 pages

Intern Report Progress

Uploaded by

Suhail

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 59

AN INTERNSHIP REPORT ON

LIVER DISEASE PREDICTION

Submitted by

AMEER BATCHA S (113219031009)

DEEPAK S (113219031033)
RAHUL PRAKASH S (113219031117)
RAJ KUMAR M (113219031118)

in partial fulfillment for the award of the degree of

BACHELOR OF ENGINEERING

COMPUTER SCIENCE AND ENGINEERING

VELAMMAL ENGINEERING COLLEGE, CHENNAI-66.

(An Autonomous Institution, Affiliated to Anna University, Chennai)

2021-2022

i
VELAMMAL ENGINEERING COLLEGE
CHENNAI -66

BONAFIDE CERTIFICATE

Certified that this internship report “Liver disease prediction using Machine
Learning” is the bonafide work of AMEER BATCHA S (113219031009),
DEEPAK S (113219031033), RAHUL PRAKASH S (113219031117), RAJ
KUMAR M(113219031118) Carried out at “PANTECH SOLUTIONS” during
07.12.2021 to 07.01.2022.

SIGNATURE SIGNATURE

Dr. B. MURUGESHWARI Ms. P. SARANYA

PROFESSOR AND HEAD SUPERVISOR
Dept. of Computer Science and Engineering Computer Science and Engineering
Velammal Engineering College Velammal Engineering College
Ambattur - Red Hills Road Ambattur - Red Hills Road
Chennai – 600 066. Chennai – 600 06

ii
CERTIFICATE FROM INDUSTRY

iii
iv
v
vi
CERTIFICATE OF EVALUATION

COLLEGE NAME : VELAMMAL ENGINEERING COLLEGE

BRANCH : COMPUTER SCIENCE ENGINEERING
SEMESTER : VI

Name of Faculty
Name of the student who
Coordinator with
Sl. No has done the Internship Title of the Internship
designation
1 AMEER BATCHA S
2 DEEPAK S LIVER DISEASE Ms. P. SARANYA
3 RAHUL PRAKASH S PREDICTION
4 RAJ KUMAR M

This report of internship work submitted by the above student in partial fulfillment for
the award of Bachelor of Computer Science and Engineering Degree in Velammal
Engineering College was evaluated and confirmed to be reports of the work done by
the above student and then assessed.

Submitted for Internal Evaluation held on........................

Examiner 1 Examiner 2 Examiner 3

vii
TABLE OF CONTENTS

CHAPTER TITLE PAGE NO

ABSTRACT X

ACKNOWLWDGE XI

LIST OF FIGURES XII

1 INTRODUCTION
1.1 EXISTING METHODOLOGY 1
1.1.1 LIMITATION 1

1.2 PROPOSED METHODOLOGY 1

1.2.1 ADVANTAGES 2
1.2.2 SYSTEM ARCHITECTURE 3

2 MODULES
2.1 MODULES 4
2.1.1 DATA COLLECTION 4
2.1.2 DATA PRE-PROCESSING 4
2.1.2.1 FORMATTING 5
2.1.2.2 CLEANING 5
2.1.2.3 SAMPLING 5
2.1.3 FEATURE EXTRACTION 6
2.1.4 EVALUATION MODEL 6

3 DATA FLOW DIAGRAM

3.1 DATA FLOW DIAGRAM 7
3.2 WORK FLOW DIAGRAM 10
3.3 UML DIAGRAM 11
3.4 SEQUENCE DIAGRAM 12
3.5 ACTIVITY DIAGRAM 13

4 DOMAIN SPECIFICATION
4.1 MACHINE LEARNING 14
4.1.1 SUPERVISED LEARNING 15
4.1.1.1 ALGORITHM 16
4.1.2 UNSUPERVISED LEARNING 16
viii
4.1.2.1 ALGORITHM 16
4.1.3 REINFORCEMENT LEARNING 19
4.1.3.1 DEFINITION 19

5 REQUIREMENTS
5.1 SYSTEM REQUIREMENTS
5.1.1 HARDWARE 20
5.1.2 SOFTWARE 20

5.2 PYTHON LIBRARIES

6 SOURCE CODE AND OUTPUT 21

7 CONCLUSION AND FUTURE WORK

7.1 CONCLUSION 46

7.2 FUTURE WORK 46

ABSTRACT
ix
Liver diseases are becoming one of the most fatal diseases in several countries. Patients
with Liver disease have been continuously increasing because of excessive consumption
of alcohol, inhale of harmful gases, intake of contaminated food, pickles and drugs. liver
patient datasets are investigated for building classification models in order to predict liver
disease. This dataset was used to evaluate prediction algorithms in an effort to reduce
burden on doctors. In that paper, we proposed as checking the whole patient Liver Disease
using Machine learning algorithms.

Chronic liver disease refers to disease of the liver which lasts over a period of six
months. So in that, we will take results of how much percentage patients get disease as
a positive information and negative information. Using classifiers, we are processing
Liver Disease percentage and values are showing as a confusion matrix. We proposed a
various classification scheme which can effectively improve the classification
performance in the situation that training dataset is available. In that dataset, we have
nearly 500 patient details.

We will get all that details from there. Then we will good and bad values are using
machine learning classifier. Thus outputs shows from proposed classification model
indicate that Accuracy in predicting the result.

ACKNOWLEDGEMENT

We wish to acknowledge with thanks to the significant contribution given by the

management of our college Chairman, Dr.M.V.Muthuramalingam, and our Chief

x
Executive Officer, Thiru. M.V.M. Velmurugan, for their extensive support.

We would like to thank Dr. S. SATHISHKUMAR, Principal of Velammal Engineering

College, for giving me this opportunity to do this project.

We wish to express my gratitude to our effective Head of the Department,

Dr. B. Murugeshwari, for her moral support and for her valuable innovative suggestions,
constructive interaction, constant encouragement and unending help that have enabled me to
complete the project.

We wish to express my indebted humble thanks to the Company PANTECH SOLUTIONS

and the External Guide Mr. Praveen Kumar, Software Developer for their invaluable
guidance in shaping of this project.

We wish to express my sincere gratitude to my faculty coordinator Ms. P. Saranya, Assistant

Professor, Department of Computer Science and Engineering for her guidance, without
whom this project would not have been possible.

We are grateful to the entire staff members of the department of Computer Science and
Engineering for providing the necessary facilities to carry out the project. We would
especially like to thank my parents for providing me with the unique opportunity to work,
and for their encouragement and support at all levels.

LIST OF FIGURES

FIGURE NO TITLE PAGE NO

xi
1.1 System Architecture 3

3.1 Data Flow Level 0 Diagram 7

3.2 Data Flow Level 1 Diagram 8

3.3 Data Flow Level 2 Diagram 9

3.4 Work Flow Diagram 10

3.5 UML Diagram 11

3.6 Sequence Diagram 12

3.7 Activity Diagram 13

6.1 Body Chemical Level 25

6.2 Example Dataset 28

6.3 2-D Scatter plot (Direct Bilirubin vs Total Bilirubin) 31

6.4 LR vs SVM vs NC vs RFC 45

xii
1. INTRODUCTION

1.1 EXISTING METHODOLOGY

Typically, the existing mechanisms assumed that the accuracy of prediction was
achieved. But this wasn’t the case then, hence, it must be improved further to
increase the classification accuracy. Also, other research works addressed these
issues by introducing efficient combination. Existing Models based on feature
selection and classification raised some issues regarding with training dataset and
Test dataset.

1.1.1 LIMITATION
● Certain approaches being applicable only for small data.
● Certain combination of classifier over fit with data set while others are
under fit.
● Some approaches are not adoptable for real time collection of
database implementation.

1.2 PROPOSED METHODOLOGY

Machine learning has attracted a huge amount of researches and has been applied
in various fields in the world. In medicine, machine learning has proved its power
in which it has been employed to solve many emergency problems such as cancer
treatment, heart disease, dengue fever diagnosis and so on. In proposed
system , we have to import the liver patient dataset (.csv). Then the dataset should
be pre-processed and remove the anomalies and full up empty cells in the dataset,
so the we can further improve the effective Liver diseases prediction. Then we are
Confusion matrix - For getting a better clarity of the no of correct/incorrect

1
predictions by the model ROC-AUC - It considers the rank of the output
probabilities and intuitively measures the likelihood that model can distinguish
between a positive point and a negative point. (Note: ROC-AUC is typically used
for binary classification only). We will use AUC to select the best model among
the various machine learning models.

1.2.1 ADVANTAGE
● The performance classification of liver based diseases is further
improved.
● Time complexity and accuracy can measured by various machine
learning models ,so that we can measures different .
● Different machine learning having high accuracy of result.
● Risky factors can be predicted early by machine learning models.

2
1.2.2 SYTSTEM ARCHITECTURE

ML
Algorithms

Machine
Data pre- Feature
Dataset learning
processing extraction
model

Data
Result Classifier
classification

Test data
samples

FIGURE 1.1 SYSTEM ARCHITECTURE

3
2. MODULES

2.1 MODULES

• DATA COLLECTION

• DATA PRE-PROCESSING

• FEATURE EXTRATION

• EVALUATION MODEL

2.1.1 DATA COLLECTION

Data used in this paper is a set of student details in the school records. This step is
concerned with selecting the subset of all available data that you will be working
with. ML problems start with data preferably, lots of data (examples or
observations) for which you already know the target answer. Data for which you
already know the target answer is called labelled data.

2.1.2 DATA PRE-PROCESSING

Organize your selected data by formatting, cleaning and sampling from it. Three
common data pre-processing steps are:

 Formatting

 Cleaning
4
 Sampling

2.1.2.1 Formatting:

The data you have selected may not be in a format that is suitable for you to work
with. The data may be in a relational database and you would like it in a flat file, or
the data may be in a proprietary file format and you would like it in a relational
database or a text file.

2.1.2.2 Cleaning:

Cleaning data is the removal or fixing of missing data. There may be data instances
that are incomplete and do not carry the data you believe you need to address the
problem. These instances may need to be removed. Additionally, there may be
sensitive information in some of the attributes and these attributes may need to be
anonym zed or removed from the data entirely.

2.1.2.3 Sampling:

There may be far more selected data available than you need to work with. More
data can result in much longer running times for algorithms and larger
computational and memory requirements. You can take a smaller representative
sample of the selected data that may be much faster for exploring and prototyping
solutions before considering the whole dataset.

5
2.1.3 FEATURE EXTRACTION

Next thing is to do Feature extraction is an attribute reduction process. Unlike

feature selection, which ranks the existing attributes according to their predictive
significance, feature extraction actually transforms the attributes. The transformed
attributes, or features, are linear combinations of the original attributes. Finally, our
models are trained using Classifier algorithm. We use classify module on Natural
Language Toolkit library on Python. We use the labelled dataset gathered. The rest
of our labelled data will be used to evaluate the models. Some machine learning
algorithms were used to classify preprocessed data. The chosen classifiers were
Random forest. These algorithms are very popular in text classification tasks.

2.1.4 EVALUATION MODEL

Model Evaluation is an integral part of the model development process. It helps to

find the best model that represents our data and how well the chosen model will
work in the future. Evaluating model performance with the data used for training is
not acceptable in data science because it can easily generate overoptimistic and
over fitted models. There are two methods of evaluating models in data science,
Hold-Out and Cross-Validation to avoid over fitting, both methods use a test set
(not seen by the model) to evaluate model performance. Performance of each
classification model is estimated base on its averaged. The result will be in the
visualized form. Representation of classified data in the form of graphs. Accuracy
is defined as the percentage of correct predictions for the test data. It can be
calculated easily by dividing the number of correct predictions by the number of
total predictions.

6
3. DATA FLOW DIAGRAM

3.1 DATA FLOW DIAGRAM

LEVEL 0

FIGURE 3.1 DATA FLOW LEVEL 0 DIAGRAM

7
LEVEL 1

FIGURE 3.2 DATA FLOW LEVEL 1 DIAGRAM

8
LEVEL 1

FIGURE 3.3 DATA FLOW LEVEL 2 DIAGRAM

9
3.2 WORK FLOW DIAGRAM

FIGURE 3.4 WORK FLOW DIAGRAM

10
3.3 UML DIAGRAM

FIGURE 3.5 USE CASE DIAGRAM

11
3.4 SEQUENCE DIAGRAM

FIGURE 3.6 SEQUENCE DIAGRAM

12
3.5 ACTIVITY DIAGRAM

FIGURE 3.7 ACTIVITY DIAGRAM

13
4. DOMAIN SPECIFICATION

4.1 MACHINE LEARINING

Machine Learning is a system that can learn from example through self-
improvement and without being explicitly coded by programmer. The
breakthrough comes with the idea that a machine can singularly learn from the data
(i.e., example) to produce accurate results.

Machine learning combines data with statistical tools to predict an output. This
output is then used by corporate to makes actionable insights. Machine learning is
closely related to data mining and Bayesian predictive modeling. The machine
receives data as input, use an algorithm to formulate answers.

A typical machine learning tasks are to provide a recommendation. For those who
have a Netflix account, all recommendations of movies or series are based on the
user's historical data. Tech companies are using unsupervised learning to improve
the user experience with personalizing recommendation.

Machine learning is also used for a variety of task like fraud detection, predictive
maintenance, portfolio optimization, automatize task and so on.

14
4.1.1 SUPERVISED LEARNING

An algorithm uses training data and feedback from humans to learn the
relationship of given inputs to a given output. For instance, a practitioner can use
marketing expense and weather forecast as input data to predict the sales of cans.

You can use supervised learning when the output data is known. The algorithm
will predict new data.
There are two categories of supervised learning:

● Classification task
● Regression task

Classification:
Imagine you want to predict the gender of a customer for a commercial. You will
start gathering data on the height, weight, job, salary, purchasing basket, etc. from
your customer database. You know the gender of each of your customer, it can
only be male or female. The objective of the classifier will be to assign a
probability of being a male or a female (i.e., the label) based on the information
(i.e., features you have collected). When the model learned how to recognize male
or female, you can use new data to make a prediction. For instance, you just got
new information from an unknown customer, and you want to know if it is a male
or female. If the classifier predicts male = 70%, it means the algorithm is sure at
70% that this customer is a male, and 30% it is a female.
The label can be of two or more classes. The above example has only two classes,
but if a classifier needs to predict object, it has dozens of classes (e.g., glass, table,
shoes, etc. each object represents a class)

15
Regression:
When the output is a continuous value, the task is a regression. For instance, a
financial analyst may need to forecast the value of a stock based on a range of
feature like equity, previous stock performances, macroeconomics index. The
system will be trained to estimate the price of the stocks with the lowest possible
error.

4.1.1.1 ALGORITHM

Algorithm Name Description Type

Linear regression Finds a way to correlate each feature to Regression
the output to help predict future values.
Logistic regression Extension of linear regression that's Classification
used for classification tasks. The output
variable 3is binary (e.g., only black or
white) rather than continuous
Decision tree Highly interpretable classification or Regression
regression model that splits data- Classification
feature values into branches at decision
nodes (e.g., if a feature is a color, each
possible color becomes a new branch)
until a final decision output is made
Naive Bayes The Bayesian method is a classification Regression
method that makes use of the Bayesian Classification
theorem. The theorem updates the prior
knowledge of an event with the
16
independent probability of each feature
Support vector Support Vector Machine, or SVM, is Regression (not very
machine typically used for the classification common)
task. SVM algorithm finds a Classification
hyperplane that optimally divided the
classes.
Random forest The algorithm is built upon a decision Regression
tree to improve the accuracy Classification
drastically. Random forest generates
many times simple decision trees and
uses the 'majority vote' method to
decide on which label to return. For the
classification task, the final prediction
will be the one with the most vote;
while for the regression task, the
average prediction of all the trees is the
final prediction.
AdaBoost Classification or regression technique Regression
that uses a multitude of models to come Classification
up with a decision but weighs them
based on their accuracy in predicting
the outcome
Gradient-boosting Gradient-boosting trees is a state-of- Regression
trees the-art classification/regression Classification
technique. It is focusing on the error
committed by the previous trees and

17
tries to correct it.

4.1.2 UNSUPERVISED LEARNING

In unsupervised learning, an algorithm explores input data without being given an

explicit output variable (e.g., explores customer demographic data to identify
patterns) You can use it when you do not know how to classify the data, and you
want the algorithm to find patterns and classify the data for you

4.1.2.1 ALGORITHM

Algorithm Description Type

K-means Puts data into some groups (k) that Clustering

clustering each contains data with similar
characteristics (as determined by the
model, not in advance by humans)

Gaussian mixture A generalization of k-means Clustering

model clustering that provides more
flexibility in the size and shape of
groups (clusters

Hierarchical Splits clusters along a hierarchical Clustering

clustering tree to form a classification system.

Can be used for Cluster loyalty-card

18
customer

Recommender Help to define the relevant data for Clustering

system making a recommendation.

PCA/T-SNE Mostly used to decrease the Dimension

dimensionality of the data. The Reduction
algorithms reduce the number of
features to 3 or 4 vectors with the
highest variances.

4.1.3 REINFORCEMENT LEARNING

4.1.3.1 DEFINITIONS

Reinforcement learning is a subfield of machine learning in which systems are

trained by receiving virtual "rewards" or "punishments," essentially learning by
trial and error. Google's DeepMind has used reinforcement learning to beat a
human champion in the Go games. Reinforcement learning is also used in video
games to improve the gaming experience by providing smarter bot.

One of the most famous algorithms are:

● Q-learning
● Deep Q network
● State-Action-Reward-State-Action (SARSA)
● Deep Deterministic Policy Gradient (DDPG)

19
5. REQUIREMENTS

5.1 SYSTEM REQUIREMENT

5.1.1 HARDWARE

 Windows 7,8,10 64 bit

 RAM 4GB (Minimum)

5.1.2 SOFTWARE

 Data Set

 Python 2.7

 Anaconda Navigator

5.2 PYTHON LIBRARIES

 Pandas

 Numpy

 Sklearn

20
 seaborn

 matplotlib

6. SOURECE CODE

Liver Disease prediction

# for numerical computing
import numpy as np

# for dataframes
import pandas as pd

# for easier visualization

import seaborn as sns

# for visualization and to display plots

from matplotlib import pyplot as plt
%matplotlib inline

# import color maps

from matplotlib.colors import ListedColormap

# Ignore Warnings
import warnings
warnings.filterwarnings("ignore")

21
from math import sqrt

# to split train and test set

from sklearn.model_selection import train_test_split

# Machine Learning Models

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import roc_curve, auc, roc_auc_score, confusion_matrix

from sklearn.preprocessing import StandardScaler

from sklearn.metrics import accuracy_score

df=pd.read_csv('indian_liver_patient.csv')

df.shape

(583, 11)

df.columns

Index(['Age', 'Gender', 'Total_Bilirubin', 'Direct_Bilirubin',

'Alkaline_Phosphotase', 'Alamine_Aminotransferase',
'Aspartate_Aminotransferase', 'Total_Protiens', 'Albumin',
'Albumin_and_Globulin_Ratio', 'Dataset'],
dtype='object')

22
df.head()

Age Gender Total_Bilirubin Direct_Bilirubin Alkaline_Phosphotase \

0 65 Female 0.7 0.1 187
1 62 Male 10.9 5.5 699
2 62 Male 7.3 4.1 490
3 58 Male 1.0 0.4 182
4 72 Male 3.9 2.0 195

Alamine_Aminotransferase Aspartate_Aminotransferase Total_Protiens \

0 16 18 6.8
1 64 100 7.5
2 60 68 7.0
3 14 20 6.8
4 27 59 7.3

Albumin Albumin_and_Globulin_Ratio Dataset

0 3.3 0.90 1
1 3.2 0.74 1
2 3.3 0.89 1
3 3.4 1.00 1
4 2.4 0.40 1

Exploratory Data Analysis

Filtering categorical data

df.dtypes[df.dtypes=='object']

23
Gender object
dtype: object

Distribution of Numerical Features

# Plot histogram grid
df.hist(figsize=(15,15), xrot=-45, bins=10) ## Display the labels rotated by 45
degress

# Clear the text "residue"

plt.show()

24
Figure 6.1 Body chemical level

Age Total_Bilirubin Direct_Bilirubin Alkaline_Phosphotase \

count 583.000000 583.000000 583.000000 583.000000
mean 44.746141 3.298799 1.486106 290.576329
std 16.189833 6.209522 2.808498 242.937989
min 4.000000 0.400000 0.100000 63.000000
25% 33.000000 0.800000 0.200000 175.500000
50% 45.000000 1.000000 0.300000 208.000000

25
75% 58.000000 2.600000 1.300000 298.000000
max 90.000000 75.000000 19.700000 2110.000000

Alamine_Aminotransferase Aspartate_Aminotransferase Total_Protiens \

count 583.000000 583.000000 583.000000
mean 80.713551 109.910806 6.483190
std 182.620356 288.918529 1.085451
min 10.000000 10.000000 2.700000
25% 23.000000 25.000000 5.800000
50% 35.000000 42.000000 6.600000
75% 60.500000 87.000000 7.200000
max 2000.000000 4929.000000 9.600000

Albumin Albumin_and_Globulin_Ratio Dataset

count 583.000000 579.000000 583.000000
mean 3.141852 0.947064 1.286449
std 0.795519 0.319592 0.452490
min 0.900000 0.300000 1.000000
25% 2.600000 0.700000 1.000000
50% 3.100000 0.930000 1.000000
75% 3.800000 1.100000 2.000000
max 5.500000 2.800000 2.000000

## if score==negative, mark 0 ;else 1

def partition(x):
if x == 2:
return 0

26
return 1

df['Dataset'] = df['Dataset'].map(partition)

Distribution of categorical data

df.describe(include=['object'])

Gender
count 583
unique 2
top Male
freq 441

Bar plots for categorical Features

plt.figure(figsize=(5,5))
sns.countplot(y='Gender', data=df)

<matplotlib.axes._subplots.AxesSubplot at 0x1f75511a208>

27
Figure 6.2 Example dataset

df[df['Gender'] == 'Male'][['Dataset', 'Gender']].head()

Dataset Gender
1 1 Male
2 1 Male
3 1 Male
4 1 Male
5 1 Male

sns.countplot (x="Gender", hue="Dataset", data=df)

<matplotlib.axes._subplots.AxesSubplot at 0x1f764c8de80>

28
Age seems to be a factor for liver disease for both male and female genders
sns.countplot(data=df, x = 'Gender', label='Count')

M, F = df['Gender'].value_counts()
print('Number of patients that are male: ',M)
print('Number of patients that are female: ',F)

Number of patients that are male: 441

Number of patients that are female: 142

29
There are more male patients than female patients
Label Male as 0 and Female as 1
## if score==negative, mark 0 ;else 1
def partition(x):
if x =='Male':
return 0
return 1

df['Gender'] = df['Gender'].map(partition)

2-D Scatter Plot

sns.set_style('whitegrid') ## Background Grid
sns.FacetGrid(df, hue = 'Dataset', size = 5).map(plt.scatter, 'Total_Bilirubin',
'Direct_Bilirubin').add_legend()

<seaborn.axisgrid.FacetGrid at 0x1f764d4d358>

30
Figure 6.3 2-D scatter plot Direct_Bilirubin vs Total_Bilirubin

sns.set_style('whitegrid') ## Background Grid

sns.FacetGrid(df, hue = 'Dataset', size = 5).map(plt.scatter, 'Total_Bilirubin',
'Albumin').add_legend()

<seaborn.axisgrid.FacetGrid at 0x1f764dd8908>

31
sns.set_style('whitegrid') ## Background Grid
sns.FacetGrid(df, hue = 'Dataset', size = 5).map(plt.scatter, 'Total_Protiens',
'Albumin_and_Globulin_Ratio').add_legend()

<seaborn.axisgrid.FacetGrid at 0x1f764e535c0>

32
df.corr()

Age Gender Total_Bilirubin \

Age 1.000000 -0.056560 0.011763
Gender -0.056560 1.000000 -0.089291
Total_Bilirubin 0.011763 -0.089291 1.000000
Direct_Bilirubin 0.007529 -0.100436 0.874618
Alkaline_Phosphotase 0.080425 0.027496 0.206669
Alamine_Aminotransferase -0.086883 -0.082332 0.214065
Aspartate_Aminotransferase -0.019910 -0.080336 0.237831
Total_Protiens -0.187461 0.089121 -0.008099
Albumin -0.265924 0.093799 -0.222250

33
Albumin_and_Globulin_Ratio -0.216408 0.003424 -0.206267
Dataset 0.137351 -0.082416 0.220208

Direct_Bilirubin Alkaline_Phosphotase \
Age 0.007529 0.080425
Gender -0.100436 0.027496
Total_Bilirubin 0.874618 0.206669
Direct_Bilirubin 1.000000 0.234939
Alkaline_Phosphotase 0.234939 1.000000
Alamine_Aminotransferase 0.233894 0.125680
Aspartate_Aminotransferase 0.257544 0.167196
Total_Protiens -0.000139 -0.028514
Albumin -0.228531 -0.165453
Albumin_and_Globulin_Ratio -0.200125 -0.234166
Dataset 0.246046 0.184866

Alamine_Aminotransferase \
Age -0.086883
Gender -0.082332
Total_Bilirubin 0.214065
Direct_Bilirubin 0.233894
Alkaline_Phosphotase 0.125680
Alamine_Aminotransferase 1.000000
Aspartate_Aminotransferase 0.791966
Total_Protiens -0.042518
Albumin -0.029742
Albumin_and_Globulin_Ratio -0.002375

34
Dataset 0.163416

Aspartate_Aminotransferase Total_Protiens \
Age -0.019910 -0.187461
Gender -0.080336 0.089121
Total_Bilirubin 0.237831 -0.008099
Direct_Bilirubin 0.257544 -0.000139
Alkaline_Phosphotase 0.167196 -0.028514
Alamine_Aminotransferase 0.791966 -0.042518
Aspartate_Aminotransferase 1.000000 -0.025645
Total_Protiens -0.025645 1.000000
Albumin -0.085290 0.784053
Albumin_and_Globulin_Ratio -0.070040 0.234887
Dataset 0.151934 -0.035008

Albumin Albumin_and_Globulin_Ratio Dataset

Age -0.265924 -0.216408 0.137351
Gender 0.093799 0.003424 -0.082416
Total_Bilirubin -0.222250 -0.206267 0.220208
Direct_Bilirubin -0.228531 -0.200125 0.246046
Alkaline_Phosphotase -0.165453 -0.234166 0.184866
Alamine_Aminotransferase -0.029742 -0.002375 0.163416
Aspartate_Aminotransferase -0.085290 -0.070040 0.151934
Total_Protiens 0.784053 0.234887 -0.035008
Albumin 1.000000 0.689632 -0.161388
Albumin_and_Globulin_Ratio 0.689632 1.000000 -0.163131
Dataset -0.161388 -0.163131 1.000000

35
plt.figure(figsize=(10,10))
sns.heatmap(df.corr())

<matplotlib.axes._subplots.AxesSubplot at 0x1f764ece668>

Data Cleaning
df = df.drop_duplicates()
print( df.shape )

(570, 11)

36
There were 13 duplicates

Removing Outliers
sns.boxplot(df.Aspartate_Aminotransferase)

<matplotlib.axes._subplots.AxesSubplot at 0x1f7656f9780>

df.Aspartate_Aminotransferase.sort_values(ascending=False).head()

135 4929
117 2946
118 1600
207 1500
119 1050
Name: Aspartate_Aminotransferase, dtype: int64

df = df[df.Aspartate_Aminotransferase <=3000 ]
df.shape

37
(569, 11)

sns.boxplot(df.Aspartate_Aminotransferase)

<matplotlib.axes._subplots.AxesSubplot at 0x1f765514710>

df.Aspartate_Aminotransferase.sort_values(ascending=False).head()

117 2946
118 1600
207 1500
199 1050
119 1050
Name: Aspartate_Aminotransferase, dtype: int64

df = df[df.Aspartate_Aminotransferase <=2500 ]
df.shape

(568, 11)
38
df.isnull().values.any()

True

df=df.dropna(how='any')
#how : {‘any’, ‘all’}
#any : if any NA values are present, drop that label
#all : if all values are NA, drop that label

df.shape

(564, 11)

df.head()

Age Gender Total_Bilirubin Direct_Bilirubin Alkaline_Phosphotase \

0 65 1 0.7 0.1 187
1 62 0 10.9 5.5 699
2 62 0 7.3 4.1 490
3 58 0 1.0 0.4 182
4 72 0 3.9 2.0 195

Alamine_Aminotransferase Aspartate_Aminotransferase Total_Protiens \

0 16 18 6.8
1 64 100 7.5
2 60 68 7.0
3 14 20 6.8
4 27 59 7.3

Albumin Albumin_and_Globulin_Ratio Dataset

39
0 3.3 0.90 1
1 3.2 0.74 1
2 3.3 0.89 1
3 3.4 1.00 1
4 2.4 0.40 1

df = df.sample(frac=1).reset_index(drop=True)

df.head()

Age Gender Total_Bilirubin Direct_Bilirubin Alkaline_Phosphotase \

0 27 0 1.2 0.4 179
1 66 0 0.6 0.2 100
2 32 0 12.7 8.4 190
3 60 0 1.9 0.8 614
4 48 0 3.2 1.6 257

Alamine_Aminotransferase Aspartate_Aminotransferase Total_Protiens \

0 63 39 6.1
1 17 148 5.0
2 28 47 5.4
3 42 38 4.5
4 33 116 5.7

Albumin Albumin_and_Globulin_Ratio Dataset

0 3.3 1.10 0
1 3.3 1.90 0
2 2.6 0.90 1

40
3 1.8 0.60 1
4 2.2 0.62 1

Machine Learning Models

Data Preparation
# Create separate object for target variable
y = df.Dataset

# Create separate object for input features

X = df.drop('Dataset', axis=1)

# Split X and y into train and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2,
random_state=1234,
stratify=df.Dataset)

# Print number of observations in X_train, X_test, y_train, and y_test

print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(451, 10) (113, 10) (451,) (113,)

from sklearn.preprocessing import minmax_scale

X_train=minmax_scale(X_train)

X_test=minmax_scale(X_test)

Logistic Regression
from sklearn.linear_model import LogisticRegression

41
lr=LogisticRegression()
lr.fit(X_train, y_train)
predict1=lr.predict(X_test)
predict1

array([1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1], dtype=int64)

model1=accuracy_score(y_test,predict1)
print(model1)

0.7256637168141593

SVM
from sklearn.svm import SVC
svclassifier = SVC(kernel='linear')
svclassifier.fit(X_train, y_train)

SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,

decision_function_shape='ovr', degree=3, gamma='scale', kernel='linear',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)

predict2=svclassifier.predict(X_test)
print(predict2)

42
model2=accuracy_score(y_test,predict2)
print(model1)

[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1111111111111111111111111111111111111
1111111111111111111111111111111111111
1 1]
0.7256637168141593

NN
from sklearn.neural_network import MLPClassifier
nn = MLPClassifier(solver='lbfgs', alpha=1e-5,hidden_layer_sizes=(5, 2),
random_state=1)
nn.fit(X_train,y_train)

predict3=nn.predict(X_test)
#predict3
model3=accuracy_score(y_test,predict3)
print(model3)

0.7168141592920354

Random Forest
from sklearn.ensemble import RandomForestClassifier
random = RandomForestClassifier(random_state=150)
random.fit(X_train, y_train)
predict4=nn.predict(X_test)
#predict4

43
model4=accuracy_score(y_test,predict4)
print(model4)

0.7168141592920354

import matplotlib.pyplot as plt

objects = (' LogisticRegression','SVM','Neural_Network')
y_pos = np.arange(len(objects))
performance = [model1,model2,model3]

plt.bar(y_pos, performance, align='center', alpha=0.5)

plt.xticks(y_pos, objects)
plt.ylabel('Accuracy')
plt.title('LogisticRegression vs SVM vs NeighborsClassifier vs
RandomForestClassifier')

plt.show()

44
Figure 6.4 LogicalRegression vs SVM vs NeighborsClassifiers vs
RandomForestClassifier

45
7. CONCLUSION AND FUTURE WORK

7.1 CONCLUSION

In this project, we have proposed methods for diagnosing liver disease in patients
using machine learning techniques. The three machine learning techniques that
were used include SVM, Logistic Regression and Artificial Neural Network. The
system was implemented using all the models and their performance was
evaluated. Performance evaluation was based on certain performance metrics.
ANN was the model that resulted in the highest accuracy with an accuracy of 98%.
Comparing this work with the previous research works, it was discovered that
ANN proved highly efficient.

7.2 FUTURE WORK

Various studies have proved the unmatchable potential of data mining and ML
tools in the medical domain. These tools can discover hidden significant predictive
parameters from medical datasets that provide early prediction and diagnosis of
diseases. The future scope has also been mentioned regarding the same that ML
techniques are highly promising in diagnosing liver diseases. But further data
proving its validity and efficiency is required for its constant use by physicians.

46
REFERENCES:

[1] https://ieeexplore.ieee.org/document/9683524

[2] https://ieeexplore.ieee.org/abstract/document/7724478

[3] https://ieeexplore.ieee.org/document/9675756

[4] https://ieeexplore.ieee.org/document/9333528

[5] https://ieeexplore.ieee.org/document/9752543

[6] https://ieeexplore.ieee.org/document/8259629

8602 Assignment No 2
100% (2)
8602 Assignment No 2
43 pages
MODULE 1 ENGLISH-5-Q1-Revised
80% (5)
MODULE 1 ENGLISH-5-Q1-Revised
23 pages
Student Performance Analysis Using Machine Learning
No ratings yet
Student Performance Analysis Using Machine Learning
40 pages
EFT Tapping Worksheet
100% (1)
EFT Tapping Worksheet
3 pages
5.2 Design of A Simple Code Generator
No ratings yet
5.2 Design of A Simple Code Generator
24 pages
Ooad Record Abinash
No ratings yet
Ooad Record Abinash
241 pages
Narrative Report On The Regional Mass Training of Grade 10 Teachers of The K
100% (1)
Narrative Report On The Regional Mass Training of Grade 10 Teachers of The K
5 pages
Lesson 3-Determining Truth
No ratings yet
Lesson 3-Determining Truth
14 pages
Dbms Project Report Inventory Management System
No ratings yet
Dbms Project Report Inventory Management System
41 pages
Attendance Management System
No ratings yet
Attendance Management System
16 pages
Fake Account Detection Using Machine Learning and Data Science
No ratings yet
Fake Account Detection Using Machine Learning and Data Science
58 pages
Factors Affecting Cross Cultural Negotiation
100% (1)
Factors Affecting Cross Cultural Negotiation
4 pages
OS Mini Project
No ratings yet
OS Mini Project
20 pages
FINGER PRINT Based Attendance System Project Proposal (1-6) .KIOKO 23docx
No ratings yet
FINGER PRINT Based Attendance System Project Proposal (1-6) .KIOKO 23docx
68 pages
A Project Report
No ratings yet
A Project Report
21 pages
Results by Using Python Full Stack: An Internship Report On
No ratings yet
Results by Using Python Full Stack: An Internship Report On
66 pages
9 Bhatia Battery Performance Test
No ratings yet
9 Bhatia Battery Performance Test
16 pages
Habits-A Repeat Performance PDF
No ratings yet
Habits-A Repeat Performance PDF
5 pages
Major Project Documentation Final 2
No ratings yet
Major Project Documentation Final 2
62 pages
VLSI Lab Manual - 2022-1
No ratings yet
VLSI Lab Manual - 2022-1
54 pages
WT 4th and 5th Unit
No ratings yet
WT 4th and 5th Unit
23 pages
Chatbot Abstract
No ratings yet
Chatbot Abstract
6 pages
Image Processing Based Facial Emotion Recognition: A Project Report On
No ratings yet
Image Processing Based Facial Emotion Recognition: A Project Report On
39 pages
Chapter-10 Parallel Programming Models, Languages and Compilers
No ratings yet
Chapter-10 Parallel Programming Models, Languages and Compilers
30 pages
Visvesvaraya Technological University: K.S.Institute of Technology
No ratings yet
Visvesvaraya Technological University: K.S.Institute of Technology
38 pages
Healthy Warehouse Progress
No ratings yet
Healthy Warehouse Progress
40 pages
Python and Machine Learning: A Practical Training Report On
No ratings yet
Python and Machine Learning: A Practical Training Report On
65 pages
Final Major Project
No ratings yet
Final Major Project
99 pages
WTA Mini Project Format
100% (3)
WTA Mini Project Format
21 pages
Seminar On Wireless Lan Security
No ratings yet
Seminar On Wireless Lan Security
23 pages
Journal App Report
No ratings yet
Journal App Report
37 pages
DBMS Project Report - $#$&
100% (1)
DBMS Project Report - $#$&
22 pages
Expression Recognition in E Learning Environment Using Deep PDF
No ratings yet
Expression Recognition in E Learning Environment Using Deep PDF
63 pages
Fake News Detection Report
No ratings yet
Fake News Detection Report
20 pages
Importance of Good Design (HCI)
No ratings yet
Importance of Good Design (HCI)
1 page
Report Event
No ratings yet
Report Event
24 pages
OOPS Concepts in PHP
100% (2)
OOPS Concepts in PHP
40 pages
WT Lab
No ratings yet
WT Lab
68 pages
Te Aids - (Elective-I) Human Computer Interface
No ratings yet
Te Aids - (Elective-I) Human Computer Interface
2 pages
Steganography Project Report For Major Project in B Tech
No ratings yet
Steganography Project Report For Major Project in B Tech
74 pages
Reflection
No ratings yet
Reflection
2 pages
DLP Adjectives
No ratings yet
DLP Adjectives
3 pages
Final PPT 2
No ratings yet
Final PPT 2
42 pages
Project Synopsis (127, 128, 136)
No ratings yet
Project Synopsis (127, 128, 136)
24 pages
Onlinepay
No ratings yet
Onlinepay
23 pages
Student Result Management System Presentation
No ratings yet
Student Result Management System Presentation
11 pages
REPORT FILE of FACE MASK DETECTION
No ratings yet
REPORT FILE of FACE MASK DETECTION
45 pages
Big Data
No ratings yet
Big Data
30 pages
1.disabling Interrupts:: Mutual Exclusion With Busy Waiting
No ratings yet
1.disabling Interrupts:: Mutual Exclusion With Busy Waiting
2 pages
Agriculture Crop Recommendation System Using
No ratings yet
Agriculture Crop Recommendation System Using
57 pages
8086 Case Studies
100% (1)
8086 Case Studies
4 pages
Online Fashion Stylist Python Project
No ratings yet
Online Fashion Stylist Python Project
5 pages
CB 17 Black Book
No ratings yet
CB 17 Black Book
47 pages
Data Flow Diagram Level 0
No ratings yet
Data Flow Diagram Level 0
6 pages
Internship Project PPT Template 2
No ratings yet
Internship Project PPT Template 2
12 pages
Design Analog Clock Using Computer Graphic and Turbo C++ Compiler
No ratings yet
Design Analog Clock Using Computer Graphic and Turbo C++ Compiler
20 pages
Saro 2.0
No ratings yet
Saro 2.0
31 pages
Clinical Dementia Rating Worksheet: Memory Questions For The Person Close To The Subject
No ratings yet
Clinical Dementia Rating Worksheet: Memory Questions For The Person Close To The Subject
12 pages
Ttmssoftwarereport 180224162503 PDF
No ratings yet
Ttmssoftwarereport 180224162503 PDF
28 pages
Unit01-Getting Started With .NET Framework 4.0
No ratings yet
Unit01-Getting Started With .NET Framework 4.0
40 pages
Final Research Proposal
No ratings yet
Final Research Proposal
11 pages
Job Recommender Java Spring Boot
No ratings yet
Job Recommender Java Spring Boot
21 pages
Notes Management System: A Synopsis On
No ratings yet
Notes Management System: A Synopsis On
8 pages
Intern Report
No ratings yet
Intern Report
22 pages
CNS Lab Manual For End Sem Exam
No ratings yet
CNS Lab Manual For End Sem Exam
28 pages
Computer Project Topics
No ratings yet
Computer Project Topics
8 pages
11-DESSA Competencies
No ratings yet
11-DESSA Competencies
1 page
Maximum and Minimum Using Divide and Conquer
No ratings yet
Maximum and Minimum Using Divide and Conquer
17 pages
How Long Are My Sides?
No ratings yet
How Long Are My Sides?
4 pages
Empathy Mapping Guide
No ratings yet
Empathy Mapping Guide
5 pages
Draft National Standards For Adult Literacy and Numeracy - FEDA Response - Literacy
No ratings yet
Draft National Standards For Adult Literacy and Numeracy - FEDA Response - Literacy
19 pages
Project Proposal Title of Final Project
No ratings yet
Project Proposal Title of Final Project
10 pages
Face Recognition System
No ratings yet
Face Recognition System
7 pages
7 Mindful Eating Tips: 1. Shift Out of Autopilot Eating
No ratings yet
7 Mindful Eating Tips: 1. Shift Out of Autopilot Eating
1 page
Lesson Plan-Math7-Q1-Week1-Day2
No ratings yet
Lesson Plan-Math7-Q1-Week1-Day2
3 pages
Object Oriented Programming Using C++
No ratings yet
Object Oriented Programming Using C++
16 pages
Advanced Internet Application MCQ'S
No ratings yet
Advanced Internet Application MCQ'S
8 pages
Two Components of User Interface (Hci)
No ratings yet
Two Components of User Interface (Hci)
2 pages
Agriculture Crop Recommendation System Using Machine Learning
No ratings yet
Agriculture Crop Recommendation System Using Machine Learning
11 pages
Why Can't We Wipe The Slate Clean?: A Lexical-Syntactic Approach To Resultative Constructions
No ratings yet
Why Can't We Wipe The Slate Clean?: A Lexical-Syntactic Approach To Resultative Constructions
25 pages
CGAS
No ratings yet
CGAS
4 pages
Social Media Analysis Using Machine Learning
No ratings yet
Social Media Analysis Using Machine Learning
11 pages
Lesson Plan Template 2016 Ib Setting-Revised Curriculum 1
No ratings yet
Lesson Plan Template 2016 Ib Setting-Revised Curriculum 1
4 pages
Jblank
No ratings yet
Jblank
2 pages
Epp4-Q3-W5-Cristine S. Supranes-Mylyn Corpuz
No ratings yet
Epp4-Q3-W5-Cristine S. Supranes-Mylyn Corpuz
18 pages
Psychology Alvin
No ratings yet
Psychology Alvin
3 pages
Tour Guide Skills
No ratings yet
Tour Guide Skills
6 pages
Touchpad Plus Ver. 1.1 Class 7
From Everand
Touchpad Plus Ver. 1.1 Class 7
Nisha Batra
No ratings yet
Cpu Scheduling Algorithm
No ratings yet
Cpu Scheduling Algorithm
3 pages
Giáo án tham khảo - 2
No ratings yet
Giáo án tham khảo - 2
5 pages
Comparative Study On Spoken Language Identification Based On Deep Learning
No ratings yet
Comparative Study On Spoken Language Identification Based On Deep Learning
5 pages
Sssssss SG GGGGG GGGGG
No ratings yet
Sssssss SG GGGGG GGGGG
4 pages
Lesson Plan-Objectives
No ratings yet
Lesson Plan-Objectives
2 pages
Table of Content: Department of Computer Applications - Maharana Pratap Engineering College Kanpur
No ratings yet
Table of Content: Department of Computer Applications - Maharana Pratap Engineering College Kanpur
1 page
Grace E. Oliveto: Contact Teaching Experience
No ratings yet
Grace E. Oliveto: Contact Teaching Experience
1 page