Intern Report Progress
Intern Report Progress
Submitted by
BACHELOR OF ENGINEERING
IN
2021-2022
i
VELAMMAL ENGINEERING COLLEGE
CHENNAI -66
BONAFIDE CERTIFICATE
Certified that this internship report “Liver disease prediction using Machine
Learning” is the bonafide work of AMEER BATCHA S (113219031009),
DEEPAK S (113219031033), RAHUL PRAKASH S (113219031117), RAJ
KUMAR M(113219031118) Carried out at “PANTECH SOLUTIONS” during
07.12.2021 to 07.01.2022.
SIGNATURE SIGNATURE
ii
CERTIFICATE FROM INDUSTRY
iii
iv
v
vi
CERTIFICATE OF EVALUATION
Name of Faculty
Name of the student who
Coordinator with
Sl. No has done the Internship Title of the Internship
designation
1 AMEER BATCHA S
2 DEEPAK S LIVER DISEASE Ms. P. SARANYA
3 RAHUL PRAKASH S PREDICTION
4 RAJ KUMAR M
This report of internship work submitted by the above student in partial fulfillment for
the award of Bachelor of Computer Science and Engineering Degree in Velammal
Engineering College was evaluated and confirmed to be reports of the work done by
the above student and then assessed.
vii
TABLE OF CONTENTS
ABSTRACT X
ACKNOWLWDGE XI
1 INTRODUCTION
1.1 EXISTING METHODOLOGY 1
1.1.1 LIMITATION 1
2 MODULES
2.1 MODULES 4
2.1.1 DATA COLLECTION 4
2.1.2 DATA PRE-PROCESSING 4
2.1.2.1 FORMATTING 5
2.1.2.2 CLEANING 5
2.1.2.3 SAMPLING 5
2.1.3 FEATURE EXTRACTION 6
2.1.4 EVALUATION MODEL 6
4 DOMAIN SPECIFICATION
4.1 MACHINE LEARNING 14
4.1.1 SUPERVISED LEARNING 15
4.1.1.1 ALGORITHM 16
4.1.2 UNSUPERVISED LEARNING 16
viii
4.1.2.1 ALGORITHM 16
4.1.3 REINFORCEMENT LEARNING 19
4.1.3.1 DEFINITION 19
5 REQUIREMENTS
5.1 SYSTEM REQUIREMENTS
5.1.1 HARDWARE 20
5.1.2 SOFTWARE 20
ABSTRACT
ix
Liver diseases are becoming one of the most fatal diseases in several countries. Patients
with Liver disease have been continuously increasing because of excessive consumption
of alcohol, inhale of harmful gases, intake of contaminated food, pickles and drugs. liver
patient datasets are investigated for building classification models in order to predict liver
disease. This dataset was used to evaluate prediction algorithms in an effort to reduce
burden on doctors. In that paper, we proposed as checking the whole patient Liver Disease
using Machine learning algorithms.
Chronic liver disease refers to disease of the liver which lasts over a period of six
months. So in that, we will take results of how much percentage patients get disease as
a positive information and negative information. Using classifiers, we are processing
Liver Disease percentage and values are showing as a confusion matrix. We proposed a
various classification scheme which can effectively improve the classification
performance in the situation that training dataset is available. In that dataset, we have
nearly 500 patient details.
We will get all that details from there. Then we will good and bad values are using
machine learning classifier. Thus outputs shows from proposed classification model
indicate that Accuracy in predicting the result.
ACKNOWLEDGEMENT
x
Executive Officer, Thiru. M.V.M. Velmurugan, for their extensive support.
We are grateful to the entire staff members of the department of Computer Science and
Engineering for providing the necessary facilities to carry out the project. We would
especially like to thank my parents for providing me with the unique opportunity to work,
and for their encouragement and support at all levels.
LIST OF FIGURES
xi
1.1 System Architecture 3
xii
1. INTRODUCTION
1.1.1 LIMITATION
● Certain approaches being applicable only for small data.
● Certain combination of classifier over fit with data set while others are
under fit.
● Some approaches are not adoptable for real time collection of
database implementation.
1
predictions by the model ROC-AUC - It considers the rank of the output
probabilities and intuitively measures the likelihood that model can distinguish
between a positive point and a negative point. (Note: ROC-AUC is typically used
for binary classification only). We will use AUC to select the best model among
the various machine learning models.
1.2.1 ADVANTAGE
● The performance classification of liver based diseases is further
improved.
● Time complexity and accuracy can measured by various machine
learning models ,so that we can measures different .
● Different machine learning having high accuracy of result.
● Risky factors can be predicted early by machine learning models.
2
1.2.2 SYTSTEM ARCHITECTURE
ML
Algorithms
Machine
Data pre- Feature
Dataset learning
processing extraction
model
Data
Result Classifier
classification
Test data
samples
3
2. MODULES
2.1 MODULES
• DATA COLLECTION
• DATA PRE-PROCESSING
• FEATURE EXTRATION
• EVALUATION MODEL
Data used in this paper is a set of student details in the school records. This step is
concerned with selecting the subset of all available data that you will be working
with. ML problems start with data preferably, lots of data (examples or
observations) for which you already know the target answer. Data for which you
already know the target answer is called labelled data.
Organize your selected data by formatting, cleaning and sampling from it. Three
common data pre-processing steps are:
Formatting
Cleaning
4
Sampling
2.1.2.1 Formatting:
The data you have selected may not be in a format that is suitable for you to work
with. The data may be in a relational database and you would like it in a flat file, or
the data may be in a proprietary file format and you would like it in a relational
database or a text file.
2.1.2.2 Cleaning:
Cleaning data is the removal or fixing of missing data. There may be data instances
that are incomplete and do not carry the data you believe you need to address the
problem. These instances may need to be removed. Additionally, there may be
sensitive information in some of the attributes and these attributes may need to be
anonym zed or removed from the data entirely.
2.1.2.3 Sampling:
There may be far more selected data available than you need to work with. More
data can result in much longer running times for algorithms and larger
computational and memory requirements. You can take a smaller representative
sample of the selected data that may be much faster for exploring and prototyping
solutions before considering the whole dataset.
5
2.1.3 FEATURE EXTRACTION
6
3. DATA FLOW DIAGRAM
LEVEL 0
7
LEVEL 1
8
LEVEL 1
9
3.2 WORK FLOW DIAGRAM
10
3.3 UML DIAGRAM
11
3.4 SEQUENCE DIAGRAM
12
3.5 ACTIVITY DIAGRAM
13
4. DOMAIN SPECIFICATION
Machine Learning is a system that can learn from example through self-
improvement and without being explicitly coded by programmer. The
breakthrough comes with the idea that a machine can singularly learn from the data
(i.e., example) to produce accurate results.
Machine learning combines data with statistical tools to predict an output. This
output is then used by corporate to makes actionable insights. Machine learning is
closely related to data mining and Bayesian predictive modeling. The machine
receives data as input, use an algorithm to formulate answers.
A typical machine learning tasks are to provide a recommendation. For those who
have a Netflix account, all recommendations of movies or series are based on the
user's historical data. Tech companies are using unsupervised learning to improve
the user experience with personalizing recommendation.
Machine learning is also used for a variety of task like fraud detection, predictive
maintenance, portfolio optimization, automatize task and so on.
14
4.1.1 SUPERVISED LEARNING
An algorithm uses training data and feedback from humans to learn the
relationship of given inputs to a given output. For instance, a practitioner can use
marketing expense and weather forecast as input data to predict the sales of cans.
You can use supervised learning when the output data is known. The algorithm
will predict new data.
There are two categories of supervised learning:
● Classification task
● Regression task
Classification:
Imagine you want to predict the gender of a customer for a commercial. You will
start gathering data on the height, weight, job, salary, purchasing basket, etc. from
your customer database. You know the gender of each of your customer, it can
only be male or female. The objective of the classifier will be to assign a
probability of being a male or a female (i.e., the label) based on the information
(i.e., features you have collected). When the model learned how to recognize male
or female, you can use new data to make a prediction. For instance, you just got
new information from an unknown customer, and you want to know if it is a male
or female. If the classifier predicts male = 70%, it means the algorithm is sure at
70% that this customer is a male, and 30% it is a female.
The label can be of two or more classes. The above example has only two classes,
but if a classifier needs to predict object, it has dozens of classes (e.g., glass, table,
shoes, etc. each object represents a class)
15
Regression:
When the output is a continuous value, the task is a regression. For instance, a
financial analyst may need to forecast the value of a stock based on a range of
feature like equity, previous stock performances, macroeconomics index. The
system will be trained to estimate the price of the stocks with the lowest possible
error.
4.1.1.1 ALGORITHM
17
tries to correct it.
4.1.2.1 ALGORITHM
18
customer
4.1.3.1 DEFINITIONS
● Q-learning
● Deep Q network
● State-Action-Reward-State-Action (SARSA)
● Deep Deterministic Policy Gradient (DDPG)
19
5. REQUIREMENTS
5.1.1 HARDWARE
5.1.2 SOFTWARE
Data Set
Python 2.7
Anaconda Navigator
Pandas
Numpy
Sklearn
20
seaborn
matplotlib
6. SOURECE CODE
# for dataframes
import pandas as pd
# Ignore Warnings
import warnings
warnings.filterwarnings("ignore")
21
from math import sqrt
df=pd.read_csv('indian_liver_patient.csv')
df.shape
(583, 11)
df.columns
22
df.head()
df.dtypes[df.dtypes=='object']
23
Gender object
dtype: object
24
Figure 6.1 Body chemical level
25
75% 58.000000 2.600000 1.300000 298.000000
max 90.000000 75.000000 19.700000 2110.000000
26
return 1
df['Dataset'] = df['Dataset'].map(partition)
Gender
count 583
unique 2
top Male
freq 441
<matplotlib.axes._subplots.AxesSubplot at 0x1f75511a208>
27
Figure 6.2 Example dataset
Dataset Gender
1 1 Male
2 1 Male
3 1 Male
4 1 Male
5 1 Male
<matplotlib.axes._subplots.AxesSubplot at 0x1f764c8de80>
28
Age seems to be a factor for liver disease for both male and female genders
sns.countplot(data=df, x = 'Gender', label='Count')
M, F = df['Gender'].value_counts()
print('Number of patients that are male: ',M)
print('Number of patients that are female: ',F)
29
There are more male patients than female patients
Label Male as 0 and Female as 1
## if score==negative, mark 0 ;else 1
def partition(x):
if x =='Male':
return 0
return 1
df['Gender'] = df['Gender'].map(partition)
<seaborn.axisgrid.FacetGrid at 0x1f764d4d358>
30
Figure 6.3 2-D scatter plot Direct_Bilirubin vs Total_Bilirubin
<seaborn.axisgrid.FacetGrid at 0x1f764dd8908>
31
sns.set_style('whitegrid') ## Background Grid
sns.FacetGrid(df, hue = 'Dataset', size = 5).map(plt.scatter, 'Total_Protiens',
'Albumin_and_Globulin_Ratio').add_legend()
<seaborn.axisgrid.FacetGrid at 0x1f764e535c0>
32
df.corr()
33
Albumin_and_Globulin_Ratio -0.216408 0.003424 -0.206267
Dataset 0.137351 -0.082416 0.220208
Direct_Bilirubin Alkaline_Phosphotase \
Age 0.007529 0.080425
Gender -0.100436 0.027496
Total_Bilirubin 0.874618 0.206669
Direct_Bilirubin 1.000000 0.234939
Alkaline_Phosphotase 0.234939 1.000000
Alamine_Aminotransferase 0.233894 0.125680
Aspartate_Aminotransferase 0.257544 0.167196
Total_Protiens -0.000139 -0.028514
Albumin -0.228531 -0.165453
Albumin_and_Globulin_Ratio -0.200125 -0.234166
Dataset 0.246046 0.184866
Alamine_Aminotransferase \
Age -0.086883
Gender -0.082332
Total_Bilirubin 0.214065
Direct_Bilirubin 0.233894
Alkaline_Phosphotase 0.125680
Alamine_Aminotransferase 1.000000
Aspartate_Aminotransferase 0.791966
Total_Protiens -0.042518
Albumin -0.029742
Albumin_and_Globulin_Ratio -0.002375
34
Dataset 0.163416
Aspartate_Aminotransferase Total_Protiens \
Age -0.019910 -0.187461
Gender -0.080336 0.089121
Total_Bilirubin 0.237831 -0.008099
Direct_Bilirubin 0.257544 -0.000139
Alkaline_Phosphotase 0.167196 -0.028514
Alamine_Aminotransferase 0.791966 -0.042518
Aspartate_Aminotransferase 1.000000 -0.025645
Total_Protiens -0.025645 1.000000
Albumin -0.085290 0.784053
Albumin_and_Globulin_Ratio -0.070040 0.234887
Dataset 0.151934 -0.035008
35
plt.figure(figsize=(10,10))
sns.heatmap(df.corr())
<matplotlib.axes._subplots.AxesSubplot at 0x1f764ece668>
Data Cleaning
df = df.drop_duplicates()
print( df.shape )
(570, 11)
36
There were 13 duplicates
Removing Outliers
sns.boxplot(df.Aspartate_Aminotransferase)
<matplotlib.axes._subplots.AxesSubplot at 0x1f7656f9780>
df.Aspartate_Aminotransferase.sort_values(ascending=False).head()
135 4929
117 2946
118 1600
207 1500
119 1050
Name: Aspartate_Aminotransferase, dtype: int64
df = df[df.Aspartate_Aminotransferase <=3000 ]
df.shape
37
(569, 11)
sns.boxplot(df.Aspartate_Aminotransferase)
<matplotlib.axes._subplots.AxesSubplot at 0x1f765514710>
df.Aspartate_Aminotransferase.sort_values(ascending=False).head()
117 2946
118 1600
207 1500
199 1050
119 1050
Name: Aspartate_Aminotransferase, dtype: int64
df = df[df.Aspartate_Aminotransferase <=2500 ]
df.shape
(568, 11)
38
df.isnull().values.any()
True
df=df.dropna(how='any')
#how : {‘any’, ‘all’}
#any : if any NA values are present, drop that label
#all : if all values are NA, drop that label
df.shape
(564, 11)
df.head()
39
0 3.3 0.90 1
1 3.2 0.74 1
2 3.3 0.89 1
3 3.4 1.00 1
4 2.4 0.40 1
df = df.sample(frac=1).reset_index(drop=True)
df.head()
40
3 1.8 0.60 1
4 2.2 0.62 1
X_train=minmax_scale(X_train)
X_test=minmax_scale(X_test)
Logistic Regression
from sklearn.linear_model import LogisticRegression
41
lr=LogisticRegression()
lr.fit(X_train, y_train)
predict1=lr.predict(X_test)
predict1
array([1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1], dtype=int64)
model1=accuracy_score(y_test,predict1)
print(model1)
0.7256637168141593
SVM
from sklearn.svm import SVC
svclassifier = SVC(kernel='linear')
svclassifier.fit(X_train, y_train)
predict2=svclassifier.predict(X_test)
print(predict2)
42
model2=accuracy_score(y_test,predict2)
print(model1)
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1111111111111111111111111111111111111
1111111111111111111111111111111111111
1 1]
0.7256637168141593
NN
from sklearn.neural_network import MLPClassifier
nn = MLPClassifier(solver='lbfgs', alpha=1e-5,hidden_layer_sizes=(5, 2),
random_state=1)
nn.fit(X_train,y_train)
predict3=nn.predict(X_test)
#predict3
model3=accuracy_score(y_test,predict3)
print(model3)
0.7168141592920354
Random Forest
from sklearn.ensemble import RandomForestClassifier
random = RandomForestClassifier(random_state=150)
random.fit(X_train, y_train)
predict4=nn.predict(X_test)
#predict4
43
model4=accuracy_score(y_test,predict4)
print(model4)
0.7168141592920354
plt.show()
44
Figure 6.4 LogicalRegression vs SVM vs NeighborsClassifiers vs
RandomForestClassifier
45
7. CONCLUSION AND FUTURE WORK
7.1 CONCLUSION
In this project, we have proposed methods for diagnosing liver disease in patients
using machine learning techniques. The three machine learning techniques that
were used include SVM, Logistic Regression and Artificial Neural Network. The
system was implemented using all the models and their performance was
evaluated. Performance evaluation was based on certain performance metrics.
ANN was the model that resulted in the highest accuracy with an accuracy of 98%.
Comparing this work with the previous research works, it was discovered that
ANN proved highly efficient.
Various studies have proved the unmatchable potential of data mining and ML
tools in the medical domain. These tools can discover hidden significant predictive
parameters from medical datasets that provide early prediction and diagnosis of
diseases. The future scope has also been mentioned regarding the same that ML
techniques are highly promising in diagnosing liver diseases. But further data
proving its validity and efficiency is required for its constant use by physicians.
46
REFERENCES:
[1] https://ieeexplore.ieee.org/document/9683524
[2] https://ieeexplore.ieee.org/abstract/document/7724478
[3] https://ieeexplore.ieee.org/document/9675756
[4] https://ieeexplore.ieee.org/document/9333528
[5] https://ieeexplore.ieee.org/document/9752543
[6] https://ieeexplore.ieee.org/document/8259629
47