Hospital prediction using data mining
Hospital prediction using data mining
Prof.S.R Jadhav
HOD Prof.V.D Badgujar
---------------------------------------------------------------------***---------------------------------------------------------------------
Abstract - One of the most critical problems in healthcare is predicting the likelihood of hospital readmission
in case of chronic diseases such as diabetes to be able to allocate necessary resources such as beds, rooms,
specialists, and medical staff, for an acceptable quality of service. Unfortunately relatively few research
studies in the literature attempted to tackle this problem; the majority of the research studies are concerned
with predicting the likelihood of the diseases themselves. Numerous machine learning techniques are suitable
for prediction. Nevertheless, there is also shortage in adequate comparative studies that specify the most
suitable techniques for the prediction process. Towards this goal, this paper presents a comparative study
among five common techniques in the literature for predicting the likelihood of hospital readmission in case of
diabetic patients. Those techniques are logistic regression (LR) analysis, multi-layer perceptron (MLP), Naïve
Bayesian (NB) classifier, decision tree, and support vector machine (SVM). The comparative study is based on
realistic data gathered from a number of hospitals in the United States. The comparative study revealed that
SVM showed best performance, while the NB classifier and LR analysis were the worst.
Key Words: Decision tree; hospital readmission; logistic regression; machine learning; multi-layer
perceptron; Naïve Bayesian classifier; support vector machines
1.INTRODUCTION
Nowadays, numerous chronic diseases, such as diabetes, are widespread in the world; and the number of
patients is increasing continuously. The estimated number of diabetic adults in 2014 was 422 million versus
108 million in 1980 [1]. Such patients visit hospitals frequently, requiring continuous preparation for
ensuring the availability of required resources including hospital beds, rooms, and enough medical staff for
an acceptable quality of service. Accordingly, predicting the likelihood of readmission of a given patient is
of ultimate importance. In fact readmission during a one month period (30 days) of discharge indicates "a
high-priority healthcare quality measure" and the goal is to address this problem [2].
Machine learning, which is one of the most important branches of artificial intelligence, provides methods
and techniques for learning from experience [3]. Researchers often use it for complex statistical analysis
tasks [4]. It is a wide multidisciplinary domain which is based on numerous
disciplines including, but not limited to, data processing, statistics, algebra, knowledge analytics,
information theory, control theory, biology, statistics, cognitive science, philosophy, and complexity of
computations. This field plays an important role in term of discovering valuable knowledge from databases
which could contain records of supply maintenance, medical records, financial transactions, applications
of loans, etc. [5].
As indicated in Fig. 1, machine learning techniques can be broadly classified into three main categories
[3]. Supervised learning techniques involve learning from training data, guided by the data scientist. There
are two basic types of learning missions: classification and regression. Models of classification attempt to
predict distinguished classes, such as blood groups, while models of regression prognosticate numerical
values [3]. In unsupervised learning, on the other hand, the system could attempt to find hidden data
patterns, associations among features or variables, or data trends [3], [4]. The main objective of
sum up this work via a conclusion and discussion of possible future work.
This section discusses the five basic machine learning techniques employed in this research study.
A. Logistic Regression Analysis
Regression is a statistical notion that can be used to identify the relationship weight between one variable
called the dependent variable and a group of other changeable variables denoted as the independent
variables. Logistic regression (LR) is a non-linear regression model, used to estimate the likelihood that an
event will occur as a function of others [13].
of
METHODOLOGY:
Before starting the comparative study, it is important to understand the data, perform preprocessing if
necessary, and select features appropriate for the experiments as depicted in Fig. 2. Those tasks are explained
below. It is worth noting that all the experiments were conducted using Python.
A. Data Preparation
1) Understanding data: In this study, we exploited a sample of a diabetes patients‟ dataset, which has
been extracted from many hospitals in the United States [11], [12]. This dataset includes 3090 instances in
the age range of 30-50 and with 18 attributes. Table I depicts the variables of the dataset together with their
descriptions. The scientific meanings of those variables are beyond the scope of this paper. Fig. 3 through 8
depict the distribution of those features.
2) Data pre-processing: This is a very important stage which includes data transformation and cleaning.
In data transformation, some variables were transformed from categorical to binary (0/1) such as (Change,
DM, G, and A). Some other variables were transformed from integer to string such as AS, DI, and AS. In
data cleaning, some values of categorical data were missing and had to be accounted for. For this purpose,
we employed imputation (substitution) via the mode of the categorical data.
B. Feature Selection: In this step, we perform feature selection for dimensionality reduction. In other
words, we select the most relevant features. In this study, towards this goal, we assessed the impact of
variables on our target. This helped us eliminate variables with low importance. Features which have
Race R Categorical
Gender G Categorical
Age A Categorical
Naïve bayesian classifier: A NB model was created using Gaussian Naive Bayes, which assumes that the
attributes follow a natural distribution.
1) Multi-Layer perceptron: We built a MLP network using 18 inputs. The number of neurons in a
hidden layer was
5. The function of the neurons was stochastic gradient descent. The maximum number of iterations was
300, and the two outputs were (readmitted < 30 and readmitted > 30). Table V illustrates that result of MLP
weight matrix after training.
This work utilized various performance measures to compare the studied techniques [31]. Specifically,
we relied on accuracy, recall, precision, and F1 scores for this purpose. Those parameters are defined in terms
of the true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) as indicated in
equations (1) through (4). TPs are cases in which we predicted yes (they will be readmitted in a month
period), and they were really readmitted. TNs are cases in which we predicted no, and they were not
readmitted. On the other hand, FPs are cases in which we predicted yes, but they were not actually
readmitted; Type I error. Finally, FNs are cases in which we predicted no, but they were actuallyreadmitted;
Type II error.
CONCLUSIONS:
This paper presented a comparative study among five machine learning techniques; namely LR, MLP,
NB classifier, decision trees, and SVMS; for predicting the likelihood of hospital readmission of diabetes
patients. The study relied on real data collected from hospitals in the United States. Based on the study, the
SVM provided the best performance. Nevertheless, the study will be extended to compare additional
techniques and larger datasets will be considered as well.
REFERENCES:
[1] 1 G. Roglic, "Global report on diabetes.," World Heal. Organ., vol. 58, no. 12, pp. 1-88, 2016.
[2] D. Rubin, K. Donnell-Jackson, R. Jhingan, S. Golden, and A. Paranjape, "Early readmission among
patients with diabetes: A qualitative assessment of contributing factors," J. Diabetes Complications,
vol. 28, no. 6, pp. 869-873, 2014.
[3] I. Kavakiotis, O. Tsave, A. Salifoglou, N. Maglaveras, I. Vlahavas, and
I. Chouvarda, "Machine Learning and Data Mining Methods in Diabetes Research," Comput. Struct.
Biotechnol. J., vol. 15, pp. 104-116, 2017.
[4] P. Chowriappa, S. Dua, and Y. Todorov, "Machine Learning in Healthcare Informatics," vol. 56, pp. 1-
23, 2014.
[5] T. Mitchell, "Machine learning (mcgraw-hill international editions computer science series)," 1997.
[6] E. Bose and K. Radhakrishnan, "Using Unsupervised Machine Learning to Identify Subgroups among
Home Health Patients with Heart Failure Using Telehealth," CIN - Comput. Informatics Nurs., vol. 36,
no. 5, pp. 242-248, 2018.
[7] L. Kaelbling, A. Littman, and A. Moore, "Reinforcement learning: A survey," J. Artif. Intell. Res., vol.
4, pp. 237-285, 1996.
[8] K. Shailaja, B. Seetharamulu, and M. Jabbar, "Machine Learning in Healthcare: A Review," 2018
Second Int. Conf. Electron. Commun. Aerosp. Technol., no. Iceca, pp. 910-914, 2018.
[9] J. Davies and J. Gibbons, "Machine Learning and Software Engineering in Health Informatics," in
Proceedings of the First International
Workshop on Realizing AI Synergies in Software Engineering, 2012, pp. 37-41.
[10] R. Bhardwaj, A. Nambiar, and D. Dutta, "A Study of Machine Learning in Healthcare," Proc. - Int.
Comput. Softw. Appl. Conf., vol. 2, pp. 236- 241, 2017.
[11] A. Asuncion and D. Newman, "UCI Machine Learning Repository," 2007. [Online]. Available:
https://archive.ics.uci.edu/ml/index.php.
[12] S. B. et al., "Impact of HbA1c measurement on hospital readmission rates: Analysis of 70,000 clinical
database patient records," Biomed Res. Int., vol. 2014, 2014.
[13] A. H. Karp, "Using logistic regression to predict customer retention," Proc. Elev. Northeast SAS
Users Gr. Conf. http//www. lexjansen. com/nesug/nesug98/solu/p095. pdf, 1998.