Using Machine Learning Algorithms To Predict Hepat
Using Machine Learning Algorithms To Predict Hepat
Using Machine Learning Algorithms To Predict Hepat
Research Article
Using Machine Learning Algorithms to Predict Hepatitis
B Surface Antigen Seroclearance
Xiaolu Tian ,1 Yutian Chong,2 Yutao Huang ,3 Pi Guo,4 Mengjie Li ,1 Wangjian Zhang,5
Zhicheng Du ,1 Xiangyong Li ,2 and Yuantao Hao 1,6
1
Department of Medical Statistics and Epidemiology & Health Information Research Center &
Guangdong Key Laboratory of Medicine, School of Public Health, Sun Yat-sen University, Guangzhou 510080, China
2
Department of Infectious Diseases, The Third Affiliated Hospital, Sun Yat-sen University, Guangzhou 510630, China
3
School of Data and Computer Science, Sun Yat-sen University, Guangzhou 510006, China
4
Department of Public Health, Medical College of Shantou University, Shantou 515063, China
5
Department of Environmental Health Sciences, School of Public Health, University at Albany,
State University of New York, Rensselaer 12144, USA
6
Sun Yat-sen Global Health Institute, Sun Yat-sen University, Guangzhou 510080, China
Correspondence should be addressed to Xiangyong Li; [email protected] and Yuantao Hao; [email protected]
Copyright © 2019 Xiaolu Tian et al. This is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is
properly cited.
Hepatitis B surface antigen (HBsAg) seroclearance during treatment is associated with a better prognosis among patients with
chronic hepatitis B (CHB). Significant gaps remain in our understanding on how to predict HBsAg seroclearance accurately and
efficiently based on obtainable clinical information. This study aimed to identify the optimal model to predict HBsAg sero-
clearance. We obtained the laboratory and demographic information for 2,235 patients with CHB from the South China Hepatitis
Monitoring and Administration (SCHEMA) cohort. HBsAg seroclearance occurred in 106 patients in total. We developed models
based on four algorithms, including the extreme gradient boosting (XGBoost), random forest (RF), decision tree (DCT), and
logistic regression (LR). The optimal model was identified by the area under the receiver operating characteristic curve (AUC). The
AUCs for XGBoost, RF, DCT, and LR models were 0.891, 0.829, 0.619, and 0.680, respectively, with XGBoost showing the best
predictive performance. The variable importance plot of the XGBoost model indicated that the level of HBsAg was of high
importance followed by age and the level of hepatitis B virus (HBV) DNA. Machine learning algorithms, especially XGBoost, have
appropriate performance in predicting HBsAg seroclearance. The results showed the potential of machine learning algorithms for
predicting HBsAg seroclearance utilizing obtainable clinical data.
studies. Researchers have investigated that low serum treatments, routine pathology measurements, and other
HBsAg levels alone or joined up with a low serum HBV clinical measurements. Regular follow-ups were performed
DNA load were important determinants of HBsAg sero- every 1–3 months. Thirty features including laboratory tests,
clearance [8, 14]. As for the host characteristics, age is one of clinical manifestations, and drug treatment strategies were
the most important characteristics in HBsAg seroconversion, recorded at baseline before the occurrence of HBsAg
followed by factors of gender, fatty liver, cirrhosis at baseline seroclearance. Verbal informed consent was obtained for all
or developed during follow-up, and alanine aminotransferase participants upon their first and subsequent follow-up visits.
(ALT) levels at baseline [6]. However, previous studies de- Results of routine liver biochemical function tests were
veloping prediction models were mainly based on long-term also obtained including serum levels of alanine amino-
tracking of limited factors and traditional statistical methods, transferase (ALT), aspartate aminotransferase (AST), serum
of which the estimates maybe biased due to the potential albumin (ALB), gamma-glutamyl transferase (GGT), total
collinearity issue for high-dimensional medical data. To ad- bilirubin (Tbil), direct bilirubin (Dbil), as well as a range of
dress the knowledge gap, in this study, we used machine erythrocyte and leucocyte markers (hemoglobin (Hb),
learning algorithms instead of traditional models to determine platelet (PLT) count, and white blood cell (WBC) count).
the association between obtainable clinical variables and The measurements were performed on Autobiochemical
HBsAg seroclearance. Machine learning algorithms have Analyzer (7600-020; HITACHI, Tokyo, Japan) following a
attracted considerable attention in health domain in recent standard protocol.
years. It has been successfully applied as powerful classifica- Serum HBsAg and hepatitis B virus E antigen (HBeAg)
tion methods to extract effective information from the high- were both measured quantitatively by Elecsys kits (Roche
dimensional, correlated, nonlinear, and imbalanced clinical Laboratories, Germany). Serum HBV DNA level was
datasets and make accurate diagnosis and predictions de- measured using the Cobas TaqMan HBV RT-PCR test
cisions [15, 16]. However, no existing models have been (CAP-CTM; Roche Molecular Systems).
identified to achieve the best performance for HBsAg sero- Radiological indicators including right liver oblique
clearance prediction. In this study, we generated multiple diameter, spleen portal width, spleen length, and spleen
appropriate machine learning models including XGBoost, RF, portal vein width were measured to reflect the thickness and
DCT, and LR according to the characteristics of the dataset width of patients’ liver and spleen.
(highly dimensional and imbalanced) and aimed to identify A total of 30 variables were included in the dataset. Ten of
the optimal one. The main purpose of this study is to identify them are categorical variables including gender, drinking
the optimal machine learning model for predicting the HBsAg history, HBV family history, initial diagnosis, current di-
seroclearance in a retrospective cohort of patients with CHB. agnosis, lines (number of treatment replacements), initial
treatment, current treatment, and virological response. The
2. Materials and Methods rest are continuous variables including age, BMI, serum in-
dicators, and radiological indicators. The dataset was split into
This study included chronic hepatitis B patients enrolled into the training dataset (70%) and test dataset (30%) to train and
the SCHEMA cohort (South China Hepatitis Monitoring test the machine learning models. The training set contains a
and Administration cohort) between January 2006 and June known output, and the model learns with these data in order
2015. Each patient was diagnosed following the “Guideline: to be generalized to new data. The test size was 0.30, indicating
prevention and treatment of viral hepatitis” revised in 2010 that 30% of the data were withheld for testing.
and followed up by staff in the Infectious Diseases De- In this study, the predictive models were built based on
partment of the Third Affiliated Hospital, Sun Yat-sen four machine learning classification algorithms: logistic
University. For the current study, we excluded patients regression, decision tree, random forest, and extreme gra-
who met at least one of the following conditions: (1) lost dient boosting by using Python programming software with
follow-up for over 6 months; (2) had an HBV DNA baseline version 3.6. The generations of each model for predictions
under detection; (3) received interferon treatment pre- required tuning of several key parameters. The value of each
viously; (4) developed comorbidities such as hepatitis A/C/E parameter was chosen by using grid search and 5-fold
virus infection, decompensated liver disease, autoimmune (stratified K-fold) cross-validation, with the training dataset
liver diseases, malignant tumors, and renal insufficiency; and split into 5 equal size subsets randomly for five times of
(5) received immunosuppressive (transplantation) therapy. cross-validation. Each round of cross-validation involved a
There were 2235 CHB patients included in this study. process of performing the model generation on four subsets
The endpoint (HBsAg seroclearance) was defined as loss (as the development set) and a process of validating on the
of HBsAg detectability during follow-up by the qualification remaining subset (as the validation set). For evaluation
method using ECL kits (Roche Laboratories, Germany; purpose, metrics including areas under the receiver oper-
lower limit of detection (LLOD), 0.05 IU/ml). We collected ating characteristic curves (AUCs) of the models, F score,
the following information for each patients: age, gender, confusion matrix, precision, and recall are calculated by 5-
BMI (body mass index), drinking history, family history, fold (stratified K-fold) cross-validation. F score represents
diagnosis of the disease phase, treatment (including lam- the harmonic mean of precision and recall. Precision rep-
ivudine (LAM), telbivudine (LDT), entecavir (ETV), ade- resents the percentage of tuples that the classifier has labeled
fovir (ADV), and tenofovir (TDF) and the changing times of as positive is actually positive. Sensitivity represents the true-
treatment was recorded as lines), virological response after positive recognition rate:
Computational and Mathematical Methods in Medicine 3
2 ∗ TP 3. Results
F-score � ,
2 ∗ TP + FP + FN
Among 2235 CHB patients, a total of 106 patients had been
TP identified as HBsAg seroclearance. The summary of par-
Precision � , (1)
TP + FP ticipant’s characteristics including demographic character-
istics and laboratory measurements for patients is shown in
TP Table 1. The mean age of the patients was 40.58 ± 12.07 years,
Sensitivity � .
TP + FN and 73.2% patients were male.
The whole dataset was randomly partitioned into 1564
where true positive (TP) represents positive case correctly instances of the training set and 671 instances of the testing
identified as positive, false positive (FP) represents negative set measured by use of a 70%/30% split of the data. Table 2
case incorrectly identified as positive, and false negative shows the tuning parameters and values of the final models.
(FN) represents positive people incorrectly identified as The performances of four models are shown in Table 3,
negative. and the receiver operating characteristic (ROC) curves for
Logistic regression model is a classic statistical clas- each model are shown in Figures 1–4. The AUCs reflecting
sification method developed in 1958 by David Cox which is the total discriminative abilities of the XGBoost, RF, DCT,
widely used for modelling binary-dependent variable and and LR were 0.891 (95% confidence interval (CI): 0.889,
is now very commonly used in scientific study, including 0.895), 0.829 (95% CI: 0.824, 0.834), 0.619 (95% CI: 0.614,
biology, medicine, health, and clinical research [17]. Lo- 0.624), and 0.680 (95% CI: 0.677, 0.683), respectively.
gistic regression investigates the correlation between XGBoost model exhibited the best AUC, and the perfor-
binary-dependent variable and -independent variables by mance was significantly better than the rest models. In terms
estimating probabilities using a logistic function, which is of other measures, the F scores of the XGBoost, RF, DCT,
the cumulative logistic distribution. and LR were 0.97, 0.97, 0.95, and 0.97, respectively. Using the
Decision tree is a nonparametric supervised learning variables exhibiting the highest coefficients of permutation
method used for classification and regression that uses a importance for HBsAg seroclearance in the XGBoost model,
tree-like graph or model of decision to predict the value of a the variable importance plot suggested that the level of
target variable by learning simple decision rules inferred HBsAg was the most important predictor of HBsAg sero-
from the data features. It can handle both numerical and clearance followed by age and DNA (Figure 5). The con-
categorical data, and nonlinear relationships between pa- fusion matrix showed that logistic regression was severely
rameters do not affect tree performance. influenced by the high degree of imbalance of the dataset, as
Random forest model is a powerful bagging and en- it classified the whole sample to the negative class.
semble learning method for classification and regression
tasks and can provide the relative importance of the input 4. Discussion
variables by comprising multiple decision trees at training
set and predictive values of classification and regression trees HBsAg seroclearance has been widely considered as one of
[18]. Random forest is one of the most accurate algorithms the most important indicators of CHB prognosis. Using
by averaging votes of multiple deep decision trees from machine learning algorithms to predict disease status or
different random subsets of the training set to reduce the outcomes with clinical datasets is consistently gaining in-
variance. The main limitation of random forest is that a large creasing attention in medical and health field, as shown by
amount of trees can make the algorithm slow and ineffective many previous studies inspecting relevant topics. In this
for real-time predictions. retrospective cohort study, we evaluated the performance of
Extreme gradient boosting was initially raised as a ter- four prediction models generated by utilizing obtainable
minal application in a research project by Tianqi Chen which baseline clinical data fitted with machine learning algorithms
could be configured using a LibSVM configuration file [19]. to accurately classify individuals who were likely to result in
Comparing with other machine learning models, XGBoost HBsAg seroclearance, with no need to acquire longitudinal
algorithms were designed to be highly efficient and flexible data. It is of remarkable significance that, in this study, we
and are of impressive predictive accuracy. XGBoost imple- have investigated the optimisation for machine learning
ments a scalable parallel classification and regression trees algorithms of routine clinical datasets. Our results indicated
(CART) boosting system under the Gradient Boosting the best performing prediction model-XGBoost obtained an
framework which can widely solve data science problems in a AUC of 0.891, indicating a good prediction accuracy. The
fast and accurate way [20]. Gradient Boosting is a boosting result of the serum HBsAg level acting as the most important
learning algorithm which combines the estimates of a set of variable shown in our study is consistent with previous study
simpler and weaker models. Because XGBoost internally [8, 14]. Following factors such as age and serum level of
provides hyperparameters for cross-validation, regularization, DNA were also proven highly related with HBsAg sero-
user-defined objective functions, tree parameters, scikit-learn clearance. As there is not enough evidence in this domain,
compatible API, and so on, it usually has better fitness than our findings will help achieve targets of early prediction and
other models, especially in solving different types of datasets detection by laboratory alternatives and assist the hepatol-
or distributions. ogists in choosing the optimal therapeutic regimen.
4 Computational and Mathematical Methods in Medicine
1.0 1.0
0.8 0.8
0.6 0.6
Sensitivity
Sensitivity
0.4 0.4
0.2 0.2
0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
1–specificity 1–specificity
LR fold 1 (AUC = 0.732) LR fold 4 (AUC = 0.652)
RF fold 1 (AUC = 0.875) RF fold 4 (AUC = 0.844)
LR fold 2 (AUC = 0.721) LR fold 5 (AUC = 0.651)
RF fold 2 (AUC = 0.720) RF fold 5 (AUC = 0.807)
LR fold 3 (AUC = 0.647) Mean ROC (AUC = 0.680)
RF fold 3 (AUC = 0.898) Mean ROC (AUC = 0.828)
Figure 1: Receiver operating characteristic curves of logistic
Figure 3: Receiver operating characteristic curves of random
regression.
forest.
1.0 1.0
0.8 0.8
0.6 0.6
Sensitivity
Sensitivity
0.4 0.4
0.2 0.2
0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
1–specificity 1–specificity
DCT fold 1 (AUC = 0.700) DCT fold 4 (AUC = 0.598) XGBoost fold 1 (AUC = 0.872)
DCT fold 2 (AUC = 0.563) DCT fold 5 (AUC = 0.557) XGBoost fold 2 (AUC = 0.831)
DCT fold 3 (AUC = 0.676) Mean ROC (AUC = 0.619) XGBoost fold 3 (AUC = 0.913)
XGBoost fold 4 (AUC = 0.931)
Figure 2: Receiver operating characteristic curves of decision tree.
XGBoost fold 5 (AUC = 0.913)
Mean ROC (AUC = 0.891)
obtained, unknown potentially relevant features may have
been unfortunately missed. Second, the models in our study Figure 4: Receiver operating characteristic curves of extreme
gradient boosting.
were developed using the dataset related to HBsAg sero-
clearance, which may not be suitable for direct application
for prediction or diagnosis for other status or diseases. Third, limited due to the dataset from a single center within the
in this study we only included four machine learning al- specific geographic region, resulting in the limitation of the
gorithms, and further exploration on investigating better sample’s representativeness of the whole study population,
models are still urgent to improve the prediction accuracy. and it may be controversial that the results might change
Finally, the external applicability of our findings might be from other centers.
6 Computational and Mathematical Methods in Medicine
0 20 40 60 80 100 120 Yuantao Hao wrote, reviewed, and edited the article. All
sAg authors read and approved the final manuscript.
Age
DNA
TBIL Acknowledgments
BMI
DBIL
SL This research was funded by National Science and Tech-
VR nology Major Project of the Ministry of Science and
GLB
ALB
Technology of China (2018ZX10715004), the PhD Start-Up
PVW Fund of Natural Science Foundation of Guangdong Prov-
Lines ince (A03299), and the 5010 Project of Clinical Research in
Feature importances
eAg
AST Sun Yat-sen University (2016009).
ALT
GGT
WBC
RLOD
References
Current treatment
SPVW [1] Y.-F. Liaw and C.-M. Chu, “Hepatitis B virus infection,” The
Drinking history Lancet, vol. 373, no. 9663, pp. 582–592, 2009.
PLT [2] T. C. Tseng and J. H. Kao, “HBsAg seroclearance: the more
Initial treatment
HBV family history and earlier, the better,” Gastroenterology, vol. 136, no. 5,
HB pp. 1843-1844, 2009.
Current diagnosis [3] J. Liu, H.-I. Yang, M.-H. Lee et al., “Spontaneous seroclear-
Initial diagnosis ance of hepatitis B seromarkers and subsequent risk of he-
Gender
patocellular carcinoma,” Gut, vol. 63, no. 10, pp. 1648–1657,
Figure 5: Variable importance plot of the XGBoost model for 2014.
predicting HBsAg seroclearance. [4] R. Idilman, K. Cinar, G. Seven et al., “Hepatitis B surface
antigen seroconversion is associated with favourable long-
term clinical outcomes during lamivudine treatment in
5. Conclusions HBeAg-negative chronic hepatitis B patients,” Journal of Viral
Hepatitis, vol. 19, no. 3, pp. 220–226, 2012.
In this study, we conducted an evaluation and comparison of [5] J.-F. Wu, H.-Y. Hsu, Y.-C. Chiu, H.-L. Chen, Y.-H. Ni, and
four well-known machine learning algorithms by regressing M.-H. Chang, “The effects of cytokines on spontaneous
the HBsAg seroclearance status of each patient against hepatitis B surface antigen seroconversion in chronic hepatitis
laboratory and demographic variables. The results show that B virus infection,” Journal of Immunology, vol. 194, no. 2,
machine learning algorithms, especially XGBoost, can pp. 690–696, 2015.
predict HBsAg seroclearance with an efficient accuracy. This [6] C.-M. Chu and Y.-F. Liaw, “Hepatitis B surface antigen
study also showed a potential of machine learning algo- seroclearance during chronic HBV infection,” Antiviral
rithms being used for clinical outcome predictions. Therapy, vol. 15, no. 2, pp. 133–143, 2010.
[7] H.-Y. Hsu, M.-H. Chang, C.-Y. Lee, J.-S. Chen, H.-C. Hsu,
and D.-S. Chen, “Spontaneous loss of HBsAg in children with
Data Availability chronic hepatitis B virus infection,” Hepatology, vol. 15, no. 3,
pp. 382–386, 1992.
The data used to support the findings of this study were
[8] T. C. Tseng, C. J. Liu, T. H. Su et al., “Serum hepatitis B surface
supplied by Xiangyong Li under license and so cannot be antigen levels predict surface antigen loss in hepatitis B e
made freely available. Requests for access to these data antigen seroconverters,” Gastroenterology, vol. 141, no. 2,
should be made to [Xiangyong Li, [email protected]]. pp. 517–525, 2011.
[9] G. Marx, S. R. Martin, J. F. Chicoine, and F. Alvarez, “Long-
Conflicts of Interest term follow-up of chronic hepatitis B virus infection in
children of different ethnic origins,” Journal of Infectious
The authors declare that there are no conflicts of interest Diseases, vol. 186, no. 3, pp. 295–301, 2002.
regarding the publication of this paper. [10] F. Bortolotti, M. Guido, S. Bartolacci et al., “Chronic hepatitis
B in children after e antigen seroclearance: final report of a 29-
Authors’ Contributions year longitudinal study,” Hepatology, vol. 43, no. 3, pp. 556–
562, 2006.
Yuantao Hao, Xiaolu Tian and Xiangyong Li conceptualized [11] M. F. Yuen, D. K. H. Wong, J. Fung et al., “HBsAg Sero-
the data. Yutian Chong and Xiangyong Li cured the data. clearance in chronic hepatitis B in Asian patients: replicative
level and risk of hepatocellular carcinoma,” Gastroenterology,
Xiaolu Tian and Yutao Huang performed formal analysis.
vol. 135, no. 4, pp. 1192–1199, 2008.
Xiaolu Tian, Pi Guo, Wangjian Zhang, and Yuantao Hao [12] R. Moucari, A. Korevaar, O. Lada et al., “High rates of HBsAg
helped in methodology. Xiangyong Li and Yuantao Hao seroconversion in HBeAg-positive chronic hepatitis B pa-
were responsible for the resources. Yutao Huang helped in tients responding to interferon: a long-term follow-up study,”
using software. Mengjie Li and Zhicheng Du validated the Journal of Hepatology, vol. 50, no. 6, pp. 1084–1092, 2009.
data. Xiaolu Tian wrote the original draft. Xiangyong Li, Pi [13] C.-M. Chu and Y.-F. Liaw, “HBsAg seroclearance in
Guo, Mengjie Li, Wangjian Zhang, Zhicheng Du, and asymptomatic carriers of high endemic areas: appreciably
Computational and Mathematical Methods in Medicine 7
INFLAMMATION
BioMed
PPAR Research
Hindawi
Research International
Hindawi
www.hindawi.com Volume 2018 www.hindawi.com Volume 2018
Journal of
Obesity
Evidence-Based
Journal of Stem Cells Complementary and Journal of
Ophthalmology
Hindawi
International
Hindawi
Alternative Medicine
Hindawi Hindawi
Oncology
Hindawi
www.hindawi.com Volume 2018 www.hindawi.com Volume 2018 www.hindawi.com Volume 2018 www.hindawi.com Volume 2018 www.hindawi.com Volume 2013
Parkinson’s
Disease
Computational and
Mathematical Methods
in Medicine
Behavioural
Neurology
AIDS
Research and Treatment
Oxidative Medicine and
Cellular Longevity
Hindawi Hindawi Hindawi Hindawi Hindawi
www.hindawi.com Volume 2018 www.hindawi.com Volume 2018 www.hindawi.com Volume 2018 www.hindawi.com Volume 2018 www.hindawi.com Volume 2018