A Time Series Forecasting Approach For Queue Wait-Time Prediction
A Time Series Forecasting Approach For Queue Wait-Time Prediction
ANTON STAGGE
ANTON STAGGE
Abstract
Waiting in queues is an unavoidable part of life, and not knowing how long
the wait is going to be can be a big source of anxiety. In an attempt to mitigate
this, and to be able to manage their queues, companies often try to estimate
wait-times. This is especially important in healthcare, since the patients are
most likely already under some distress.
In this thesis the performance of two different machine learning (ML) ap-
proaches and a simulation approach were compared on the wait-time predic-
tion problem in a digital healthcare setting. Additionally, a combination ap-
proach was implemented, combining the best ML model with the simulation
approach.
The ML approaches used historical data of the patient queue in order to
produce a model which could predict the wait-time for new patients joining the
queue. The simulation algorithm mimics the queue in a virtual environment
and simulates time moving forward until the new patient joining the queue is
assigned a clinician, thus producing a wait-time estimation. The combination
approach used the wait-time estimations produced by the simulation algorithm
as an additional input feature for the best ML model.
A Temporal Convolutional Network (TCN) model and a Long Short-Term
Memory network (LSTM) model were implemented and represented the se-
quence modeling ML approach. A Random Forest Regressor (RF) model and
a Support Vector Regressor (SVR) model were implemented and represented
the traditional ML approach. In order to introduce the temporal dimension to
the traditional ML approach, the exponential smoothing preprocessing tech-
nique was applied.
The results indicated that there was a statistically significant difference be-
tween all models. The TCN model and the simulation algorithm had the lowest
Mean Square Error (MSE) of all individual models. Both sequence modeling
models had lower MSE compared to both of the traditional ML models. The
combination model had the lowest MSE of all, adopting the best performance
traits from both the ML approach and the simulation approach. However, the
combination model is the most complex, and thus requires the most mainte-
nance.
Due to the limitations in the study, no single approach can be concluded as
optimal. However, the results suggest that the sequence modeling approach is a
viable option in wait-time prediction, and is recommended for future research
or applications.
iv
Sammanfattning
Att vänta i köer är en oundviklig del av livet. Att inte veta hur lång väntan
kommer att bli kan framkalla ångest. I ett försök att lindra denna ångestkälla,
samt för att kunna hantera sina köer, försöker företag ofta uppskatta vänteti-
den. Detta är särskilt viktigt inom hälso- och sjukvården, eftersom patienterna
troligtvis redan upplever någon typ av oro.
Syftet med denna uppsats är att jämföra prestandan hos tre olika metoder
för att förutspå väntetiden hos en digital vårdstjänst. Två olika maskininlär-
ningsmetoder (ML) samt en simuleringsmetod jämfördes. Utöver detta jäm-
fördes även en kombinationsmetod, som kombinerade den bästa ML-modellen
med simuleringsmetoden.
ML-metoderna använde sig av historisk data från patientkön för att skapa
en modell som kunde förutsäga väntetiden för nya patienter som ställer sig i
kön. Simuleringsalgoritmen imiterar kön i en virtuell miljö och simulerar att
tiden går framåt i denna miljö tills den nya patienten som anslöt sig till kön
kan tilldelas en ledig kliniker. På detta sätt kan en prediktion av väntetiden ges
till patienten. Kombinationsmetoden använde simuleringsprediktionerna som
ytterligare indata till den bästa ML-modellen.
En Temporal Convolutional Network (TCN)-modell samt en Long Short-
Term Memory (LSTM)-modell implementerades och representerade sekvens-
modelleringsmetoden (eng: sequence modeling). En Random Forest Regressor
(RF)-modell samt en Support Vector Regressor (SVR)-modell implementera-
des och representerade den traditionella ML-metoden. För att den traditionella
ML-metoden skulle få tillgång till tidsdimensionen applicerades förbehand-
lingstekniken exponentiell utjämning på dess data.
Resultatet visade att det fanns en statistiskt signifikant skillnad i kvadrat-
felet mellan alla modellerna. TCN-modellen samt simulationsalgoritmen hade
lägst medelkvadratfel av de ensamstående modellerna. Båda sekvensmodelle-
ringsmodellerna hade lägre medelkvadratfel än de traditionella ML-modellerna.
Kombinationsmodellen hade absolut lägst medelkvadratfel, då modellen be-
höll fördelarna från både ML- samt simuleringsmetoden. Däremot är kombi-
nationsmetoden den metod som kräver mest underhåll.
På grund av begränsningarna i studien kan ingen enstaka metod hävdas
vara optimal. Resultaten tyder emellertid på att sekvensmodelleringsmetoden
kan användas för väntetidsprediktion i ett kösystem, och rekommenderas där-
för för framtida forskning eller applikationer.
Contents
1 Introduction 1
1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Background 5
2.1 Research Areas . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Sequence Modeling . . . . . . . . . . . . . . . . . . . 5
2.1.2 Queuing Systems . . . . . . . . . . . . . . . . . . . . 6
2.2 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 Exponential Smoothing . . . . . . . . . . . . . . . . . 8
2.2.2 Artificial Neural Networks . . . . . . . . . . . . . . . 8
2.2.3 Convolutional Neural Networks . . . . . . . . . . . . 9
2.2.4 Temporal Convolutional Networks (TCN) . . . . . . . 10
2.2.5 Recurrent Neural Networks . . . . . . . . . . . . . . . 15
2.2.6 Long Short-Term Memory (LSTM) . . . . . . . . . . 16
2.2.7 Regression Tree . . . . . . . . . . . . . . . . . . . . . 18
2.2.8 Random Forest Regressor (RF) . . . . . . . . . . . . . 19
2.2.9 Support Vector Regression (SVR) . . . . . . . . . . . 20
2.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3.1 Wait-time Prediction . . . . . . . . . . . . . . . . . . 23
2.3.2 Wait-time Prediction in Healthcare . . . . . . . . . . . 25
3 Methodology 27
3.1 The Queuing System and its Data . . . . . . . . . . . . . . . . 27
3.1.1 The Queuing System . . . . . . . . . . . . . . . . . . 27
3.1.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
v
vi CONTENTS
3.2.1 Naive . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2.2 The Simulation Algorithm (Sim) . . . . . . . . . . . . 31
3.2.3 TCN . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.4 LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.5 RF . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.6 SVR . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.7 Combination Model (Comb) . . . . . . . . . . . . . . 35
3.3 Hyperparameter Optimization . . . . . . . . . . . . . . . . . 35
3.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4 Results 38
4.1 Hyperparameter Optimization . . . . . . . . . . . . . . . . . 38
4.1.1 TCN . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.1.2 LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.1.3 RF . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.1.4 SVR . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.3 Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5 Discussion 45
5.1 Model Comparison . . . . . . . . . . . . . . . . . . . . . . . 45
5.1.1 Forecasts . . . . . . . . . . . . . . . . . . . . . . . . 47
5.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.2.1 Model Selection . . . . . . . . . . . . . . . . . . . . 47
5.2.2 Temporal Dimension . . . . . . . . . . . . . . . . . . 48
5.2.3 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.2.4 Hyperparameter Optimization . . . . . . . . . . . . . 49
5.3 Social Aspects and Usability . . . . . . . . . . . . . . . . . . 50
5.4 Sustainability and Ethics . . . . . . . . . . . . . . . . . . . . 51
5.5 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . 51
6 Conclusion 53
Bibliography 54
A Forecasts 61
CONTENTS vii
Terminology
The following terms and abbreviations are used in this report:
ML - Machine Learning
TCN - Temporal Convolutional Network
LSTM - Long Short-Term Memory
RF - Random Forest
SVM - Support Vector Machine
SVR - Support Vector Regressor
Feature / predictor / parameter - Input variable
QL - Queue length, i.e. the number of consumers actively waiting in a queue
ED - Emergency Department
DLAMI - Deep Learning Amazon Machine Instance
Consultation / meeting - Used interchangeably to describe a consultation
meeting between a patient and a clinician.
Chapter 1
Introduction
1
2 CHAPTER 1. INTRODUCTION
for the wait-time can be made more easily. Patients seeking help consequently
do not have to waste their time waiting idly in a queue.
A substantial amount of research has been conducted with focus on wait-
time prediction [5], [7]–[13]. Some of this work will be covered in detail in
section 2.3. Several approaches to the problem have previously been inves-
tigated: Queuing Theory, Simulations and Machine Learning [6], [9], [10],
[13]. In the Machine Learning (ML) approach, historical data of the state of
the queue is used in order to obtain a model which can make predictions of the
future state of the queue. The state-of-the-art ML model used for this purpose
is the Random Forest (RF) [14] model but other models have been examined
[7], [9], [13].
Another research area that has had some advances more recently is Se-
quence Modeling, which consists of mapping an input sequence to an output.
This results in a new dimension added to the input, namely the temporal di-
mension. Sequence modeling is traditionally applied on problems involving
texts, speech, DNA or other naturally occurring sequences [15]. The current
state-of-the-art sequence modeling models are the Long Short-Term Memory
(LSTM) [16] network and the (more recently presented) Temporal Convolu-
tional Network (TCN) [15]. The historical data of a queue can be considered
a sequence, since the current state of the queue is highly dependent on pre-
vious states of the queue. While simpler linear techniques have been applied
[9], [17], no studies applying the more advanced techniques from the sequence
modeling research area on the wait-time prediction problem have been identi-
fied.
The aim of this thesis is to compare the performance of three different
approaches to the queue wait-time prediction problem in digital healthcare: a
sequence modeling approach, the traditional ML approach and a simulation
approach implemented by the company KRY [6]. In order to incorporate the
temporal dimension to the traditional non-sequence modeling ML approach,
the exponential smoothing [18] preprocessing technique is applied.
In order to answer the research question stated above the performance of two
sequence modeling models (TCN and LSTM), two non-sequence modeling
ML models utilizing exponential smoothing (RF and SVR), and the simula-
tion algorithm implemented by the company KRY [6] will be compared. Ad-
ditionally, the performance of a combination model, comprising of the best
performing ML model and the simulation algorithm, will be examined.
1.2 Purpose
The purpose of this research is to investigate whether the more advanced tech-
niques from sequence modeling research area can be applied to the queue wait-
time prediction problem in healthcare, and if so, investigate how it performs
compared to the traditional approaches. Recent studies have applied other ML
approaches with success, but the ML models used do not take the temporal
dimension into account. The TCN model has recently demonstrated its capa-
bility on a wide variety of tasks and datasets [15]; thus, there lies an interest
in exploring its usability on the wait-time prediction problem. If the sequence
modeling approach proves to be viable, it could be applied to other wait-time
prediction problems, not just wait-time prediction in a digital healthcare set-
ting, as long as there exists historical data of the state of the queue.
1.3 Scope
This paper compares four different ML models, where two are sequence mod-
eling models and the other two are not. Other models are not considered. The
scope does not include the application of queuing theory to the problem. Fur-
thermore, the data used in this thesis comes from a single digital healthcare
provider and from Swedish patients only. The consumer behaviour might dif-
fer in other markets.
1.4 Outline
This report is divided into six chapters. In Chapter 2 (Background) the re-
search areas are introduced, the theory behind all models and techniques ap-
plied in this paper is explained, and finally related work is presented. In Chap-
4 CHAPTER 1. INTRODUCTION
Background
This chapter starts by introducing the research areas within which this thesis
is carried out. Secondly, the theoretical background of all methods used is
explained in detail. Lastly, related work and their similarities and differences
are presented.
5
6 CHAPTER 2. BACKGROUND
Queue Modeling
The research area known as queuing theory focuses on using mathematical
models to explain a queuing system. This is the traditional way of modeling a
queue [13]. The different mathematical formulas provided by queuing theory
each have different constrains which have to be met [12], [27]. Alternative less
restricted methods of modeling a queue include simulations or the use of ML.
Simulations can be used to model a queue. This is the algorithmic ap-
proach. A simulation is designed to mimic a queue as perfectly as possible,
and then time is moved forward in a virtual environment. By using assump-
tions or calculated estimations for the amount of time each consumer is going
to spend at the server, time can be moved forward until every consumer has
visited a server. This queuing model, therefore, heavily relies on the accu-
racy of the estimated service time. A detailed description of the simulation
algorithm investigated in this thesis is given in subsection 3.2.2.
When modeling a queue with ML, historical data of the state of the queue is
used. A set of features is selected to represent the state, and a model is trained
on the data of said features. The goal is for the model to portray the system as
accurately as possible [5], [10], [13], [28]. This approach is a viable alternative
to queuing theory [10], and all it requires is data regarding the history of the
queue. When a model of the queue has been acquired it can be used to make
predictions about future states of the queue, and certain variables in its state,
for instance the wait-time.
8 CHAPTER 2. BACKGROUND
2.2 Theory
2.2.1 Exponential Smoothing
The technique and use of exponential smoothing and exponential windows
have existed for a long time. In the statistical literature it is often credited to
Brown [29], and referred to as Brown’s simple exponential smoothing. The
method was originally introduced as a forecasting method, but have in the
more recent years been proven to be a good preprocessing method [30]. The
forecasting method is described as:
that connect multiple neurons into a large network, known as Artificial Neural
Networks (ANNs), have been invented. Rosenblatt introduced the first percep-
tron model in [34]. An example of a multi-layer perceptron (MLP) network,
CHAPTER 2. BACKGROUND 9
(b) 10 feature maps stacked together to form a single output with depth 10.
A single convolutional layer can apply multiple filters to the same input.
The feature maps produced by each filter are stacked on top of each other in
order to produce a single output, as shown in Figure 2.5b. The number of
filters, the size of the filters and the size of the stride are hyperparameters to
the layer. [39]
Figure 2.7: An example of a basic TCN architecture with 3 stacked layers each
with k = 3.
In order to expand the number of inputs used to compute the output (known
as the receptive field) beyond linear with the filter size and network depth, a
technique knows as dilated convolutions [40] are made. A dilation is a fixed
step d between two filter taps. The filter skips the inputs between the taps as
shown in Figure 2.8 [15].
Figure 2.8: An example of a dilation layer with d = 4 and k = 3. Not all lines
have been drawn to simplify the image. In reality, each output ŷt is connected
to k inputs xt .
Because of the residual blocks, the calculation of the receptive field changes.
Without residual blocks, the formula for the receptive field R would be:
where n is the layer, d(n) is the dilation for layer n and R(0) = 1. For example,
looking back at Figure 2.7 which has k = 3 and d = 1, the receptive field for
the output layer is 15. However, since each residual block contains 2 causal
convolutions, the formula becomes:
If the dilation rate increases exponentially with a base 2, the formula can be
simplified to:
An illustration of how the residual blocks affect the receptive field is shown
in Figure 2.11. As one can see the number of inputs considered for a single
output is 13. Using Equation 2.6:
Figure 2.11: An example of the receptive field of two stacked residual blocks
with k = 3 and dilation increasing exponentially with a base of 2.
Figure 2.12: The computational graph of a basic RNN. U W and V are weight
matrices, x is the input, h is the hidden state which is passed forward in time,
o is the output, and ŷ is the output after the activation function. The black box
represents a delay of one time-step.
begins with an initial state h(0) and then for each time-step t in the sequence
the following equations are applied:
Similar to the RNN, the process of producing outputs ŷ (t) , i.e. forward
propagation, for an LSTM unit starts with a initial hidden state h(0) , and then
the equations in Equation 2.9 are applied for each time-step t of the sequence.
Here i(t) denotes the input gate, f (t) the forget gate, o(t) the output gate, and
s(t) the memory node.
where σ(·) is the sigmoid activation function and is element wise multipli-
cation [39]. All bias vectors have been excluded from the equations for sim-
plicity. The size of all matrices W and U is a hyperparameter to the LSTM.
Much like the TCN, the modern LSTM utilizes dropout [44], both between
layers, and between the recurrent units.
Figure 2.15: The prediction space divided into regions based on the Regres-
sion Tree in Figure 2.14.
predictor and cutpoint is the one that minimizes the Residual Sum of Squares
(RSS) given by Equation 2.10, where J is the number of regions (R), and ŷRj
is the mean prediction value of the points within region Rj . [47]
J X
X
RSS = (yi − ŷRj )2 (2.10)
j=1 i∈Rj
bootstrapped training data, each tree in the forest become less correlated to
the others, and the variance of the model is reduced. The number of trees, the
maximum depth of each tree and the number of predictors m are hyperparam-
eters to the model. [47]
yi × (w
~ · x~i − b) ≥ M (1 − i ) (2.11)
Every point should be on the correct side of the plane, with margin, but allow-
ing for some (i ) slack. The slack relaxes the problem so that a solution can
2
be found for more cases. The size of the margin M can be calculated to ||w|| ~
.
Maximizing the margin is therefore equivalent to minimizing the length of w. ~
The optimization problem for the SVC is
X
~ 2+C
Minimize ||w|| i
i
(2.12)
Under the constraints yi × (w
~ · x~i − b) ≥ 1 − i , ∀i
i ≥ 0, ∀i
where C is a hyperparameter that tunes the amount of slack allowed [48]. Fig-
ure 2.16 shows an example of a SVC in 2D space.
CHAPTER 2. BACKGROUND 21
An interesting property of the SVC is that it is only the points that lay on or
within the margin that affect the position of the hyperplane. These are known
as the support vectors [47]. For example, if you were to move any of the top
four blue points in Figure 2.16 to any other location on the correct side of the
margin, it would not change the classifier at all.
In 2015 the authors in [49] proved that the linear regression model known
as the Elastic Net [50] could be reduced to a SVC. For every Elastic Net in-
stance, there exists a solution to a constructed binary classification problem in
which the hyperplane is identical to the Elastic Net solution. The practical im-
plications of this is that it enables the use of optimized SVC solvers for Elastic
Net problems [49].
2
√ x 1
x
~x = 1 φ(~x) = 2x1 x2 (2.13)
x2
x22
1 X
~ 2+C
Minimize ||w|| i
2 i
(2.17)
Under the constraints yi × (w
~ · φ(~
xi ) − b) ≥ 1 − i , ∀i
i ≥ 0, ∀i
1 X
~ 2+C
Minimize ||w|| (ζi + ζi∗ )
2 i
Under the constraints yi − w
~ · φ(xi ) − b ≤ ε + ζi , (2.18)
~ · φ(xi ) + b − yi ≤ ε + ζi∗ ,
w
ζi , ζi∗ ≥ 0, ∀i
where ζi∗ denotes training errors above ε and ζi denotes training errors below
ε [48], [54]. The ζ variables, analogous to the slack variables in the SVM
optimization problem, allow for some pairs (xi , yi ) to be approximated with
worse precision than ε and thus relaxes the constraints of the optimization
problem [53]. Figure 2.17 illustrates a linear SVR model.
the line. Although [11] focuses on the queuing theory research area, the result
of their study can be applied in this thesis.
In a study from 2017 by Mourao, Carvalho, Carvalho, and Ramos [12]
several different models to predict (classify) wait-time overflows in banking
queues were compared. Banks in Brazil can be served fines if their customers
have to stand in line over a certain amount of time. Such a case was consid-
ered an overflow. The models compared were 3 ML models (Deep Learning,
Gradient Boost Machine and RF) as well as a model based on queuing theory.
The ML models were trained on historical data about the queue state for each
costumer. The features of this data included, but were not limited to: Arrival
Rate, Service Rate, Arrival Time, Average Service Time, Queue Length, Head
of Line (HOL). All ML models outperformed the queuing theory model, both
in accuracy and F1-score. The GBM achived the highest accuracy of 97%, and
the RF model had an accuracy of 96%. The work done in [12] differs from this
paper because of the single-server queuing system used and the classification
approach.
The ML approach was proven to be viable for wait-time prediction by
Kyritsis and Deriaz in their study from 2019 [10]. Kyritsis and Deriaz im-
plemented a ANN model with 2 hidden layers and trained it with an Adam
Optimizer on historical data from banking queues. The dataset used is pub-
licly available and consisted of, for each person that joined the queue, the 5
features: queue length, day of week, hour of day, minute of hour and the wait-
time. The data had a mean wait-time of 13.18 minutes, a median of 12 minutes
and a standard deviation of 5.95 minutes. The final model produced and Mean
Absolute Error (MAE) of 3.35 minutes, compared to a naive model which al-
ways predicted the mean with a MAE of 4.71 minutes. The work in [10] differs
from this paper by the fact that the number of servers in the multi-server queue
system was an unknown variable, and by the choice in ML model.
In a study by Zhang, Nguyen, and Zhang in 2013, several algorithms, in-
cluding some ML models, were implemented to predict the wait-time on a
time series dataset from the Department of Motor Vehicles (DMV) in San
Jose, California [17]. The data consisted of the wait-time for every 10 minutes
of the day. The simple linear regression model achieved the lowest average
error, followed closely by the SVM. The ML models were trained on the fea-
tures: latest wait-time, time of the day and day of the week. Much like the
work in this thesis, [17] compares the performance of multiple ML models on
wait-time prediction. However, the work differs by the fact that the data used
as well as the algorithms chosen were much simpler.
CHAPTER 2. BACKGROUND 25
Methodology
This chapter describes the methods used to carry out the experiments. A de-
scription of the queuing system in place at KRY (hereby referred to as the
company), and the historical data used for training is given. A detailed de-
scription of the implementation of all the prediction models is given (including
the simulation algorithm implemented by the company), as well as the hyper-
parameter optimization executed for the ML models. Finally, the evaluation
metrics are defined.
All hyperparameter optimization, training and evaluation were executed
on a Deep Learning Amazon Machine Instance (DLAMI) of type g4dn.xlarge
(4x vCPU, 16GB RAM, 1x NVIDIA T4 GPU), running Ubuntu 16.04.
27
28 CHAPTER 3. METHODOLOGY
clinicians carrying the same labels, indicating that they can treat those patients.
Another difference is the fact that there exist some special cases of patients,
which are given different priorities in the queue. Such cases include: infants,
prescription renewals, and internal referrals. In the case of prescriptions re-
newals, patients do not have to have a video consultation but can do this via
a text meeting. These meetings require much less time compared to video
consultations. Internal referrals is the case where a clinician at the company
has referred a patient to another clinician at the company. The patients in the
queue are sorted firstly by priority, and secondly by arrival time.
The last difference in the queue system is that patients are unable to join
the queue if the wait-time estimation by the simulation model exceeds 4 hours.
This is a feature implemented to avoid overcrowding.
3.1.2 Data
Data from 767246 patients between the period of 2018-04-22 to 2020-02-17
was collected. In the data cleaning process patients with missing values and
patients with a recorded wait-time exceeding 6 hours were considered anoma-
lies and were therefore dropped. Next, patients with a difference in wait-time
compared to the previous patient of more than 3 times the standard deviation
(σ = 9.3214 minutes) were considered outliers and were also dropped. This
was done due to the fact that admins can access the queue system and manu-
ally assign patients to clinicians, resulting in a very small wait-time. After the
cleaning process, data from a total of 757086 patients remained.
The data collected from the queue is not uniformly sampled. Sequence
modeling applications implicitly require that the data is uniformly sampled
to be effective. The solution applied to this problem was to append the time
intervals ∆t (difference in time between each patients) to the input vector as
suggested in [57]. A description of each feature included in an observation is
displayed in Table 3.1.
For the sequence modeling models, the input features were normalized by
removing the mean and scaling to unit variance. This was done using Scikit-
learns StandardScaler [48]. The sequence length was set to 110, thus, each
multivariate input sequence consisted of 110 inputs, x0 , ..., x110 , where each
vector xt contained information about all input features described in Table 3.1.
The output target y110 was the actual wait-time feature for patient x110 , as de-
scribed in Equation 2.2.
The technique of Exponential Smoothing, as described in subsection 2.2.1,
was applied in order to incorporate the temporal dimension in the data for the
CHAPTER 3. METHODOLOGY 29
Feature Description
Year Year of the visit.
Month Month of the year.
Day Day of the month.
Hour Hour of the day.
Minute Minute of the hour.
∆t Seconds elapsed since the latest patient
joined the queue.
HoL How long the patient currently in the front
of the queue has waited.
Queue length Number of patients currently in the queue.
Arrival rate Number of patients arriving per second,
measured over the latest 15 patients.
Gender Gender of the patient. 0, 1 or 2, where 0
means not specified.
Age Age of the patient.
Symptom The symptom the patient is seeking help
for. This is selected by the patient from a
list of 61 symptoms, and was represented
by the index in said list.
Latest meeting length The length of the latest meeting that
ended.
Latest wait-time The wait-time of the latest patient that
received help.
Doctors working Number of doctors currently working.
Receiving service Number of patients in meetings.
Mean average meeting length The mean of the currently working
doctors average meeting length.
Service rate Mean number of patients in meetings
measured over the latest 15 patients.
Simulation wait-time The wait-time estimation given by the
simulation model.
Actual wait-time The actual wait-time in minutes for the
patient.
Table 3.1: A description of all the features in the data collected from the queue.
The first group consists of all input features used for the ML models. The
second group consists of the simulation estimation and the target feature.
CHAPTER 3. METHODOLOGY 31
3.2 Models
3.2.1 Naive
Two naive models were implemented. The first model, Naive (previous), was
defined as:
ŷ (t+1) = y (t) (3.1)
This model is not possible to apply in the real world application, because the
wait-time for the previous patient is most often not known when the next pa-
tient arrives. The previous patient is most likely still in the queue, thus, this
prediction model uses information from the future. However, the model was
used merely as a benchmark for the other models.
The second naive model, Naive (latest known), predicts the latest known
wait-time. This is one of the input features as described in Table 3.1. More
formally, the model was defined as:
(t)
ŷ (t+1) = xlatest_wait_time (3.2)
Initially, the events consist of all the start and end times of the clinician shifts,
including their breaks, up to 7 days in to the future. Additionally, if any shift
has an ongoing meeting, the expected end time of said meeting is added as
an event. This end time is calculated based on the meeting length estimations
explained above. At the time for each event in the ordered event list, a check
is done if any ongoing meeting has ended, signifying that a clinician has be-
come available for matching. Furthermore, all scheduled shifts are searched
for shifts that have either ended or started at this point in time. Clinicians
whose shifts just ended are no longer considered for matching, and vice versa
for clinicians whose shifts just started. Next, the patients waiting in the queue
are matched with the currently available clinicians based on their labeling in-
formation. An event is created for each matched patient, signifying the end
time of their meeting. The matched patients are assigned wait-time estima-
tions calculated by taking the difference of the simulated matching time and
the actual current time. The algorithm is presented in pseudo code below in
Algorithm 1.
CHAPTER 3. METHODOLOGY 33
3.2.3 TCN
The TCN model was implemented using TensorFlow2 [58] and the Keras li-
brary [59]. The implementation of the TCN layer in [60] was used, together
with a final dense layer with linear activation. The TCN layer in [60] was
made following the specification from [15]. The sole difference is that the
weight normalization in the residual block was swapped for an optional batch
normalization [61]. Whether to use batch normalization or not was left as a hy-
perparameter. The other hyperparameters were kernel size, number of filters,
number of residual blocks (with exponentially increasing dilation), dropout
rate, and kernel initializer. The Adam Optimizer [62] from Keras was used
during training, with the default learning rate of 0.001. The model was trained
for a maximum of 100 epochs with a batch size of 64 and a sequence length
of 110. Early stopping was used with a patience of 20, i.e. the training was
terminated if the validation MSE did not decrease for 20 consecutive epochs.
3.2.4 LSTM
The LSTM model was implemented using TensorFlow2 [58] and the Keras
library [59]. The LSTM layer from Keras was used together with a dense
layer with linear activation. All hyperparameters to the layer were set to the
default values except for the number of units, kernel initializer, dropout rate,
and recurrent dropout rate. Identically to the TCN, the LSTM was trained for
a maximum of 100 epochs using: the Adam Optimizer [62] with the default
learning rate of 0.001, early stopping with a patience of 20, a batch size of 64,
and a sequence length of 110.
3.2.5 RF
The Scikit-learn implementation RandomForestRegressor [48] was used for
the RF model. All hyperparameters were left to their default values except
n_estimators (for the number of trees), max depth, and max features (maxi-
mum features considered at each split).
3.2.6 SVR
The GPU accelerated implementation of the SVR in the cuML library from
the open source project Rapids [63] was used. The hyperparameters were left
to default except for the type of kernel, which degree for polynomial kernels,
amount of slack C allowed, and epsilon.
CHAPTER 3. METHODOLOGY 35
Table 3.2: TCN hyperparameter search space. The maximum amount of train-
able parameters is 2845633, and the maximum receptive field is 631.
36 CHAPTER 3. METHODOLOGY
Table 3.4: RF hyperparameter search space. If only smooth was set to false,
the smoothened representations of all features would be appended to the input
vector instead of replacing it. Therefore, n = 18 or n = 31, depending on the
value of only smooth.
Table 3.5: SVR hyperparameter search space, where N (0, 1) denotes the nor-
mal distribution with µ = 0 and σ = 1. The degree parameter was only used
for the polynomial kernel. If only smooth was set to false, the smoothened
representations of all features would be appended to the input vector instead
of replacing it.
CHAPTER 3. METHODOLOGY 37
3.4 Evaluation
In order to reduce the variance and effect of randomness for each ML model,
5 separate models were trained and combined into a final model by averaging
their separate predictions. This does not apply to the SVR model, since its
training process is deterministic, i.e. does not include randomness.
To determine if there was a statistically significant difference in the er-
rors of any model, a Friedman test [66] was done.. The Friedman test is
the non-parametric version of a analysis of variance test (ANOVA). The non-
parametric version was used because the errors can not be assumed to be nor-
mally distributed. The null hypothesis was that all errors had the same distri-
bution. The hypothesis was tested by picking a random sample of N = 25000
patients from the test data, which total size was 75600. The squared errors for
the predictions from each final model were calculated for each patient in the
sample. The Friedman test implementation from the SciPy library [67] was
then used for testing said error samples. The significance level was set to 0.01.
The Friedman test was followed by a post-hoc pairwise comparison us-
ing the non-parametric version on the dependent (paired) t-test which is the
Wilcoxon signed-rank test [68]. This was done to determine between which
models there was a statistical significant difference in the errors. The null
hypothesis tested was that the two paired samples come from the same distri-
bution. Similar to the Friedman test, this was done by sampling N = 25000
patients from the test data, calculating the square errors in the predictions for
each models, and then using the Wilcoxon singed-rank test implementation
from the SciPy library [67] with a significance level of 0.01.
To measure the performance of each final model their MSE and R2 was
calculated on the test data. The MSE is the most commonly used loss function
for regression, and it is calculated as:
n
1X
MSE = (ŷi − yi )2 (3.3)
n i
Results
4.1.1 TCN
The hyperparameters with the lowest validation loss are displayed in Table 4.1.
The lowest validation loss was 52.502 and the average s/trial were 295.363.
This yields a receptive field of 33 according to Equation 2.6.
kernel size filters blocks dropout rate kernel initializer batch norm
3 64 3 0.0637 he_normal False
38
CHAPTER 4. RESULTS 39
4.1.2 LSTM
Table 4.2 presents the hyperparameters for the LSTM model which resulted in
the lowest validation loss of 50.470. The average s/trial were 2391.482.
4.1.3 RF
The hyperparameters that received the lowest validation loss of 51.227 are
displayed in Table 4.3. The RF model took on average 140.498 s/trial.
4.1.4 SVR
The SVR model with the lowest validation loss of 59.021 had the parameters
displayed in Table 4.4. The SVR model took on average 216.340 s/trial.
4.2 Evaluation
The p-value produced by the Friedman test on a sample of the test data with
a size of 25000 was 0.0, signifying that at least one model does not have the
same distribution in errors compared to the rest on a significance level of 0.01.
The resulting p-values of the post-hoc pairwise comparison using the Wilcoxon
singed-rank test is displayed in Table 4.5. The table shows that there is a sig-
nificant difference in the error distribution between all pairs of models with a
significance level of 0.01.
40 CHAPTER 4. RESULTS
Table 4.5: The p-value results from the post-hoc pairwise Wilcoxon signed-
rank test. Sample size N = 25000.
The MSE and R2 for each of the final models are displayed in Table 4.6,
and the training times are displayed in Table 4.7. The TCN model achieved
the lowest MSE out of all standard models. Therefore, this model was used
for the combination model (Comb).
Model MSE R2
Naive (previous) 28.8 0.935
Comb 38.3 0.914
TCN 56.2 0.873
Simulation 56.6 0.872
LSTM 61.1 0.862
RF 70.3 0.841
SVR 78.6 0.823
Naive (latest known) 102.7 0.768
Table 4.6: MSE and R2 for each model sorted on MSE. The Naive (previous)
model is not possible to implement in a real scenario, because the previous
patient has most likely not been assigned a clinician when the next patient
arrives.
the Naive (previous) model. All ML models and the Simulation model out-
performed the Naive (latest known) baseline model.
The models in the non-sequence modeling ML approach (RF and SVR)
achieved substantially higher MSEs and lower R2 scores compared to the mod-
els in the sequence modeling approach (LSTM and TCN), as can be seen in
Table 4.6. Since the statistical tests showed that there was a statistically sig-
nificant difference between all said models, this provides evidence that the se-
quence modeling approach is superior to the non-sequence modeling approach
at predicting wait-time in this specific digital healthcare setting.
The TCN is significantly faster at training compared to the LSTM. The
non-sequence modeling ML models were significantly faster at training com-
pared to the Neural Network models.
Table 4.7: The average training time and seconds per epoch form all ML mod-
els. The RF and SVR models are not trained in epochs.
4.3 Forecasting
Figure 4.1 and Figure 4.2 display the forecasts from each model from 2 ran-
domly selected days in the test data (not including the Naive (previous) model).
It should be noted that these days only consist of a small part of the test set,
which contained a total of 55 days, and might not represent the whole data set.
More forecast graphs can be found in Appendix A.
As one can see in the figures, the ML models tend to react slower to changes
in the wait-time. The simulation algorithm however seem to overestimate the
changes and predict a higher wait-time than necessary during the peak hours.
In Figure 4.1, the combination model do not seem to suffer from any of the
problems mentioned. Whilst, in Figure 4.2, it seems to have both problems,
but in a much smaller scale.
The simulation seems to be able to handle special cases better than the ML
models. An example of this can be seen at approximately 17:45 in Figure 4.1,
where the wait-time has 4 distinct dips. This happens to be 3 child patients
42 CHAPTER 4. RESULTS
and one patient renewing his/her prescription. These are all considered special
cases as explained in subsection 3.1.1. The simulation is able to assign a much
lower wait-time to these patients, but the ML models does not seem to have
learned this feature in the queue system, even though the input features include
both age and symptom (prescription renewal is considered a symptom).
The combination model seems to have the benefit of fast reactions and spe-
cial case detection from the simulation outputs and the benefit of not overes-
timating spikes in the wait-time from the ML model. This can be seen clearly
in both Figure 4.1 and Figure 4.2.
CHAPTER 4. RESULTS 43
Figure 4.1: The actual wait-time and the predicted wait-time for each model
for single day in the test data (2019-12-30).
44 CHAPTER 4. RESULTS
Figure 4.2: The actual wait-time and the predicted wait-time for each model
for single day in the test data (2020-01-20).
Chapter 5
Discussion
In this chapter a discussion regarding the results and their implications is held.
The performances of the models are compared according to the measured met-
rics, and other aspects of the different models are brought to light. The forecast
graphs are analyzed and evaluated in detail. The limitations in the data used
and the methodology are underlined, the impact of the work is discussed and
finally future research is suggested.
45
46 CHAPTER 5. DISCUSSION
5.1.1 Forecasts
A pattern can be noticed in the forecast behaviors of all ML models compared
to the simulation. The ML models tend to have slower reactions to changes
in the wait-time, which causes them to lag behind the curve a bit. This is
believed to be due to the latest known wait-time feature, which essentially is
the wait-time curve with a time delay. The ML models have to learn how the
wait-time has changed since the latest known wait-time, and if they are unable
to predict any change, the resulting forecasts will follow this time delayed curve
to the point, which is exactly what the Naive (latest known) model does. The
simulation, on the other hand, has faster reactions but tend to overestimate the
spikes in the wait-time. The wait-time is affected by several different factors,
for instance, at peak hours the working clinicians can potentially take the long
queue into account and start working more effectively as a countermeasure.
This is something the simulation is not able to consider, and might be the
reason for the overestimations. The ML models can potentially learn the effect
of such factors, if they are represented in the training data.
The combination model adopted the best traits from both of the approaches
from which it is combined. This is believed to be one of the reasons behind
superior performance of the combination model.
5.2 Limitations
5.2.1 Model Selection
It’s important to note that the research conducted in this paper has its limita-
tions. First and foremost only two different models were compared for each
48 CHAPTER 5. DISCUSSION
5.2.3 Data
In order to make the data collected useful for the ML models, some cleaning
had to be done. This cleaning removed certain patients due to missing or un-
realistic data values. Said patients might have been crucial in explaining the
CHAPTER 5. DISCUSSION 49
behavior of the wait-time, and their removal can thus have affected the perfor-
mance of all ML models in a negative way.
The ML models require a lot of data, the more the better. However, due
to the pandemic caused by the virus Covid-19 in the beginning of 2020, some
data was deemed non usable. The company experienced a disproportionate
increase of patients per day compared to earlier years, which meant that no
data from patients seeking help past the date of 2020-02-17 could be used.
Moreover, data from a long time ago might not reflect the behavior of the
queue and wait-time more recently. For that reason and because of a change in
the structure of the data, no data from earlier than 2018-04-22 could be used.
The chronological split of the data into training, validation and test sets
could be a source of interference for the ML models as well. This splitting
caused both Christmas (2019-12-24) and new years (2020-01-01), two signif-
icant holidays, to be within the test set. The consumer behavior on said days
differs a lot from other regular days. This could have had an impact on the ML
models performance on the test set.
Another significant limitation in the data is that it is collected from a sin-
gle digital healthcare source. The trends in the data, and therefore the results
of this thesis, might differ for other digital healthcare providers, or in other
markets.
The features used to represent the patients in the data were chosen based on
previous research and availability, but also on reproducibility. That means that
features that were too specific to the queuing system in place at the company
were not used, for instance, clinician and patient labels. However, the work in
this thesis does not contain any analysis of said features. It could be interesting
to investigate how much each feature contributes to the predictions. Some of
the features used may only be unwanted noise. This is something that was not
considered due to the time limitation of the study.
for the LSTM was 2391. As 100 trials were done, the total time required for
the hyperparameter optimization for the LSTM was around 66 hours.
Not only was the amount of data limited but also the actual hyperparam-
eters and their search space as well. The batch size and, as mentioned previ-
ously, the sequence length was set to a fixed number even though they might
have a considerable impact on the training and performance of the models.
The range for all hyperparameters in the search space were also chosen with
the time limitations in consideration. This might have affected the result of the
hyperparameter optimization in a way that if a larger search space was cho-
sen, another minima might have been found. The current search space might
have caused the search to end up in a local minima, when a better global min-
ima could exist. In an attempt to make the hyperparameter optimization as
fair as possible between the TCN and LSTM models, the search space was
constructed in a way where the maximum possible size of the two models,
measured in number of trainable weights, was roughly the same.
The hyperparameters deemed to be the best for the TCN were also used
for the TCN in the combination model. Since the combination model has a
new input (the simulation output), the best hyperparameters for the regular
TCN might not be best in this case. But due to time limitations, the assump-
tion that the hyperparameters would also be a good choice for the TCN in the
combination model had to be made.
was applied in this thesis, appending the time differences between each patient
∆t to the input vector. However, research has shown that there are other, better,
solutions to this problem. More specifically [72] introduced the Phased-LSTM
which extends the regular LSTM by adding a new time gate, which results in
a greatly improved performance in standard RNN applications. Very recently,
similar research has been done for the TCN in [57], which introduces a Con-
tinuous Convolutional Neural Network for non-uniform time series showing
promising results. Future research is needed in order to investigate whether
the models mentioned above could further improve the performance of the
sequence modeling approach on the wait-time prediction problem.
The behavior and trends in the wait-time changes with time. As mentioned
in subsection 5.2.3, data from a long time ago might not contribute as much to
the performance of the ML models as more recent. It might just add unwanted
noise. Future research is required to evaluate if any attempt to mitigate this
problem will provide an increase in performance for the ML models. For in-
stance, one could use sample weights in order to weight the data from more
recent patients more heavily so that they contribute more during training. Fur-
thermore, investigating other, more representative, evaluation techniques for
the ML models could be of interest. Splitting the data in chronological order
will cause the ML models to be evaluated on the latest trend which might not
be included in the training data. In order to alleviate this affect one could, for
instance, evaluate the ML models on one week at a time, allowing them to
train on the data from the previous week before evaluating the next.
Chapter 6
Conclusion
53
Bibliography
54
BIBLIOGRAPHY 55
[54] F.-K. Wang and T. Mamo, “A hybrid model based on support vector re-
gression and differential evolution for remaining useful lifetime predic-
tion of lithium-ion batteries”, eng, Journal of Power Sources, vol. 401,
pp. 49–54, 2018, issn: 0378-7753.
[55] D. Worthington, “Queueing models for hospital waiting lists”, Journal
of the Operational Research Society, vol. 38, no. 5, pp. 413–422, 1987.
[56] R. A. Nosek and J. P. Wilson, “Queuing theory and customer satisfac-
tion: A review of terminology, trends, and applications to pharmacy
practice”, eng, Hospital Pharmacy, vol. 36, no. 3, pp. 275–279, 2001,
issn: 0018-5787.
[57] H. Shi, Y. Zhang, H. Wu, S. Chang, K. Qian, M. Hasegawa-Johnson,
and J. Zhao, Continuous convolutional neural network fornonuniform
time series, 2020. [Online]. Available: https : / / openreview .
net/forum?id=r1e4MkSFDr.
[58] Martın Abadi, Ashish Agarwal, Paul Barham, et al., TensorFlow: Large-
scale machine learning on heterogeneous systems, Software available
from tensorflow.org, 2015. [Online]. Available: https://www.tensorflow.
org/.
[59] F. Chollet et al., Keras, https://keras.io, 2015.
[60] P. Rémy, Keras tcn, https://github.com/philipperemy/
keras-tcn, 2020.
[61] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep net-
work training by reducing internal covariate shift”, arXiv preprint arXiv:1502.03167,
2015.
[62] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization”,
arXiv preprint arXiv:1412.6980, 2014.
[63] Rapids, https://rapids.ai, 2020.
[64] J. Bergstra, D. Yamins, and D. D. Cox, “Making a science of model
search: Hyperparameter optimization in hundreds of dimensions for vi-
sion architectures”, eng, JMLR, 2013, isbn: 1938-7228.
[65] J. S. Bergstra, R. Bardenet, Y. Bengio, and B. Kégl, “Algorithms for
hyper-parameter optimization”, in Advances in Neural Information Pro-
cessing Systems 24, J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F.
Pereira, and K. Q. Weinberger, Eds., Curran Associates, Inc., 2011,
pp. 2546–2554. [Online]. Available: http : / / papers . nips .
60 BIBLIOGRAPHY
Forecasts
This section presents the predicted and actual wait-time for all models for 4
randomly selected days in the test-data.
61
62 APPENDIX A. FORECASTS
Figure A.1: The actual wait-time and the predicted wait-time for each model
for single day in the test data (2019-12-28).
APPENDIX A. FORECASTS 63
Figure A.2: The actual wait-time and the predicted wait-time for each model
for single day in the test data (2020-01-09).
64 APPENDIX A. FORECASTS
Figure A.3: The actual wait-time and the predicted wait-time for each model
for single day in the test data (2020-02-08).
APPENDIX A. FORECASTS 65
Figure A.4: The actual wait-time and the predicted wait-time for each model
for single day in the test data (2020-02-10).
TRITA EECS-EX-2020:241
www.kth.se