Rsif 2020 1006
Rsif 2020 1006
royalsocietypublishing.org/journal/rsif
to forecast dengue fever epidemic years in
Brazil using weather and population
susceptibility cycles
Research
Sarah F. McGough1,2, Leonardo Clemente1,3, J. Nathan Kutz4
and Mauricio Santillana1,2,5
Cite this article: McGough SF, Clemente L,
1
Kutz JN, Santillana M. 2021 A dynamic, Computational Health Informatics Program, Boston Children’s Hospital, Boston, MA 02115, USA
2
ensemble learning approach to forecast Harvard T.H. Chan School of Public Health, Harvard University, Boston, MA 02115, USA
3
Tecnológico de Monterrey, 64849 Monterrey, Nuevo León, Mexico
dengue fever epidemic years in Brazil using 4
Department of Applied Mathematics, University of Washington, Seattle, WA 98195, USA
5
weather and population susceptibility cycles. Department of Pediatrics, Harvard Medical School, Harvard University, Boston, MA 02115, USA
J. R. Soc. Interface 18: 20201006. LC, 0000-0001-8939-8841
https://doi.org/10.1098/rsif.2020.1006
Transmission of dengue fever depends on a complex interplay of human, cli-
mate and mosquito dynamics, which often change in time and space. It is well
known that its disease dynamics are highly influenced by multiple factors
Received: 11 December 2020 including population susceptibility to infection as well as by microclimates:
Accepted: 19 May 2021 small-area climatic conditions which create environments favourable for the
breeding and survival of mosquitoes. Here, we present a novel machine learn-
ing dengue forecasting approach, which, dynamically in time and space,
identifies local patterns in weather and population susceptibility to make epi-
demic predictions at the city level in Brazil, months ahead of the occurrence of
Subject Category: disease outbreaks. Weather-based predictions are improved when information
Life Sciences–Physics interface on population susceptibility is incorporated, indicating that immunity is an
important predictor neglected by most dengue forecast models. Given the
Subject Areas: generalizability of our methodology to any location or input data, it may
computational biology, biomathematics prove valuable for public health decision-making aimed at mitigating the
effects of seasonal dengue outbreaks in locations globally.
Keywords:
dengue, forecasting, ensemble
1. Introduction
Owing to emerging sensor technologies and computational advances, the last
decade has seen significant strides in the way data are generated and collected,
Authors for correspondence: resulting in large volumes of complex information known as ‘big data’. The
Sarah F. McGough recent availability of these data has opened up the possibility of new and comp-
e-mail: [email protected] lementary avenues for epidemic monitoring that leverage diverse data
Mauricio Santillana modalities such as satellite imagery [1,2], Internet search engine activity [3,4],
social media [5], mobile phones [6,7], genomics [8,9] and disease surveillance
e-mail: [email protected]
databases [10,11]. This has opened up opportunities to posit and explore
more hypotheses for characterizing the causes and outcomes of disease trans-
mission, population behaviour, environmental conditions and other potential
indicators of population health. Exploiting these relationships to generate
reliable prospective forecasts would benefit health systems by allowing early
mobilization of resources for the prevention of morbidities and deaths in the
face of public health threats. A major challenge in disease forecasting is devel-
oping algorithms that can autonomously and continuously learn from these
complex and ever-changing dynamical systems, uncovering patterns and sig-
nals with little human effort. Machine learning algorithms are ideally suited
for such tasks. Indeed, they are having a profound impact across a wide
Electronic supplementary material is available range of application fields because of their ability to aid in learning and
online at https://doi.org/10.6084/m9.figshare. discovery.
c.5448568.
© 2021 The Authors. Published by the Royal Society under the terms of the Creative Commons Attribution
License http://creativecommons.org/licenses/by/4.0/, which permits unrestricted use, provided the original
author and source are credited.
(a) (b) (c) (d) (e) 2
t0 + p
304 1
95 days
p p+1. . .p+5 non-epidemic calculate
royalsocietypublishing.org/journal/rsif
4.25
temperature (K)
epidemic
out-of-sample
302 1
epidemic historical
0 t0 test
accuracy 0
300 1 t0 + 1 train
precipitation frequency
. 4.00 and repeat for 1
extract .
p, period length
298 final
features ar all (t0, p)
t0 + 4 ye 1 prediction
2000 2001 2002 2003 2004 2005 1
3.75 1
p p+1. . .p+5 1
precipitation (mm)
7.5
0
epidemic t0
5.0 0 3.50 0
t0 + 1
10 days
1
.
2.5 . 1
extract t0 + 4 ar
ye June t0, start date Oct 1
features
2000 2001 2002 2003 2004 2005 301 302 303
date temperature (K)
Figure 1. Ensemble forecast workflow. (a) To predict next year’s epidemic status, we extract features from a daily time series of temperature (K) and precipitation
(mm) over a defined (t0, p) time interval and for each year in the training period. (b) We produce an array of features corresponding to the mean value
One such complex system is the interplay of human, descriptive study showed the promise of a data-driven
climate and mosquito dynamics that give rise to the trans- approach in identifying weather patterns with meaningful
mission of mosquito-borne diseases such as dengue. Dengue signals for dengue fever outbreaks [31]. Specifically, their
fever, a viral mosquito-borne disease transmitted predomi- data-driven strategy identified temperature and frequency of
nately by the Aedes aegypti and Aedes albopictus mosquitoes, precipitation as key features in forecasting dengue outbreaks
infects an estimated 390 million people per year, with nearly by extracting windowed time intervals for different cities that
half the world’s population living at risk of infection [12]. were highly predictive. Motivated by such learning algorithms,
The global burden of dengue has doubled every 10 years we build upon this data-driven strategy to build a richer,
over the last three decades [13], and the disease is projected supervised forecasting algorithm.
to expand its latitude range as global temperatures increase
and create new suitable habitats for the Aedes mosquitoes
among previously unexposed human populations [14]. Short-
term climate conditions, particularly temperature and precipi- 2. Results
tation, can create favourable conditions for the breeding and
survival of Aedes mosquitoes that may increase the trans-
2.1. Exploiting weather signals to create a data-driven
mission of the dengue fever virus in humans. Distinct ranges forecast system
of temperature and precipitation have been observed to have We obtained data on both annual dengue fever cases (Brazilian
an influence on the extrinsic incubation period [15,16], mos- Ministry of Health) for 2001–2017 and on daily temperature
quito maturation rate [17], length of larval hatch time [18], and precipitation (GMAO-NASA) for 2000–2016, for 20
survival rate [19] and biting rate [20]. However, the relation- dengue-endemic municipalities (figure 1; electronic sup-
ships that govern these parameters and give rise to dengue plementary material, table S1) in Brazil. Weather patterns
transmission are complex and dynamic, changing over time were extracted and analysed across hundreds of partially over-
and across geographies. Moreover, multi-year cycles of lapping time intervals collectively spanning the last seven
dengue fever outbreaks, caused by one or more circulating months of a given year, a time period that typically precedes
dengue fever serotypes (DENV I, II, III, IV) and short-term the onset of epidemic outbreaks in Brazil. Each of these pat-
immunity conferred after infection, add an important layer of terns was then assessed for its ability to predict an outbreak
complexity to prediction [21]. year (defined as a year in which the number of cases exceeds
The dengue forecasting literature lacks a systematic, self- 100 per 100 000 persons) for the subsequent year. Retrospective
adaptive and generalizable framework capable of identifying and fully out-of-sample forecasts, trained on a yearly expand-
weather and population susceptibility patterns that may be pre- ing window, were produced for 10 years (2008–2017) and for
dictive of dengue fever outbreaks, particularly at the city level. each time interval using support vector machines (SVMs), a
Vector-borne diseases commonly exhibit spatial heterogeneity, binary classifier. Every year, the time intervals with high his-
a result of spatial variation in vector habitat, weather patterns torical predictive power were automatically selected and
and human control actions [22–25]. For developing forecast evaluated in the upcoming year to produce out-of-sample pre-
systems, this feature implies a trade-off between model consist- dictions for the subsequent dengue season (figure 1). An
ency and spatial resolution. As a consequence, most studies to ensemble approach was then implemented to determine, in a
date focus on producing ad hoc predictions for a single location, completely out-of-sample fashion (using the first 4 years of
ranging from the national to the city level [26–28], while others out-of-sample predictions to inform ensemble model selec-
build and evaluate multiple modelling strategies per study site tion), the system’s final prediction: whether a year would be
in efforts to manually identify relationships between weather epidemic or not for the next 6 years (2012–2017).
patterns and dengue incidence over different geographies This system, which autonomously identifies and exploits the
and temporal windows [29,30]. Both approaches highlight predictions of multiple time windows during the calendar year,
the difficulty in producing forecast models that are viable in makes it possible to identify temporally similar regions of highly
diverse settings. By contrast, data-driven techniques demon- predictive periods of the year preceding dengue outbreaks, here
strate promise by learning from multi-scale, complex systems referred to as ‘weather signatures’. Weather signatures represent
and automatically adapting to new information. A recent time windows across years that show strong influence
São Gonçalo Santa Cruz do Capibaribe Juazeiro do Norte Jí–Paraná Rondonópolis 3
100
royalsocietypublishing.org/journal/rsif
75
50
25
75
50
out-of-sample
25 accuracy
period length, p
1.0
0.9
0.8
Belo Horizonte Parnaíba São Vicente Barretos Aracajú 0.7
25
75
50
25
June July Aug Sep Oct June July Aug Sep Oct June July Aug Sep Oct June July Aug Sep Oct June July Aug Sep Oct
start date, t0
Figure 2. The 10 year (2008–2017) out-of-sample forecast accuracy (%) for each time window of temperature and precipitation, by the municipality. The x-axis (t0)
indicates the start date of the time interval, and the y-axis ( p) indicates the length of the time interval from which weather data were gathered (10–95 days).
Models achieving at least 7/10 correct out-of-sample forecasts are shown in shades of yellow. Municipalities are ordered by decreasing ensemble prediction accuracy;
that is, the proportion of years correctly forecast by the ensemble method over the years 2012–2017.
(predictive power) on the incidence of dengue in a subsequent of epidemic outbreaks from year to year, such as the
year. We observed that cities where our methodology led to population susceptibility to being infected with the virus.
higher prediction accuracy tended to have clear and robust Specifically, endemic transmission of dengue fever is typically
weather signatures over the years, while cities where our distinguished by periodic outbreak cycles of around 3–4 years.
approach was not strongly predictive did not exhibit consistent These outbreak cycles are thought to occur as a result of (i) an
and robust weather patterns (figures 2 and 3a). Further, we exhaustion of the susceptible population after an outbreak and
observed that strong weather signatures in our sample of cities (ii) short-term cross-immunity to other circulating DENV sero-
often corresponded with or preceded important alternating types after infection [21], although the cycles can also be
tropical seasons, such as rainy and dry seasons. complicated by increased severity of a second infection [32].
Both factors result in a depletion of the population vulnerable
to infection and act as barriers to subsequent outbreaks. Inde-
2.2. Weather-based forecasting performance pendent of climate variability over the years, we expect some
Using weather data (temperature and frequency of precipi-
preservation of these susceptibility cycles.
tation) alone to predict annual dengue outbreaks, our
Inspired by this phenomenon, we implemented a data-
approach correctly forecast 81% of all epidemic years across
driven hidden Markov model by empirically computing the
20 municipalities in Brazil between 2012 and 2017 (table 1,
frequency of transitioning between multiple sequences of epi-
figure 3). For reference and as a baseline, the frequency of epi-
demic and non-epidemic years (described in detail in the
demic and non-epidemic years was 60% and 40%, thus a
electronic supplementary material). Given the previously
naive approach that predicts that all years are epidemic (the
observed sequence of consecutive outbreak and non-outbreak
class majority) would achieve an overall accuracy of 60%.
years (dengue fever cycles), the Markov model computes the
Our approach only identified 58% of non-epidemic years cor-
probability of the next year being an outbreak or a non-outbreak
rectly. This resulted in an overall accuracy of approximately
year. This acts as a proxy to dengue fever susceptibility in the
72%. Our approach significantly exceeded p = 0.005, the
population as it accounts for the cyclical nature of outbreaks
predictive power of a naive predictor.
that may be influenced by, for example, a depletion of the suscep-
tible population following multiple years of high dengue
2.3. Incorporating empirically observed dengue activity. The approach is implemented as follows: if the
susceptibility cycles weather-based approach makes a prediction with low prob-
The previously described weather-based ensemble approach ability, a decision rule is implemented to automatically
ignores important factors that may influence the emergence override the weather-based prediction if the hidden Markov
(a) (b) 4
São Gonçalo epidemic non-epidemic
Santa Cruz do Capibaribe
royalsocietypublishing.org/journal/rsif
15 correctly predicted
Juazeiro do Norte
no
Jí−Paraná
10 yes
count
Rondonópolis
Manaus status
5
São Luís prediction epidemic
Barra Mansa epidemic non-epidemic
non-epidemic 0
Eunápolis
2012 2014 2016 2012 2014 2016
Sertãozinho
year
city
Belo Horizonte
mean posterior
class probability
Parnaíba 0.25
(c)
epidemic non-epidemic
São Vicente 0.50
Figure 3. Weather-based prediction results for 120 municipality years. (a) Annual out-of-sample forecasts of outbreak status (epidemic/non-epidemic) for 20
Brazilian municipalities from 2012 to 2017, shaded by the mean posterior probability of the true outbreak status. Correct forecasts are indicated by a plus (+)
sign, and cells with light shading indicate that the model predicted the class with low probability. Municipalities are ordered by decreasing ensemble prediction
accuracy; that is, the proportion of years correctly forecast by the ensemble method over the years 2012–2017. (b) The number of total epidemic and non-epidemic
years correctly forecast across 20 municipalities, by year. The dashed white line indicates the number correctly forecast after the incorporation of empirically observed
dengue cycles. (c) The mean posterior class probability across municipalities, by year and epidemic status.
Table 1. Performance of weather-based out-of-sample forecasts across 120 experienced three consecutive epidemic years leading up to
municipality years in Brazil, with and without consideration for DENV the prediction.
susceptibility cycles. Overall, the combined approach (weather-based plus
dengue cycles) was dominantly driven by weather patterns
and informed by the decision rule only in a few cases when
weather + DENV historical data showed a very strong likelihood of either an
evaluation metric weather cycle
epidemic or not epidemic year happening. Thus, the decision
accuracy 71.70% 75% rule to favour the Markov model acts as an ‘expert opinion’
for situations in which there is clear evidence that a given
hit rate (sensitivity) 81% 78%
predicted outbreak scenario (even if suggested by the
non-epidemic detection 58% 71% weather patterns) is unlikely. Our specific finding—that
rate (specificity) the dengue cycles were used exclusively to overturn epidemic
no-information rate 60% 60% forecasts—suggests that while the weather conditions in
those locations and years were identified to be conducive to
P (accuracy > no- p = 0.005 p = 0.0004
an outbreak, there was stronger evidence that the population
information rate) may have had low susceptibility to infection (thus avoiding
an outbreak), based on multiple consecutive preceding
years of high disease incidence.
model (based on the pattern of consecutive outbreaks and non-
outbreaks in years prior) predicts a more likely scenario. In this
way, the ‘cycles’ of dengue fever outbreak susceptibility are 2.5. Model performance by year
incorporated into our otherwise agnostic weather-based The success of our combined epidemic forecasts varied by
approach. year, reflecting the difficulty of forecasting disease activity
relying only on weather patterns and the empirically
extracted susceptibility cycles. During the last three years of
the time series (2015–2017), epidemics were predicted by
2.4. Combining dengue cycles with weather patterns the weather-only models with at least 80% accuracy, with
improves forecasts 100% of the 13 outbreaks in 2016 correctly forecast
Compared with the exclusively weather-based approach, (figure 3b,c). Conversely, non-epidemic years during
incorporating these empirically observed dengue cycles into 2013–2014 were particularly difficult to predict, with only
our system improved our ability to predict non-epidemic one-third and one-half of cities correctly forecasting non-
years by approximately 20% (specificity = 69%) and increased epidemics for these years, respectively. The most successful
overall accuracy to 74.2% (table 1). Specifically, the additional non-epidemic predictions occurred in 2012, for which six
decision rule replaced seven epidemic forecasts with non- out of eight non-epidemics (75%) were predicted correctly.
epidemic forecasts, of which five were correct (figure 3b). Overall, 2015 and 2016 were the most successfully classified
The majority of these cases belonged to cities which had years, with 80% and 85% of municipalities correctly classified
São Gonçalo Santa Cruz do Capibaribe Juazeiro do Norte Jí−Paraná Rondonópolis 5
80
75 75 75 60
royalsocietypublishing.org/journal/rsif
70
50 50 50 40
60
25 25 50 25 20
June July Aug Sep Oct June July Aug Sep Oct June July Aug Sep Oct June July Aug Sep Oct June July Aug Sep Oct
80 80 75 75
75
60 60 50 50
50
40 25 25
40 25
June July Aug Sep Oct June July Aug Sep Oct June July Aug Sep Oct June July Aug Sep Oct June July Aug Sep Oct
20 40 25 25
40
June July Aug Sep Oct June July Aug Sep Oct June July Aug Sep Oct June July Aug Sep Oct June July Aug Sep Oct
75 45 75 75
60
40
40 50 50 50
35
20 30 25 25
25
25
June July Aug Sep Oct June July Aug Sep Oct June July Aug Sep Oct June July Aug Sep Oct June July Aug Sep Oct
Figure 4. Periods of the year selected into the ensemble forecast model for 2012–2017, by the municipality. The x-axis (t0) indicates the start date of the time
interval, and the y-axis ( p) indicates the length of the time interval from which weather data were gathered (10–95 days). Municipalities with smaller and brighter
yellow centres are those which exhibit the highest consistency in the predictive performance of weather patterns. Municipalities are ordered by decreasing ensemble
prediction accuracy; that is, the proportion of years correctly forecast by the ensemble method over the years 2012–2017.
as epidemics or non-epidemics, respectively, while 2014 and year, the combination of these outputs is calculated using a
2017 were the most difficult years to predict, with 45% and voting system that only considers time windows that have
35% of municipalities misclassified, respectively. consistently exhibited the highest historical out-of-sample
Incorporating information on the dengue cycles helped prediction performance among all other time windows of
detect an additional non-epidemic in 2012 and 2015, and an the calendar year. In our framework, time windows are auto-
additional three non-epidemics in 2017 (figure 3b). matically selected into the forecasting ensemble if (i) their
own historical out-of-sample performance is high and (ii)
the historical performance of their calendar neighbours, that
2.6. Quantifying the strength of predictions is, models using temporally nearby time windows as
Because our forecast system produces deterministic binary pre- predictors, is high as well.
dictions (epidemic/non-epidemic year) using local-in-time Consequently, we computed metrics of ensemble accuracy
SVM classifiers, a natural question is how to quantify the con- and strength (or confidence) by quantifying both of these
viction (or confidence) of each prediction. It is important elements. We found that, in cities where the predictive per-
to note that the number of observations per city is small formance of our approach is highest (electronic
(n = 17), and, thus, a rigorous probabilistic approach to quanti- supplementary material, figure S2), the successful individual
fying conditional probabilities of success is not feasible. classifiers that contribute to our final prediction use as input
However, in the interest of better communicating to public temporal regions that are clustered around one another (as
health officials the reliability of our predictions in a given shown in figure 4), suggesting that the presence of temporally
location and time period, as well as identifying the determi- consistent weather patterns can be thought of as an indicator of
nants of success of our prediction system if one were to the success of our methodology.
extend our predictive approach to new locations, we explored It is important to note that models with high historical
simple ways to characterize the accuracy and conviction of pre- prediction performance may still lead to poor outcomes if
dictions. We did this based on both the historical performance the weather data for the year of (out-of-sample) forecast do
of the selected ensemble generating the prediction and the not clearly belong to an epidemic or non-epidemic class, as
performance of the weather-based classifiers themselves. learned by the individual classifiers, and/or if its weather
Our prediction system combines the output of a collection patterns happen to ‘look like’ those appearing historically
of local-in-time binary classifiers that use different time in the opposite class.
periods (characterized by an initial point in time, t0, and a In order to further assess the individual strength or con-
window length, p), prior to the typical date of the onset of viction of each individual classifier, we estimated whether
dengue outbreaks, as predictors. For each city and each the separability or difference between the two classes
(epidemic versus non-epidemic) was well captured by the immediately extended to other locations, requiring no 6
classifier by extracting calibrated posterior probabilities of location-specific manipulation or inputs aside from a globally
royalsocietypublishing.org/journal/rsif
each SVM model using Platt’s scaling [33]. The posterior available time series of daily temperature and precipitation as
probability reflects the distance to the separation boundary well as a complete yearly record of dengue incidence.
distinguishing epidemic and non-epidemic years on the Using weather information only, our models seek to
basis of weather. Thus, a higher probability represents how characterize and exploit the predictive ability of distinct
strongly the weather patterns of the prediction year aligned weather patterns preceding outbreak years. Because our
with those experienced by prior outbreak or non-outbreak framework automatically identifies the time periods for
years. We observed that, in general, the probabilities were which weather patterns produce strong signals, it was poss-
moderately calibrated, i.e. roughly 80% of predictions made ible to identify temporal weather signatures in multiple
with 0.8 probability were epidemics (electronic supplemen- locations with vastly different ecosystems and geographical
tary material, figure S3); however, the small sample size locations. For this, we observed that cities with better overall
(i.e. six out-of-sample years for each of the 20 cities) limits prediction accuracy had stronger weather signatures,
the ability to interpret this feature appropriately. We found suggesting perhaps some biological consistency. For example,
royalsocietypublishing.org/journal/rsif
studies may aid with this component, though this surveil- out-of-sample ensemble predictions, but ultimately it is difficult
lance information is more challenging to routinely acquire to establish strong climatic distinctions between outbreak and
[36]. Regardless, here we highlight the importance of incor- non-outbreak years in the data with so few samples. Thus, we
porating mechanistic processes of disease transmission into anticipate improvement in performance for settings that have
data-driven approaches that may be otherwise blinded to multiple decades of data, which would allow for longer training
them. periods, improved separability in the data and more stable
Our approach achieved an overall accuracy of 75%, which identification of dengue susceptibility cycles, all improving the
we believe is promising considering the difficulties in predict- quality, robustness and accuracy of predictions. In addition,
ing the target. To put our results in context, we visited other where epidemiological data are available at finer temporal resol-
benchmarks in the dengue prediction literature. While most utions (e.g. weekly, monthly), this prediction problem could
dengue forecast models predict a continuous outcome such leverage more classical time-series approaches (such as
as total incidence (rendering comparisons of performance SARIMA models) that incorporate adjustments for seasonality
royalsocietypublishing.org/journal/rsif
signal processing/spectral analysis, machine learning and important weather signals in multiple different locations with
ensemble modelling to achieve robust, data-driven epidemic vastly different ecosystems and weather patterns, we allow the
forecasts that do not require any prior knowledge of the system data to inform the choice of time intervals. Our algorithm
(i.e. climatic influences on dengue transmission). Our research achieves this by scanning over multiple, partially overlapping
question is inherently one of time-series classification, to forecast time intervals across the calendar year, and building hundreds
epidemic versus non-epidemic years of dengue fever. The work- of models on these different intervals in order to select those
flow begins with a time series of hourly and daily weather with the strongest signals.
information, which serve as inputs to a collection of classifiers Each time interval is defined by a start date, t0, between early
that contribute to ensemble-based epidemic predictions. Our June and late September, and a period length, p, of between 10
approach can be described in five steps. and 95 days. The combination of each (t0, p) produces multiple,
partially overlapping intervals spanning the last seven months
1. Signal preprocessing: for a time series of weather data, define of the calendar year.
We applied our approach to 20 cities in Brazil spanning large 4.4. Independent model training and prediction
geographical and population ranges (electronic supplementary The goal of our independent model-building step is to identify
material, figure S1 and table S1). We used as input a historical dynamically, through the continually updating performance of
time series spanning 17 years and consisting of information on a collection of models, the periods of the year that are most pre-
dengue case reports (number, annual) and two weather variables: dictive of annual dengue outbreaks, in order to exploit a small
2 m air temperature (kelvin, daily) and precipitation (kg m−2, number of them to generate forecasts.
hourly). We describe data sources, acquisition and processing in To forecast outbreak years, we trained a collection of SVM
the electronic supplementary material. After an initial training classifiers on an initial 7 year training period and produced
period of 7 years, we generated 10 years of out-of-sample epidemic annual forecasts incorporating the most recently available
predictions for each of the independent models using a 1 year weather information using a dynamic, 1 year expanding training
expanding training window (step 2). We used the first 4 years of window. A unique SVM was trained for each of the (t0, p) time
out-of-sample predictions to inform ensemble model selection intervals, resulting in a total of 432 independent models trained
(step 4) and produced ensemble-based predictions for the remaining per year. Each model generated out-of-sample predictions for the
6 years (step 5). remaining 10 years of data. Predictions were made by classifying
the 30 out-of-sample data points corresponding to the weather
information preceding the target year, and taking a majority
4.2. Signal preprocessing vote. In order to handle highly nonlinear relationships between
Using a daily time series of weather data to forecast dengue fever weather variables, both radial basis function and sigmoid kernels
epidemic status requires identifying the most predictive period(s) were used and evaluated for performance and show results for
the best respective kernel in each city. We tuned model par- (ii) short-term cross-immunity to other circulating DENV sero- 9
ameters (gamma, soft margin cost function and coefficient) types after infection [21]. Both factors result in a depletion of
using 10-fold cross-validation. the population vulnerable to infection and act as barriers to sub-
royalsocietypublishing.org/journal/rsif
SVMs, a supervised learning method for classification, were sequent outbreaks. Independent of climate variability over the
used because of their flexibility in the face of complex, nonlinear years, we expect some preservation of these cycles.
decision boundaries and their robustness to overfitting and outliers. Consequently, we implemented a ‘decision rule’ in the model
The property that underpins these advantages is known as the ‘large- based on the observed transitions between epidemic and non-
margin classifier’. SVMs are also known for their good performance epidemic years across 51 Brazilian municipalities meeting ende-
in high-dimensional feature space, which is advantageous for the mic inclusion criteria (electronic supplementary material).
scale-up of the model to include dozens more predictors. Across these municipalities, we computed the mean second-
and third-order Markov transition probabilities, representing
the probability of transition from one outbreak state (epi-
4.5. Model selection demic/non-epidemic) to the opposite outbreak state (non-
From the resulting collection of 432 models, the best-performing epidemic/epidemic) after 2 and 3 consecutive years, respectively.
models (n = 11) were selected each year based on (i) historical out- Thus, we obtained the transition probabilities corresponding to
References
1. Ford TE, Colwell RR, Rose JB, Morse SS, Rogers DJ, 2. Sewe MO, Tozan Y, Ahlm C, Rocklöv J. 2017 Using 3. McGough SF, Brownstein JS, Hawkins JB, Santillana
Yates TL. 2009 Using satellite images of remote sensing environmental data to forecast M. 2017 Forecasting Zika incidence in the 2016
environmental changes to predict infectious disease malaria incidence at a rural district hospital in Latin America outbreak combining traditional
outbreaks. Emerg. Infect. Dis. 15, 1341–1346. Western Kenya. Sci. Rep. 7, 2589. (doi:10.1038/ disease surveillance with search, social media, and
(doi:10.3201/eid/1509.081334) s41598-017-02560-z) news report data. PLoS Negl. Trop.
Dis. 11, e0005295. (doi:10.1371/journal.pntd. 18. Byttebier B, De Majo MS, Fischer S. 2014 Hatching climate conditions for different capitals. (https://arxiv. 10
0005295) response of Aedes aegypti (Diptera: Culicidae) eggs org/abs/1701.00166 [q-bio.QM])
royalsocietypublishing.org/journal/rsif
4. Yang S, Santillana M, Kou SC. 2015 Accurate at low temperatures: effects of hatching media and 32. Guzman MG, Alvarez M, Halstead SB. 2013
estimation of influenza epidemics using Google storage conditions. J. Med. Entomol. 51, 97–103. Secondary infection as a risk factor for
search data via ARGO. Proc. Natl Acad. Sci. USA 112, (doi:10.1603/ME13066) dengue hemorrhagic fever/dengue shock syndrome:
14 473–14 478. (doi:10.1073/pnas.1515373112) 19. Barry W, Alto DB. 2013 Temperature and dengue an historical perspective and role of antibody-
5. Marques-Toledo CdA, Degener CM, Vinhal L, Coelho virus infection in mosquitoes: independent effects dependent enhancement of infection. Arch. Virol.
G, Meira W, Codeço CT, Teixeira MM. 2017 Dengue on the immature and adult stages. Am. J. Trop. 158, 1445–1459. (doi:10.1007/s00705-013-1645-3)
prediction by the web: tweets are a useful tool for Med. Hyg. 88, 497. (doi:10.4269/ajtmh.12-0056) 33. Platt J. 1999 Probabilistic outputs for support vector
estimating and forecasting Dengue at country and 20. Scott TW, Morrison AC, Lorenz LH, Clark GG, machines and comparisons to regularized likelihood
city level. PLoS Negl. Trop. Dis. 11, e0005729. Strickman D, Kittayapong P, Zhou H, Edman JD. methods. Adv. Large Margin Classifiers 10, 61–74.
(doi:10.1371/journal.pntd.0005729) 2000 Longitudinal studies of Aedes aegypti 34. van Panhuis WG, Hyun S, Blaney K, Marques Jr ETA,
6. Bengtsson L, Gaudart J, Lu X, Moore S, Wetter E, (Diptera: Culicidae) in Thailand and Puerto Rico: Coelho GE, Siqueira Jr JB, Tibshirani R, da Silva Jr JB,
Sallah K, Rebaudet S, Piarroux R. 2015 Using mobile population dynamics. J. Med. Entomol. 37, 77–88. Rosenfeld R. 2014 Risk of dengue for tourists and teams