Bruno Soalheira - Dissertacao
Bruno Soalheira - Dissertacao
Bruno Soalheira - Dissertacao
Examination Committee
October 2019
2
Acknowledgments
I would like to acknowledge everyone that played an important role during my
academic accomplishments.
First, I want to thank my supervisor, professor Andreas Wichert, for supporting
me since the beginning of my thesis. Every time I needed technical advice or even
a word of encouragement, he was there to help me.
Secondly, I want to thank my family, who were always there with unconditional
love and support.
Finally, my closest friends, who were extremely important just by being there
and all the small gestures and words of encouragement.
Thank you all for making this possible.
3
Abstract
Artificial neural networks are a powerful machine learning technique that is
used for several ends; time series forecasting is one of them. Several of these
techniques are explored with special attention to recurrent neural networks, which
are one of the methods most commonly used in time series forecasting. All these
methods are presented with the goal of better understanding recurrent neural
networks. There is no predefined better algorithm to solve this type of problems; in
this field experimentation is required and only then we can draw conclusions.
In this case, feedforward neural networks and recurrent neural networks will be
used to obtain models capable of predicting time series. Two related data sets but
with some important differences that will affect the results are used in these
experimentations. The main objective is to compare the two types of networks on
both data sets, presenting the limitations, drawbacks and advantages of each one
of them.
In most cases, recurrent neural networks outperform feedforward neural
networks for time series forecasting problems.
Resumo
Redes neuronais artificiais são uma técnica de aprendizagem de máquina que
é usada para várias finalidades; previsão de séries temporais é uma delas. Muitas
desta técnicas são exploradas com especial atenção às redes neurais recorrentes,
que são um dos métodos mais utilizados neste tipo de problemas de previsão de
séries temporais. Não existe um método predefinido que seja melhor que os outros
para resolver este tipo de problemas; nesta área é necessário experimentação e só
depois é que é possível retirar conclusões.
Neste caso, redes neurais feedforward e redes neurais recorrentes serão
usadas para obter modelos capazes de fazer a previsão de séries temporais. Dois
conjuntos de dados relacionados mas com importantes diferenças que afetarão os
resultados serão usados nestas experimentações. O principal objetivo é comparar
os dois tipos de redes neurais em ambos os conjuntos de dados, apresentando as
limitações, vantagens e desvantagens de cada um deles.
Na maioria dos casos, redes neurais recorrentes têm melhor desempenho do
que redes neurais feedforward para problemas de previsão de séries temporais.
Table of Contents
1 Introduction .............................................................................................. 8
4.3.2 Results of the sunspots data set using a feedforward neural network
37
4.3.3 Results of the temperatures data set using Elman and Jordan neural
networks 39
4.3.4 Results of the sunspots data set using Elman and Jordan neural
7
networks 42
5 Conclusion ............................................................................................. 45
6 Bibliography ........................................................................................... 48
7 Appendix A ............................................................................................ 50
8
1 Introduction
1.1 Motivation
Time series forecasting is an area of machine learning that has been fascinating
data scientists for the last few decades. The field of machine learning has brought
us innumerous advantages, being one of them the capability of being able to predict
what is going to happen and the only thing we need is information. No matter how
hard and random a data set may seem, these algorithms can find patterns that are
not possible to identify using other techniques. Using past and present data we can
infer what is going to happen in the future with precision; we must choose the right
method and experimentation is the key.
Neural networks are an amazing method using for different purposes; time
series forecasting is only one of them. They can also be used in many different other
areas like in customer research, data validation and risk management. It is also
important to refer that they are not alone; there are other methods capable of
accomplishing some things that neural networks can and we will discuss these in the
next chapter.
Neural networks models have been around for more than half a century now.
Since the publication of the first papers on the matter, scientists knew these models
had enormous potential to achieve innumerous goals. Just by thinking of the
complexity of the human brain, we can only imagine how far neural networks can go.
It is a huge area of study and it is important to refer that only a small fraction of the
neural network’s capabilities is taken advantage of, nowadays. There is so much
more to discover on this matter.
1.2 Objectives
1.3 Outline
As previously mentioned, this thesis will be composed of two main parts after
this introduction:
• Chapter 2, in which we discuss and present the theory behind this complex
area of research. We briefly present some important concepts; prediction
methods that are used in this area other than neural networks; we explain in
detail neural networks, including its history, definition and structure; then we
delve into the specific type of neural networks we are interested in for this
project: recurrent neural networks; we describe the different architectures
that exist and the advantages, disadvantages and applications for this type
of neural networks.
• In Chapters 3, 4 and 5 we will present the proposed architecture, the
evaluation method that will be used to analyze the results and the
experimentations and conclusions of the thesis, respectively.
10
2 Related Work
2.1.1 Forecasting
In order to have a better understanding on the matter, we will look briefly into
some techniques that are used in this type of analysis.
An interesting, yet simple model in which we infer predictions using the mean
value of past data. Considering the characteristics of the data, we will choose how
further back we want to investigate. Let Ft+1 be the forecast value for the next period,
where t is the current period, n is the forecasting horizon (i.e. how far back we look)
and A is the data value from each period:
Even though MA algorithms are simplistic and effective in some situations they
also present several limitations.
The fact that Moving Averages can spread out over any period can be
problematic because the general trend can change depending on the time period
being analyzed. It is also important to refer that these methods don’t work for all
companies specially the ones that are heavily influenced by current events. They are
only focused on past events and completely ignore current and valuable information
such as new competitors, higher or lower demand for products in the industry…
Concluding, careful inspection must be done before applying this method since
the results might be misleading if not done following the right terms.
As in the previous method, some concerns arise when dealing with Exponential
Smoothing analysis.
It produces forecasts that lag the actual trends, the main reason being that the
algorithm neglects the ups and downs associated with the random variations. Also,
this method will not work properly if there is a trend in the series, therefore it is more
13
efficient when used for short-term forecasts and with no cyclical or seasonal
variations.
2.2.4 Regression
y = þO + þ1x + s (2.5)
and, once again, using the information we possess about the data we
choose the values for þO, þ1, þ2, … , þk according to the importance that we
want to give to each predictor.
14
It may seem obvious, but it is important to refer that linear models are limited to
linear relationships and the data must be independent.
Only in a few cases we can find a straight line to model our data and, because
of the model being described as a straight line, it is not possible to model cases in
which having curves is convenient.
This model is based in the relationships between the mean values of the
dependent and independent variables but in some cases can be necessary to look
at the extremes of the dependent variable.
Outliers can have huge effects on the predictions because the mean values will
be affected by these.
Xt = c + ∑i=1
p * X
i t–i + st (2.8)
where Xt is the value to predict, Xt–1, Xt–2, … , Xt–p are the p previous values used,
*1, *2, … , *p are the parameters of the model, c is a constant and st is the white
noise error term.
15
2.2.6 Autoregressive-moving-average model (ARMA)
This name should sound familiar to the reader since this method involves two
previously mentioned models working together. As a quick reminder, the AR part
consists of predicting a value on its own past values and the MA part involves
modeling the error term as a linear combination of error terms occurring
contemporaneously and at successive times in the past. ARMA(p,q) is used to
denote this model and refers to the p order of the AR part and the q order of the MA
part [27]. Consider this different way of defining the MA model to better understand
the ARMA model:
where 81, 82 , … , 8q are the parameters of the model, µ is the expectation of Xt and
st and st–1 , s t–2 , … , st–q are white noise error terms.
A new equation to define the ARMA model is obtained:
q
Xt = c + st + ∑p i=1 *i Xt–i + ∑ i=1 8i st–i (2.10)
This section had the main goal of introducing some algorithms used in time
series forecasting in order to provide some knowledge on how they work, which will
be helpful in understanding the idea behind neural networks, a prediction technique
that has the whole next section reserved for itself due to the fact that this will be the
method used in this project.
2.3.1 History
The study of the human brain has been around for centuries and it was only
natural to try to harness this thinking process. In the 40s and 50s some scientists
started taking this matter seriously and published several papers.
Warren McCulloch and Walter Pitts were among the first to write about neural
networks in 1943 [4].
As computers advanced in the following years, it became possible to model
some of these theories. The first few attempts to model neural networks failed but
later attempts were successful and from that point on different types of neural
network models were developed and used for various purposes.
In 1959, Bernard Widrow and Marcian Hoff developed models that they named
ADALINE and MADALINE. MADALINE was the first neural network model to be
applied to a real-world problem with success.
Scientists were impressed with the results of these models and they kept
pushing forward which caused people to exaggerate the potential of neural networks.
Neural networks were powerful methods but the technology limitations in the mid-
1950s were too big to overcome. The perceptron, which was proposed by Frank
Rosenblatt in 1958, had a major drawback; it could only learn to separate linearly
separable classes. Critiques arose regarding neural networks research which,
obviously, led to a period of stunted growth till the early 1980s.
In 1982 John Hopfield presented a paper in which he showed how neural
networks could work and what they could do.
By 1985 the American Institute of Physics began what has become an annual
meeting where innumerous scientists would get together and discuss new ideas
about neural networks. This meeting was called Neural Networks for Computing.
Over the last few years, technology has been evolving exponentially, which led
to the development of systems capable of handling even more complex artificial
neural networks.
Nowadays, neural networks are a subject of constant study; they are discussed
everywhere.
17
In Figure 2.1 some important advances in the area of neural networks are
shown.
2.3.2 Biology
Figure 2.2 The Structure of a Neuron. Adapted from “Neural Network Design”
by Hagan, M.T., Demuth, H.B., Beale, M.H. and De Jesús (1996)
Neural networks have a learning phase which is crucial for the potential results.
There are three methods for learning strategies:
• Supervised learning. It is the machine learning task of learning a function
that maps an input to output based on example input-output pairs. The
example input-output pairs are usually called training data and each of these
consists of an input object (usually a vector) and an output value. The new
inputs that will be tested are called test data. This algorithm analyses the
training data and infer a function that will be used to calculate the output
values for the new test data values [12][29].
• Unsupervised Learning. This method must be used when there is no training
data to learn from. The neural network analyses the data, and then a cost
function tells the neural network how far it was from the correct result.
Basically, the model learns from the test data and keeps adapting according
to this data.
• Reinforced Learning. This algorithm reinforces the neural network in the
presence of positive results and punishes it for negative results, forcing the
neural network to learn over time.
ƒ(x) = x (2.11)
1
What is an Activation Function (deepai.org)?
21
For example, we have the sigmoid activation function, which is used to
obtain values between 0 and 1 (commonly used in probability problems), we
have tanh activation function, which provides values ranging from -1 to 1
(commonly used for classification between two classes), among many
others. The sigmoid function is an example and is defined the following way:
$(z) = 1
1+e —z
(2.12)
2.3.6 Backpropagation
In the beginning, the weights of a neural network are initialized randomly, and
these values need to be adjusted until we obtain the values that fit our model the
best. One of the most common ways to accomplish this is using the backpropagation
technique. It is a method used to calculate a gradient that is needed in the calculation
of the weights used in a neural network. It is more used in situations where there are
more than a hidden layer in a neural network [34].2
Backpropagation requires three things to work properly:
• A dataset consisting of input-output pairs.
• A feedforward neural network, meaning that there are no connections
between nodes in the same layer and that layers are fully connected.
• An error function, which defines the error between the desired output and
the obtained output.
This method is used in any feedforward networks to learn a training set of inputs
and outputs.
The main idea behind backpropagation is to update the weights’ values in such
a way that the error becomes minimum. First, we figure out if we have whether to
increase or decrease the values of the weights. Once we know which way to go, we
keep making smaller and smaller adjustments until the network provides the desired
results.
Considering ղ to be the learning rate, C to be the loss function and ε(t) to be a
stochastic term we can update the weights w, using a stochastic gradient descent,
the following way:
2
What is Backpropagation? (deepai.org).
22
2.3.7 Recurrent Neural Networks
Recurrent Neural Networks have been deeply studied since the 1990’s. They
are commonly used to learn sequential or time varying patterns, which is the matter
we are dealing with in this project [7].
There are two types of architectures of recurrent neural networks.
The first ones are named fully connected neural networks (Figure 2.4). Fully
connected networks do not have distinct input layers of nodes, and each node has
input from all the other nodes.
The second ones are named partially connected neural networks (Figure 2.5).
Although some nodes are part of a feedforward structure, other nodes provide the
sequential context and receive information from other nodes.
An Elman network contains three layers and a set of context units. The context
units are connected to the hidden layer with the fixed weight of one. At each time
step, the input is fed-forward, and a learning rule is applied. The context units save
a copy of the previous values of the hidden layer nodes with the goal of maintaining
a sort of a state, allowing this model to perform tasks like sequence-prediction
[30][31].
Let ℎt be the hidden layer vector, yt the output vector, xt the input vector, W, U
and b the parameter matrices and vector and oh and oy the activation functions. The
hidden layer and the output vector are calculated the following way:
Jordan networks are similar to Elman networks with the small difference of the
context units containing information about the output layer instead of the hidden
layer. Considering the same variable from the previous two formulas, we can obtain
the value for the hidden layer and the output vector for the Jordan networks:
It works by learning a number of binary patterns and then returning the one that
is the most similar to a given input.
It consists of only one layer of nodes which are connected to each other but not
to itself. It is a feedback network since the outputs are redirected to the inputs. Every
node is, at the same time, the input and output of the network, therefore the number
of input and output nodes is the same. After a certain number of iterations, the values
of the nodes tend to converge.
24
2.3.7.3 Bidirectional Associative Memory
The echo state network contains a sparsely connected random hidden layer.
The connectivity and weights of the hidden layer neurons are fixed and randomly
assigned. The weights of the output neurons can be learned so that the network can
produce specific temporal patterns. So, basically the only thing that changes in this
network is the weight of the synapses connecting the hidden layer nodes to the
output layer nodes.3
In this type of network, each neuron only receives information about its past
state and thus neurons are independent of each other’s states.
2.3.7.6 Conclusion
These mentioned models are just some of the ones that are used in time series
forecasting problems. Only a brief description of each of them was provided in order
to give a general idea of how each of them works.
Artificial neural networks bring some benefits that overcome some of the
limitations referred in the algorithms mentioned in the previous chapter. It is apparent
that a neural network derives its computing power through its massively parallel
distributed structure and its ability to learn and therefore generalize. Generalization
consists on the process of the neural network obtaining reasonable results for inputs
that were not part of the training [11].
The following points are just some of the benefits and capabilities that neural
networks offer us:
• Nonlinearity. Artificial neurons can be linear or nonlinear. A neural network,
made up of an interconnection of nonlinear neurons, is itself nonlinear. This
Figure 2.7 An example of how neural networks can be used in the context of
Character Recognition problems. Adapted from “Application of Neural Networks to
Character Recognition” by Dong Xiao Ni (2007)
2.3.11 Conclusions
3 Thesis plan
3.1 Introduction and initial plan
For this project, I always had in mind working with data sets that were related
with a big issue that is becoming more and more evident as time goes by; the climate
change.
I started researching and looking for data sets that looked interesting to work
with and the initial plan was to work with a multivariate data set that aimed to predict
if one day is either going to be an ozone day (meaning the ozone level is above a
certain predefined threshold) or a normal day. This data set included several
attributes including the temperatures at different times and in different atmospheric
pressures throughout the day, the wind speed at various times of the day, the sea
level pressure, the relative humidity, the total solar radiation for the day, among
others. The neural network would learn from these attributes and their respective
results (ozone day or not) and would be able to make predictions according to the
attributes.
I ended up deviating from the initial plan because I decided that I wanted to see
the differences of how feedforward neural networks and recurrent neural networks
deal with seasonal and non-seasonal data sets. I had in mind that feedforward neural
networks would perform well on seasonal data sets because there is no need of
memory since the pattern is very clear.
Still with the climate change subject in mind, I found two interesting related
univariate data sets. The first one was the seasonal data set and it contained
information regarding the mean monthly global temperatures for a period of 65 years
over the twentieth century. The second one was the non-seasonal data set and it
contained information regarding the yearly number of sunspots over a period of 314
years; a factor that greatly affects the temperatures and one that we can’t control.
The Maunder Minimum was a period (1645-1715) in which the sunspots
became exceedingly rare and this period roughly coincided with the middle part Little
Ice Age, during which Europe and North America experienced colder than average
temperatures. The volcanic activity is the current best explanation for the cause of
this little ice age period, but the Maunder Minimum phenomena is also one of the
factors that affected the temperature during this period.
Basically, I want to predict both data sets using feedforward neural networks
and recurrent neural networks and then compare the performance of each of the
algorithms. During the recurrent neural networks experimentations, I want to
30
compare Elman networks and Jordan network and check how impactful their small
differences in structure affect the results.
I also had in mind performing another experimentation which involved predicting
values for one data set while training the neural networks with the other data set. I
ended up not doing this experimentation because as we know not only the sunspots
affect the temperature; there are many other factors and therefore the results would
be more accurate if we would use other data sets that would contain relevant
information on these factors.
To start off with my experimentations, I decided to pick two related data sets; a
simpler one where the seasonality is very clear and there are not many deviations
from the regular values and a more complex one where there are clearly more
deviations from the pattern. Both data sets were obtained from datamarket.com and
the first one is called “mean-monthly-temperature-1907-1972” and it contains 792
observations of the mean monthly global temperatures in Fahrenheit starting on
January 1907 and ending on December 1972 (Figure 4.1). The second data set is
called “yearly-mean-total-sunspot-number” and it contains 315 observations of the
mean yearly number of sunspots starting on 1700 and ending on 2014 (Figure 4.2).
I decided to choose these two data sets because there is a clear relation between
them as I explained in the previous chapter. The number of sunspots is a huge factor
on the temperatures around the planet. In Figures 4.3 and 4.4 I present two samples
of both temperatures and sunspots data sets, respectively.
It is important to mention this process since it can greatly affect the results. If we
don’t perform a correct pre-processing of the data, it will negatively affect the results.
This process is a data mining technique that involves transforming raw data into an
understandable format that can be used for several purposes, including analysis.
Real world data is often incomplete, inconsistent, and/or lacking in certain behaviors
or trends; therefore, it is likely to contain many errors. Some of the steps of data pre-
processing include checking out for missing values, values normalization/
standardization, checking out the consistency of the data types, splitting the data
into training and test data sets, among others.
34
Now I will describe the pre-processing method I followed before applying the
neural network models.
First, I extracted the data sets as csv files from the data source. In R, I set the
working directory to the respective directory where I downloaded my csv files to and
obtained the files content to the program.
The pre-processing plan presented in the next two sub-sections was followed in
every experimentation.
In Figures 4.5 and 4.6 the time-series objects for both data sets I used for my
experimentations are shown. Equation 4.1 shows the formula followed to obtain the
normalized values.
35
Figure 4.5 Time series objects of the temperatures data set used to perform the
experimentations
Figure 4.6 Time series objects of the sunspots data set used to perform the
experimentations
X–XNin
nX =X (4.1)
Nax –XNin
The results I’m about to present are in compliance with what I expected in the
beginning with exception of one or two points that I will mention later on.
Before I present the results and conclusions I obtained, it is important to mention
an important fact about the experimentations on this project; it is known that neural
networks have hidden layers and these layers will be the main concern during our
experimentations. There are no rules for choosing the number of hidden layers and
hidden nodes and the only way to figure them out is by experimenting several values
for them and analyzing if the results are improving or not. By default, the nnetar
function, has a predefined number of hidden layers and hidden nodes. The hidden
layer is one and the hidden nodes depends on the context of the problem. This
predefined number of hidden nodes will be my starting point and I will proceed by
testing with smaller and higher number of nodes in order to figure out which is the
best number to obtain the best results possible. For the Elman and Jordan networks
the process will be similar, but I can also choose the number of hidden layers of the
network.
36
4.3.1 Results of the temperatures data set using a feedforward
neural network
We trained our data set with a feedforward neural network. For this, we used
the R function nnetar which is used for forecasting univariate time series and the
architecture of this network is composed by a single hidden layer.
As aforementioned, the data set was divided into the training and test data sets.
The training data set contained data from January 1907 to October 1959 while the
test data set contained the rest from November 1959 to December 1972.
I called the nnetar function to train the network with the training set and create
a model; then I forecasted the respective values for the time period corresponding
to the test set using the predict function and compared the obtained results to the
test set, graphically and numerically.
I tested the model with several number of hidden nodes. To compare the results,
I used the accuracy function which returns several measures of the forecast
accuracy. I spoke with my professor and he recommended me to use the RMSE
(root-mean-squared-error) which is one of the most frequently used measures in
times series forecasting problems. The RMSE value is obtained the following way:
∑N
RMSE = J (fi–oi)2
i=1
(4.2)
N
where ƒi are the forecasts, oi are the observed values (known results) and N is the
sample size.
Figure 4.7 Comparison between the temperatures test set and the prediction
of the feedforward neural network
For the number of sunspots data set the approach was very similar.
The data set was divided in two parts. The training data set contained data
points from the starting year of 1700 and ending on 1951 while the test data set
contained data from 1952 to 2014.
As mentioned, the procedure was similar; we called the nnetar function and it
produced a model with its default values. The predict function was used to obtain the
predictions of the number of sunspots for the years 1952-2014. After, we compared
the results to our test data set.
As in the previous experimentation, we tested with several number of hidden
nodes in order to figure out which is the best for this model. The default number of
hidden nodes for this model was 5 and provided an RMSE value of 51.668736. I
proceeded by testing with lower numbers of hidden nodes and the best I obtained
was with 2 hidden nodes the value of 41.01420. For the higher values hidden nodes
the best result was found for 7 hidden nodes and the RMSE value of 45.722987. The
graphical representations of the models with 2 hidden nodes and 7 hidden nodes
38
are represented in Figures 4.9 and 4.10, respectively. Again, in blue we have the
predictions from our model and in red the actual values.
So, as it is noticeable, for the model with 2 hidden nodes the results are not
even remotely close to the desired ones. We can observe that our prediction almost
converges to a straight line.
For the second model with 7 hidden nodes even though the results are not the
best it is visually clear that they provide a result closer to our goal.
I decided to take into account two more measures that are clearly relevant to
the context of this problem. I used the MPE (Mean Percentage Error) and the MAPE
(Mean Absolute Percentage Error).
The MPE is the average of percentage errors by which forecasts of a model
differ from actual values of the quantity being forecast. The value of MPE is obtained
the following way:
1OO%
MPE = ∑n at–ft (4.3)
n t=1 at
where at is the actual value of the quantity being forecast, ƒt is the forecast, and n
is the number of times for which the variable is forecast.
The MAPE measures the accuracy as a percentage and can be calculated as
the average absolute percent error for each time period minus actual values At
divided by actual values. Considering At to be the actual values, Ft the forecast
values and n the sample size we can obtain the MAPE the following way:
For the 2 hidden nodes model we have an MPE of -136.8492 while MAPE is
166.6492. For the 7 hidden nodes model we have an MPE of -54.22009 and an
MAPE of 94.26946.
For this specific case, we can conclude that the second model is overall better
than the first one.
The results were not surprising. The performance was poor, as we can see from
the accuracy values we obtained. This is since our data set is non-seasonal, which
requires our network to have memory in order to perform well.
39
Figure 4.9 Comparison between the sunspots test set and the prediction of the
feed-forward neural network with 2 hidden nodes
Figure 4.10 Comparison between the sunspots test set and the prediction of
the feed-forward neural network with 7 hidden nodes
4.3.3 Results of the temperatures data set using Elman and Jordan
neural networks
I will now experiment with the same data sets using recurrent neural networks,
more specifically Elman and Jordan neural networks. These networks were designed
to implement memory in order to perform better in other type of problems like time
varying sequences. For the first problem, I already obtained a pretty decent result
with the feedforward neural network mainly because the data is seasonal, but for our
second problem, where the data is non-seasonal, I want to improve the results with
the following experimentations.
40
To achieve better results, I trained my model with lagged versions of the original
time series. In this case I used 12 lagged versions which corresponds to the 12
months of the year. This method is commonly used in times series because of a
phenomenon called autocorrelation, which is a tendency for the values in a time
series to be related to previous copies of itself. This is important to identify patterns
which will help determining the seasonality of the series.
I started experimenting with a single hidden layer. I called the elman function with
the different numbers of hidden nodes and then the predict function. In the end, the
respective RMSE values were calculated. The best value for the RMSE I obtained
was 0.04949789 for 6 hidden nodes. After this, I started experimenting with 2 hidden
layers. For 2 hidden layers, I decided to automate the process with a cycle since it
would consume a lot of time otherwise. By the end of the loop, my program would
return the best RMSE value and the corresponding number of hidden nodes of each
hidden layer. The best results obtained were for 3 hidden nodes for both of the
layers. The corresponding RMSE value was 0.04879495. It is important to mention
the parameters that I used to call the elman and jordan functions; the “inputs[train]”
contains the values corresponding to the training component (80%) of all the lagged
versions of the time series, the “outputs[train]” contains the values corresponding to
the training component (80%) of the original time series, “size=c(i,j)” corresponds to
the number of hidden nodes in the corresponding hidden layers,
“learnFuncParams=c(0.1)” is the parameter used for the learning function and
“maxit=5000” corresponds to the number of iterations that we use to train our model.
The following Figures 4.11, 4.12 and 4.13 show the comparison between the
prediction and test set, the iterative error plot, and the instruction used to train the
neural network, respectively. The first graph has the normalized values of the
temperatures and the bottom values are the years of the prediction. The iterator error
plot shows us how the network error evolved along the training iterations.
The results here were fine as expected. The prediction of both the recurrent
neural networks were fine with good accuracy values. The unexpected part is that
our feedforward model outperformed the recurrent models.
41
Figure 4.11 Comparison between the temperature test set and the prediction of
the Elman neural network
Figure 4.12 Iterative error plot for the Elman network on the temperatures data
set
Now it is time to train our data with the Jordan network. The setup was the
same of the Elman network. The only difference is that I called the jordan
function instead. The explanation of the parameters is analogous to the elman
network and the parameters used in the function call were also the same. The
best result achieved for this network was with an RMSE value of 0.0495885 for a
42
single hidden layer with 4 nodes. As it is not possible to have more than a
hidden layer in Jordan networks, we will compare the results between the best
values of RMSE between the two networks with one hidden layer. In this case
the Elman network performed slightly better; the difference is minor though. The
respective prediction is in Figure 4.14 and the instruction is in Figure 4.15.
Figure 4.14 Comparison between the temperature test set and the prediction of
the Jordan neural network
4.3.4 Results of the sunspots data set using Elman and Jordan
neural networks
The procedure for this experimentation was the same presented in the previous
section 6.3.3.
For this non-seasonal data set, the best results I obtained for the Elman network
were an RMSE value of 0.06965678 with two hidden layers containing 5 and 2
hidden nodes. For a single hidden layer, the best RMSE value was 0.08226485.
For the Jordan network the best value for the RMSE was 0.08479622
corresponding to a single layer with 7 hidden nodes.
In this case, the Elman network outperformed the Jordan network again.
The following Figure 4.16 and 4.17 show the comparisons for the Elman and
Jordan networks, respectively.
43
The results obtained were expected. The recurrent neural networks performed
well on this non-seasonal data set and the values of the accuracy for the predictions
are very good. The existence of a memory in our models, for non-seasonal time
series forecasting problems, is crucial for the positive results.
Figure 4.16 Comparison between the sunspots test set and the prediction of the
Elman neural network
Figure 4.17 Comparison between the sunspots test set and the prediction of the
Jordan neural network
From these experimentations, I can conclude that for the seasonal dataset the
feedforward neural network performed sufficiently well. On the other hand, for the
non-seasonal dataset the results were not even close to satisfactory. Using the
recurrent neural networks, either Jordan network or Elman network, the results were
positive for both the seasonal and non-seasonal dataset.
44
The main conclusion we can take is that recurrent neural networks are a better
approach when we are dealing with time series.
45
5 Conclusion
5.1 Results
To sum up the obtained results I present them in the following tables of Figures
5.1 and 5.2. In the first table we have the RMSE values corresponding to the
networks with one hidden layer.
Figure 5.1 Comparison between all the networks with one hidden layer
Figure 5.2 Results from the Elman network with two hidden layers
For the temperatures data set, we can conclude that the feedforward neural
network was the one that performed the best. This is a surprising result for me, I was
expecting it to perform well but not to beat the performance of both the recurrent
neural networks; as mentioned before, the good result is explained by the
seasonality of the first data. Between the Jordan and Elman networks the results are
very similar even though the Elman network value is slightly better.
Regarding the sunspots and the feedforward neural network the expected
happened; the network performance was extremely poor in the non-seasonal data
set, which can be explained by the fact that this type of network doesn’t have a type
of memory like the recurrent neural networks do. Between the Jordan and Elman
networks the value obtained from the Elman network is again slightly better.
In the second table, where we have the values for two hidden layers, we
obtained better values. This happened because the network is deeper; an extra layer
will allow the network to boost its performance when dealing with more complex,
non-linear problems.
46
So, the main conclusion we can infer from these results are:
• Feedforward neural networks are worth considering we are dealing with
seasonal time series data sets; they might even outperform recurrent neural
networks or other techniques.
• Elman neural networks slightly outperformed Jordan neural networks but the
difference is not significant, so it is worth taking both into account when
dealing with these problems.
• The deeper the neural network is the better it will perform with complex
problems. It is important to note that increasing layers will also increase the
cost and the time consumed, and sometimes this tradeoff is not worth
because the results might not be that much better or even, in some cases,
they might get worse.
5.2 Improvements
Even though we chose to work with neural networks, other algorithms (some of
them described in Chapter 2) would be able to solve this type of problems and maybe
with even better results. Regarding the neural networks, we could try different
activation functions.
We used training and test data sets containing 80% and 20% of the data,
respectively. We could improve the results by trying different sizes for these
partitions. We could also use different parts of the data for these partitions; for
example the training data set being the last years of the data set and the test data
set being the beginning or even split the test data set and make part of the prediction
in the beginning and part of the prediction in the end.
5.3 Difficulties
Through the realization of this work, I learned a lot about neural networks, which
caught my attention and had been intriguing me since the SAD course. I learned
about different types of neural networks as well as other algorithms that can be used
to accomplish the same goals. During the realization of the experimentations, I
gained a deeper view on how some of these machine learning techniques work.
Overall, it was an enriching experience that allowed me to learn more and
improve in one of the areas that has been captivating me.
48
6 Bibliography
1. Shumway, R.H. and Stoffer, D.S. (2000). Time Series Analysis and Its
Applications. Springer International Publishing.
2. Cowpertwait, P.S.P and Metcalfe A.V. (2009). Introductory Time Series with
R. Springer-Verlag New York.
3. Montgomery, D.C. , Jennings, C.L. and Kulahci, M. (2008). Introduction to
Time Series Analysis and Forecasting. Wiley
4. McCulloch, W.S. and Pitts, W.H. (1943). A Logical Calculus of the Ideas
Immanent in Nervous Activity.
5. Hagan, M.T. , Demuth, H.B. , Beale, M.H. and De Jesús, O. (1996). Neural
Network Design. Martin Hagan.
6. Haykin, S.S. (1993). Neural Networks and Learning Machines. Pearson.
7. Medsker, L.R. and Jain, L.C. (1999, Dec 20). Recurrent Neural Networks:
Design and Applications. CRC Press.
8. Skapura, D.M. (1996). Building neural networks. Addison-Wesley
Professional.
9. Casdagli, M. And Eubank, S. (1992, Jun 20). Nonlinear Modeling and
Forecasting. CRC Press.
10. Gerven, M. V. and Bohte, S. (2017, Dec 19). Artificial Neural Networks as
Models of Neural Information Processing. Frontiers in Computional
Neuroscience.
11. Shmueli, G. and Lichtendahl Jr., K.C. (2015, Jul 17). Pratical Time Series
Forecasting with R: A Hands-On Guide. Axelrod Schnall Publishers.
12. Russel, S.J. and Norving, P (2010). Artificial Intelligence: A Modern
Approach. Prentice Hall.
13. Kleene, S.C. (1956). Representation of Events in Nerve Nets and Finite
Automata. Annals of Mathematics Studies.
14. Donald, H. (1949). The Organization Of Behaviour. New York: Wiley.
15. Werbos, P.J. (1975). Beyond Regression: New Tools for Prediction and
Analysis in the Behavioral Sciences.
16. Schmidhuber, J. (2015). Deep Learning in Neural Networks: An Overview.
17. Ivakhnenko, A.G. and Grigor’evich Lapa, V. (1967). Cybernetics and
forecasting techniques. American Elsevier Pub. Co.
18. Behnke, S. (2003). Hierarchical Neural Networks for Image Interpretation.
Lecture Notes in Computer Science. Springer.
19. Graves, A. and Schmidhuber, J. (2009). Offline Handwriting Recognition
with Multidimensional Recurrent Neural Networks.
49
20. Ciresan, D.; Giusti, A.; Gambardella, L.M. and Schmidhuber, J. (2012).
Deep Neural Networks Segment Neuronal Membranes in Electron
Microscopy Images. Curran Associates, Inc.
21. Graupe, D. (2013. Principles of Artificial Neural Networks. World Scientific.
22. Crick, F. (1989). The recent excitement about neural networks. Nature.
23. Yin-Wen, C.; Cho-Jui, H.; Kai-Wei, C.;Ringgaard, M.; Chih-Jen, L. (2010).
Training and testing low-degree polynomial data mapping via linear SVM.
Journal of Machine Learning Research.
24. Fan, J. (1996). Local Polynomial Modelling and Its Applications: From linear
regression to nonlinear regression. Chapman & Hall/CRC.
25. Yule, G.U. (1927). On a Method of Investigating Periodicities in Disturbed
Series with Special Reference to Wolfer’s Sunspot Numbers. Philosophical
Transactions of the Royal Society of London.
26. Theodoridis, S. (2015). Machine Learning: A Bayesian and Optimization
Prespective. Academic Press.
27. Box, G.; Jenkins, G.M.; Reinsel, G.C. (1994). Time Series Analysis:
Forecasting and Control. Prentice-Hall.
28. Notation for ARIMA Models. Time Series Forecasting System. SAS Institute.
29. Geman, S.; Bienenstock, E. and Doursat, R. (1992). Neural networks and
the bias/variance dilemma. Neural Computation.
30. Sak, H.; Senior, A. and Beaufays, F. (2014). Long Short-Term Memory
recurrent neural network architectures for large scale acoustic modeling.
31. Milos, M. (2012). Comparative analysis of Recurrent and Finite Impulse
Response Neural Networks in Time Series Prediction. Indian Journal of
Computer and Engineering.
32. Kosko, B. (1988). Bidirectional Associative Memories. IEEE Transactions on
Systems, Man, and Cybernetics.
33. Hinkelmann, K. Neural Networks. University of Applied Science
Northwestern Switzerland.
34. Nielsen, M.A. (2015). Chapter 6. Neural Networks and Deep Learning.
50
7 Appendix A
Here we can find the pre-processing code for both data sets used in our
experimentations (first for the temperatures data set and secondly the sunspots data
set).
Temperatures data set: