Predicting The Future With Artificial Inteligence
Predicting The Future With Artificial Inteligence
com
ScienceDirect
ScienceDirect
Procedia Computer Science 00 (2018) 000–000
Available online at www.sciencedirect.com
Procedia Computer Science 00 (2018) 000–000 www.elsevier.com/locate/procedia
www.elsevier.com/locate/procedia
ScienceDirect
Procedia Computer Science 140 (2018) 383–392
Complex Adaptive Systems Conference with Theme: Cyber Physical Systems and Deep Learning, CAS 2018,
Complex Adaptive Systems Conference–with
5 November Theme: Cyber
7 November 2018, Physical
Chicago, Systems and Deep Learning, CAS 2018,
Illinois, USA
5 November – 7 November 2018, Chicago, Illinois, USA
Predicting the Future with Artificial Neural Network
Predicting the Future with Artificial Neural Network
Anifat Olawoyin*, Yangjuin Chen
Anifat Olawoyin*, Yangjuin Chen
University of Winnipeg, Winnipeg R3B2E9, CANADA
University of Winnipeg, Winnipeg R3B2E9, CANADA
Abstract
Abstract
Accurate prediction of future values of time series data is crucial for strategic decision making such as inventory
management,
Accurate budgetofplanning,
prediction customer
future values relationship
of time series datamanagement,
is crucial formarketing
strategic promotion, and efficient
decision making such as allocation
inventory
of resources. budget
management, However, time series
planning, prediction
customer can bemanagement,
relationship very challenging especially
marketing whenand
promotion, there are elements
efficient of
allocation
uncertainty
of resources.including
However, natural
time disaster, change incan
series prediction government policies andespecially
be very challenging weather condition.
when there In arethis research,
elements of
four different
uncertainty multilayer
including perceptron
natural disaster,(MLP)
changeartificial neural networks
in government haveweather
policies and been discussed andIncompared
condition. with
this research,
Autoregressive Integratedperceptron
four different multilayer Moving Average
(MLP)(ARIMA)
artificialfor this task.
neural The models
networks have beenare evaluated
discussed using two statistical
and compared with
performance evaluation
Autoregressive Integratedmeasures, Root Mean
Moving Average Squared
(ARIMA) forError (RMSE)
this task. The and coefficient
models of determination
are evaluated (R2). The
using two statistical
experimental evaluation
performance result shows measures, Root MLP
that a 4-layer Meanarchitecture
Squared Error using the tanh
(RMSE) activation
and function
coefficient in each of the
of determination (R2hidden
). The
layer and a linear
experimental result function
shows that in athe output
4-layer MLPlayer has the lowest
architecture prediction
using the error and
tanh activation the highest
function in eachcoefficient
of the hiddenof
determination
layer among
and a linear the configured
function multilayer
in the output layerperceptron neuralprediction
has the lowest networks. error
In addition,
and thecomparative analysis of
highest coefficient
performance
determinationresult
among reveals that the multilayer
the configured multilayerperceptron
perceptronneural
neuralnetwork
networks.MLP has a lower
In addition, predictionanalysis
comparative error than
of
the ARIMA model.
performance result reveals that the multilayer perceptron neural network MLP has a lower prediction error than
the ARIMA model.
© 2018 The Authors. Published by Elsevier B.V.
This © open
is an 2018 access
The Authors. Published
article under by Elsevier
the CC BY-NC-ND B.V.license (https://creativecommons.org/licenses/by-nc-nd/4.0/)
© 2018 The Authors. Published by Elsevier B.V.
This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0/)
Selection
This is an and
openpeer-review
access under
article underresponsibility
the CC of the Complex
BY-NC-ND license Adaptive Systems Conference with Theme: Engineering Cyber
(https://creativecommons.org/licenses/by-nc-nd/4.0/)
Selection and peer-review under responsibility of the Complex Adaptive Systems Conference with Theme:
PhysicalEngineering
Selection Systems.
and peer-review under responsibility
Cyber Physical Systems. of the Complex Adaptive Systems Conference with Theme: Engineering Cyber
Physical Systems.
Keywords: Artificial Neural Network, ARIMA, Multilayer Perceptron, Time Series, Data Preprocessing
Keywords: Artificial Neural Network, ARIMA, Multilayer Perceptron, Time Series, Data Preprocessing
model that mimics a biology nervous system. The ANN can detect patterns and trends that are too complex for
human or other statistical models such as non-linearity in time series data to analyse. Real world applications of the
ANN include pattern classification such as handwritten recognition, time series prediction, image compression,
credit scoring for loan approval, and machine control, just to name a few.
This research designs a Multilayer Perceptron neural network for time series prediction and compares this
with one of the traditional statistical time series prediction techniques known as the Autoregressive Integrated
Moving Average, ARIMA. The study varies the number of hidden layers and investigates the best activation function
for a set of data. In addition, this study explores the significance of pre-processing in time series prediction through
data transformation by which a dataset having 5 attributes and 1,098,044 instances is converted to another dataset
having 2 attributes and 366 instances by using aggregation, equal frequency binning and feature selection
techniques.
The rest of this paper is organized as follows: Section 2 gives the background information and related works,
Section 3 presents the main theoretical framework, Section 4 describes the implementation details, Section 5 is
devoted to the experimental result and discussion. Finally, a short conclusion is set forth in Section 6.
2. Related Work
Artificial Neural network (ANN) has been applied to time series forecasting problems by many researchers. The
study in [1] employed the Elman recurrent neural network (ERNN) with stochastic time effective functions for
predicting price indices of stock markets. The ERNN can keep memory of recent events in predicting the future. The
study in [2] used the Multilayer Feed Forward Neural Network (MLFFNN) and the Nonlinear Autoregressive
models with the Exogenous Input (NARX) Neural Network to forecast exchange rates in a multivariate framework.
Experimental findings indicated that the MLFFNN and NARX were more efficient when compared with the
Generalized Autoregressive Conditional Heteroskedastic (GARCH) and Exponential Generalized Autoregressive
Conditional Heteroskedastic (EGARCH).
Another advanced statistical technique for predicting future time series is the Autoregressive Integrated Moving
Average (ARIMA) model, which assumes that the time series data are stationary. That is, the data are not time
dependent. Thus, to use the ARIMA for time series prediction requires checking for stationarity; and a common
approach to do this is to use the augmented Dickey-Fuller test (ADF) to test the presence of a unit root in a sample.
Specifically, if the p-value is greater than 0.05, null hypothesis is accepted.
Besides, the hybrid techniques combining ARIMA and ANN have been shown to be successful by [3,4, 5, 6].
However, in [3] it is assumed that the linear and non-linear pattern can be separately modelled, their relationship is
additive, and the residual from the linear model will contain only the non-linear pattern which may lead to
performance degeneration for instance if the relationship is multiplicative. The empirical evidence from the study in
[7] showed that such integrated approaches may not necessarily outperform the individual forecasting techniques.
Although, the authors in [4] proposed a hybrid model to overcome the limitation of the traditional hybrid models and
guarantee that the model will not be worse than using the individual ARIMA and artificial neural network, this
assurance cannot be true in all cases. Hence, in this study we focus on the individual model comparison using the
parking tickets dataset.
Unit (LTU) activation function. The activation function commonly used in most artificial network configurations is
the sigmoid function because of its ability to combine linear, curvilinear and constant behaviors, as well as being
smoothly differentiable.
The single perceptron output is defined by
� � �� � �� �� � �� �� � � �� �� (1)
where t = threshold, �� � �� � �� are the associated weight of the input attributes �� � �� � �� .
The major drawbacks of a simple neural network include:
Single neurons cannot solve complex tasks;
It is restricted to linear calculations.
Nonlinear features need to be generated by hand, an expensive operation.
The focus of this paper is the Multilayer Perceptron (MLP). A multi-layer perceptron is a feedforward neural
network consisting of a set of inputs, one or more hidden layers and an output layer. The layers in MLP are fully
connected such that neurons between adjacent layers are fully pairwise connected while neurons within a layer share
no connection.
The input represents the raw data (�� � � � �� � fed into the network. The raw data and the weight are fed into the
hidden layer. The input to the hidden layer is thus given as
� � ���� � ∑������� �� � (2)
The hidden layer is the processing unit where the learning occurs. The hidden layer transforms the values
received from the input layer using an activation function. A commonly used activation function is the sigmoid
function given as
σ = 1/(1 + e-x) (3)
Other activation functions are:
i. tanh(x)- non-linearity activation function is a scaled sigmoid function given as:
�
������� � ��� ��� � � (4)
ii. Rectifier Linear unit (RELU) is an activation function with a threshold of zero given as:
���� � ��� ��� �� (5)
The output of the hidden layer is given as:
H � ������� � ��������� � ����∑������� �� �� (6)
where A is the activation function. Assuming that the sigmoid gives
�
H = 1/�� � � � ∑��� �� �� � (7)
The output layer receives the output and the associated weight of the hidden layer neurons as inputs. The
output � of the output layer assuming a sigmoid function is given as
� � ����∑�
��� �� �� �� (8)
where �� ��� �� are the output and weight of individual neurons of the hidden layer.
The activation function of the output layer is commonly a linear function, and depending on the task, a tanh or a
sigmoid function may be applicable.
A multilayer perceptron architecture having 2 hidden layers denoted as 2-layer multilayer perceptron neural
network is shown in Figure 2.
The main issue with a Multilayer Perceptron neural network is weight adjustment in the hidden layer which is
necessary to reduce the error at the output layer. The weight adjustment in the hidden layer is achieved using
backpropagation algorithm. The back propagation takes the sequence of training samples (time series data for this
study):
(�� � �� �� ��� � �� � � � � � ��� � �� �
386 Anifat Olawoyin et al. / Procedia Computer Science 140 (2018) 383–392
A Olawoyin, Y Chen/ Procedia Computer Science 00 (2018) 000–000
as the input and produces a sequence of weights (�� � �� � �� � � � �� � starting from some initial weight �� , usually
chosen at random [4]. Generally, the backpropagation rule is given as:
�����
�� � � � ���� � �� � � ∂� (9)
��
where � represents the weights, E(w) is the cost function that measures how far the current network’s output is from
the desired one. ∂E(w)/∂w is the partial derivative of the cost function E that specifies the direction of the weight
adjustment to reduce the error, is the learning rate, measured as the number steps for each iteration of the weight
update equation.
The weight change for the hidden layer is given as:
�� � � ∂� �� (10)
where ∂� � �� �� � �� � ∑ ��� �� .
The weight change for the output layer is given as:
�� � � ∂� �� (11)
where ∂� � �� �� � �� ��� � �� �, and T is the target output and �� is the output.
The network is trained by adjusting the network weights as defined in equation 9 -11 above to minimize the
output errors on a set of training data.
The training of a multilayer perceptron can be summarized as:
Given a dataset D with (�� � � � �� � input and P patterns for the network to learn
The network with n input units is fully connected to h nonlinear hidden layers via connection weight ���
associated with each input unit.
The hidden layer is fully connected to T output units via connection weight ��� associated with each neuron
in the hidden layer.
The training is initiated with random initial weight for each neuron in the network.
An appropriate error function ����� �, for instance the Mean Square Error (MSE) to minimize by the
network is predetermined.
The learning rate η is also predetermined.
The weight associated with each neuron in the hidden layer and the output layer is updated using the
�����
equation: ∆w � �� until the error function is minimized.
��
A momentum � is an inertia term used to diminish fluctuations of weight changes over consecutive iterations.
Thus, the weight update equation becomes:
�����
�� � �� � � ���� (12)
��
Figure 1: Single Layer Perceptron Neural Network Figure 2: Multi-Layer Neural network (MLP)
Anifat Olawoyin et al. / Procedia Computer Science 140 (2018) 383–392 387
A Olawoyin, Y Chen/ Procedia Computer Science 00 (2018) 000–000
4.0 IMPLEMENTATION
4.1 Development Environment and Tools
All our experiments are performed on a 64-bit operating system. The processor is 2.4GHz Intel(R) core™i5
laptop, 8GB installed memory. Programming language is Python, and Development environment is Enthought
Canopy. The used Machine learning tool is Scikit-learn [9], Keras libraries [10] with Pandas, NumPy, stats Models
and Matplotlib.
4.2 Dataset
The dataset for this study is a set of parking contravention transactions updated monthly by the city of
Winnipeg on open data government license available in [11]. The dataset has five attributes and over a million
instances comprising of parking tickets issued between January 1st, 2010 and March 31st, 2017. For this paper,
seven years’ data (2010-2016) are used. The description and preview of the dataset is presented in Table 1 and table
2, respectively.
Table 1:Dataset Description
Dataset Name Number of attributes Number of Instances
Parking_Contravention_Citaitons.csv 5 1.09M
�
Robert Nau Lecture notes on forecasting: Fuqua School of Business. Duke
Universityhttp://people.duke.edu/~rnau/Slides_on_ARIMA_models--Robert_Nau.pdf
A Olawoyin, Y Chen/ Procedia Computer Science 00 (2018) 000–000
388 Anifat Olawoyin et al. / Procedia Computer Science 140 (2018) 383–392
4.3 Evaluation
The models are evaluated using the root mean square error (RMSE) and coefficient of determination (R2).
The RMSE is the square root of mean square error, a risk metric corresponding to the expected value of the
squared error loss function, defined as:
� �
���� � � �������� � �� �� . (13)
�
The coefficient of determination is a measure of goodness of the model. It explains how well future samples are
likely to be predicted by the model [4]. The value of R2 can be negative or positive. A negative R2 defines an
arbitrary worse model, defined as
∑����� � ��� �� �
�� � � � ∑���
�����
�
, where �� � . (14)
��� � ��
� � � � ∑���
��� ��
The summary statistics for the dataset presented in table 4 shows that the minimum weekly mean between year
2010 and 2016 is 178 tickets while the maximum is 1341 tickets. The graph for the dataset presented in figure 4
shows that there is a spike in ticket numbers around January-February each year when the snow related violation
tickets are issued.
Anifat Olawoyin et al. / Procedia Computer Science 140 (2018) 383–392 389
A Olawoyin, Y Chen/ Procedia Computer Science 00 (2018) 000–000
Figure 3: Implementation Chart Figure 4: Dataset Trends Graph Figure 5: ACF and PACF Plot
390 Anifat Olawoyin et al. / Procedia Computer Science 140 (2018) 383–392
A Olawoyin, Y Chen/ Procedia Computer Science 00 (2018) 000–000
Table 5: Augmented Dickey-Fuller (ADF) Test Table 6: ARIMA (p, d, q) Results-( RMSE and R2)
All the models are separately trained for up to 1000 epochs using the sigmoid activation function and a comparison
is made using the tanh activation function. The relationship between the sigmoid and tanh activation functions is
stated in equation (6). The optimizer selected for the training is the Stochastic Gradient Descent (SGD) optimizer
with a default learning rate of 0.01. The dataset is standardized using the MixMaxScaler function in the range (-1, 1).
An attempt to use the sigmoid activation function in the output layer resulted in negative r2 (-9.67); thus, a linear
activation function is used for the output layer of all the architectures. The setup is presented in table 7.The loss
function specified for all the models is the Mean Square Error (MSE), RMSE and R2 are subsequently calculated for
evaluation.
The result presented in Table 8 for the sigmoid activation function shows that a 2-layer with one neuron in the
hidden layer has the best goodness of fit having correlation of determination R2 of 0.61 and an error of 0.103.
Adding more neuron to a layer does not improve performance as seen in the result for 2H1 and 2H4. Similarly,
addition of layer to the network does not improve the prediction capability of the network. The root means square
error, RMSE increases from 0.103 for 2H1 network to 0.104 for 3H41 while the coefficient of determination, R2
increases to 0.66 for 3H41 network from 0.61 for 2H1 network.
The result from table 9 for the network designed using tanh activation function shows performance improvement
when more layers are added to the network up to 4H411 (depicted in figure 7) where the best result is recorded.
Further additions of layer beyond 4H411 add no value to the prediction capability and goodness of fit of the
network.
The Comparative analysis of the result presented in table 10 and figure 6 shows that the 4H411 neural network
designed with tanh activation function has the lowest error (RMSE=0.099) having an average prediction error of 57
tickets per week. A 2-layer MLP with one neuron in the hidden layer also has a better performance than ARIMA
(2,0,2) having an average prediction error of 60 tickets per week.
Anifat Olawoyin et al. / Procedia Computer Science 140 (2018) 383–392 391
A Olawoyin, Y Chen/ Procedia Computer Science 00 (2018) 000–000
Table 7: Multilayer Perceptron architecture Table 8: Sigmoid Function Evaluation Results (RMSE and R2)
Evaluation Result for Multilayer Perceptron and
ARIMA models
1
0
MLP tan h (4H411) MLP sigmoid (2H1) ARIMA (2,0,2)
RMSE R2
Figure 6: Comparison evaluation result (RMSE and R2) Figure 7: 4H411 MLP
6.0 CONCLUSION
The performance of the Multilayer Perceptron neural network and ARIMA models have been investigated in this
research. Observations from the performance evaluation of the models revealed that the four MLP architectures
designed using tanh activation function outperform the ARIMA model. Specifically, with the 4H411 model, they
produce the best goodness of fit (R2 = 0.77) and lowest prediction error (RMSE = 0.099). The effect of adding more
layers on the performance of a multilayer perceptron neural network is also investigated. Using the sigmoid
activation function, a 2-layer MLP having one neuron in the hidden layer has the best performance in term of
prediction error (RMSE = 0.103) and the coefficient of determination (R2 = 0.61) measures. The empirical evidence
from this study indicates that adding more layers to a network configured using sigmoid function may not
necessarily improve the predictive power of the network and may result in performance degeneration.
392 Anifat Olawoyin et al. / Procedia Computer Science 140 (2018) 383–392
A Olawoyin, Y Chen/ Procedia Computer Science 00 (2018) 000–000
Like the sigmoid activation function, the tanh activation function also has a saturation effect, however, unlike the
sigmoid, the output of the tanh activation function is zero-centered. Thus, adding layers to a network configured
using the tanh activation function can improve the performance of a network as demonstrated in this study. From the
result in Table 9, it can be observed that adding more layers reduces the prediction error and improves the goodness
of fit of the network up to the 4-layer network (4H411).
In addition, pre-processing datasets is a necessity to some models like the ARIMA and MLP investigated in this
study. The ARIMA model requires a stationary time series data. This is achieved by first aggregating the ticket
transaction to daily counts and using equal weekly frequency to group the mean values and then apply the logarithm
function to them. Standardization is a requirement for multilayer perceptron networks to remove bias that might be
caused from wide variation in range of values of raw data during a training. From the summary of pre-processing
stage in table 4, it can be observed that standardization is required since the minimum average ticket per week is 178
while the maximum is 1340. This study used the MinMaxScaler function of the Scikit-learn library to transform the
dataset to a range [-1, 1].
Our experiments suggest that choosing a good activation function can significantly improve the performance of a
multilayer perceptron neural network.
ACKNOWLEDGEMENTS
The first author would like to thank two anonymous referees for their helpful comments. Special thanks to Dr Sheela
Ramanna and Dr. Sergio Camorlinga, University of Winnipeg, for their helpful comments at the initial stage of this
work.
REFERENCES
[1] J. Wang, J. Wang, W. Fang, and H. Niu, Financial time series prediction using Elman recurrent random neural networks,
Computational Intelligence and Neuroscience, vol. 2016, Article ID 4742515, 14 pages, 2016.
[2] Chaudhuri T. D. et al. Artificial Neural Network and Time Series Modeling Based Approach to Forecasting the Exchange Rate in a
Multivariate Framework” Journal of Insurance and Financial Management, Vol. 1, Issue 5 (2016), pp 92-123.
[3] Zhang, G. P. (2003). Time series forecasting using a hybrid ARIMA and neural network model. Neurocomputing, 50, 159-175.
[4] Khashei, Mehdi, and Mehdi Bijari. "A novel hybridization of artificial neural networks and ARIMA models for time series
forecasting." Applied Soft Computing 11.2 (2011): 2664-2675.
[5] Babu, C. N., & Reddy, B. E. (2014). A moving-average filter based hybrid ARIMA–ANN model for forecasting time series
data. Applied Soft Computing, 23, 27-38.
[6] Khandelwal, I., Adhikari, R., & Verma, G. (2015). Time series forecasting using hybrid ARIMA and ANN models based on DWT
decomposition. Procedia Computer Science, 48, 173-179.
[7] Taskaya-Temizel, T., & Casey, M. C. (2005). A comparative study of autoregressive neural network hybrids. Neural Networks, 18(5-
6), 781-789.
[8] Rumelhart, David E.; Hinton, Geoffrey E.; Williams, Ronald J. (8 October 1986). "Learning representations by back-propagating
errors". Nature. 323 (6088): 533–536.
[9] Scikit-learn Machine Learning in Python http://scikit-learn.org/stable/index.html
[10] Keras Deep Learning Documentation https://keras.io/
[11] City of Winnipeg Parking contravention dataset: https://data.winnipeg.ca/Parking/Parking-Contravention-Citations-/bhrt-29rb/data