Solar Power Forecasting With Machine Learning Techniques: Emil Isaksson Mikael Karpe Conde
Solar Power Forecasting With Machine Learning Techniques: Emil Isaksson Mikael Karpe Conde
We would first like to thank our supervisor Pierre-Julien Trombe and Vattenfall AB.
Without Pierre-Julien’s coordination, enthusiasm, and patience, none of this would
have been possible. Secondly, we would like to show our gratitude to our supervisor
at KTH, Jimmy Olsson, for giving us guidance. Finally, we would both like to thank
our families and friends for the support during our five years at KTH.
i
Contents
List of Figures iv
List of Tables iv
List of Acronyms vi
1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Research Question . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.5 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Previous Research 4
2.1 General Research on Solar Power Forecasting . . . . . . . . . . . . . . 4
2.2 Time Series Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Machine Learning Techniques . . . . . . . . . . . . . . . . . . . . . . 5
2.4 Conclusions and Insights . . . . . . . . . . . . . . . . . . . . . . . . . 7
3 Mathematical Theory 8
3.1 Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.1.1 Stationary Processes . . . . . . . . . . . . . . . . . . . . . . . 8
3.1.2 AR and MA Processes . . . . . . . . . . . . . . . . . . . . . . 8
3.1.3 ARMA and ARIMA Processes . . . . . . . . . . . . . . . . . . 9
3.1.4 SARIMA Processes . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1.5 Main Assumptions of ARIMA . . . . . . . . . . . . . . . . . . 10
3.2 K -Nearest Neighbors . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2.1 Feature Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.3 Tree Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3.1 Regression Trees . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3.2 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3.3 Gradient Boosting Regression Trees . . . . . . . . . . . . . . . 12
3.4 Lasso Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.5 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . 13
3.5.1 The Network and its Functions . . . . . . . . . . . . . . . . . 13
3.5.2 Fitting a Neural Network . . . . . . . . . . . . . . . . . . . . . 15
3.5.3 Issues For Training a Neural Network . . . . . . . . . . . . . . 17
ii
CONTENTS
5 Results 30
5.1 ARIMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.2 Machine Learning Models . . . . . . . . . . . . . . . . . . . . . . . . 33
5.2.1 Hyperparameter Tuning . . . . . . . . . . . . . . . . . . . . . 33
5.2.2 RMSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.2.3 Variable Importance . . . . . . . . . . . . . . . . . . . . . . . 37
5.2.4 nRMSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.2.5 Skill Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.2.6 Run-time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
6 Discussion 39
6.1 Time Series Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6.2 Machine Learning Models . . . . . . . . . . . . . . . . . . . . . . . . 40
6.3 Other Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.3.2 Error Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6.3.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Bibliography 44
iii
List of Figures
iv
List of Tables
4.1 Sites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 NWPs from Meteomatics . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.3 Lasso: hyperparameters and search grids . . . . . . . . . . . . . . . . 27
4.4 KNN: hyperparameters and search grids . . . . . . . . . . . . . . . . 27
4.5 GBT: hyperparameters and search grids . . . . . . . . . . . . . . . . 28
4.6 ANN: hyperparameters and search grid . . . . . . . . . . . . . . . . . 28
v
List of Acronyms
AR Autoregressive
CV Cross-Validation
LR Linear Regression
MA Moving Average
ML Machine Learning
PV Photovoltaic
vi
Chapter 1
Introduction
1.1 Background
The global shift towards renewable energy sources (RES) has driven the development
of photovoltaic (PV) panels. For example, the costs of producing electricity from
PV panels have dropped significantly, while simultaneously increasing the energy
conversion efficiency. More specifically, the levelized cost of electricity of large-
scale PV panels has decreased by 73% between the years of 2010 and 2017 [1].
The decreased cost and increased efficiency have made PV panels a competitive
alternative as a RES in many countries [2].
However, since PV panel energy output depend on weather conditions such as
cloud cover and solar irradiance, the energy output of the PV panels is unstable.
To understand and manage the output variability is of interest for several actors in
the energy market. In the short-term (0-5 hours), a transmission system operator
is interested in the energy output from PV panels to find the adequate balance for
the whole grid, since over- and under-producing electricity often results in penalty
fees. On another side of the spectrum, electricity traders are interested in long
time horizons, ordinarily, day-ahead forecasts since most electricity is traded on the
day-ahead market. Consequently, the profitability of these operations relies on the
ability to forecast the fluctuating solar PV panel energy output accurately.
It is likely, as more countries decide to invest more and more in RES, that the use
of solar PV panels will continue to increase. This will increase the need for suitable
means of forecasting solar PV energy output. While the demand for accurate and
efficient forecasts of PV panel energy output is evident, the solution is far from
trivial. There are many complications that the current research within the field is
handling. One evident nuisance is the inherited variation of weather, which makes
accurate weather forecasting challenging.
Parallel to the increased demand of PV power forecasting solutions, the means
for forecasting with the help of machine learning (ML) techniques have in recent
years gained in popularity relative to traditional time series predictive models. Al-
though ML techniques are nothing new, the improved computational capacity and
the higher availability of quality data have made the techniques useful for forecast-
ing. This poses for an interesting area of research when forecasting the solar power
output: How do machine learning techniques perform relative to traditional time
series forecasting techniques?
1
CHAPTER 1. INTRODUCTION
1.2 Objective
The main objective is to benchmark different forecasting techniques of solar PV
panel energy output. Towards this end, machine learning and time series techniques
can be used to dynamically learn the relationship between different weather condi-
tions and the energy output of PV systems.
Four ML techniques are benchmarked to traditional time series methods on PV
system data from existing installations. This also required an investigation of feature
engineering methodologies, which can be used to increase the overall prediction
accuracy.
1.4 Limitations
Some limitations were done to clarify the scope of the study.
Five established prediction models were chosen beforehand. The models that will
be implemented and compared are: Lasso, ARIMA, K-Nearest Neighbors (KNN),
Gradient Boosting Regression Trees (GBRT), and Artificial Neural Networks (ANN).
These models have been selected based on their tendency to perform well in previous
research of energy forecasting.
The focus will be placed in benchmarking ML and time series techniques. Many
of the above models are generic and therefore do most of them have a wide range
of different model set-ups. The aim is to give a general overview of the relative
performance of the methods rather than investigating a specific model in depth.
1.5 Outline
The remaining parts of this thesis will be structured the following way:
2
CHAPTER 1. INTRODUCTION
• Chapter 5 - Results
The empirical results of the study is presented.
• Chapter 6 - Discussion
Analysis and discussion of the results is presented. Strengths and weaknesses
of the study along with potential improvements are also outlined.
3
Chapter 2
Previous Research
4
CHAPTER 2. PREVIOUS RESEARCH
5
CHAPTER 2. PREVIOUS RESEARCH
Other researchers, such as Andrade et al. [2], explored ML techniques and evalu-
ated them in combination with developing features that supposedly should improve
the performance. The main approaches used in the study were Principal Compo-
nent Analysis (PCA) and a feature engineering methodology in combination with a
Gradient Boosting Tree model. Furthermore, the authors used different smoothing
approaches to create features from their NWP data. Specifically, the authors used a
grid of NWP data around the location of the PV installation and computed spatial
averages and variances of weather parameters. Besides creating features based on a
local grid of points, the authors also computed variances for different predictors for
different lead times. When constructing variance features based on lead times, the
underlying idea was that the feature would indicate the variability of the weather.
The main conclusion is improved results from using both PCA and feature engi-
neering. According to the authors, there is a twofold knowledge gap for the further
research. The first aspect concerns feature management (feature engineering & fea-
ture selection) and more concretely regarding how to create meaningful features
that improve the forecast. The second aspect concerns the issue of further exploring
ML modeling techniques that can be implemented in combination with informative
features. Their final comment is that deep learning techniques will be an interesting
path to pursue in combination with proper feature management.
Davò et al. [10] used PCA combined with the techniques ANN and Analog
Ensemble (AnEn) to predict solar irradiance. With the aim to reduce the dimen-
sionality of the dataset, PCA was used as a feature selection method. The dataset
consists of the aggregated daily energy output of the solar radiation, measured over
eight years. A comparison between using and not using PCA showed that using
PCA in combination with ANN and AnEn enhances the prediction accuracy.
Chen et al. present results on long-term (up to 100 hours) forecasting. The
authors employed an ANN as their forecasting method with NWP data as input.
The model was sensitive to prediction errors of the NWP input data and also showed
a deterioration when forecasting on rainy days in particular. During cloudy and
sunny days, the ANN model produced results with MAPEs of around 8 % [11].
Persson et al. used Gradient Boosted Regression Trees (GBRT) to forecast solar
energy generation 1-6 hours ahead. The data used was historical power output
as well as meteorological features for 42 PV installations in Japan. Concerning
RMSE, the GBRT model performed better than the adaptive recursive linear AR
time series model, persistence model and climatology model on all forecast horizons.
For shorter forecast horizons, it was shown that lagged power output values had a
larger predictive ability. Similarly, for longer forecast horizons, the weather forecasts
increased in importance. [12]
Shi et al. propose an algorithm for weather classification and SVM to forecast
PV power output on 15-minute intervals for the next-coming day. The weather is
classified into four classes: clear sky, cloudy day, foggy day, and rainy day. The
rationale behind the classification is correlation analysis on the local weather fore-
casts and the PV power output. The data is thereafter normalized to reduce error
and improve accuracy whilst still keeping correlation in the data. Four SVM models
with radial basis function kernel are thereafter fitted to the four different weather
classes. In conclusion, the result shows a way of using SVM models to train models
on specific climatic conditions. [13]
6
CHAPTER 2. PREVIOUS RESEARCH
7
Chapter 3
Mathematical Theory
In this chapter, we will present the mathematical theory behind the employed tech-
niques. For more mathematical depth, we refer to the cited works in this section.
8
CHAPTER 3. MATHEMATICAL THEORY
p
X
Xt = ✓ i Zt i . (3.3)
i=0
Here, Zi ⇠ W N (0, 2
) and Zi are independent for all i. Also, for i = 0 it holds
that ✓0 = 1 [14].
Xt 1 Xt 1 ··· p Xt p = Z t + ✓ 1 Zt 1 + · · · + ✓ q Zt q , (3.4)
where {Zi } ⇠ W N (0, 2 ) are independent for all i, and the polynomials (1 ✓1 B 1
... ✓q B q ) and (1 1B ... p B ) have no common factors. For the above
p
p
(B)(1 B)d Xt = ✓q (B)Zt , (3.6)
where {Zt } ⇠ W N (0, 2 ), p (B) 6= 0 and |B| 1. When d = 0, the ARIMA
model is an ARMA model. The differencing in Xt is usually performed to attain
stationarity as it reduces the time dependence of the mean. [14]
P , D and Q, have the same properties as the p, d and q, however with a seasonal step.
Thus D = 1 gives the seasonal difference for some seasonality s: (1 B)1⇥s Xt =
Xt Xt s .
9
CHAPTER 3. MATHEMATICAL THEORY
0 Xi min X
Xi = , i 2 1, . . . , n, (3.10)
max X min X
which transforms all values into the range [0, 1] and assures that all the predictors
have the same scale.
10
CHAPTER 3. MATHEMATICAL THEORY
M
X
Ŷ = F (X) = cm I(X 2 Rm ). (3.11)
m=1
3.3.2 Boosting
Boosting is a stage-wise approach using so-called weak learners. A weak learner can
be any statistical model that predicts somewhat better than pure chance. In any
statistical learning, the general goal is to find a predictive function F ⇤ (X) such that:
11
CHAPTER 3. MATHEMATICAL THEORY
M
X
F (X) = m h(X; ām ), (3.15)
m=0
where h(X; ām ) is a weak learner with coefficients ām = (a1 , ...) and M is the total
number of weak learners. In the algorithm, one loops through m = 1, 2, ..., M and
computes the following:
N
X
( m , ām ) = arg min L(Yi , Fm 1 (Xi ) + m h(Xi ; ām )), (3.16)
,ā i=1
Fm (X) = Fm 1 (X) + m h(X; ām ). (3.17)
Above, i denotes the i:th data point vector and the final prediction is given by
Ŷ = F (X). From this equation, one can observe that the algorithm iteratively
improves the weak learners Fm 1 (X) against the residuals. A clear example is when
applying the squared-error loss function:
N h
X i2
⇥ L(Yi , F (Xi )) ⇤
ām = arg min F (X)=F (X)m
⇢h(Xi ; ā) , (3.19)
ā,⇢
i=0
F (Xi ) 1
where the first fraction is the pseudo-residuals for each training point i. The pseudo-
residual is the gradient of the loss function, which points in the direction in which the
loss-function changes the most. With the minus sign, the psuedo-residual represents
the direction for which the loss function is most rapidly decreasing. Thus, the
algorithm seeks a weak learner that captures the direction in which the loss function
most rapidly decreases. Once ām is determined, the optimal value of m of the weak
learner is determined by solving:
N
X
m = arg min L(Yi , F (Xi )m 1 + h(Xi ; ām )). (3.20)
i=1
12
CHAPTER 3. MATHEMATICAL THEORY
p n n n
X j
X j j 2
X X
min [yj 0 + i Xi ] + | |j = RSS + | |j . (3.21)
1 ,..., p
j=1 i=1 j=1 j=1
Pn
The added penalty j=1 | |j makes the coefficients shrink towards zero, thus
the model avoids overfitting. The absolute value forces some of the ’s to take the
value zero, and thus the Lasso performs a variable selection. When employing the
Lasso with a linear regression model, the prediction model is given by:
n
X
Ŷ = 0 + i Xi . (3.22)
i=1
The complicated part of calibrating the Lasso is the choice of . This parameter
decides how strongly to penalize high values of the ’s. One method for choosing
an appropriate value of is cross-validation. [15]
13
CHAPTER 3. MATHEMATICAL THEORY
1
Zm 1
= (↵0m 1 T
+ [↵m ] Xi ), m = 1, ..., M, (3.23)
j
j
Zm = (↵0m j T j 1
+ [↵m ] Z ), m = 1, ..., M and j = 2, ..., J, (3.24)
Tk = 0k + T J
kZ , k = 1, ..., K, (3.25)
Yk =Fk (X) = gk (T ), k = 1, ..., K, (3.26)
The function g(u) is referred to as the final transformation function of the outputs
T. For regression problems, g(u) = u is a common choice, i.e., just summing the last
layers produced Zm J
values.
When referring to the network architecture, there is no limitation on how it
can be structured. The common terminology is input layer, hidden layer, and the
output layer. The input layer will consist of n nodes, where each node represents
one parameter of the input data Xi , i.e., the n:th input node corresponds to Xin .
The hidden layer(s) consist of M nodes and will first transform the Xi input via the
summation function ↵0m 1
+ ↵m1
Xi and then apply the activation function (v) of this
sum, to produce the Zm values. If the network only consists of one hidden layer, the
Z values will be passed to the transformation function via the summation function
T (Z) and then produce the final output value via the function g(T ). In the case
of multiple hidden layers, the Zm j
-values of a hidden layer (layer j) will be passed
j
on with the summation function ↵0m + ↵m j
X and then again transformed with a
sigmoid function (or any other activation function) to the layer j + 1. This process
can be repeated many times until the Z values of the last hidden layer has been
computed, then the T function and final transfer function g(T ) will be activated
to produce the output layer. Using many hidden layers is commonly referred to as
Deep Learning since the network is considered to have a depth of layers.
Figure 3.1 demonstrates how a neuron works. The input, either original data
X or another layers’ values of Z, is added to the summation function and then
transformed with the activation function. This gives an output, denoted as Zm j
in
the image. This output is later passed on to other neurons in other layers or to the
output layer via the final transfer function.
14
CHAPTER 3. MATHEMATICAL THEORY
Weight
Input
X1 a1j
Summation Activation
function
2j
n Output
X2
ij X i
i 1
Z mj
Xn nj
Figure 3.2: Example of a Deep ANN
Figure 3.2 shows an example of an ANN with three hidden layers, each with nine
neurons per layer. The lines represent the weights used to pass the sum of the input
data to the next neurons/layers, and a circle represents one neuron at each layer. It
should be clarified that each hidden layer does not have to have the same amount
of nodes. Once a network has been set up, the aim is to find the weights within in
the network that minimizes some criteria, which usually translates to minimizing
the squared error.
N
X N X
X K
R(✓) = Ri = (Yi,k Fk (Xi ))2 , (3.27)
i=1 i=1 k=1
where k is the number of classes and i the number of training inputs. In this setting,
Xi is one input data vector, Fk is the output of the ANN and Yik is the true value
for a given class k and input vector Xi .
The general way for minimizing the above expression is by gradient descent,
which is also referred to as back-propagation. With the above sum-of-squared errors
notation and the specified functions across the ANN, the network has the following
derivatives:
@Ri 0
= 2(Yi,k Fk (Xi ))gk ( T
k Zi )Zm,i , (3.28)
@ k,m
K
X
@Ri 0 0
1
= 2(Yi,k Fk (Xi ))gk ( T
k Zi ) m,k
1 T
([↵m ] X1 )Xn,i . (3.29)
@↵n,m k=1
The above derivatives are valid for a single layer network, however the derivatives
for ajn,m follows analogously. Here, m = 1, ..., M is the number of neurons and i
corresponds to the number of data inputs and n to a specific data input parameter
of Xi .
15
CHAPTER 3. MATHEMATICAL THEORY
With these derivatives, the weights are continuously updated from an initial
guess of ↵ and the following way, where r is a time step and (a constant)
represents the learning rate:
XN
(r+1) @Ri
m,k = r
m,k r r
, (3.30)
i=1
@ m,k
XN
@Ri
j,(r+1)
↵n,m = j,r
↵n,m j,r . (3.31)
i=1
@↵n,m
Resilient Propagation
Resilient backpropagation is a type of batch learning for multi-layer ANN. When
determining the weight step in each iteration, resilient backpropagation only use the
sign of the partial derivative and not the size. The step size of the update of the
(r)
weights ( ↵m,j ) is then determined by:
8
> (r) @E (r)
>
< m,j , if @↵m,j
>0
(r) @E (r)
↵m,j = +
(r)
m,j , if <0 (3.34)
>
> @↵m,j
:0, else.
16
CHAPTER 3. MATHEMATICAL THEORY
where 0 < ⌘ < 1 < ⌘ + . In essence, the algorithm looks at the sign of the
partial derivative of ↵m,j and thereafter updates the weight by adding or subtracting
the update value m,j . The update values are thereafter adjusted and made larger
if the partial derivative did not change signs to speed up convergence, and made
smaller if the partial derivative changed signs meaning it skipped a local minimum.
The parameters ⌘ + and ⌘ are normally set to 1.2 and 0.5. [19]
Overfitting
A common remedy for overfitting is to apply a penalizing term to the minimization
problem, similar as described in the Lasso. Normally, R(✓) + J(✓) is set to be the
function to minimize via back-propagation, where:
X XX
J(✓) = 2
mk + 2
↵nm, j, (3.36)
km j nm
Input Scaling
The inputs’ scaling can affect the scaling of the weights in the output layer and thus
affect the final output results. Therefore, it is common to standardize all input data.
Normally, one centers the data (zero mean) with the standard deviation of one.
17
CHAPTER 3. MATHEMATICAL THEORY
Network Architecture
According to Hastie et al. [16], it is in general better to have too many hidden units
or neurons per layer than too few. With too few neurons, the model may not reach
the desired flexibility and thus not capture non-linear structures of the data. By
using an appropriate penalizing term when minimizing RSS, many of the weights of
the excessive neurons will be shrunken to zero and thus have little impact on the
model.
Regarding choosing the adequate structure of layers, the process is mainly based
on experience according to Hastie et al. As the back-propagation algorithm is con-
sidered to be computationally expensive, more layers will make the computational
time longer but can, on the other hand, enhance the performance. As of this, when
modeling with deep ANNs, back-propagation is not the conventional choice of algo-
rithm.
Multiple minimums
One of the main issues with minimizing RSS is that when Fk is complex, it is
not certain that R(✓) is convex. It is non-convex in the above setting, and thus,
it has several local minima. This indicates that the final solution of the model
will depend on the initial guess of weights. Therefore, it is common to start with
different random starting weights and then choose the starting weight that yields
the lowest penalized error. Another alternative is to use bagging, which averages
the predictions of different networks from bootstrapped datasets.
18
Chapter 4
In this chapter, we present the methodology of this thesis. We discuss the original
data, how it was obtained and how it was structured. The chapter will continue to
discuss the data processing employed and an explicit description of what has been
done for the different time series and ML techniques.
Throughout the work, all data handling and mathematical computation have
employed R version 3.4.3 [20] and RStudio [21]. The main R package used is caret
[22], which is a wrapper containing tools for streamlining the creation of predictive
models and pre-processing data.
19
CHAPTER 4. METHOD AND DATA
The NWP data accessed has some limitations. One is that the forecasts are
produced every 6 hours (6:00, 12:00, 18:00, 24:00) and span six hours forward in
time, which means that it is not possible to produce more extended predictions
than 6 hours. Another consequence is that that the training set is reduced for
longer time horizons as one cannot use unavailable weather forecasts. For example,
predicting 11.00 to 13.00 would mean using NWP data from 12.45 to 13.00, which
becomes available at 12.00 and thus is unavailable at 11.00 when the prediction is to
be made. Moreover, the data was attained via the ZIP codes of the site’s location.
This makes the data not completely correspondent with the precise coordination of
the solar PV panel.
20
CHAPTER 4. METHOD AND DATA
21
CHAPTER 4. METHOD AND DATA
0 Yt
Yi = , (4.1)
Ytcl
where Yt represents the energy output for the time interval t and Ytcl represents the
energy output for the same time interval on a day with sunshine the whole day and
no clouds. In the above formula, the constant term Ytcl has been observed to vary
across the yearly cycle and therefore modeling it is required.
While physical models have proved efficient, they require specific information
about the PV installation [24]. In lack of this, the approach used by Bacher et al.
[25], requiring only a time series of solar PV panel energy output was employed.
The technique is based on fitting a Gaussian kernel in two dimensions (time of day
and time of year) in such way that the hyperplane is located near the top-values of
the observations. By aligning the Gaussian kernel with the top output values (the
values that correspond to clear sky energy output days), an estimated hyperplane
representing the clear sky output values throughout the year is produced. Fitting
the kernel towards the top-values is achieved with a quantile regression, which can be
described as a regular regression, where the regression minimizes the sum of squared
residuals while forcing q of the points under the fitted hyperplane. For more details,
consult [8] and [25].
22
CHAPTER 4. METHOD AND DATA
in the models, where one-step is the granularity of the data (15 minutes). If yt is
the output at time t then the variables are constructed as such:
ylag1step := yt 1 , (4.2)
ylag2step := yt 2 , (4.3)
ylag1day := yt 24·4 , (4.4)
ydelta := yt 1 yt 2 . (4.5)
(4.6)
Temporal Variability
The variability of the NWP was computed by using the 15-minute lagged value, the
15-minute lead value and the actual value of the certain weather parameter. The
idea was to create a variable that captures potential variability in the weather. For
example, high variability in effective cloud cover is likely to affect the energy output,
making it a potentially important input data. As Bessa and Andrade found in their
study, there existed a correlation between the NWP’s variability and the energy
output variability.
23
CHAPTER 4. METHOD AND DATA
much each parameter minimizes the cost function. From this, the GBRT chooses to
split with the feature that reduces the cost function the most, and thus it ranks the
features with relative importance. This is performed for each split until the tree is
entirely built. By relying on the most important features indicated from the GBRT,
we were able to perform a computationally efficient feature selection. Pan et al. [27]
showed that there exist GBRT based feature selection algorithms that are efficient,
thus relying on GBRT should give a good proxy for the most important features.
4.3 Cross-Validation
Many time series methods and ML techniques require hyperparameter optimization.
A common practice is to perform cross-validation (CV) to determine the optimal
choice of hyperparameters. One type of cross-validation is the k-fold cross-validation
In general, this refers to randomly dividing the set into k folds. The validation
principle picks one of the k folds as a test set and the rest of the observations as
the training set. The process then changes the test set to another fold while using
the rest dataset as training data. This is repeated for all the folds and each k-fold,
a performance metric is computed. Once a performance metric is computed for
each fold, the average performance is computed. This can be performed for several
hyperparameters to determine the value that yields the best performance metric
from the cross-validation. A common choice of k is usually between 5 and 10.
24
CHAPTER 4. METHOD AND DATA
where Yi corresponds to the true value and Ŷi is the forecast. As the RMSE sums
the squared errors before averaging, any outliers will be given a higher weight.
where max(Y ) is the maximum output recorded for a specific site. [4]
Amodel Areference
SSmodel = . (4.9)
Aperfect model Areference
In this setting, A refers to any performance metric and the reference model is
usually chosen to be the persistence or climatology model. When using RMSE, the
‘perfect model’ is equivalent to RMSE = 0. This yields:
RMSEmodel
SSRMSE
model = 1 . (4.10)
RMSEreference
25
CHAPTER 4. METHOD AND DATA
4.5.1 Algorithm
Figure 4.6 shows an overview of the machine learning algorithm performed for each
site and model. The forecast horizons are from 15 minutes to 5 hours, divided
into hourly intervals. Each site is trained independently as they all have different
specifications. Furthermore, it allows for a comparison of how different models
perform on different weather regimes. Each model is explained in more detail below.
4.5.2 ARIMA
The ARIMA modeling was performed in a classical time series method. As of this,
a test and training set was set up, in which all data except the last recent half year
was considered training data and the last half year as the test data.
The series of energy output was normalized with the clear sky normalization to
reduce the non-stationarity. Later on, it was modeled with the ACF and PACF
plots with lags set to 200 to capture any seasonal patterns, in which lag 96 is the
daily seasonality.
Once the model had been determined, a Ljung-Box test was performed with the
stats package [30] to see if there still existed any autocorrelations on 20 and 200
lags respectively. Furthermore, coefficient tests were performed to identify if the
26
CHAPTER 4. METHOD AND DATA
parameters were significant. Ultimately, a qq-plot of the model’s residuals was fitted
to a normal distribution to see if the residuals followed a white noise process.
If any of these assumptions were violated, the validity of the ARIMA model can
be questioned, and it was decided not to model the ARIMA in such cases. If the
central assumptions of the ARIMA model could be achieved, then final predictions
were computed on the different time horizons.
Hyperparameters
Hyperparameters
27
CHAPTER 4. METHOD AND DATA
Hyperparameters
Hastie [16] claims that an interaction depth of 4 J 8 performs well when
boosting and that the trees are insensitive to choices in this range. For this reason,
the interaction depth is set to J = 6 and is not optimized further. The choice of
shrinkage v, require larger values of M to converge, and thus longer computational
time. Ideally one would want a very small (v < 0.1) and a large number of M . In
this case, a set of different shrinkage parameters are tested in the belief that the
error will converge for some values of the shrinkage before reaching M = 200.
Hyperparameters
28
CHAPTER 4. METHOD AND DATA
100
1X
Ŷt+k = Yt h, (4.12)
n h=1
29
Chapter 5
Results
In this chapter, the empirical results of the models are presented for the different
employed techniques.
5.1 ARIMA
The result of the clear sky normalization is presented in tables 5.1 and 5.2. The
Gaussian kernel smoothing manages to reduce the stationarity to some extent. As
the data is more centered during the summer months in the normalized time series,
it still exists some variations in variance and mean during the winter seasons. Thus,
the stationary is not completely erased.
Figure 5.1: Normalized energy output Figure 5.2: Actual energy output
As night data was included in the ARIMA, a seasonal pattern is evident, thus
a seasonal differencing was first added. The ACF plot can be seen in figure 5.3,
where a slow but exponential decay is present. The exponential decay indicates
that a stationary series is attained, which seems likely from plotting the seasonally
differenced series, figure 5.4.
Once starting to model for the seasonal and non-seasonal ARMA terms, it was
clear that SAR and AR terms of order one were indicated due to exponential de-
cays in the ACF, figure 5.3 and strong cut-offs in the PACF, figure 5.5. Thus, an
ARIMA(1,0,0)(2,1,0) was fitted due to the evident partial autocorrelation at the
second lag.
30
CHAPTER 5. RESULTS
1.0
0.8
0.5
0.6
0.0
ACF
0.2
0.0
−0.5
−0.2
−0.4
−1.0
0 50 100 150 200 Jan 2016 Apr 2016 Jul 2016 Oct 2016 Jan 2017
Time
Lag
0.05
0.8
0.6
0.00
Partial ACF
Partial ACF
0.4
−0.05
0.2
−0.10
0.0
−0.2
−0.15
Lag Lag
As can be seen in figure 5.6 the seasonality is reduced; however the model does
not manage to reduce it completely. As also can be seen, many significant auto cor-
relations are found among the lags, in particular in the beginning. After successively
trying different ARIMA setups according to the ACF and PACF, the best model
found was an ARIMA(4,0,0)(2,1,0).
Once the best model was found with respect to the ACF and PACF, a noise in the
ACF and PACF between the lags of around 10-90 was observed, which can be seen
in figure 5.7 and 5.8. This violates the ARIMA model assumptions of uncorrelated
lags.
To not only rely on the ACF and PACF plots, a Ljung-Box test was performed on
both short term and long term lags, 10 and 200 respectively. As can be seen in Table
5.1, the p-value was very close to 0. This indicates that the null hypothesis should
be rejected, which in a Ljung-Box test is that all autocorrelations are equal to zero.
It can thus be concluded that this test also indicated a non-correlation between the
31
CHAPTER 5. RESULTS
0.00
0.00
−0.05
−0.05
Partial ACF
ACF
−0.10
−0.10
−0.15
−0.15
Lag Lag
Figure 5.7: ACF of ARIMA (4, 0, 0) Figure 5.8: PACF of ARIMA (4, 0, 0)
(2, 1, 0) residuals (2, 1, 0) residuals
lags over time, which is a violation of the main assumption of the ARIMA.
32
CHAPTER 5. RESULTS
8
8
Cross−validation RMSE
Cross−validation RMSE
7
6
4 5
4
0 50 100 150 200 0 50 100 150 200
Number of trees, M Number of trees, M
Shrinkage, v 0.01 0.03 0.05 0.07 0.10 Shrinkage, v 0.01 0.03 0.05 0.07 0.10
As expected, a higher shrinkage gave faster convergence for the GBRT on both
15 minutes and five hours. For large M , the RMSE starts to grow due to overfitting
for all shrinkages except 0.01, which might not have converged. The results for the
other time horizons and sites were similar and therefore omitted.
Cross−validation RMSE
Site
1 2 3 4 5
Horizon (h) 4
0.25 0.017 0.022 0.007 0.001 0.006
0.5 0.001 0.004 0.001 0.001 0.004
1 0.001 0.017 0.001 0.024 0.007
2 0.001 0.004 0.008 0.027 0.005 3
3 0.002 0.013 0.006 0.039 0.007
4 0.001 0.010 0.001 0.008 0.080
0 25 50 75 100
5 0.001 0.017 0.001 0.001 0.240
Number of neighbors, K
When looking at the Lasso, a small in the range of 0.001-0.08 is the most
common, with one exception at site five for hour five. For the KNN, the RMSE
converges before reaching K = 99 for all time horizons. Similar results were obtained
for the other sites and therefore omitted.
For the ANN, the number of layers and weight decays used for each model varied
widely, and no apparent pattern is observed. This could be linked to the random
initialization of the weights, which is further discussed in chapter 6.
33
CHAPTER 5. RESULTS
Table 5.2: Numbers of layers and decay for different sites and horizons
5.2.2 RMSE
An overall lower RMSE is observed from the machine learning algorithms. The ML
algorithms perform similarly in the beginning, while the more complex algorithms
ANN and GRBT performs better at longer time horizons in most cases.
3.9
10
3.6
8
RMSE
RMSE
6 3.3
4
3.0
1 2 3 4 5
Time horizon (Hours) 2.7
1 2 3 4 5
ANN GBRT Lasso Time horizon (hours)
colour
Climatology KNN Persistance
colour ANN GBRT KNN Lasso
34
CHAPTER 5. RESULTS
14
40
13
30
RMSE
RMSE
12
20
11
10 10
1 2 3 4 5
Time horizon (Hours)
1 2 3 4 5
ANN GBRT Lasso Time horizon (hours)
colour
Climatology KNN Persistance
colour ANN GBRT KNN Lasso
8
20
15 7
RMSE
RMSE
10
5
1 2 3 4 5
Time horizon (Hours)
1 2 3 4 5
ANN GBRT Lasso Time horizon (hours)
colour
Climatology KNN Persistance
colour ANN GBRT KNN Lasso
35
CHAPTER 5. RESULTS
4.4
9
RMSE
4.0
RMSE
7
5
3.6
3
1 2 3 4 5
3.2
Time horizon (Hours)
1 2 3 4 5
ANN GBRT Lasso Time horizon (hours)
colour
Climatology KNN Persistance
colour ANN GBRT KNN Lasso
5.0
12.5
10.0 4.5
RMSE
RMSE
7.5
4.0
5.0
3.5
1 2 3 4 5
Time horizon (Hours)
1 2 3 4 5
ANN GBRT Lasso Time horizon (hours)
colour
Climatology KNN Persistance
colour ANN GBRT KNN Lasso
36
CHAPTER 5. RESULTS
5.2.4 nRMSE
0.10
nRMSE
0.09
0.08
1 2 3 4 5
Time horizon (hours)
In figure 5.17 one can see that ANN and GBRT have the lowest nRMSE across
sites and horizons.
37
CHAPTER 5. RESULTS
5.2.6 Run-time
38
Chapter 6
Discussion
In this chapter, we will analyze and discuss the results of the study. The first sections
will handle the results of the time series models and the ML techniques. In the later
chapters, the discussion is directed towards general issues of the study as a whole.
39
CHAPTER 6. DISCUSSION
40
CHAPTER 6. DISCUSSION
with linear regression, and thus it does not manage to capture possible non-linear
structures in the data.
The global radiation features had a high relative importance for the GBRT.
The global radiation is dependent on other weather conditions and does, therefore,
encircle many important climatic conditions in one variable. As this was one of the
essential features for the GBRT and that our GBRT slightly improved compared to
Persson et al. regarding nRMSE, we recommend other studies to incorporate this
feature.
Other issues to discuss is the improvement potential for each model. As the
study has not covered an in-depth analysis of different versions of the models, it is
likely that some models can be improved. We believe that the models that have the
most potential to improve is the KNN and ANN.
The reason for the KNN’s potential improvement is mainly based on that no
sophisticated feature selection procedure was performed, and thus it is likely that
there exist versions that would give better predictions. Towards this end, improving
the performance of the KNN with a proper feature selection procedure may be
possible.
There was no evident ANN parameter setup (number of layers and weight decay)
that outperformed the other. As the initial weights were initialized at random and
only a limited amount of trials were conducted for each setup, further testing is
needed to infer an optimal structure. However, even though the ANN outperformed
the other ML models, there is room for improvement, mainly because the model
has many variations. For example, bagging or boosting ANNs might mitigate the
risk of ending up with a bad performing ANN due to bad initial start weights.
Another aspect that was taken up in the literature review is that certain optimization
algorithms can impact the network’s ability to converge and thus improve the results.
We have not employed the genetic optimization algorithm that previously has been
proven to improve the performance of an ANN. A comparison of performance relative
to computational times is relevant, in particular for the ANN. One could discuss the
trade-off of attaining a gain in prediction accuracy while attaining an increase in
computational time.
41
CHAPTER 6. DISCUSSION
both regarding suppliers and the underlying model employed for producing a NWP
forecast. Through this, the input data errors’ variability could have been lowered,
and thus improved the results to some extent.
One other aspect to highlight is that we did not have access to NWP data over
a spatial grid surrounding the solar PV installation location. As pointed out in the
work of Andrade et al. [2], a spatial feature engineering process proved to enhance
the performance of the predictions as it manages to capture how the weather is likely
to change over a 15-minute interval. To further improve the results, a spatial grid
of NWP should be collected at each location to produce spatial features.
6.3.3 Methodology
Regarding methodology, there are aspects to consider to improve the results. One
thing, is to implement different models for different seasons. Splitting the year into
a winter and a summer season, and then train the models independently would
likely improve the results. To connect this to the problems mentioned of boosted
performance metrics and the fact that models tend to perform better when trained
on more specific periods, one could try to model only during peak hours. In this
setting, the model would not have to make a trade-off of adapting the parameters
in such way that they perform well on both morning or night data as well as during
mid-day hours.
The problem with using one model over a whole year is that the model will be
trained on data that vary due to the varying weather conditions. This forces the
model to generalize more over the year. If one would train the model for only winter
months, the model could solely focus on these points without having to do a trade-
off between fitting the summer data well and fitting the winter data well. On the
other hand, having different models depending on the climate would mean having to
change models from time to time, which may be tedious in an operational setting.
42
Chapter 7
In this work, we have compared time series techniques and machine learning tech-
niques for solar energy forecasting across five different sites in Sweden. We find
that employing time series models is a complex procedure due to the non-stationary
energy time series. In contrast, machine learning techniques were more straightfor-
ward to implement. In particular, we find that the Artificial Neural Networks and
Gradient Boosting Regression Trees performs best on average across all sites.
This study has compared the different models on a general level. For further
research, we suggest continuing comparing different machine learning techniques in
depth while using feature engineering approaches of numerical weather predictions.
43
Bibliography
[1] IRENA. Renewable power generation costs in 2017. Technical report, Interna-
tional Renewable Energy Agency, Abu Dhabi, January 2018.
[2] Jose R. Andrade and Ricardo J. Bessa. Improving renewable energy forecasting
with a grid of numerical weather predictions. IEEE Transactions on Sustainable
Energy, 8(4):1571–1580, October 2017.
[3] Rich H. Inman, Hugo T.C. Pedro, and Carlos F.M. Coimbra. Solar forecasting
methods for renewable energy integration. Progress in Energy and Combustion
Science, 39(6):535 – 576, 2013.
[6] Tao Hong, Pierre Pinson, Shu Fan, Hamidreza Zareipour, Alberto Troccoli, and
Rob J. Hyndman. Probabilistic energy forecasting: Global energy forecasting
competition 2014 and beyond. International Journal of Forecasting, 32(3):896
– 913, 2016.
[8] Peder Bacher, Henrik Madsen, and Henrik Aalborg Nielsen. Online short-term
solar power forecasting. Solar Energy, 83(10):1772 – 1783, 2009.
[9] Hugo T.C. Pedro and Carlos F.M. Coimbra. Assessment of forecasting tech-
niques for solar power production with no exogenous inputs. Solar Energy,
86(7):2017 – 2028, 2012.
[10] Federica Davò, Stefano Alessandrini, Simone Sperati, Luca Delle Monache, Da-
vide Airoldi, and Maria T. Vespucci. Post-processing techniques and principal
component analysis for regional wind power and solar irradiance forecasting.
Solar Energy, 134:327 – 338, 2016.
[11] Changsong Chen, Shanxu Duan, Tao Cai, and Bangyin Liu. Online 24-h solar
power forecasting based on weather type classification using artificial neural
network. Solar Energy, 85(11):2856 – 2870, 2011.
44
BIBLIOGRAPHY
[12] Caroline Persson, Peder Bacher, Takahiro Shiga, and Henrik Madsen. Multi-site
solar power forecasting using gradient boosted regression trees. Solar Energy,
150:423 – 436, 2017.
[13] J. Shi, W. J. Lee, Y. Liu, Y. Yang, and P. Wang. Forecasting power out-
put of photovoltaic systems based on weather classification and support vector
machines. IEEE Transactions on Industry Applications, 48(3):1064–1069, May
2012.
[14] Peter J Brockwell. Introduction to Time Series and Forecasting. Springer Texts
in Statistics. Springer, 3rd ed. 2016.. edition, 2016.
[18] D.Randall Wilson and Tony R. Martinez. The general inefficiency of batch
training for gradient descent learning. Neural Networks, 16(10):1429 – 1451,
2003.
[19] Christoph Bergmeir and José M. Benítez. Neural networks in R using the
stuttgart neural network simulator: RSNNS. Journal of Statistical Software,
46(7):1–26, 2012.
[22] Max Kuhn. Contributions from Jed Wing, Steve Weston, Andre Williams,
Chris Keefer, Allan Engelhardt, Tony Cooper, Zachary Mayer, Brenton Kenkel,
the R Core Team, Michael Benesty, Reynald Lescarbeau, Andrew Ziem, Luca
Scrucca, Yuan Tang, Can Candan, and Tyler Hunt. caret: Classification and
Regression Training, 2017. R package version 6.0-78.
[24] Jay A Kratochvil, William Earl Boyson, and David L King. Photovoltaic array
performance model. Technical report, Sandia National Laboratories, 2004.
[25] Peder Bacher. Short-term solar power forecasting. Master’s thesis, Technical
University of Denmark, DTU, DK-2800 Kgs. Lyngby, Denmark, 2008.
45
BIBLIOGRAPHY
[27] Feng Pan, Tim Converse, David Ahn, Franco Salvetti, and Gianluca Donato.
Feature selection for ranking using boosted trees. In Proceedings of the 18th
ACM conference on Information and knowledge management, pages 2025–2028.
ACM, 2009.
[28] Christoph Bergmeir, Rob J. Hyndman, and Bonsoo Koo. A note on the va-
lidity of cross-validation for evaluating autoregressive time series prediction.
Computational Statistics & Data Analysis, 120:70 – 83, 2018.
[29] Allan H Murphy. Skill scores based on the mean square error and their relation-
ships to the correlation coefficient. Monthly weather review, 116(12):2417–2424,
1988.
[31] Noah Simon, Jerome Friedman, Trevor Hastie, and Rob Tibshirani. Regulariza-
tion paths for cox’s proportional hazards model via coordinate descent. Journal
of Statistical Software, 39(5):1–13, 2011.
[33] Greg Ridgeway with contributions from others. gbm: Generalized Boosted Re-
gression Models, 2017. R package version 2.1.3.
46
TRITA -SCI-GRU 2018:214
www.kth.se