2017 Gri-20362
2017 Gri-20362
2017 Gri-20362
FACULTY OF SCIENCES
SCHOOL OF INFORMATICS
DEPARTMENT OF COMPUTER SCIENCE
«KNOWLEDGE, DATA AND SOFTWARE TECHNOLOGIES»
Master Thesis:
Praxitelis-Nikolaos Kouroupetroglou
UID:629
Supervisor:
Grigorios Tsoumakas, Assistant Professor
September 2017
Abstract
Economic growth in the modern world, depends directly on the availability of electric energy, especially
because most societies, industries, and economies depend almost entirely on its use. The availability of
a source of continuous, cheap, and reliable energy is of foremost economic importance. Electric load
forecasting is an important tool used to ensure that the energy supplied by utilities meets the consumers
needs. To this end, a staff of trained personnel is needed to carry out this specialized function. Load
forecasting is always defined as basically the science or art of predicting the future load on a given
system, for a specified period of time ahead. These predictions may be just for a fraction of an hour
ahead for operation purposes, or as much as 20 years into the future for planning purposes.
The purpose of this master thesis is to create accurate machine learning models for short-term electric
load demand forecasting in short-term horizon (for 1 day ahead) for the Greek Electric Network Grid.
The load data that the Thesis uses originates from IPTO which stands for Independent Power Transmission
Operator. Meteorological Features taken from Dark Sky API. To compare the Thesis predictions to the
system predictions, next day current load predictions from OoEM which stands for Operator of Electricity
Market in Greece were used.
The datasets were downloaded from its origins, preprocessed, cleaned and six machine learning
algorithms was used. After four different types of experiments, the combination of SVM, XGBoost and
Model Trees models was used in order to get 2.4% prediction error and on the other hand OoEM gives
2.53% prediction error. Hence the Thesis models improve the prediction performance by reducing the
prediction error by +4.74% than the OoEM’s predictions.
The code of this master thesis can be found at github repo and results - visualizations can be found
at shinyapps.
Praxitelis-Nikolaos Kouroupetroglou
September - 2017
1
Περίληψη
Η οικονομική ανάπτυξη του σύγχρονου κόσμου εξαρτάται άμεσα από τη διαθεσιμότητα ηλεκτρικής
ενέργειας, ιδιαίτερα επειδή οι περισσότερες κοινωνίες, οι βιομηχανίες και οι οικονομίες Εξαρτώνται
σχεδόν εξ ολοκλήρου από τη χρήση της. Η διαθεσιμότητα μιας πηγής συνεχούς, φτηνής και αξιόπιστης
ενέργειας έχει πρωταρχική οικονομική σημασία. Η πρόβλεψη του ηλεκτρικού φορτίου είναι ένα σημα-
ντικό εργαλείο που χρησιμοποιείται για να εξασφαλίσει ότι η ενέργεια που παρέχεται από επιχειρήσεις
κοινής ωφελείας ικανοποιεί τις ανάγκες των καταναλωτών. Για το σκοπό αυτό, απαιτείται προσωπικό
εξειδικευμένο προσωπικό για να ασκεί αυτή την εξειδικευμένη λειτουργία. Η πρόβλεψη φορτίου ορί-
ζεται πάντα ως βασικά η επιστήμη ή η τέχνη της πρόβλεψης του μελλοντικού φορτίου σε ένα δεδομένο
σύστημα για μια καθορισμένη χρονική περίοδο. Αυτές οι προβλέψεις μπορεί να είναι μόνο για ένα
κλάσμα μιας ώρας μπροστά για επιχειρησιακούς σκοπούς, ή μέχρι 20 χρόνια στο μέλλον για σκοπούς
προγραμματισμού.
Σκοπός αυτής της διπλωματικής εργασίας είναι η δημιουργία μοντέλων με ακρίβεια πρόβλεψης της
βραχυπρόθεσμης πρόβλεψης της ζήτησης ηλεκτρικού φορτίου (έως για 1 μέρα) για το Ελληνικό Ηλε-
κτρικό Δίκτυο.
Τα δεδομένα ηλεκτρικού φορτίου προέρχονται από την ΑΔΜΗΕ που σημαίνουν τα αρχικά της ”Α-
νεξάρτητος Διαχειριστής Ηλεκτρικής Ενέργειας”. Μετεωρολογικά δεδομένα αντλήθηκαν από Dark Sky
API και για να συγκρίνω τις προβλέψεις της διπλωματικής εργασίας, προβλέψεις από τη ΛΑΓΗΕ που
σημαίνει Λειτουργός Αγορά Ηλεκτρικής Ενέργειας.
Τα σύνολα δεδομένων αντλήθηκαν από την προέλευσή τους, προεπεξεργάστηκαν, και χρησιμοποιή-
θηκαν έξι αλγόριθμοι μηχανικής μάθησης. Μετά από τέσσερις διαφορετικούς τύπους πειραμάτων χρησι-
μοποιήθηκε ο συνδυασμός μοντέλων SVM, XGBoost και Model Trees προκειμένου να ληφθεί σφάλμα
πρόβλεψης 2.4% ενώ η ΛΑΓΗΕ δίνει 2.53% σφάλμα πρόβλεψης. Έτσι, τα μοντέλα της συγκεκριμέ-
νης διπλωματικής εργασίας βελτιώνουν την απόδοση πρόβλεψης με μείωση του σφάλματος πρόβλεψης
κατά +4.74% από την πρόβλεψη της ΛΑΓΗΕ.
Ο κώδικας της διπλωματικής εργασίας μπορεί να βρεθεί στο αποθετήριο github και τα αποτελέσμα-
τα - οπτικοποιήσεις των αναλύσεων βρίσκονται στο shinyapps.
2
Acknowledgments
At first, I would like to thank my parents for their support, patience for aiding me for my efforts and their
love all these years. I want to thank my master thesis supervisor Dr. Grigorios Tsoumakas, who guided
throw the master thesis and assisted me when I faced issues, problems and dead ends. Moreover I want to
thank other professors that I met and signed in their postgraduate courses, such as Dr. Ioannis Vlahavas,
Dr. Eleutherios Aggelis, Dr. Athina Vakali, Dr. Nikolaos Vaseiliadis, Dr. Anastasios Gounaris, Dr.
Apostolos Papadopoulos and Dr. Konstantinos Tsichlas. Their postgraduate courses helped me to explore
and understand the world of “Data Science” from many views and aspects. Lastly I want to thank all those
who provide the rich and useful information from Internet sites like: stackoverflow, stats.stackexchange,
datascience.stackexchange, tex.stackexchange, kdnuggets, analyticsvidhya.com,
machinelearningmastery, r-bloggers and of course wikipedia. if these sites had not been there, I could
not have finished the Thesis.
3
Contents
Abstract 1
Περίληψη 2
Acknowledgments 3
4
CONTENTS 5
3.2.4 SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.5 Random Forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.6 Rule-Based . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2.7 k-Nearest Neighbors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2.8 XGBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3 Commercial Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3.1 ETAP’s load forecasting software . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3.2 Aiolos Forecasting Studio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3.3 Electric Load Forecasting Using Artificial Neural Networks . . . . . . . . . . . 36
3.3.4 Escoware demand-forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3.5 SAS® Energy Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3.6 Statgraphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6 Future Work 56
7 Appendix of Tables 57
7.1 Tables of Selected Features per target variable: . . . . . . . . . . . . . . . . . . . . . . 57
7.2 Tables of Best Tuning Parameters per ML Algorithms and Target Variable with full features 59
7.2.1 List of Best SVM tuning Parameters per target variable with full features . . . . 59
7.2.2 List of Best KNN tuning Parameters per target variable with full features . . . . 60
7.2.3 List of Best Random Forest tuning Parameters per target variable with full features 61
7.2.4 List of Best Neural Networks tuning Parameters per target variable with full
features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
7.2.5 List of Best XGBoost tuning Parameters per target variable with full features . . 62
7.2.6 List of Best Model Trees tuning Parameters per target variable with full features 63
7.3 Tables of Best Tuning Parameters per ML Algorithms and Target Variable with Feature
Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
7.3.1 List of Best SVM tuning Parameters per target variable with feature selection . . 64
7.3.2 List of Best K-Nearest Neighbors tuning Parameters per target variable with fea-
ture selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
CONTENTS 6
7.3.3 List of Best Random Forest tuning Parameters per target variable with feature
selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
7.3.4 List of Best Neural Networks tuning Parameters per target variable with feature
selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
7.3.5 List of Best XGBoost tuning Parameters per target variable with feature selection 67
7.3.6 List of Best Model Trees tuning Parameters per target variable with feature se-
lection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Bibliography 69
List of Figures
7
LIST OF FIGURES 8
1.1 Electric Load Forecasting with Minimum Updating cycle and maximum horizon per
business needs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2 Availability of climate factors, economics, and land use information for load forecasting 13
1.3 Classification of different types of electric load forecasting based on features and factors 13
1.4 Applications of different types of electric load forecasts in Business Needs . . . . . . . . 13
7.1 Intercept of all selected Features for all target Variables load.x . . . . . . . . . . . . . . 57
7.2 svm best tuning parameter with full features . . . . . . . . . . . . . . . . . . . . . . . . 59
7.3 K-Nearest Neighbors best tuning parameter with full features . . . . . . . . . . . . . . . 60
7.4 Random Forest best tuning parameter with full features . . . . . . . . . . . . . . . . . . 61
7.5 Neural Networks best tuning parameter with full features . . . . . . . . . . . . . . . . . 62
7.6 XGBoost best tuning parameter with full features . . . . . . . . . . . . . . . . . . . . . 62
7.7 Mode Trees best tuning parameter with full features . . . . . . . . . . . . . . . . . . . . 63
9
LIST OF TABLES 10
In this chapter an introduction will be given about the subject of this master thesis and a summary of the
contents that follow.
We observe that the reasons for predicting the electric load demand are important and vital for an
electric production company.
The next table 1.1 presents according to the lead time range of each business need as described above,
the ”Minimum updating cycle” which shows the minimum time that the future load prediction must be
updated and ”max horizon” of the forecasts which shows how far in the forecasts can be made.
11
CHAPTER 1. ELECTRIC LOAD FORECASTING - INTRODUCTION 12
Table 1.1: Electric Load Forecasting with Minimum Updating cycle and maximum horizon per business
needs
Minimum updating
max horizon
cycle
purchasing and producing 10 years and
1 hour
electric power above
transmitting, transferring and
1 day 30 years
distributing electric power.
managing and maintaining
15 minutes 2 weaks
the electric power sources.
managing the daily 10 years and
15 minutes
electric load demand above
10 years and
financial and marketing 1 month
above
As it is shown from the table above 1.1, the electric load prediction is an useful tool for an electric
company that produce electric power because in this modern current world electric companies must
provide adequate amount of electrical power for the needs of consumers. In general the electric load
prediction is defined as the methodology for the load forecasting for a specific duration/horizon. The
maximum horizon can be at most 20 years [2].
There is no formal or typical procedure for electric load forecasting for every type of electric com-
pany and for every duration of time. The forecasting depends on various factors such as the means and
sources of electrical production by each electrical company, the demand for electric load, climate factors,
economical reasons and the human activity [1].
Climate factors are those who are based on meteorological features like temperature, humidity, wind
speed, precipitation etc. Based on the scientific literature temperature plays an important role for electric
load consumption and many electric load forecasting systems use temperature as a feature. However
temperature can be used for small horizon for example for 1-2 days of forecast and it is an unreliable
factor for more than that [2].
The impact of human activity on the electric load consumption can be analyzed in various features.
One type of such features are the calendar features such as the day of the week or the month of the
year. Other such features concern economical factors, like the economic activities in urban areas or the
economic transactions. For a long-term forecasting, e.g. forecasting for a year, economic factors play a
vital role for future prediction. Moreover this kind of forecasting can use the human activities from rural
areas such as the agricultural activities and the changes in rural land [1].
The different types of electric load forecasting based on the data/features available that affect the
forecasting and in conjunction with the duration of time for the prediction, can be categorized in the
following categories:
1. Very Short Term Load Forecasting (VSTLF). Can be used for small time-window, some hours
at most and it utilizes the amount of previous hourly loads on that current day. It can not use
information from economic factors or land because they do not change in this small amount of
time.
2. Short Term Load Forecasting (STLF). This type of forecasting can be used from 24-hour forecast-
ing up two 2 weeks. The difference from the previous forecasting method is that here temperature
and in general climate/meteorological factors play a vital role for the forecasting. STLF can be
used for decision-taking like how much energy should an electric company purchase from other
electric companies in the near future.
3. Medium Term Load Forecasting (MTLF). This forecasting method can be used for 1 month to 3
years horizon. Temperature and other climate factors do not be used for forecasting due to their
CHAPTER 1. ELECTRIC LOAD FORECASTING - INTRODUCTION 13
unavailability of their own forecasting for this horizon. So other ways are used such as simulating
their behavior and are being forecasting with this technique in order to be used of MTLF loaf
forecasting. MTLF uses economical factors because their impact plays a role in daily life and for
the needs of consumers. The same thing applies to the change of land in rural areas.
4. Long Term Load Forecasting (LTLF). This forecasting is used for time-window 3-10 years. Sim-
ulation techniques are used to calculate climate factors and the economic transactions/activities.
In the previous 4 categories of load forecasting, the model that is used for predictions, it describes the
relationship between the electric load and the features that affect it such as the climate factors, economical
factors, land use etc. However, for the load forecasting for long periods such as MTLF and LTLF, climate
factors and economical features can not precisely be determined for this long period of time.
The following table 1.2 summarizes the availability of climate features, economical factors and land
use for load forecasting [2].
Table 1.2: Availability of climate factors, economics, and land use information for load forecasting
In addition, the following table 1.3 categorizes the different types of forecasting based on the horizon
for the prediction and the different kind of business needs for prediction [2].
Table 1.3: Classification of different types of electric load forecasting based on features and factors
Based on the previous table we can relate every type of forecasting with the suitable features/factors
needed to be applied. Furthermore, the different types of forecasting can be related with the different
business need as mentioned above. The following table 1.4 presents this relationship. Due to the fact
that many business needs are being applied in various time periods/horizons, thus various of the previous
forecasting techniques can be used for the same business need [2].
Table 1.4: Applications of different types of electric load forecasts in Business Needs
• Supervised learning: The computer is presented with example inputs and their desired outputs,
and the goal is to learn a general rule that maps inputs to outputs.
• Unsupervised learning: No labels are given to the learning algorithm, leaving it on its own to find
structure in its input. Unsupervised learning can be a goal in itself (discovering hidden patterns in
data) or a means towards an end (feature learning).
15
CHAPTER 2. THEORETICAL BACKGROUND AND DATA SCIENCE DEVELOPMENT PLATFORMS16
Another categorization of machine learning tasks arises when one considers the desired output of a
machine-learned system:
• In classification, inputs are divided into two or more classes, and the learner must produce a
model that assigns unseen inputs to one or more (multi-label classification) of these classes. This
is typically tackled in a supervised way. Spam filtering is an example of classification, where the
inputs are email (or other) messages and the classes are ”spam” and ”not spam”.
• In regression, also a supervised problem, the outputs are continuous rather than discrete.
• In clustering, a set of inputs is to be divided into groups. Unlike in classification, the groups are
not known beforehand, making this typically an unsupervised task.
Support Vector Machine can be applied not only to classification problems but also to the case of
regression. Still it contains all the main features that characterize maximum margin algorithm: a non-
linear function is leaned by linear learning machine mapping into high dimensional kernel induced feature
space. The capacity of the system is controlled by parameters that do not depend on the dimensionality
of feature space.
In the same way as with classification approach there is motivation to seek and optimize the general-
ization bounds given for regression. They relied on defining the loss function that ignores errors, which
are situated within the certain distance of the true value. This type of function is often called – epsilon
intensive – loss function. The figure below 2.1 shows an example of one-dimensional linear regression
function with – epsilon intensive – band. The variables measure the cost of the errors on the training
points. These are zero for all points that are inside the band [?].
Another figure 2.2 shows a similar situation but for non-linear regression case.
One of the most important ideas in Support Vector Classification and Regression cases is that present-
ing the solution by means of small subset of training points gives enormous computational advantages.
Using the epsilon intensive loss function we ensure existence of the global minimum and at the same time
optimization of reliable generalization bound. In SVM regression, the input is first mapped onto a m-
dimensional feature space using some fixed (nonlinear) mapping, and then a linear model is constructed
in this feature space. Using mathematical notation, the linear model (in the feature space) f (x, w) is
CHAPTER 2. THEORETICAL BACKGROUND AND DATA SCIENCE DEVELOPMENT PLATFORMS18
∑
given by: f (x, w) = m j=1 wi gi (x) + b where gj (x), j = 1, ..., m denotes a set of nonlinear transfor-
mations, and b is the “bias” term. Often the data are assumed to be zero mean (this can be achieved by
preprocessing), so the bias term is dropped [?].
Figure 2.3: Non-linear mapping of input examples into high dimensional feature space. (Classification
case, however the same stands for regression as well).
The quality of estimation is measured by the loss function L(y, f (x, w)). SVM regression uses a
new type of loss function called -insensitive loss function proposed by Vapnik:
{
0 if |y − f (x, w)| ≤ ϵ
L(y, f (x, w)) =
|y − f (x, w)| − ϵ otherwise
The empirical risk is:
1∑
n
R(emp) (w) = L(yi , f (xi , w))
n
i=1
SVM regression performs linear regression in the high-dimension feature space using ε-insensitive
loss and, at the same time, tries to reduce model complexity by minimizing ||w||2 . This can be described
by introducing (non-negative) slack variables , to measure the deviation of training samples outside
- insensitive zone. Thus SVM regression is formulated as minimization of the following functional
[36, 37]:
1 2 ∑n yi − f (xi , w) ≤ ϵ + ξi∗
min ||w || + C (ξi + ξi∗ ) f (xi , w) − yi ≤ ϵ + ξi , s.t.
2
i=1 ξi , ξi∗ ≤ 0, i = 1, ..., n
This optimization problem can transformed into the dual problem and its solution is given by:
∑
nSV
∑
m
K(x, xi ) = gj (x)gj (xi )
j=1
It is well known that SVM generalization performance (estimation accuracy) depends on a good set-
ting of meta-parameters parameters C, and the kernel parameters. The problem of optimal parameter
selection is further complicated by the fact that SVM model complexity (and hence its generalization
performance) depends on all three parameters. Existing software implementations of SVM regression
usually treat SVM meta-parameters as user-defined inputs. Selecting a particular kernel type and kernel
function parameters is usually based on application-domain knowledge and also should reflect distribu-
tion of input (x) values of the training data.
CHAPTER 2. THEORETICAL BACKGROUND AND DATA SCIENCE DEVELOPMENT PLATFORMS19
Figure 2.4: Performance of Support Vector Machine in regression case. The epsilon boundaries are given
with the green lines. Blue points represent data instances.
Parameter C determines the trade off between the model complexity (flatness) and the degree to which
deviations larger than are tolerated in optimization formulation for example, if C is too large (infinity),
then the objective is to minimize the empirical risk only, without regard to model complexity part in the
optimization formulation.
Parameter controls the width of the -insensitive zone, used to fit the training data. The value of can
affect the number of support vectors used to construct the regression function. The bigger , the fewer
support vectors are selected. On the other hand, bigger -values results in more “flat” estimates. Hence,
both C and -values affect model complexity (but in a different way).
Figure 2.5: Performance of Support Vector Machine in regression case. The epsilon boundaries are given
with the green lines. Blue points represent data instances.
(Additional instances are introduced in this case and after being supplied to the model are inside
epsilon band. The regression function has changed. However this makes some points that were initially
inside the interval to be misclassified.)
CHAPTER 2. THEORETICAL BACKGROUND AND DATA SCIENCE DEVELOPMENT PLATFORMS20
Figure 2.6: Performance of Support Vector Machine in regression case. The epsilon boundaries are given
with the green lines. Blue points represent data instances.
One of the advantages of Support Vector Machine, and Support Vector Regression as the part of it,
is that it can be used to avoid difficulties of using linear functions in the high dimensional feature space
and optimization problem is transformed into dual convex quadratic programmes. In regression case
the loss function is used to penalize errors that are grater than threshold - . Such loss functions usually
lead to the sparse representation of the decision rule, giving significant algorithmic and representational
advantages [36, 37].
Figure 2.7: Histogram showing the accuracy of 1000 decision trees. While the average accuracy of
decision trees is 67.1%, the random forest model has an accuracy of 72.4%, which is better than 99% of
the decision trees.
Random forests are widely used because they are easy to implement and fast to compute. Unlike
most other models, a random forest can be made more complex (by increasing the number of trees) to
improve prediction accuracy without the risk of over-fitting.
A random forest is an example of an ensemble, which is a combination of predictions from different
models. In an ensemble, predictions could be combined either by majority-voting or by taking averages.
CHAPTER 2. THEORETICAL BACKGROUND AND DATA SCIENCE DEVELOPMENT PLATFORMS21
Below in figure 2.8 is an illustration of how an ensemble formed by majority-voting yields more accurate
predictions than the individual models it is based on [38, 39]:
Figure 2.8: Example of three individual models attempting to predict 10 outputs of either Blue or Red.
The correct predictions are Blue for all 10 outputs. An ensemble formed by majority voting based on the
three individual models yields the highest prediction accuracy.
Random forest is an ensemble of multiple decision trees, it leverages “wisdom of the crowd”, and is
often more accurate than any individual decision tree. This is because each individual model has its own
strengths and weakness in predicting certain outputs. As there is only one correct prediction but many
possible wrong predictions, individual models that yield correct predictions tend to reinforce each other,
while wrong predictions cancel each other out.
For this effect to work however, models included in the ensemble must not make the same kind of
mistakes. In other words, the models must be uncorrelated. This is achieved via a technique called
bootstrap aggregating (bagging) [38, 39].
In random forest, bagging is used to create thousands of decision trees with minimal correlation.
In bagging, a random subset of the training data is selected to train each tree. Furthermore, the model
randomly restricts the variables which may be used at the splits of each tree. Hence, the trees grown are
dissimilar, but they still retain certain predictive power [38, 39].
Random forests are considered “black-boxes”, because they comprise randomly generated decision
trees, and are not guided by explicitly guidelines in predictions. We do not know how exactly the model
came to the conclusion that a violent crime would occur at a specific location, instead we only know that
a majority of the 1000 decision trees thought so. This may bring about ethical concerns when used in
areas like medical diagnosis [38, 39].
Random forests are also unable to extrapolate predictions for cases that have not been previously
encountered. For example, given that a pen costs $2, 2 pens cost $4, and 3 pens cost $6, how much
would 10 pens cost? A random forest would not know the answer if it had not encountered a situation
with 10 pens, but a linear regression model would be able to extrapolate a trend and deduce the answer
of $20 [38].
Choosing the optimal value for k is best done by first inspecting the data. In general, a large K value
is more precise as it reduces the overall noise; however, the compromise is that the distinct boundaries
within the feature space are blurred. Cross-validation is another way to retrospectively determine a good
K value by using an independent data set to validate your K value. The optimal K for most datasets is
10 or more. That produces much better results than 1-NN [40].
There are several canonical activation functions. For instance, the sigmoid activation function:
We can form a network by chaining these nodes together. Usually this is done in layers - one node
layer’s outputs are connected to the next layer’s inputs.
Our goal is to train a network using labeled data so that we can then feed it a set of inputs and it
produces the appropriate outputs for unlabeled data. We can do this because we have both the input xi
and the desired target output yi in the form of data pairs. Training in this case involves learning the correct
edge weights to produce the target output given the input. The network and its trained weights form a
function (denoted h) that operates on input data. With the trained network, we can make predictions
given any unlabeled test input [41].
Figure 2.13: Training and testing in the neural network context. Note that a multilayer network is shown
here.
In supervised training, both the inputs and the outputs are provided. The network then processes the
inputs and compares its resulting outputs against the desired outputs. Errors are then propagated back
through the system, causing the system to adjust the weights which control the network. This process
occurs over and over as the weights are continually tweaked. The set of data which enables the training
is called the ”training set.” During the training of a network the same set of data is processed many times
as the connection weights are ever refined.
CHAPTER 2. THEORETICAL BACKGROUND AND DATA SCIENCE DEVELOPMENT PLATFORMS24
If a network simply can’t solve the problem, the designer then has to review the input and outputs, the
number of layers, the number of elements per layer, the connections between the layers, the summation,
transfer, and training functions, and even the initial weights themselves. Those changes required to create
a successful network constitute a process wherein the ”art” of neural networking occurs [41].
• Gradient Boosting algorithm also called gradient boosting machine including the learning rate.
• Stochastic Gradient Boosting with sub-sampling at the row, column and column per split levels.
XGBoost algorithm was engineered for efficiency of compute time and memory resources. A design
goal was to make the best use of available resources to train the model. Some key algorithm implemen-
tation features include [42]:
• Continued Training so that you can further boost an already fitted model on new data.
XGBoost dominates structured or tabular datasets on classification and regression predictive mod-
eling problems. The evidence is that it is the go-to algorithm for competition winners on the Kaggle
competitive data science platform [42].
The XGBoost library implements the gradient boosting decision tree algorithm. This algorithm goes
by lots of different names such as gradient boosting, multiple additive regression trees, stochastic gradient
boosting or gradient boosting machines [42].
Gradient Boosting is an ensemble technique where new models are added to correct the errors made
by existing models. Models are added sequentially until no further improvements can be made. A popular
example is the AdaBoost algorithm that weights data points that are hard to predict [44].
Gradient boosting is an approach where new models are created that predict the residuals or errors of
prior models and then added together to make the final prediction. It is called gradient boosting because it
uses a gradient descent algorithm to minimize the loss when adding new models. This approach supports
both regression and classification predictive modeling problems [44].
The idea of boosting came out of the idea of whether a weak learner can be modified to become
better. The first realization of boosting that saw great success in application was Adaptive Boosting or
2
DMLC
3
mxnet deep learning library
CHAPTER 2. THEORETICAL BACKGROUND AND DATA SCIENCE DEVELOPMENT PLATFORMS25
AdaBoost for short. AdaBoost works by weighting the observations, putting more weight on difficult
to classify instances and less on those already handled well. New weak learners are added sequentially
that focus their training on the more difficult patterns. Predictions are made by majority vote of the
weak learners’ predictions, weighted by their individual accuracy. The most successful form of the
AdaBoost algorithm was for binary classification problems and was called AdaBoost. AdaBoost and
related algorithms were recast in a statistical framework. The statistical framework cast boosting as a
numerical optimization problem where the objective is to minimize the loss of the model by adding weak
learners using a gradient descent like procedure [44].
This class of algorithms were described as a stage-wise additive model. This is because one new weak
learner is added at a time and existing weak learners in the model are frozen and left unchanged [44].
The generalization allowed arbitrary differentiable loss functions to be used, expanding the technique
beyond binary classification problems to support regression, multi-class classification and more [44].
Gradient boosting involves three elements:
1. A loss function to be optimized.
• Shrinkage
CHAPTER 2. THEORETICAL BACKGROUND AND DATA SCIENCE DEVELOPMENT PLATFORMS26
• Random sampling
• Penalized Learning
Notice that the if-then statements generated by a tree define a unique route to one terminal node for
any sample. A rule is a set of if-then conditions (possibly created by a tree) that have been collapsed into
independent conditions. For the example above, there would be three rules [45]:
The next step here is to substitute the results from the leaves with linear regression models, creating
thus Rule - Based Model Trees where each terminal node has rules that drive us to use a specific linear
regression formula. This is very useful because we have predictive models with high accuracy, stability
and ease of interpretation that they map non-linear relationships quite well. They are adaptable at solving
any kind of problem at hand. For example the following figure shows the result from the library Cubist,
it creates 2 rules and each rule ends with a linear regression function [45]:
The energy consumption behavior in different days of a week may be different, which is due to many
reasons. For instance, office buildings may be closed during weekend, which causes fewer loads than
those of work days. People may get up late in the morning during weekend, which shifts the morning
peak one or two hours later than a normal work day so seven groups of days of a week can be created.
In daily basis, hours can be grouped from rush hours and non-rush hours.
Forecasting the loads of special days, e.g., holidays (national and religious), has been a challenging
issue in STLF, not only because the load profiles may vary from different holidays and the same holiday
of different years, but also due to the limited data history.
When comparing forecast methods on a single data set, the MAE is popular as it is easy to understand
and compute.
The percentage error is given by pi = 100ei /yi . Percentage errors have the advantage of being
scale-independent, and so are frequently used to compare forecast performance between different data
sets. The most commonly used measure is:
R is an open source high-level programming language and a statistics and graphics development
environment supported by the R Foundation for Statistical Computing. R language is widely used by
statisticians, data miners, data analysts and its popularity has increased over the years. R is a GNU
project, its source code is mainly written in C and Fortran. H R is available for free under the GNU Gen-
eral Public License. It has a command line environment as well as a variety of graphical environments
(e.g. R-Studio). R’s predecessor is named S, and S initially developed by Bell Labs. R was created by
CHAPTER 2. THEORETICAL BACKGROUND AND DATA SCIENCE DEVELOPMENT PLATFORMS29
Ross Ihaka and Robert Gentleman of the University of Auckland, New Zealand, and is currently its main
development team. Named by the original letter of the names of its creators. Its development began in
1992 with an initial release in 1995 and the stable version appeared in 2000. R and its libraries imple-
ment a wide range of statistical and graphical functions, including linear and nonlinear models, classical
statistical controls, time series analysis, engineering classroom classification models, clustering, etc. It
can easily be expanded because it is an open source project and her community is particularly active in
contributing to improving her libraries. [46]
Python is a widely known high-level language, created by Guido van Rossum and released in 1991.
The concept is to simplify and legibility of the code by using only the blank character/s to avoid code
track block using brackets as implemented with other programming languages. This language has also
adopted a special way of writing to allow developers to write code with fewer lines than C ++ and Java.
It makes these features a specific language suitable for easy-to-read code suitable for small or large scale
and scalable. [50] Python provides several libraries for data analysis, some of which are:
• Scikit-learn:
Scikit-learn is a free python library for data science and machine learning and its main features
are classification, regression, clustering, gradient boosting, k-means, DBSCAN and is designed
to work with other Python libraries (NumPy and SciPy) for scientific and numerical calculations
mainly [51].
• Pandas:
Pandas is a Python library for data management and analysis. Specifically, it provides suitable data
structures and methods for managing multidimensional arrays and time series. It is compatible with
the BSD license and is a free library software to use [52].
• Statsmodel:
CHAPTER 2. THEORETICAL BACKGROUND AND DATA SCIENCE DEVELOPMENT PLATFORMS30
Statsmodel is a Python library that allows users to explore data, evaluate different statistical mod-
els, apply statistical controls. Is a suitable library for descriptive statistics, and for descriptive
graphs. It is a subset of the superset of SciPy library-based libraries [53].
• Anaconda:
Anaconda was created and promoted by continium, a suite of open source libraries from both
Python and R, and is suitable for large-scale data analysis and processing, forecasting and com-
puting. Its purpose is to collect the appropriate libraries for the above functions for the most
simplified management of large data. Anaconda is in line with the Open Data Science innovation,
based on open source communities for better and more quality data analysis, using over 720 data
analysis libraries to visualize them, for mechanical learning, deep learning and resource manage-
ment large data. Includes the appropriate libraries for Apache Hadoop and Spark to handle large
python data. It also includes suitable libraries for deep learning mechanical learning and image
processing such as theano, tensorflow, neon, h2o, keras etc. [54]
Rapidminer is a commercial software and is a complete platform for engineering learning, deep
learning, text analysis, and predictive analytics. Use for commercial, business, research and educational
use. It includes all the machine learning processes that are used for preparation of data, visualization
of data, verification of results and optimization of data Analysis. It is available in a free version but
with limited capabilities of 1 logical cpu for computational resources and 10000 training examples. The
commercial version provides all the software features at an initial price of $2500. The initial release
was named YALE (Yet Another Learning Environment) and was developed in 2001 by Ralf Klinken-
berg, Ingo Mierswa, and Simon Fischer in the Artificial Intelligence Lab at the Dortmund Polytechnic
University. In 2006 its development continued with the release name Rapid-l, from Ingo Mierswa and
Ralf Klinkenberg. Rapidminer has up to now enormous downloads for its free software version and over
250,000 commercial users including BMW, Cisco, GE and Samsung [55].
CHAPTER 2. THEORETICAL BACKGROUND AND DATA SCIENCE DEVELOPMENT PLATFORMS31
Orange is an open source software for data visualization, machine learning and knowledge mining
from the data. Its main feature is visual programming, front-end exploratory data analysis and data
visualization. Its development began at the University of Ljubljana in 1997. It is currently in version
3.4. It is based on Python, Cython, C ++, C. It is cross platform software and follows the GNU General
Public License. [56]
Knime (The Konstanz Information Miner) is an open source software for data analysis, data visual-
ization, data mining and mechanical learning. Its graphical environment is developed so that a user can
choose the tools - functions - transformations he wants to implement and with which order he wishes to
do the data processing without writing any code. This provides user - friendly usability for analyzing
the datasets. Knime is mostly used for pharmaceutical research, customer data analysis (CRM), business
intelligence for forecasting future assets, data analysis and visualization models [57]
Weka is a free software for machine learning and data analysis written in Java. It was developed
by the University of Waikato in New Zealand and it complies with the GNU General Public License. It
includes a set of visualization tools for data analysis, prediction models, pre-processing data, and a user-
friendly interface environment. Historically, Weka developed from the University of Waikato in New
Zealand in 1993 from programming language Tcl / Tk and Makefiles. In 1997, the development of Weka
from Java began. In 2005, the software was awarded from the SIGKDD Data Mining and Knowledge
Discovery Service Award. And in 2006, Pentaho Corporation has attributed Weka certification as a soft-
ware for business intelligence, data mining and predictive analytics. By 2011, it has received 2,487,213
downloads from sourceforge.net [47]
SPSS, is an IBM software for statistical analysis. Originally developed by SPSS Inc. and in 2009
it was acquired by IBM. SPSS mainly is used in areas such as marketing, health sciences, social sci-
ences, text and emotion analysis, knowledge mining, data analysis. The main functions of SPSS are
statistics with variable frequencies, cross tabulation, ratio statistics, and Bivariate Statistics. Correla-
tions, ANOVA, Non-parametric controls. Predictions with multiple linear regrowth. Analysis of factors,
clustering analysis, discreet analysis as well as visualization of data - examples. [48]
Microsoft Excel is a spreadsheet software and is a cross platform software for Windows, macOS and
Android and one of the software packages of the Microsoft Office software suite. Its features are var-
ious computations of spreadsheet data, graphical depictions - representations, script programming with
Visual Basic for Applications on spreadsheet data. It is a well-known spreadsheet software. Microsoft
Excel has a key feature in its spreadsheets. This is the use of cells for data management and the use
of spreadsheets cells and columns for data management, computational operations and graphical repre-
sentations. Microsoft Excel includes a large set of statistical and econometric functions for displaying
graphs, histograms, etc. Its first version was in 1987, and the latest version is called Excel 2016 where
the stable and final version of the software was released in April 2016. An easy-to-use Microsoft Excel
tool from 2010, 2013, 2016 is Power Pivot, with the main goal of developing linear regression models
for spreadsheet data, and mainly uses DAX (Data Analysis Expressions) as the primary language money
within the tool to create the models [49].
Chapter 3
In this chapter various techniques from statistics to machine learning found from the scientific literature
will be presented.
33
CHAPTER 3. SCIENTIFIC REVIEW AND COMMERCIAL SOFTWARE 34
of Puget Sound Power and Light Company’s from November 1988 to January 1989. Artificial neural
networks were also used in the above mentioned survey of Pacific Gas and Electric Company [3] and
compared the model of neural networks with that developed by regression analysis in the same set of
training data from 1986 to 1990. It was shown that artificial neural networks exhibit greater prediction
accuracy than regression analysis. The neural network model used was a Back Propagation model.
An extension of the previous research was based on the use of the radial basis function neural network
(RBFN) [9], used for the same company and the same set of data. The new algorithm showed better
performance and accuracy than the BPN model of neural networks and the use of calendar variables
for holidays, holidays, etc. For Greece, there have been remarkable surveys using clusters of neural
networks for short-term forecasting of electricity consumption from various research [10], and other
similar investigations using artificial intelligence techniques [11–13]
3.2.4 SVM
Support Vector Machines are widely used as models for predicting electricity consumption. For exam-
ple, in 2001 the European Intelligent Technology Organization (EUNITE) organized a competition for
medium-term forecasting of electricity consumption with the best model being SVM-based [21]. SVM
for MTLF was used to predict power consumption in Istanbul electric grid using average temperature
data per day, calendar factors and daytime electricity load demand records [22].
exhibits nonstationarity, heteroscedasticity, trend and multiple seasonal cycles. The main advantages of
the model are its ability to generalization, built-in cross-validation and low sensitivity to parameter val-
ues. The proposed forecasting model was applied to historical load data in Poland and its performance
is compared with some alternative models such as CART, ARIMA, exponential smoothing and neural
networks [27].
3.2.6 Rule-Based
Rule-Based are also being applied to the field of Short-term load forecasting and the scientific literature
verifies it. For example a study was made to investigate the possibility to apply symbolic data mining
methods to the problem of load prediction. By employing rules that is used for extraction of classification
or prediction from data. When new data is coming, their current rules are applied to the data and their
weights are composed by the inference mechanism to the resulting weight of a given prediction. The
presented approach is applied to the problem of short-term electric load forecasting [28].
Another Rule based approach is discussed from the next survey from IEEE [29]. In that survey the
formulation of rules and their application in load forecasting was discussed. The use of rules was made
for the classification of the load forecast parameters into weather-sensitive and nonweather-sensitive
categories was described. The rationale underlying the development of rules for both the one-day and
seven-day forecast is presented. This exercise leads to the identification and estimation of parameters
relating to electric load such as weather variables, day types, and seasons. Moreover they provided
a self-learning process is described which shows how rules governing the electric utility load can be
updated [29].
3.2.8 XGBoost
Furthermore, XGBoost models can be found in the scientific literature to be used in electric load fore-
casting. For example in the following study, the use of XGBoost models was used in order to be built
a short-term power load extreme gradient boosting (XGBoost) model with multi-information fusion by
CHAPTER 3. SCIENTIFIC REVIEW AND COMMERCIAL SOFTWARE 36
using the gradient boosting algorithm. [32]. In another study XGBoost models was used in order to com-
pete in Kaggle competition in 2012. The competition involved a hierarchical load forecasting problem
for a US utility with 20 geographical zones. The available data consisted of the hourly loads for the
20 zones and hourly temperatures from 11 weather stations, for four and a half years. For each zone,
the hourly electricity load for nine different weeks needed to be predicted without having the location
of zones or stations. They used separate models for each hourly period, with component-wise gradi-
ent boosting to estimate each model using univariate penalized regression splines as base learners. The
models allow for the electricity demand to change with time-of-year, day-of-week, time-of-day, and on
public holidays, with the main predictors being current and past temperatures as well as past demand.
The team ranked 5th among 105 teams [33].
1
https://etap.com/product/load-forecasting-software
2
Aiolos Forecast Studio
3
Electric Load Forecasting Using Artificial Neural Networks
CHAPTER 3. SCIENTIFIC REVIEW AND COMMERCIAL SOFTWARE 37
historical data provided by the user. The way that it works is through minimizing the error of prediction
vs the real values. When the error criterion is satisfied then the final model is given to the user.
This particular software is user-friendly and works only in Windows systems. It provides descriptive
statistics tools. It uses calendar features (Weekends, Holidays, National Days etc). It can utilize ODBC
Data Base connections in order to retrieve information from Data Bases. The horizon of prediction is
from short-term to long-term forecasting. Moreover is has the ability to decompose time series data in
its components (trend, seasonality, cyclic).
3.3.6 Statgraphics
Statgraphics 6 is a commercial statistical software for statistical analysis. It is developed in 1980 from
Dr. Neil Polhemus, professor in statistics from Princeton University. The current edition is called Stat-
graphics Centurion XVII is realised in 2014.
In terms of data analysis for electricity consumption, Statgraphics uses data structures in the form
of time series. Statgraphics has a plethora of time series visualization tools. It has controls for finding
random time series, analyzing their oscillations, and finding any voltage, seasonality, periodicity, etc.
In addition, it has mechanisms for descriptive data statistics, and more specifically for time series con-
sumption data, it has autocorrelation functions to interpret how past data affect future ones. Also other
tools such as the periodogram are useful for analyzing time series oscillations and whether they main-
tain specific frequencies at regular intervals. It also has linear and non-linear smoothing techniques for
time series and especially those that hold enough noise to reliably find the time series trend. It also has
time series degradation techniques in their basic components to make it possible to find seasonal patterns
and produce data adapted to the seasonality-periodicity component. Finally, Statgraphics has tools for
predicting time series, exploiting the past time series data to be predicted by various processes such as
random walk, mobile media, trend models, simple, linear, polynomial models, exponential smoothing.
ARIMA parametric models. Statgraphics also has the ability to suggest that the user selects the model
itself, which minimizes the prediction error of the time series.
4
Demand Forecasting Solution
5
SAS® Energy Forecasting
6
statgraphics
Chapter 4
4.2 Datasets
The Datasets come from 3 origins.
38
CHAPTER 4. APPLICATION - ML MODELS FOR STLF IN GREEK ELECTRIC GRID 39
The IPTO’s Loads Demand Datasets are in hourly format and span from 2010-10-01 to 2017-04-30
roughly 6.5 years of data.
We can see that from 00:00 - 7:00 (non-rush hours) load demand is decreasing, from 7:00 - 14:00
(rush hours) load are increasing, from 14:00 - 18:00 (non-rush hours) load is declining, from 18:00 -
21:00 again loads are rising and finally from 21:00 - 00:00 loads are decreasing. This is a typical load
demand during weekdays. The load demand trend and cycle may vary during weekends and Holidays due
to the fact that electricity load demand depends on human behavior and during weekends and holidays
the human behavior is changes thus the demand alters.
• Summary
Categorical variable, summary of current weather conditions.
• Icon
Categorical variable, more detailed edition of current weather conditions.
• Temperature
Continuous variable, the temperature in the selected location, measured in Celsius degrees.
• Dew Point
Continuous variable, the dew point in the selected location, measured in Celsius degrees.
• Humidity
Continuous variable, the humidity in the air in the selected location, measured with percentage.
• Wind Speed
Continuous variable, the wind speed the air flows in the selected location measured in meters per
second.
• Wind Bearing
Continuous variable, the bearing of the wind flows in the selected location measured in degrees.
• Visibility
Continuous variable, the visibility in the selected location in kilometers.
CHAPTER 4. APPLICATION - ML MODELS FOR STLF IN GREEK ELECTRIC GRID 41
• Cloud cover
Continuous variable, measuring how much the sky is covered by clouds in the selected location
measured with percentage.
• UV index
Continuous variable, measuring sun’s radiation during the day in the selected location.
1. fill the missing values by the mean of the adjacent by hour values corresponded to the same feature.
2. if the previous adjacent values are missing values too, fill the missing values by the adjacent by
day values corresponded to the same feature.
3. but if and the previous values are missing values too, copy just one of the adjacent values to the
missing value corresponded to the same feature.
• isHoliday, categorical variable, if that day is holiday (Greek National and Religious Holidays).
1. Firstly, it adds randomness to the given data set by creating shuffled copies of all features (which
are called shadow features).
2. Then, it trains a random forest classifier on the extended data set and applies a feature importance
measure (the default is Mean Decrease Accuracy) to evaluate the importance of each feature where
higher means more important. At every iteration, it checks whether a real feature has a higher
importance than the best of its shadow features (i.e. whether the feature has a higher Z score than
the maximum Z score of its shadow features) and constantly removes features which are deemed
highly unimportant.
3. Finally, the algorithm stops either when all features gets confirmed or rejected or it reaches a
specified limit of random forest runs.
In this master Thesis, the cases has a great amount of features, 1484 features for 2394 cases. This
great amount of features was created because huge variety of information from meteorological and Calen-
dar features was used, hence feature selection is integral to the experiments. To predict the short-term load
forecasting, the loads were separated by hour, so 24 target variables were created: load.0, load.1, ..., load.24.
For each of these loads the Boruta feature selection algorithm was executed in order to find the relevant
6
Boruta package
7
Boruta tutorial
CHAPTER 4. APPLICATION - ML MODELS FOR STLF IN GREEK ELECTRIC GRID 43
features per target variable. The results are from 1484 features are reduced from 180 - 220 per target
variable.
Because the amount of The selected features per load.x target variable is enormous. The intercept of
selected features for all load.x will be presented in table format at the end, at the appendix of tables.
The load.x target variable were separated by hour, so 24 target variables were created: load.0, load.1, ..., load.24.
Thus 24 models per machine learning algorithm were made in order to predict each of the target-variable.
The main parameters for the experiments are the following:
• Parameters: Full Features, Feature Selection, Default ML Models parameters and Grid Search
Four Experiments were performed in order to minimize the error of prediction and increase the pre-
diction accuracy. The type of experiments are the following:
1. Experiment #1:
Run the machine learning algorithms with their default parameters, and cases as is without feature
selection but with full features. The dataset partition was the following: train-set: the first 5.5
years and test-set: the last 1 year
2. Experiment #2:
Run the machine learning algorithms with their default parameters, and cases with feature selec-
tion. The dataset partition was the following: train-Set: the first 5.5 years and test-set: the last 1
year
3. Experiment #3:
Run the machine learning algorithms and find the best tuning parameters from grid search, and
cases as is without feature selection but with full features. The dataset partition was the following:
train-set: the first 4.5 years, validation-set: the following 1 year and test-set: the last 1 year.
4. Experiment #4:
Run the machine learning algorithms and find the best tuning parameters from grid search, and
cases with feature selection. The dataset partition was the following: train-set: the first 4.5 years,
validation-set: the following 1 year and test-set: the last 1 year.
CHAPTER 4. APPLICATION - ML MODELS FOR STLF IN GREEK ELECTRIC GRID 44
Table 4.2: mape evaluation performance from various models with default model parameters
for the 24-hour based target variables
mape evaluation performance from various models with default model parameters
for the 24-hour based target variables
random
target variable / models svm knn nn xgboost Model Trees
Forest
load.0 2.76% 3.57% 2.91% 4.67% 2.23% 2.03%
load.1 2.63% 3.48% 2.7% 4.61% 2.43% 1.92%
load.2 2.57% 3.44% 2.58% 4.28% 2.22% 2.1%
load.3 2.7% 3.51% 2.62% 4.22% 2.53% 2.15%
load.4 2.73% 3.74% 2.72% 4.04% 2.6% 2.36%
load.5 2.71% 3.71% 2.71% 3.87% 2.56% 2.54%
load.6 3.01% 3.86% 2.92% 4.16% 2.94% 2.87%
load.7 3.78% 4.97% 3.27% 3.67% 3.3% 2.73%
load.8 5.58% 7.17% 3.74% 11.5% 3.83% 3%
load.9 6.08% 7.86% 4.22% 6.83% 3.65% 3.47%
load.10 5.72% 7.52% 3.77% 5.94% 3.3% 3.17%
load.11 5.08% 6.58% 3.68% 4.35% 3.04% 2.85%
load.12 4.53% 5.87% 3.37% 4.35% 3.07% 2.54%
load.13 4.47% 5.81% 3.21% 4.43% 2.72% 2.45%
load.14 4.68% 6.18% 3.33% 5.01% 3.12% 2.62%
load.15 5.23% 6.9% 3.77% 5.43% 2.98% 2.85%
load.16 5.63% 7.11% 4.07% 5.73% 4.07% 3.06%
load.17 5.46% 7.05% 4.14% 5.86% 3.93% 3.46%
load.18 5.67% 6.85% 4.29% 6.29% 3.83% 3.48%
load.19 5.74% 6.36% 3.99% 12.7% 3.95% 2.95%
load.20 6.03% 5.81% 3.59% 14.1% 3.35% 2.39%
load.21 4.29% 5.19% 3.06% 11.1% 2.63% 2.26%
load.22 3.57% 4.71% 3.15% 3.73% 2.74% 2.26%
load.23 3.01% 4.61% 3.42% 4.94% 2.87% 2.27%
The following table shows the mean mape from models for all the target variables per machine
learning algorithm 4.3:
Table 4.3: mean mape for the 24 models per machine learning algorithm with default parameters and
full features
Some Remarks
• Model Trees with Cubist library, performed almost perfectly, with 2.65% error. Only Model Trees
and no other ML algorithm has the best accuracy per target variable.
• On the second and third place takes the XGBoost with 3.07% and RandomForest with 3.38%
further experimentation will increase their accuracy.
• Having as minimum mean mape prediction error 2.65% from all models from Model Trees and the
load prediction error from OoEM is 2.53%, this experiment decreases the predictive accuracy by
-4.74% than that from OoEM’s. The following table describes the ensemble of models per target
variable and shows the prediction error comparison from OoEM existed model to Model Trees
4.4.
Table 4.4: classifier selection from target variable ensemble, mape evaluation and performance evalua-
tion for experiment #1
Table 4.5: mape evaluation accuracy from various models with tuned parameters
for the 24-hour based target variables
The following table 4.6 shows the mean mape evaluation from models for all the target variables per
machine learning algorithm:
Table 4.6: mean mape for the 24 models per machine learning algorithms with default parameters and
feature selection
Some Remarks
• Model Trees with Cubist library, performed again almost perfectly, it went from 2.65% prediction
error to 2.6% prediction error.
• However now with feature selection SVM started to perform better than the previous experiment,
its prediction error decreased from 4.31% to 3.07% and in some target variables performs better
than Model Trees.
• XGBoost although is increase its accuracy than the previous experiment with feature selection, it
comes second in mean accuracy but it overruns SVM in most of target variables.
• one significant tactic here is to combine different models from target variable to target variable
based on the minimum mape. Here for the load target variable from 00:00 to 6:00, SVM shows
greater performance that Model Trees, hence we can combine the mean mape from these SVM
models and the rest of target variables from Model Trees. If we do this we get even better accuracy
CHAPTER 4. APPLICATION - ML MODELS FOR STLF IN GREEK ELECTRIC GRID 47
with prediction error 2.55% which is better from the mean mape from Model Trees alone which
is 2.6%. The load prediction error from OoEM is 2.53%, thus this experiment again decreases
the predictive performance by -0.7% than that from OoEM’s. The following table describes the
ensemble of models per target variable and shows the performance evaluation from OoEM existed
model 4.7.
Table 4.7: classifier selection from target variable ensemble, mape evaluation and performance
evaluation for experiment #2
To select the best tuning parameters per machine learning algorithm, each combination of tuned
parameters was evaluated to the Validation Set. Those who had the smaller mape were selected to train
the final model and compare the accuracy results to the Test Set. The best tuned parameters per machine
learning algorithm and per target variable with full features can be found at the appendix of tables.
24 Models where trained per target Variable and per machine learning algorithm as many are the
hours in the day (24 hours of the day). The mape prediction error accuracy results applied to the Test-Set
are presented in the following table, with green color are highlighted the mape prediction error per model
and target variable with the least value. 4.9:
Table 4.9: mape evaluation performance from various models with tuned parameters and full features
for the 24-hour based target variables
The following table 4.10 shows the mean mape for all the target variables per machine learning
algorithm:
CHAPTER 4. APPLICATION - ML MODELS FOR STLF IN GREEK ELECTRIC GRID 49
Table 4.10: mean mape for the 24 models per machine learning algorithms with tuned parameters and
full features
Some Remarks
• Again Model Trees has overall the best mean accuracy with 2.66% prediction error.
• XGBoost got even better better accuracy than before with prediction error 2.73%. On the other
hand SVM accuracy became even worse than before. KNN can not match the in accuracy the other
ML algorithms.
• Even though Model Trees has the best mean accuracy from the other five machine learning algo-
rithms, there are some specific target variables which XGBoost models shows better performance.
If we combine models per target variable and mix Model Trees and XGBoost per target variable
by mape minimum prediction error we get even small mean prediction error 2.58%. Even though
the mean mape combination of XGBoost and Model Trees gives us 2.58% which is a bit better
than only Model Trees 2.60%, however 2.58% is worst than the previous combination experiment
which was 2.55%. This may imply that combining the two tactics, feature selection and grid search
may lead us to even better results. The load prediction error from OoEM is 2.53%, thus this experi-
ment again decreased the predictive performance by -1.97% and again is worst than the prediction
error from OoEM. The following table describes the ensemble of models per target variable and
shows the performance evaluation from OoEM existed model 4.11.
Table 4.11: classifier selection from target variable ensemble, mape evaluation and performance evalu-
ation for experiment #3
To select the best tuning parameters per machine learning algorithm, each combination of tuned
parameters was evaluated to the Validation Set. Those who had the smaller mape were selected to train
the final model and compare the accuracy results to the Test Set. The best tuned parameters per machine
learning algorithm and per target variable with feature selection can be found at the appendix of tables.
24 Models where trained per target Variable and per machine learning algorithm as many are the
hours in the day (24 hours of the day). The mape prediction error accuracy results are presented in the
following table, with green color are highlighted the mape prediction error per model and target variable
with the least value. 4.13:
Table 4.13: mape evaluation performance from various models with tuned model parameters and
feature selection for the 24-hour based target variables
mape evaluation performance from various models with tuned model parameters and
feature selection for the 24-hour based target variables
random
target variable/ models svm knn nn xgboost Model Trees
Forest
load.0 1.78% 3.57% 2.73% 2.46% 2.01% 2.05%
load.1 1.86% 3.4% 2.53% 2.59% 1.94% 2.23%
load.2 1.77% 3.22% 2.33% 2.57% 2.05% 1.99%
CHAPTER 4. APPLICATION - ML MODELS FOR STLF IN GREEK ELECTRIC GRID 51
The following table shows the mean mape for all the target variables per machine learning algorithm
4.14:
Table 4.14: mean mape for the 24 models per machine learning algorithms with tuned parameters and
feature selection
Some Remarks
• Again Model Trees has overall the best mean accuracy with 2.60% prediction error.
• XGBoost got the an improved performance than every other experiment with prediction error
2.64%.
• SVM also also got an improved performance than every other experiment with prediction error
2.62%.
• Again if we combine and mix models from target variable to target variable by minimum mape we
get the best mean combined (ensemble) mape ever which is 2.41%. The load prediction error from
OoEM is 2.53%, thus this experiment increase the predictive performance by +4.74%. The last
experiment increased the performance accuracy by decreasing the prediction error more than that
from OoEM’s prediction error. The following table describes the ensemble of models per target
variable and shows the performance evaluation from OoEM existed model 4.15.
CHAPTER 4. APPLICATION - ML MODELS FOR STLF IN GREEK ELECTRIC GRID 52
Table 4.15: classifier selection from target variable ensemble, mape evaluation and performance evalu-
ation for experiment #4
The second page presents both loads and meteorological features in table format
CHAPTER 4. APPLICATION - ML MODELS FOR STLF IN GREEK ELECTRIC GRID 53
The third page shows descriptive statistics for both loads and the meteorological features (histograms
and boxplot)
The forth page shows the correlation between electricity load demand and meteorological features.
CHAPTER 4. APPLICATION - ML MODELS FOR STLF IN GREEK ELECTRIC GRID 54
And the fifth page shows the correlation between electricity load demand and meteorological features
As presented in the initial chapters, electric load forecasting is nowadays a vital process, due to the
fact that the increased needs for electrical power from consumers and from our daily life and routine.
Electrical companies should take into account this process and constantly improve their models for more
accurate forecasts.
As seen in the scientific literature, many machine learning and statistical approaches exist to predict
the short-term load forecasting. Applying six of them (SVM, Random Forest, k-Nearest Neighbors,
Neural Networks, XGBoost, Model Trees) the prediction error accuracy increased to OoEM’s by +4.74%
with the best prediction error, 2.41% from ensemble of models SVM, XGBoost and Model Trees. This is
one interesting part, that combining models by their mape evaluation is that we can increase even more
the accuracy of the prediction error from our models in total.
55
Chapter 6
Future Work
Many techniques have not been developed as seen from the literature. More importantly if they are
developed the ensemble - combination of them should give interesting results. Furthermore techniques
of multi-target regression can be applied. Moreover to increase the performance because grid search
takes a lot of time, spark library for distributed computing or parallel for loops should be introduced.
Two cities was used Athens and Thessaloníki and its meteorological and calendar features. What if
more cities introduced to the models. Having feature selection makes introduction of more cities easier.
Another future idea is to make short-term load forecasting for any future date. Having now pre-trained
models we can make predictions for a future date, only new cases with meteorological and calendar
values must be constructed for that date. Another idea is to give more liveness to the ipto-ml shinyapp
application by fetching “on the fly” all the files from IPTO, OoEM and DarkSky and then pre-process
and train new models. Finally the idea of an “Energy analytics” would be a great visualization for load
forecasting.
56
Chapter 7
Appendix of Tables
57
CHAPTER 7. APPENDIX OF TABLES 58
7.2.2 List of Best KNN tuning Parameters per target variable with full features
7.2.3 List of Best Random Forest tuning Parameters per target variable with full features
7.2.4 List of Best Neural Networks tuning Parameters per target variable with full fea-
tures
7.2.5 List of Best XGBoost tuning Parameters per target variable with full features
7.2.6 List of Best Model Trees tuning Parameters per target variable with full features
7.3.2 List of Best K-Nearest Neighbors tuning Parameters per target variable with fea-
ture selection
CHAPTER 7. APPENDIX OF TABLES 65
7.3.3 List of Best Random Forest tuning Parameters per target variable with feature
selection
7.3.4 List of Best Neural Networks tuning Parameters per target variable with feature
selection
7.3.5 List of Best XGBoost tuning Parameters per target variable with feature selection
7.3.6 List of Best Model Trees tuning Parameters per target variable with feature selec-
tion
CHAPTER 7. APPENDIX OF TABLES 68
[1] Electrical load forecasting : modeling and model construction / Soliman Abdel-hady Soliman (S.A.
Soliman), Ahmad M. Al-Kandari.
[2] Short Term Electric Load Forecasting by Tao Hong, A dissertation submitted to the Graduate
Faculty of North Carolina State University in partial fulfillment of the requirements for the degree
of Doctor of Philosophy, Operations Research and Electrical Engineering, Raleigh, North Carolina
[4] T. Haida and S. Muto, ”Regression based peak load forecasting using a transformation technique,”
IEEE Transactions on Power Systems.
[5] B. Krogh, E. S. de Llinas, and D. Lesser, ”Design and Implementation of An on-Line Load
Forecasting Algorithm,” IEEE Transactions on Power Apparatus and Systems.
[6] M. T. Hagan and S. M. Behr, ”The Time Series Approach to Short Term Load Forecasting,” IEEE
Transactions on Power Systems.
[7] N. Amjady, ”Short-term hourly load forecasting using time-series modeling with peak load
estimation capability,” IEEE Transactions on Power Systems.
[8] D. C. Park, M. A. El-Sharkawi, R. J. Marks, II, L. E. Atlas, and M. J. Damborg, ”Electric load
forecasting using an artificial neural network,” IEEE Transactions on Power Systems, vol. 6, pp.
442-449, 1991.
[10] A. Adamakos, ”Short-Term Load Forecasting using a Cluster of Neural Networks for the Greek
Energy Market”, Electrical & Computer Engineer Public Power Corporation Greece.
[12] A.G.Bakirtzis, V.Petridis, S.J.Kiartzis, M.C.Alexiadis, A.H.Maissis, A neural network short term
load forecasting model for the Greek power system, IEEE Transactions on Power Systems, Vol. 11,
No.2, May 1996, pp. 858-863
[14] S. Rahman and R. Bhatnagar, ”An expert system based algorithm for short term load forecast,”
IEEE Transactions on Power Systems, vol. 3, pp. 392- 399, 1988.
[15] S. Rahman, ”Formulation and analysis of a rule-based short-term load forecasting algorithm,”
Proceedings of the IEEE, vol. 78, pp. 805-816, 1990
69
BIBLIOGRAPHY 70
[16] K.-L. Ho, Y.-Y. Hsu, C.-F. Chen, T.-E. Lee, C.-C. Liang, T.-S. Lai, and K.-K. Chen, ”Short term
load forecasting of Taiwan power system using a knowledge-based expert system,” IEEE
Transactions on Power Systems, vol. 5, pp. 1214-1221, 1990.
[17] Y. Y. Hsu and K. L. Ho, ”Fuzzy expert systems: an application to short-term load forecasting,”
IEE Proceedings - Generation, Transmission and Distribution, vol. 139, pp. 471-477, 1992.
[18] P. A. Mastorocostas, J. B. Theocharis, and A. G. Bakirtzis, ”Fuzzy modeling for short term load
forecasting using the orthogonal least squares method,” IEEE Transactions on Power Systems, vol.
14, pp. 29-36, 1999.
[19] H. Mori and H. Kobayashi, ”Optimal fuzzy inference for short-term load forecasting,” IEEE
Transactions on Power Systems, vol. 11, pp. 390-396, 1996.
[20] S. Rahman, ”Formulation and analysis of a rule-based short-term load forecasting algorithm,”
Proceedings of the IEEE, vol. 78, pp. 805-816, 1990
[21] N. Sapankevych and R. Sankar, ”Time Series Prediction Using Support Vector Machines: A
Survey,” IEEE Computational Intelligence Magazine, vol. 4, pp. 24-38, 2009
[22] Electrical Load Forecasting using Support Vector Machines, Belgin Emre, Dilara Demren,
Instanbul Technical University, Turkey.
[23] C.-M. Huang and H.-T. Yang, ”Evolving wavelet-based networks for shortterm load forecasting,”
IEE Proceedings - Generation, Transmission and Distribution, vol. 148, pp. 222-228, 2001.
[24] Ensemble Deep Learning for Regression and Time Series Forecasting, Xueheng Qiu, Le Zhang,
Ye Ren and P. N. Suganthan, School of Electrical and Electronic Engineering, Nanyang
Technological Univeristy, Singapore, Gehan Amaratunga, Department of Engineering, University
of Cambridge, UK.
[25] Deep Neural Network Based Demand Side Short Term Load Forecasting, Seunghyoung Ryu,
Department of Electronic Engineering, Sogang University, 35 Baekbeom-ro, Mapo-gu, Seoul
121-742, Korea; [email protected], Jaekoo Noh, Software Center, Korea Electric Power
Corporation (KEPCO)
[26] Short Term Electrical Load Forecasting Using Mutual Information Based Feature Selection with
Generalized Minimum-Redundancy and Maximum-Relevance Criteria, Nantian Huang, Zhiqiang
Hu, Guowei Cai and Dongfeng Yang, School of Electrical Engineering, Northeast Dianli
University, Jilin 132012 China.
[27] Short-Term Load Forecasting using Random Forests, Grzegorz Dudek, Department of Electrical
Engineering, Czestochowa University of Technology, Czestochowa, Poland
[28] Rule-Based Prediction of Short-term electric Load, Petr Berka, Prof, PhD, University of
Economics Prague & University of Finance and Administration Prague, Czech Republic
[29] Formulation and analysis of a rule-based short-term load forecasting algorithm, S. Rahman, Dept
of Electr. Eng., Virginia Polytech. Inst. & State Univ. Blacksburg, VA, USA,
ieeexplore.ieee.org/document/53400
[30] A Novel Hybrid Model Based on Extreme Learning Machine, k-Nearest Neighbor Regression and
Wavelet Denoising Applied to Short-Term Electric Load Forecasting, Weide Li, Demeng Kong and
Jinran Wu, School of Mathematics and Statistics, Lanzhou University, Gansu, China
BIBLIOGRAPHY 71
[31] A composite k-nearest neighbor model for day-ahead load forecasting with limited temperature
forecasts, Rui Zhang, Yan Xu, Zhao Yang Dong, Weicong Kong, Kit Po Wong, School of Electrical
and Information Engineering, University of Sydney, NSW 2006, Australia
ieeexplore.ieee.org/document/7741097
[32] Short-Term Electricity Load Forecasting Based on the XGBoost Algorithm, Guangye Li, Wei Li,
Xiaolei Tian, Yifeng Che, State Grid Liaoning Electric Power Co., Ltd., Shenyang Liaoning
[33] A gradient boosting approach to the Kaggle load forecasting competition, Souhaib Ben Taieb,
Machine Learning Group, Department of Computer Science, Faculty of Sciences, Universit´e Libre
de Bruxelles, Rob J Hyndman, Department of Econometrics and Business Statistics, Monash
University, Clayton, VIC 3800, Australia
[34] O. Chapelle and V. Vapnik, Model Selection for Support Vector Machines. In Advances in Neural
Information Processing Systems, Vol 12, (1999)
[35] M. T. Hagan and S. M. Behr, ”The Time Series Approach to Short Term Load Forecasting,” IEEE
Transactions on Power Systems, vol. 2, pp. 785-791, 1987.
[36] svm regression tutorial
[37] svm regression tutorial matlab
[38] random forest tutorial
[39] random forest explainer
[40] KNN regression
[41] neural networks regression
[42] Introduction to Boosted Trees
[43] An Introduction to XGBoost
[44] An Introduction to Gradient Boosting
[45] Cubist - Model Trees Presentation
[46] R language wiki
[47] Weka wiki
[48] SPSS wiki
[49] Excel wiki
[50] python wiki
[51] scikit-learn wiki
[52] pandas wiki
[53] statsmodel wiki
[54] Anaconda main site
[55] Rapidminer wiki
[56] Orange wiki
[57] Knime wiki
[58] Boruta Feature Selection Algorithm