Supervised Machine Learning Techniques For Short-Term Load Foreca
Supervised Machine Learning Techniques For Short-Term Load Foreca
Digital Commons @ DU
1-1-2019
Recommended Citation
Amarasundar, Harish, "Supervised Machine Learning Techniques for Short-Term Load Forecasting"
(2019). Electronic Theses and Dissertations. 1642.
https://digitalcommons.du.edu/etd/1642
This Thesis is brought to you for free and open access by the Graduate Studies at Digital Commons @ DU. It has
been accepted for inclusion in Electronic Theses and Dissertations by an authorized administrator of Digital
Commons @ DU. For more information, please contact [email protected],[email protected].
Supervised Machine Learning Techniques for Short-Term Load Forecasting
Abstract
Electric Load Forecasting is essential for the utility companies for energy management based on the
demand. Machine Learning Algorithms has been in the forefront for prediction algorithms. This Thesis is
mainly aimed to provide utility companies with a better insight about the wide range of Techniques
available to forecast the load demands based on different scenarios. Supervised Machine Learning
Algorithms were used to come up with the best possible solution for Short-Term Electric Load
forecasting. The input Data set has the hourly load values, Weather data set and other details of a Day.
The models were evaluated using MAPE and R2 as the scoring criterion. Support Vector Machines yield
the best possible results with the lowest MAPE of 1.46 %, a R2 score of 92 %. Recurrent Neural Networks
univariate model serves its purpose as the go to model when it comes to Time-Series Predictions with a
MAPE of 2.44 %. The observations from these Machine learning models gives the conclusion that the
models depend on the actual Data set availability and the application and scenario in play
Document Type
Thesis
Degree Name
M.S.
Department
Electrical Engineering
First Advisor
Mohammad A. Matin, Ph.D.
Second Advisor
George Edwards, Ph.D.
Third Advisor
Mohammed Mahoor, Ph.D.
Keywords
Electric energy, Load forecasting, Machine learning, Neural networks, Time series predictions, Utility
companies
Subject Categories
Electrical and Computer Engineering | Engineering
Publication Statement
Copyright is held by the author. User is responsible for all copyright compliance.
A Thesis
Presented to
the Faculty of the Daniel Felix Ritchie School of Engineering and Computer
Science
University of Denver
In Partial Fulfillment
Master of Science
by
Harish Amarasundar
August 2019
Abstract
Electric Load Forecasting is essential for the utility companies for ene-
been in the forefront for prediction algorithms. This Thesis is mainly aimed
to provide utility companies with a better insight about the wide range of
Supervised Machine Learning Algorithms were used to come up with the best
possible solution for Short-Term Electric Load forecasting. The input Data set
has the hourly load values, Weather data set and other details of a Day. The
models were evaluated using MAPE and R2 as the scoring criterion. Support
Vector Machines yield the best possible results with the lowest MAPE of 1.46
MAPE of 2.44 %. The observations from these Machine learning models gives
the conclusion that the models depend on the actual Data set availability and
ii
Acknowledgements
Dr. Mohammed Mahoor, Dr. Yun-bo Yi for your attendance, support and
questions. I gained valuable knowledge, ideas and inspiration from all those
great people.
Finally, I owe many thanks to my parents, friends and family for all
iii
Table of Contents
Chapter 1 Introduction 1
Electric Energy . . . . . . . . . . . . . . . . . . . 1
Load Forecasting . . . . . . . . . . . . . . . . . . 3
Need for Load Forecasting . . . . . . . . . . . 3
Types of Load Forecasting . . . . . . . . . . . 4
Advantages of Load Forecasting . . . . . . . . 5
Challenges in Load Forecasting . . . . . . . . . 6
Problem Statement . . . . . . . . . . . . . . . . . 7
Objective . . . . . . . . . . . . . . . . . . . . . . 8
iv
Chapter 3 Dataset 34
Data set Preparation . . . . . . . . . . . . . . . . 36
Data set Pre-Processing . . . . . . . . . . . . . . 36
Label Encoding . . . . . . . . . . . . . . . . . 39
One Hot Encoder . . . . . . . . . . . . . . . . 40
Chapter 5 Results 60
Scenarios . . . . . . . . . . . . . . . . . . . . . . . 60
Case - I . . . . . . . . . . . . . . . . . . . . . . 60
Case - II . . . . . . . . . . . . . . . . . . . . . 61
Case - III . . . . . . . . . . . . . . . . . . . . . 61
Case - IV . . . . . . . . . . . . . . . . . . . . . 62
ANN - Multivariate . . . . . . . . . . . . . . . . . 62
CNN - Univariate . . . . . . . . . . . . . . . . . . 63
CNN - Multivariate . . . . . . . . . . . . . . . . . 64
RNN - Univariate . . . . . . . . . . . . . . . . . . 65
RNN - Multivariate . . . . . . . . . . . . . . . . . 66
Support Vector Machines - Regression . . . . . . . 67
Performance Comparison . . . . . . . . . . . . . . 68
v
Chapter 6 Conclusion 69
Future Work . . . . . . . . . . . . . . . . . . . . . 71
Appendix A Definitions 72
Bibliography 79
vi
List of Figures
vii
List of Tables
viii
Chapter 1
Introduction
Electric Energy
Electrical Energy is a form of energy that results from the flow of Elec-
tric Charge. Electrical Energy is not a material resource that can be stored
and reused later, it has to be generated and transferred based on the Demand
regular basis that help power the devices that run on Electricity. Electricity
to this day is the most important invention because it serves as the baseline
for all inventions to come. It is safe to say that Electrical Energy has bec-
generated from Electric Power Generation stations from resources like Nat-
ural Gas, Solar, Coal, Fossil Fuels, Nuclear, Hydroelectric, Wind turbines,
Geothermal, Biomass and other sources as shown in Fig 1.1 [1]. Electrical
1
That is where Load forecasting comes into play which gives the utility compa-
nies with meaningful information. The Utility Companies make use of these
Data and prediction algorithms which provides them a better sense of the Load
Demand for future consumption. Predicted Load Demand allows the Utility
Companies to efficiently allocate resources and meet the supply the Demand
of the consumers.
2
Load Forecasting
The Term Load Forecasting can be defined as the way or the methods
the consumers both residential and commercial and supply the required Load
living standards rise, economies expand and the need for electrification of
load allocation and planning for future generation facilities and transmission
ing. There is also a rising need for Power Suppliers to build their bidding
strategies with their competitors so that later the consumers can derive a plan
to maximize their utilities using electricity purchased from pool [2, 3].
3
Figure 1.2. Types of Load Forecasting
4
• Medium-Term: Medium-term forecasts are usually from a few weeks
aim is to predict hour ahead load demands which falls under the
by Understanding the future load Demands which helps the company to plan
transmission investments like for example, Utility Companies can set up Gen-
eration stations near to where the demand is particularly higher and reduce
5
the transmission costs. It helps the Utility Companies to plan for the sched-
uled maintenance of the Power systems. This proves the point as to why Load
Forecasting is highly essential for Utility Companies, however there are com-
Data as one of the inputs for training the models but given the unpredictable
nature of the weather provides a challenge, it sometimes gets tricky while fore-
casting the Load, given that the weather data set was also predicted by a fore-
casting model. However it is only tricky for those areas where the weather is
However, most consumers pay rates based on the seasonal cost of electricity.
ity of generation sources, fuel costs, and power plant availability. Hence Load
forecasting helps Utility companies plan out their cost of supply to their con-
while load forecasting. Some of the most important factors are as follows:
• Highly volatile price and Load values that make the prediction pro-
6
• Electricity cannot be transported from one region to another one
markets.
Problem Statement
and age. It is expected that by the year of 2040, there will be a rise in
US Census. There lies the problem statement as to finding a way for Utility
resources and come up with a pricing plan based on the Demand from various
7
Objective
Patterns and that is where Load forecasting comes in handy and helps forecast
which Supervised Machine Learning models are best suited for a given real life
8
Chapter 2
Literature Survey
Machine Learning
Machine Learning can be defined as the algorithm that has the ability
to automatically learn from the Data and Observations and give out a Clas-
sification or a Prediction with just the Architecture being designed and not
the actual program explicitly being coded. So why Machine Learning now
when it was introduced in the late 1950’s? There was a lack of Computational
and Processing power to deal with Machine Learning Algorithms in the earlier
years when they were introduced, also there was a lack of storage resources
to run and store such taxing computational tasks. The availability of relevant
Data sets were also scarce. Machine learning algorithms learn better as the
number of features for that particular data set increased. Hence the Data
was not enough for the models to train and efficiently learn to provide good
results. The advent of Digitization age paved the way for Machine learning
9
to gain back its popularity. The wide range of Data that is available today to
train the models made Machine Learning attractive in areas where it can be
Processing power and Storage resources these models were able to run at a
much faster pace and store Data with greater ease. Machine Learning also off-
ers us a lot of flexibility to optimally tune the models and make them robust to
be backwards compatible with any Data set that is fed into it and get accurate
results. In the field of Machine Learning there are three methods as to how
its goal is to infer the natural structure present within a set of train-
10
Boltzmann Machines are some of the Unsupervised Machine Learn-
ing Models
gory label is given; instead, the only teaching feedback is that the
where it finds the best possible reward. For Example, Markov pro-
This research focuses only on Supervised machine learning models since the
Utility companies have the required data set for training. This Thesis focuses
because in reality for Load forecasting one deals with labelled Data that are
Operators which are made available online 24/7 [5]. Hence there is no real
subset of Machine learning algorithms and this research explores three types
11
History of Neural Networks
are loosely based on the biological neural networks of the human brain. The
human brain consists of three major parts namely Temporal lobe, Occipital
lobe and the Frontal lobe and the neurons connecting them. Each of these
parts along with the neurons served as influence and inspiration for modelling
the various types of neural networks available to the present day. The first
Rosenblatt [11, 9], it was intended to model how the human brain processed
visual data and learned to recognize objects. Preceptrons are also based on
vector of numbers, belongs to some specific class. The idea was to combine
layer, hidden layers and an output layer. In the modern sense, the preceptron
9]. For a simple preceptron the weights are initialized to zero at first and
12
then passed through the layers, then for each training sample an output was
obtained from the unit step activation function. The weights are then updated
based on the output values with a goal of minimizing the errors. The weights
1, if w.x + b > 0
X= (2.1)
0, otherwise
13
Backpropagation
propagation is one of the simplest and most general methods for supervised
slowly paved way for the different type of neural networks like the Artificial
based on the Occipital Lobe and Recurrent Neural Networks based on Front
present a training pattern to the input layers, pass the signals through the
net and determine the output at the output layer. The error function is some
scalar functions of the weights and is minimized when the networks outputs
match the desired outputs. Thus the weights are adjusted to minimize the
error. The gradient descent algorithm with the cost functions in (2.2) was
c
(tk − zk )2 = ||t − z||2
X
J(w) = 0.5 ∗ (2.2)
k=1
14
Figure 2.2. Back Propagation [8]
The weights are initialized at the start of training and then they are changed
δw = −η ∂J/∂w (2.3)
where ’w’ corresponds to the weights and ’J’ is the cost function which is
15
Activation Function
and add a bias before passing into the activation function, which then decides
whether the to send that value to the next neuron or not, From (2.1) Say Y
n
X
Y = (W ∗ X) + b (2.4)
t=1
f (Y ) = max(0, Y ) (2.5)
that few simple conditions such as continuity and that it is differentiable. One
and at the same time ensures continuity and smoothness. There are different
activation functions like sigmoid, rectified linear unit (ReLU), tanh and so on.
Sigmoid activation is use primarily in classification tasks where the output has
16
for regression problems. ReLU activation function activates the output if the
Loss Function
preset a training pattern to the input layer, pass the signals through the net
and determine the output at the output layer. Here these outputs are com-
pared to the target values; any difference corresponds to an error. The weights
are then adjusted to reduce the measure of error. Neural networks requires
used loss functions are Mean Squared error, Mean Absolute Error, Categorical
17
cross entropy and so on. First two losses are suited for a regression problem
n
(Actualt − P redictedt )2
X
M SE = 1/n (2.6)
t=1
From Equation (2.6) n refers to the number of samples in the input set, MSE
corresponds to the mean of the difference between the actual load value and
the predicted load value. MSE was used as the Loss function during the
training phase to minimize the errors There is another loss function called
Mean Absolute Error (MAE) used in regression analysis for neural networks.
MAE is more robust to outliers since it does not make use of square. On the
other hand, MSE is more useful if we are concerned about large errors whose
n
X
M AE = 1/n (|Actualt − P redictedt |) (2.7)
t=1
It’s the average over the test sample of the absolute differences between pre-
diction and actual observation where all individual differences have equal
18
Figure 2.4. Overview of Gradient Descent Algorithms [13]
Optimizer
ing the error/loss function stated in the section above and update the sys-
tem weights. There are different type optimization algorithms like Adam,
(Adam) optimizer was well suited for this research problem. Adam was pre-
19
sented by Diederik Kingma from OpenAI and Jimmy Ba from the University
of Toronto in their 2015 ICLR Thesis (poster) titled “Adam: A Method for
namely the adaGrad and RMSprop. Instead of adapting the parameter learn-
ing rates based on the average first moment (the mean) as in RMSProp, Adam
also makes use of the average of the second moments of the gradients (the unc-
compiler for the training set in neural networks requires both the Loss func-
from the diagram that adam and RMSprop were the most effective in terms
Scoring Criterion
(MAPE) was the primary scoring mechanism employed in the Test set.
20
n
X (Actualt − P redictedt )
M AP E = 1/n | | ∗ 100 (2.8)
t=1 Actualt
MAPE is a measure of how close the predicted values are to the actual val-
ues. MAPE is most commonly used scoring mechanism for regression tasks
determination was also used as an evaluation metric for the training and test-
ing accuracies. R2 also gives us a statistical measure of how close the given
data is to the actual fitted regression curve. It is one of the most populous
evaluation metrics for continuous data set or regression score function. It can
SSres
R2 = 1 − (2.9)
SStot
n
X
ym = 1/n ∗ yi (2.10)
i=1
(yi − ym )2
X
SStot = (2.11)
i
(yi − fi )2
X
SSres = (2.12)
i
21
from the above equations from (2.9 - 2.12), yi is the true value, fi is the
the prediction errors. Bias and Variance Trade off helps to understand these
errors better.
data. Models with high bias pays very little attention to the training data and
oversimplifies the model. Models with high variance pays a lot of attention to
training data and does not generalize on the data which it has not seen before.
Now comes the problem of overfitting the data and underfitting the data.
22
Figure 2.5. Bias - Variance [15]
perform well on the actual test data. It occurs when the model is
23
trained on a noisy data set and they tend to have low bias and high
variance.
ture the underlying pattern of the data. These models usually have
high bias and low variance. It usually occurs when the amount of
24
The goal is to find the optimal balance in the bias variance trade off,
so that it minimizes the error for the case. Thus it is extremely important to
There are different methods that help the machine learning models
overcome the problems of Overfitting and Underfitting. This Thesis uses the
• Train-Test-Validation Split
• Cross-Validation
• Dropout
Train-Test-Validation Split
It is common practice to split the entire data set into a Training Set
and a Test set. The training set contains a known output and the model learns
on this data in order to be generalized to other data later on. We have the test
data set (or subset) in order to test our model’s prediction on this subset. The
split is usually around 70/30 for the training and the Test sets. Furthermore
25
the Training set is subjected to K-fold cross validation to split the training set
into a part for training and validation across the validation set [8].
Cross-Validation
sometimes called rotation estimation, the data set D is randomly split into
size. It is split into k folds as given by the user. The model is trained on the
split subsets and tested only on the k th subset and the process is repeated for
each fold as the test set. It is usually done on the Training Data with Training
26
and Validation split done beforehand. In the figure below shows a data set
with k = 5 and being cross validated over 5 folds [17, 16, 8].
Lasso and Ridge Regression are the most commonly used regularizers
for regression analysis when it comes to overfitting and underfitting. They are
also referred to as L1 and L2 norms respectively. They are use to make the
27
k
X
w = argmin(ErrorF unction) + λ
∗
(|wi |) (2.13)
i=1
k
(wi )2
X
w∗ = argmin(ErrorF unction) + λ (2.14)
i=1
regularizing term while learning the weights for both the norms no matter
what the error function is.L1 norm is the sum of the normalized values of the
weights whereas the L2 norms is the sum of the squared values of the weights.
hand L1 norms have sparse cases hence it is inefficient. L1 norms have built
in feature selection whereas the other does not. Sparsity refers to the non-zero
entries that are very scarce in a matrix or a vector. Feature selection refers
to the ability of the model to select only useful coefficients that contribute as
Dropout
randomly drop units (along with their connections) from the neural network
during training. This prevents units from co-adapting too much. During
28
training, dropout samples from an exponential number of different “thinned”
network that has smaller weights. This significantly reduces overfitting and
29
Tools and Resources Used
The problem statement gives the idea that the problem at hand is Data
Tensorflow
ense 2.0. TensorFlow is a free and open-source software library for dataflow
library, and is also used for machine learning applications such as neural net-
Keras
Keras is also a Python package available for free licensed by MIT. Keras
30
PlaidML. Designed to enable fast experimentation with deep neural networks,
Numpy
port for large, multi-dimensional arrays and matrices, along with a large col-
by BSD [23].
scikit-learn
inter operate with the Python numerical and scientific libraries NumPy and
General Architecture
The General Architecture of the models is shown in Fig. 2.10. The first
task is to find the right Data set to start analyzing and decide on the right
amount required for training. It is always good practice to obtain the Data
31
and carefully scan them and look for missing values and irregularities. Every
model goes through the initial process of preparing and cleaning the Data.
Cleaning the data refers to carefully combing through the data in search of
missing Data point or irregularities and getting rid of them either by finding
the right data or averaging the data that is available. Then it is subjected to
Data is split into the Training Set and the Test Set. Validation Set is obtained
from the Training Set as a split form different folds of the K-Fold Cross vali-
dation. The training set is fed into the model for training and carefully tuned
on the validation set based on the scoring criterion for the problem at hand, in
this case it will be the best possible MAPE and R2 score. When the training
is done, the results are predicted using the Test Set. These predicted Test
set Values are then compared with the Actual Load values using MAPE and
R2 for performance Evaluation. The real load values and the predicted load
values are plotted on a hourly basis to get a visual representation of how good
32
Figure 2.10. General Architecture
33
Chapter 3
Dataset
Machine Learning algorithms required adequate Data sets for the mod-
els to train and make predictions. The Real Load Data set used in this research
Real load values that was supplied to the consumers in the city of Dallas Fort-
worth for the entire year for 2018. The load values were obtained for every hour
for the year of 2018. The weather plays an important role as to how to load
pattern behaves so it was only natural to include the weather data in the input
data set. The hourly weather data was obtained from the National Oceanic
Dallas Fort-Worth Area [26]. These data were properly indexed by the Date
and time. There are different types of variables that are encountered in this
34
also be defined as a variable that is changed or controlled in a sci-
the data set. It can also be defined as a variable being tested and
names or labels. From Table (3.1) Hour of the Day, Holiday or Not,
Weekend or not are all categorical variables that are just labels and
35
Data set Preparation
The real load values was lagged for 24 hours from the current hour of
the target dependent variable [4]. The load values were also averaged for the
previous 24 hours for the current hour of the target dependent variable. Load
hence was included as part of the Data set. The data set was carefully scanned
Index Features
1 Date (MM/DD/YYYY)
2 Hour of the Day (0 – 23)
3 Holiday or not (0 or 1)
4 Weekend or not (0 or 1)
5 Lagged Load by 24 Hours (MWh)
6 Average Load from 24 Hours ago (MWh)
7 Weather Dataset
8 Actual Load Demand from ERCOT
Data Pre-processing was used to get rid of the noise and irregularities
that existed in the data set. It may also be defined as a data mining technique
Machine learning models. It is clear from Table 2.1 that the Actual Load data
36
set is the Target Dependent Variable to be predicted, the Index 3 and 4 from
the table correspond to the categorical variables of the Data set. It is common
practice to center the data set of features around zero and then normalize it.
The Load values and the Weather data set were normalized with zero mean
From Equation (3.1), x refers to the input samples, u is the mean and
s is the standard deviation of the data set [28]. Below is a sample Table
of what actually the scaled values of the load for StandardScaler looks like.
There is another way in which they Data can be pre-processed which is the
37
Table 3.2. Standard Scaler
from the equation (3.2) and (3.3) is for the MinMaxScaler where min and max
and user input feature range values. This estimator scales and translates each
feature individually such that it is in the given range on the training set, e.g.
between zero and one. Table (3.3) shows the scaling for a MinMaxScaler for
the load values when the user input feature range is (-2,2) [29].
The Weekend or not and Holiday or not data were encoded using a
combination of one One Hot Encoder and Label Encoder to establish the fact
38
Table 3.3. MinMaxScaler
Label Encoding
encode variables with values from 0 to N-1, where N corresponds to the number
of variables/classes. The problem here is, since there are different numbers in
the same column, the model will misunderstand the data to be in some kind
of order, 0 <1 <2. But this is not the case at all. To overcome this problem,
39
Table 3.4. Label Encoding
The entire Data set is split into Training Set and Test Set. The Training
Set containing the first full nine months of Data and the Test set having the
last three months. The Split is about 67 percent for the training set and
33 percent for the Test Set. Further the Training set is subjected to K-Fold
40
cross validation where the Model is tuned on the Validation Set for each Fold
folds of cross validation to obtain the best possible accuracy for the model to
generalize. The main reason for splitting the Data set is to tackle the problem
of Over fitting [8]. The Test Set is left undisturbed until the model is properly
tuned on the Validation set. The Test set is only used to evaluate and compare
the accuracy of the different models explored in this Research. The training set
consists of 6552 samples and the Test set with 2208 samples [27]. Some of the
Machine Learning models in this Thesis use only part of the data set and the
others use the entire data set based on the scenario and desired application in
analysis.
analysis where the data being analyzed contains only one variable.
41
Chapter 4
Artificial Neural Networks are based on the Temporal Lobe of the hum-
an brain as stated before. It takes it’s influence from the temporal lobe’s ability
to retain long term memories or its ability to learn from past experiences. The
ANN model was first designed with the use of only the Load values to see how
42
they perform. As expected the ANN model performed poorly with only the
load data set for training. The model performed poorly in terms of the MAPE
a single feature training. This proves the fact that the ANN models benefits
more when the actual data set of features increases. Now the ANN model was
designed to fit the scenario where the data set comprises both the Load values
and their corresponding Weather Data. Smart City or a Smart Grid is the
perfect example where it has both the Load values and the weather data set.
43
Thus the reason for choosing both the Load and Weather Data set as input to
Train, Validate and Test. The ANN model consists of fully connected Dense
layers of one input layer, one hidden layer and a single value output layer. The
hidden unit had about 20 neurons in the layer [8, 9]. If there are more than
is not required in this case as the number of input samples are small when
44
compared to the input samples used in Deep Learning models. The ANN is a
feed forward neural network where the data travels only in one direction from
the input to the output. The weights are updated using the Back Propagation
algorithm. These layers are fitted with Dropout Regularization to tackle Over
with continuous values and also introduce non-linearity to the model [20]. The
regularizer and L2 norm as the activity regularizer which helps tackle over
fitting [19].
error (MSE) loss with five folds of cross validation and thoroughly optimized
with the validation set for optimal result without over fitting. The plot for
the Load values of the Actual against the predicted for the month of October
from the Test set is shown in Fig (5.1) and the error score of MAPE average
equal to 6.45 % was for the entire three months of the Test set [4]. ANN is the
type of model that learns well from prior experiences or epochs and updates
the weights after each epoch. The more meaningful data set it has as inputs
to train, the better the accuracy. Hence, ANN model is best suited for Smart
Cities or Smart Grids where they have ample amounts Load data and good
45
Convolutional Neural Networks - Univariate
of Neurons in the Human Brain and was inspired by the organization of visual
standard three-layer networks, but because of the weight sharing, the final
output does not depend upon the position of the input pattern. Due to lesser
parameters, CNN can be trained smoothly and does not suffer over fitting. A
pooling layer, activation function, and fully connected layer as shown in the
figure. The weight vector, also known as filter or kernel, slides over the input
vector to generate the feature map. This method of sliding the filter hori-
extracts N number of features from the input image in a single layer repre-
or largest, value in each patch of each feature map. Next come the Dimen-
sionality reduction where the results are down sampled or pooled feature maps
46
that highlight the most present feature in the patch, not the average presence
of the feature in the case of average pooling. This has been found to work
better in practice than average pooling for computer vision tasks like image
Linear Unit (ReLU) has proved itself better than other activation functions [8,
33].
1-Dimensional CNN
DSP. 1-D CNN’s use the mathematical convolutions from signal processing
where two signals are integrated with one being time flipped also known as
discrete time convolutions. In 1-D CNN’s it uses vectors as inputs and outputs
rather than matrices that are involved with 2-D CNN’s. Convolution Neural
47
Data are images which are converted to a 2-dimensional array from which the
model learns. In this case, the inputs are just 1-dimensional continuous values
layer had 8 filters with a kernel size of 3 and the hidden layer had 16 filters
with a kernel size of 3. Input fed into a CNN are to be processed according
to the batch size, the number of steps and the number of channels to fit a
were the previous 24 hours from the current real load and trains on the 25th
hour as output. The CNN layers are also subjected to L1 and L2 norms and
Dropout classes to tackle over fitting [19]. CNN model just before the output
layer must flatten the 3D tensor to get the appropriate value for the output
layer. The CNN model was fed with the input which is the training set and
validated on different folds of the cross validation to optimally tune the model.
The CNN model after being fitted on the training set was used to predict the
Load values using the inputs from the Test set. The MAPE average score for
the entire Test set was about 2.98 % and the plot for the Load values of the
actual to the predicted for the month of October from the Test set is shown
in Fig. 3. CNN is well suited only when we have the Load data set available
for manipulation. CNN models of this kind will be best suited for scenario’s
48
where the load data set is the only data available for training, like a remote
village or a hilltop where weather data set is not recorded or when the weather
data set is highly unpredictable due to the region it is located. [34, 35, 5].
here it refers to all the features from the Data set. The input layer had 8 filters
with a kernel size of 3 and the hidden layer had 16 filters with a kernel size
of 3. The output layer had a single neuron for the predictions. Multivariate
parameters. Multivariate CNN after being trained on all the features from the
Dataset returned an average MAPE score of 4.10 percent which is almost close
to that of the Univariate model but not an improvement over the previous in
Data set for Load Forecasting, which is a type of scenario that only exists in
Smart Cities and Smart Grids or other booming cities like New York where
the weather and Load data set are readily available. [34, 35, 5].
49
Recurrent Neural Networks - Univariate
RNN is based on the Frontal lobe of our brain where it learns from
which a human remembers for a short period of time. RNN is not a feed
forward neural network as it has in it’s architecture Long Short Term Memory
(LSTM’s) which allows the models to learn from recent experiences. LSTM’s
their networks. It is one of the most populous time series prediction models.
The recurrent ”unfolded” architecture as shown in Fig (4.4), has output unit
values fed back and duplicated as auxiliary inputs, augmenting the traditional
dependent signals whose structure varies over fairly short periods, thus the
error gets diluted when passed back through the layers many times [33, 8, 36].
LSTM
gradient problem. Each LSTM cell has a Cell state vector Ct so that the next
LSTM can choose to read, write or reset the cell using an explicit gating
mechanism. There are three gates in each LSTM cell as binary gates. The
three gates are the input gate it which decides whether the memory cell is
50
Figure 4.4. LSTM [37]
updated, the forget gate ft controls whether the memory cell is reset to zero
and the output get ht controls whether the information of the current cell
state is made visible or not. The three gates are based on a sigmoid activation
function because the constitute a smooth curve from zero to one and the model
variable is differentiable. Apart from these three gates there is one other vector
C̄t that modifies the cell state with a tanh activation function because with
a zero centered range a long sum operation will distribute the gradients well
state takes the hidden state and the current input x as the inputs. ht state is
51
Figure 4.5. LSTM - Cell [37]
RNN basically updates the weights and backpropagates from the short
paved the foundation for RNN [36]. RNN’s input must be processed in a way
features. The difference between CNN and RNN is that RNN can handle data
with unknown lengths or in other words it can handle dynamic lengths for
both inputs and outputs. The RNN had three layers the input, hidden and
52
the output layer. The input layer had 50 units the hidden had 100 units and
RNN can handle sequential data whereas CNN cannot handle it. Input
here is the same as it was in CNN using the previous 24 hours Load data as the
dependent variable and the 25th hour acts as the Independent variable to be
trained on and this recurs for the entire data set. It has one input LSTM layer
with two hidden layers with Dropout classes to deal with Over fitting and one
Dense output layer for the output [20]. These layers are also subjected to the
L1 and L2 norms like for the ones that were employed earlier in ANN to tackle
Over fitting [19]. The RNN model was fed with the input which is the training
set and validated on different folds of the cross validation to optimally tune
the model. The RNN model after being fitted on the training set was used
to predict the Load values using the inputs from the Test set. The resulting
average MAPE was about 2.44 percent for the Test set. RNN takes a longest
time to train when compared to the other models. RNN’s accuracy improves
when working on bigger Data sets. Taking in account that the RNN only
had the Load data set to train on it provided with the best MAPE score for
a Univariate model. This also suits a scenario where a particular region has
53
Recurrent Neural Networks - Multivariate
RNN being the most sought out Time Series prediction model gave out
RNN has eighteen features which are time lagged for 24 time stamps. The
multivariate RNN had three layers the input, hidden and the output layer. The
input layer had 50 units the hidden had 100 units and a single neuron in the
output layer for the load predictions. In Multivariate RNN it returned a score
of 3.06 percent for the Test set, which is the best among the Neural Networks
and it only improves as the data set gets bigger. This model also suffers from
the assumption that all of the data set are readily available for training, Hence
falls under the Smart Grid and Smart City scenario. Multivariate RNN takes
the longest to run in terms of computational time which should also be taken
points in space, mapped so that the examples of the separate categories are
can always be separated by a hyperplane. They are mainly used for classifica-
54
tion. Support Vector Machines are regarded in short as Large margin Linear
where in (4.1) are known as the slack variables. The goal is to make the margin
N
1
∗ ||w||2 + C (I(ξi ))
X
J(w, w0 , ξ) = (4.2)
2 i=1
where in (4.2) the parameter C is a positive constant that controls the relative
influence of the two competing terms, it is also known as the Cost function
where Support Vector Regression comes into play. They rely on pre-processing
the data into a higher dimension from the original feature space into a hyp-
run as neural networks. It uses the kernel trick to map the lower dimensional
55
data to a higher dimension data. Hyperplanes are the lines that are used the
separate/classify different data points in the Data set. SVR tries to maximize
the boundary margin between the hyperplanes. Support Vectors are the data
points or vectors that either lie on the hyperplanes or inside the boundary that
constitute the weights for the boundary lines. In the same way as with classi-
bounds given for regression. They relied on defining the loss function that
ignores errors, which are situated within the certain distance of the true value.
This type of function is often called – epsilon intensive – loss function. The fig-
– epsilon intensive – band. The variables measure the cost of the errors on the
training points. These are zero for all points that are inside the band [40, 41,
42, 8]. SVM regression performs linear regression in the high-dimension fea-
ture space using -insensitive loss and, at the same time, tries to reduce model
56
of the following functional:
N
1 2
X
minimize − J(w, w0 , ξ, ξ ) = ∗ ||w|| + C (ξi + ξ ∗ )
∗
(4.3)
2 i=1
yi − w ∗ xi − b <= ǫ + ξ, (4.4)
w ∗ xi + b − yi <= ǫ + ξ ∗ , (4.5)
ξ, ξ ∗ >= 0 (4.6)
The SVR was initially designed with only the load values as a input to the
training set to see how to perform with lesser features. With limited data set
57
and feature availability it provided with predictions results which were fairly
okay. They provided with a 5.61 % MAPE and almost a 70 % R2 score given
the limited availability of the data and features. Below is the actual plot for
Epsilon is the tolerance for the margin. SVR was trained on the training
set which comprises of both the Load values and their corresponding Weather
data. Eplison was set to 0.01 after tuning on the validation set. The Cost of
tolerance was set to 0.1. This SVR model used the radial basis function as
the kernel. The training time was comparatively faster than the other models.
SVR’s usually give better results with smaller data sets as seen from Fig. (5.6)
58
plotted from the Test set with the actual Load values and the predicted Load
values. The kernel used in this model is the radial basis function and after
some trails the value of epsilon was optimized to yield the best results. SVR
gave about 1.46 percent MAPE on the Test set and had the best result in
terms of computational time. SVR generally over fits when there is a larger
data set hence, they are well suited for a scenario where the model has access
to both the Load values and their corresponding Weather data given that the
data set is smaller. SVR’s are particularly good in Load forecasting when it
deals with a smaller controlled environment also with a much smaller data set.
They usually under perform with larger data sets, they run a bunch of complex
59
Chapter 5
Results
Scenarios
The Machine Learning models in this Thesis were designed for real life
scenarios that the utility companies face. Some of the factors that affect these
scenarios are Data set availability, geographical location, weather and an ideal
situation.
Case - I
Machine learning models rarely have an ideal situation where the Data
set is free of noises and just enough for training the model to get the best
possible outcome. This includes that the Data set has all the necessary and
relevant features like weather and load values for every hour. This type of
able like a Smart Grid or a Smart City. This case serves as a benchmark for
60
the other cases that are to follow. This scenario can also be regarded as the
Case - II
There are cases where the geographical location plays a very important
role for the data set availability. The Data set is only fully available in big
booming cities, not exactly is the same case for remote cities which still need
load forecasting. The weather and the load data set may not be fully available
for these places where it ends up to substituting values which are not exactly
in coherence with the actual pattern, this in-turn leads to irregularities of the
predicted values.
Case - III
There case depends on the integrity of the actual data set. Places like
set which will end up harming the desired outcome of our load values. In
these cases the weather data set will be dropped due to their insignificance in
their features. Hence at the end there is only the load values as input to the
61
Case - IV
In the last case occurs a situation where the required data set is not
actually enough to do proper predictions. In this case the models are designed
to learn from smaller data sets at the worst case with only one feature. This
type of situation occurs quite often in developing cities where the data set
ANN - Multivariate
62
The ANN Multivariate model was the least performing model in terms
of both the MAPE and R2 score, yet it provided acceptable scores around 4
% errors. This will improve if fed with a larger data set hence it falls under
the Case I as it is best suited with larger data set like the one’s available in
CNN - Univariate
63
The one dimensional CNN univariate model performed better than exp-
It comes under both Cases II and III because of its ability to perform even with
limited data set. The important thing to take note here is their computational
time which is much lower when compared to traditional time series prediction
models.
CNN - Multivariate
64
The CNN Multivariate model only gave slight improvements over it’s
univariate counter part. This model can fall under Case I if there is a time
constraint where the predictions are to be give out in a short period of time.
These model train faster when compared to other multivariate models in this
RNN - Univariate
65
The RNN univariate model in this research gives out one of the stable
results. These are traditional time series prediction models but perform much
slower when compared to the other models. These fall under cases II and III
RNN - Multivariate
The RNN multivariate model is the best among the time series predic-
tion models. It performs better than it’s Univariate counter part as expected
66
and slowest among the other models. This model also fits Cases II and III if
The SVR model performs the best when compared to the other models
Case IV where there is a limitation on the data set. SVM’s perform well with
67
Performance Comparison
The Models in this Thesis are scored on MAPE and R-Squared accuracy
score which are regarded as the conventional measure of accuracy for regression
analysis. Time being an important resource, the models were also evaluated
on the computational time for the training over the training data set. The
time observed from table 5.7 are only in seconds which is meagre compared to
the actual data because we are only working with a year’s worth of data for
performance evaluation but the actual training time would take a lot longer
than what is observed from Table 5.7 also shows SVR as the most successful
model and next comes the RNN model. RNN takes the most time to train the
model as they deal with complex and taxing computations in their recurrent
layers. The accuracies of the three neural network models will improve with a
68
Chapter 6
Conclusion
as each have their own benefits and shortcomings. The Recurrent Neural Net-
works Multivariate model and Support Vector Machines stands out with their
results from the average value of MAPE, R-Squared score and computational
time. SVR works better with a smaller data set given that they have all the
required features for forecasting hence, they are used in a scenario where the
environment is controlled, like a building where all the features for Load fore-
casting are readily available but only in smaller data sets. RNN’s accuracy
improves with a bigger data set that is provided to the model for training but
like SVR they require all the features in the data set for it to perform better.
RNN’s Multivariate model are well suited for a scenario where the data set
has all the features and they are also larger, such an environment exists in
Smart Cities or Smart Grids where they have all the data they need to train
their models and hence provide good results on the load forecasts [31]. The
69
Recurrent Neural Network and the Convolutional Neural Network model also
performs even better in terms of their current MAPE and R-Squared score if
they are fed with a larger data set for training. The RNN’s and CNN’s Uni-
variate models were trained only with the load values. They did not use the
weather data set for training even then they provided with a MAPE values
which were almost similar. The MAPE for these models even though on the
higher side than the other models, they are better results as they were trained
only on the actual Load values as the primary feature. RNN’s and CNN’s Uni-
variate fit a scenario where there may not be a data set with all the features,
like a remote city with only the actual load values as the data set available
for training. Though the results of CNN are quite like RNN but in practice
RNN’s are much better Time-Series Prediction models [6]. RNN’s and CNN’s
Univariate model also works well for this scenario, where the forecasting must
be done only with the load values which is what happens in most cases today
where the weather data set is available for only a certain period of time and
missing for the other, this is mainly due to the fact that the weather is highly
unpredictable in certain areas and cannot really rely on the weather for fore-
casting in that area. Hence, it really depends on the data set availability to
determine which model to be used for load forecasting for the given scenario.
70
Future Work
data and provided quick results, but they all focused on Supervised Learn-
and then feed into the actual networks, which gives rise to hybrid models for
Load forecasting. These Hybrid models are time consuming compared to other
Supervised Learning models but will provide new insights from the data and
71
Appendix A
Definitions
ing algorithm. High bias can cause an algorithm to miss the relevant
72
set of data containing observations (or instances) whose category
membership is known.
unlabeled.
or compression.
earth.
73
• Gradient Descent: is an optimization algorithm used to minimize
model.
and augments each piece of that unlabeled data with some sort of
or desirable to know.
eled loosely after the human brain, that are designed to recognize
patterns.
74
• Regression: a measure of the relation between the mean value of one
input
created artifacts that you can obtain relatively easily from the world.
75
• Utility Company: An electric utility is a company in the electric
ulated market.
set. The testing data set is a separate portion of the same data set
random noise in the training data, rather than the intended out-
predictions.
76
Appendix B
77
• MNIST: Mixed National Institute of Standards and Technology
78
Bibliography
79
[9] Jacek M Zurada. Introduction to artificial neural systems. Vol. 8.
West publishing company St. Paul, 1992.
[10] Yann LeCun and M Ranzato. “Deep learning tutorial”. In: Tutorials
in International Conference on Machine Learning (ICML’13).
Citeseer. 2013, p. 35.
80
cross-validation-explained-evaluating-esimator-performance-e51e5430ff85
(visited on 12/10/2018).
[19] Feiping Nie et al. “Efficient and robust feature selection via joint 2,
1-norms minimization”.
In: Advances in neural information processing systems. 2010,
pp. 1813–1821.
[25] ERCOT (Electric Reliability Council of Texas). 2018 Load Data set.
2018. url: http://www.ercot.gov (visited on 01/25/2019).
81
[28] David Cournapeau. Standard Scaler. 2018. url:
https://scikit-learn.org/stable/modules/generated/sklearn.
preprocessing.StandardScaler.html (visited on 10/12/2018).
82
[37] colah’s blog. Understanding the LSTM Networks. 2015. url:
https://colah.github.io/posts/2015-08-Understanding-LSTMs
(visited on 10/10/2018).
83