0% found this document useful (0 votes)
15 views14 pages

Residential Energy Consumption Forecasting Using Deep Learning Models

This study compares various deep learning models for forecasting residential energy consumption, focusing on Recurrent Neural Networks (RNN), Long Short-Term Memory (LSTM), Gated Recurrent Unit (GRU), and Time Series Transformer (TST). The findings indicate that the Transformer architecture performs well with fewer samples, and the use of a voting ensemble optimized by Simulated Annealing improves prediction accuracy by 23%. The research emphasizes the importance of feature selection and model interpretability in enhancing forecasting performance.

Uploaded by

Nishan Buyo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views14 pages

Residential Energy Consumption Forecasting Using Deep Learning Models

This study compares various deep learning models for forecasting residential energy consumption, focusing on Recurrent Neural Networks (RNN), Long Short-Term Memory (LSTM), Gated Recurrent Unit (GRU), and Time Series Transformer (TST). The findings indicate that the Transformer architecture performs well with fewer samples, and the use of a voting ensemble optimized by Simulated Annealing improves prediction accuracy by 23%. The research emphasizes the importance of feature selection and model interpretability in enhancing forecasting performance.

Uploaded by

Nishan Buyo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Applied Energy 350 (2023) 121705

Contents lists available at ScienceDirect

Applied Energy
journal homepage: www.elsevier.com/locate/apenergy

Residential energy consumption forecasting using deep learning models


Paulo Vitor B. Ramos a , Saulo Moraes Villela a ,∗, Walquiria N. Silva b , Bruno H. Dias b
a
Department of Computer Science, Federal University of Juiz de Fora (UFJF), Juiz de Fora, Minas Gerais, Brazil
b
Department of Electrical Energy, Federal University of Juiz de Fora (UFJF), Juiz de Fora, Minas Gerais, Brazil

ARTICLE INFO ABSTRACT

Keywords: The energy sector plays an important role in socioeconomic and environmental development. Accurately
Deep learning forecasting energy demand across various time horizons can yield substantial advantages, such as better
Demand forecasting planning and management of energy resources. Different methodologies, including mathematical, statistical,
Electric energy
and machine learning models, have been proposed for energy consumption prediction. Nevertheless, some
Feature selection
studies claim that deep learning models can outperform other approaches when dealing with time series data
Multivariate time series
characterized by a high level of granularity. There are different implementations of deep learning models,
such as recurrent unit layers, dimensional convolutions, or Transformers. Hence, it is interesting to compare
the performance of different architectural types using a methodology that utilizes feature selection and model
interpretation to underscore the significance of meteorological features and timestamps in the forecasting task.
This study presents a comparative methodology of forecasting approaches employing recurrence-based models:
Recurrent Neural Network (RNN), Long Short-Term Memory (LSTM), Gated Recurrent Unit (GRU), and Time
Series Transformer (TST), while exploring the relationship between model performance and resource allocation
related with amount of features and time resolution. The findings reveal that the Transformer architecture
works better with fewer samples, despite the occurrence of overfitting. In addition, training the models enabled
the creation of a voting ensemble, optimized by the Simulated Annealing metaheuristic, resulting in a 23%
improvement in hourly granularity.

1. Introduction assist in the intelligent coordination of inputs linked to technological


innovations and sector digitization [12].
The role of the electric sector is crucial for socioeconomic and Forecast models are articulated by different sets of variables, such as
environmental development [1,2]. The level of electricity consumption forecasting of electricity demand. One of the determinants of demand
in a society is an important indicator of its sustainable development, are forecast horizons (long-term, medium-term, and short-term), the
both in economic and environmental terms [3]. Therefore, it is essential level of aggregation of load, climate, and socioeconomic activities [13].
to continuously improve studies related to electricity consumption The time horizons also define the forecast applications to be made.
issues to assist in various areas [4], such as consumption management Examples include (𝑖.) long-term forecasts that provide inferences to
to achieve resource savings, increased energy efficiency, driving tech- support transmission investment decisions [14]; (𝑖𝑖.) medium-term fore-
nological innovation, and promoting environmental sustainability, and casts applicable to the planning process of preventive maintenance of
other actions for the sector development [5,6]. the distribution system, guaranteeing principles of continuity and secu-
In recent years, forecasting energy consumption and related demand rity of the energy system [15]; (𝑖𝑖𝑖.) short-term forecasts that consider
variables have been widely studied in the electric sector [7]. Infer- daily seasonal aspects and have applications in the commercialization
ring these predictions in different time horizons is attractive due to of energy generated by prosumers [16].
their fundamental applicability for the sector’s operation, and planning By considering these different time horizons, it is possible to address
[4,8]. Case studies can be found focusing on predicting energy demand a wide range of demand forecasting applications. However, there are
in residential [9] and industrial [10] environments, where the main ob- still areas that lack research. An example of this is explained by Hong
jectives are subdivided into effective management of consumption, cost et al. [17], which highlights the recurrence of case studies focused on
savings, continuous energy supply, increasing efficiency, and system high voltage, which have applications aimed at long-term inference. On
operational performance [11]. Thus, demand forecasting models can the other hand, there has not yet been an extensive investigation into

∗ Corresponding author.
E-mail address: [email protected] (S.M. Villela).

https://doi.org/10.1016/j.apenergy.2023.121705
Received 28 May 2023; Received in revised form 20 July 2023; Accepted 31 July 2023
Available online 17 August 2023
0306-2619/© 2023 Elsevier Ltd. All rights reserved.
P.V.B. Ramos et al. Applied Energy 350 (2023) 121705

the current state of the art of low voltage level predictions for short- In summary, the main objective of this work is to compare different
term applications, except at the smart meter level targeting residential DL models for residential energy demand inference. The forecasting
applications [18], very few articles focus on the secondary or even methods used include recurrent models such as Recurrent Neural Net-
primary level of the distribution network substation [19]. work (RNN), Long Short-Term Memory (LSTM), Gated Recurrent Unit
Short-term forecasting applications typically involve predicting (GRU), as well as the Time Series Transformer (TST). Using these
electricity consumption in the hours, days, or weeks ahead [20]. methods, the knowledge gap in short-term energy demand forecasting
They find utility in electricity market applications for daily supply at the low voltage level is addressed by providing novel insights into the
planning and demand-side management [13]. For such applications, performance and interpretability of recurrence-based and transformer-
deep learning models (DLMs) can be effectively utilized [21], including based models. The comparison allows describing the preprocessing
both traditional approaches and those recently incorporated into the process, treatment of multivariate series in time windows, and the
state of the art. Petropoulos et al. [22] highlights the use of statistical process of interpreting the level of impact of each attribute. In addition,
models and hybrid solutions that combine entirely new and customized this study addresses feature importance and selection within its scope.
vector machines or DLMs. The study also discusses the inference of It can contribute to further studies in designing DL architectures and
objects using these approaches. innovating pre-processing techniques or constructing new forms of
According to Petropoulos et al. [22], many approaches have been
training models.
described for short-term demand forecasting, particularly those based
The article is structured as follows: In Section 2, a background is
on deep learning. These models often take into account climatic and
presented, providing details of the main models and the technique used
temporal characteristics that are correlated with energy consumption.
for ensemble construction. The methodology used for data analysis
However, Hong and Wang [23] emphasized that despite the abundance
and processing, along with the evaluation metrics for recurrent and
of studies in this area, it is necessary to validate such methods, explore
transforming models, is described in Section 3. In Section 4, the datasets
preprocessing techniques, and conduct additional analyses to assess the
used in this study are presented. The simulation results are reported
impact of each variable on the predicted outcome.
in Section 5, highlighting the main findings of this study. Finally, the
Following the lack of research on the field expressed by Haben et al.
[19] and Hong et al. [17], this study aims to fill this knowledge gap by research concludes with final comments in Section 6.
conducting a comprehensive literature review on deep learning models
(DLMs) specifically focused on short-term energy demand forecasting. 2. Brief background
The objective is to identify different methodologies on data prepro-
cessing, multivariate time series analysis, and algorithms for model In this section, a comprehensive literature review is presented,
optimization intended to improve the prediction of these DLMs. In highlighting pertinent aspects related to the current study. Addition-
doing so, we aim to provide sufficient contributions to the new body ally, the fundamental concepts surrounding recurrent neural networks
of knowledge in this field. and Transformers are discussed. Finally, to enhance the understanding
Guided by this objective, this study presents a comparison between of the core concepts crucial to this paper, the utilization of ‘‘Simu-
recurrence-based models and transformer-based models for residential lated Annealing’’ as a guiding framework for ensemble construction is
energy demand inference, both commonly seen in the literature review. introduced.
Recurrent models, such as Recurrent Neural Networks (RNN), Long
Short-Term Memory (LSTM), and Gated Recurrent Unit (GRU), belong 2.1. Literature review
to the traditional class, while transformer-based models represent the
state-of-the-art approach. By comparing these models, we aim to con- When analyzing multivariate time series of consumption, the work
tribute novel insights into the performance, interpretability, and feature of Forootani et al. [24] addresses the importance of resource analysis.
importance analysis of different models for short-term energy demand They present case studies that demonstrate the improved accuracy of
forecasting. predictive models when reducing attributes. The comparison conducted
The main contributions of this study are as follows: in the study is limited to the behavior of the recurrent Gated Recurrent
Unit (GRU) model, with a particular emphasis on attribute importance
• Analysis of model performance during the training and evaluation
in K-Nearest Neighbors (KNN). The author also explores other signif-
phases through case studies involving different sample sizes and
icant aspects, including outlier inspections, the need for comparisons
model classes. This analysis provides novel insights into the spe-
between machine learning and deep learning models, and the analysis
cific characteristics and the impact of reducing readings during
the training process. of short-term residential load forecasting using DL techniques.
Still, regarding the work of Forootani et al. [24], only one type
• Verify changes in model performance during the evaluation stage
by selecting features in two situations: considering the entire of recurrent DL model was addressed, the GRU. However, there are
dataset and considering the model specificity, here using Shapley other approaches in the literature, such as Recurrent Neural Network
Additive Explanations (SHAP) values. This analysis contributes to (RNN) and Long Short-Term Memory (LSTM). Despite being considered
the understanding of the importance levels of all features linked evolutionary architectures, it is interesting to compare and evaluate
to the forecast of the research object the accuracy achieved by these models in the context of consumption
• Describe the ensemble by voting guided by the metaheuristic forecasting applications. For instance, Kumar et al. [25] investigated
Simulated Annealing (SA) to reduce prediction error and compare the effectiveness of neural network architectures based on LSTM and
its performance obtained by each model in their respective case GRU in predicting energy load, providing a performance comparison
studies. Here, this study provides novel insights into ensemble between the networks.
techniques for energy demand forecasting by using the Mean Furthermore, another noteworthy study is conducted by Abumohsen
Squared Error (MSE) metrics as a function guide to improve the et al. [26], which compared the performance of recurrent networks,
voting distribution. specifically LSTM, GRU, and RNN, in predicting electrical charges. The
• Compare the achieved performance by considering all factors results indicated that the GRU model achieved the best performance,
related to the prediction of energy consumption: type of Deep with the highest accuracy and the lowest error rate.
Learning (DL) model, number of attributes involved, post-training In addition to recurrence methodologies, other methods can be
selection, and ensemble by voting. This comparison contributes mentioned that have been used to select significant parameters for de-
to the understanding of the strengths and limitations of differ- veloping forecast models, such as Principal Component Analysis (PCA)
ent models and techniques in the context of short-term energy and Stochastic Neighbor Embedding (TSNE) distributed as demon-
demand forecasting. strated by Chu et al. [27]. They used these methods for resource

2
P.V.B. Ramos et al. Applied Energy 350 (2023) 121705

selection for the LSTM and 1-D Convolutional Neural Network (CNN) In addition to evaluating performance using regression metrics,
networks to predict daily electricity consumption in residential build- previous studies have emphasized the importance of feature selection
ings. The results showed that, overall, LSTM performed better than the to understand the impact of each attribute on the analyzed objects.
1-D CNN algorithm. This type of feature selection method can also be This involves incorporating preprocessing steps such as using PCA to
seen in Yang et al. [28], which describes multi-horizon forecasting. identify the attributes that best represent the dataset. Furthermore,
Another study by Oliveira et al. [29] provides a comparison of the inclusion of model interpretability is crucial for understanding the
different approaches in deep learning. They applied and compared the collaboration between attributes and recognizing model characteris-
performance of several deep learning algorithms, LSTM, CNN, mixed tics [31]. Therefore, when evaluating the impact level of variables, it
CNN-LSTM and Temporal Convolutional Network (TCN), in a real test is essential to consider the specific characteristics of the highlighted
for energy forecasting. The experimental results suggest that TCN is models.
the most reliable method for short-term predicting of instant energy Furthermore, in order to obtain better results in terms of the accu-
consumption. racy of the separately trained models, a comparison enables the devel-
Although the TCN architecture is considered state-of-the-art, the opment of an ensemble through voting. Since there are many trainable
comparison would be more complete with the inclusion of the Trans- models available for building this ensemble, as shown by Reddy A
former architecture, as shown in the work of Zerveas et al. [30]. They et al. [38], Wang et al. [39] and Hadjout et al. [40], an optimized
present a new framework for learning multivariate representations way of distributing the collaboration weights of each model is needed
of time series based on the transformer encoder architecture. When to avoid creating a particular model for this reason. Thus, by using
evaluating the proposed structure, the author concludes that the pro- models related to the ensemble, but already trained with optimized
posed model presents superior performance to the available methods weights and bias, can demonstrate better results when compared to the
for regression and classification when testing in different sets of time results separately. Assigning appropriate weight values can be modeled
series. as an optimization problem, whose solutions can be provided by the
The aforementioned studies focused on evaluating the accuracy of well-established means of metaheuristic algorithms [41].
DL models using inference error metrics. While it is critical to evaluate
these models by such metrics, it is equally important to interpret the 2.2. Recurrent neural networks and ensemble
way they generate their predictions. Zeng et al. [31] proposed incorpo-
rating model interpretability to improve dataset preprocessing and gain
Artificial Neural Network (ANN) are commonly employed for clas-
a better understanding of so-called black-box models, although their
sification and regression tasks. One particular architecture that ex-
study was not specifically related to energy consumption prediction.
hibits a unique connection between Perceptrons is referred to as Multi-
The SHapley Additive exPlanations (SHAP) tool introduced in this study
Layer Perceptron (MLP). This unique connection implies that individual
was able to determine the impact of each feature for a single prediction
nodes do not have direct influence on the network’s state or participate
and for the entire test set.
in weight updates.
When analyzing consumption forecasts using a multivariate series
Similar to tasks in Neural Language Processing (NLP), time series
approach, it is crucial to consider external factors, such as weather
forecasting aims to infer certain values based on past readings of
conditions, as they can introduce uncertainty and randomness in short-
correlated features. To achieve this, the architecture of Deep Learning
term forecasts. In this regard, the work of Fei and Zhigang [32] is note-
Model (DLM) must be capable of incorporating past values to make
worthy. They conducted a correlation analysis between meteorological
accurate predictions [42].
factors and short-term load forecasting using a neural network algo-
rithm known as backpropagation (BP). The inclusion of weather, pre- The RNN architecture is relatively simple as it includes a loop that
cipitation, and humidity as attributes in the forecast model effectively represents the hidden state of the neuron. However, as the number of
improved the accuracy of the proposed forecasts. inputs increases, so does the complexity of adjusting the weights of the
Still aiming to address the uncertainty in the short-term load fore- network. This can lead to a problem known as the vanishing gradient
casting and considering the multidimensionality of the load data and problem, as described by Bengio et al. [43].
the continuity of time series, Bian et al. [33] developed a method for To address the vanishing gradient problem, the LSTM architecture
short-term energy load prediction based on the cumulative temperature was introduced. The LSTM architecture, overcomes the limitations of
effect using a TCN combined with a BP neural network. Compared traditional RNNs by incorporating specialized memory cells and gating
to simple TCN and BP models, the method proposed in this article mechanisms that allow for better gradient flow and capturing long-term
increased the training time but reduced the prediction time loss and dependencies in the data.
improved the prediction accuracy. Due to the inability of the RNN model to handle long-term depen-
In addition to the importance of features in model prediction, it dencies, the training process can become inefficient by the appearance
is interesting to highlight the impact factor of each attribute through of the vanishing gradient problem. To mitigate this, a review of weight
two different perspectives: inference and subset. This perspective is manipulation considering a large number of layers is necessary, cor-
discussed by Kim and Cho [34] when describing the autoencoder and responding to a longer sequence of input data. However, good results
by Kim and Cho [35] when evaluating the performance of long and can still be achieved through the implementation of the optimization
short-term features. Similarly, other researchers such as Gao and Ruan function proposed by Kingma and Ba [44].
[36] and Kim and Cho [37] have developed specific models to showcase For the construction of short and long memory terms, the LSTM
the impact of features. However, using the SHAP method provides an model includes a component called memory cell, along with the hidden
interesting alternative as it determines these properties without the state updated during the post-training steps [42,45]. Therefore, the
need to train a model specifically for that purpose. neuron has both an output and a hidden state, providing a better
The literature review carried out raised some interesting points representation of all the knowledge gained in previous training epochs.
about predicting energy consumption in a multivariate scenario, where This information is filtered through gates to determine the usage of
multiple features influence the predicted outcome. As demonstrated, data.
there are several approaches focusing on DL models, allowing a com- The LSTM model, in addition to addressing the vanishing gradient
parison between these models. However, it is necessary to restrict the problem, uses a complex decision-making system to guarantee the
metrics used for such comparisons. Based on the studies analyzed, representativeness of the information. The development of the filtering
including the review by Khalil et al. [9], recurrent models have been mechanism through the training allows the creation of short and long-
the traditional approach for object inference in this field of research. term memory characteristics. However, as described, the system can

3
P.V.B. Ramos et al. Applied Energy 350 (2023) 121705

be complex due to the use of a new memory state. The GRU model, 3. Methodology
becomes a compact solution to this decision-making process.
The GRU model, like LSTM, features separate memory states sent to As seen, one of the main objectives of this study is to compare
other layers, but with a simpler structure. It accomplishes this using two different DL models for the inference of residential energy demand. This
gates: the reset gate, which controls the utilization of the previous state involves the preprocessing of the data and the process of interpreting
to create a temporary state, and the update gate, which combines this the level of impact of each attribute. Therefore, the methodology of this
temporary state with another memory state generated during training. study involves several steps consisting of:
However, unlike LSTM, GRU passes only one memory state containing
the updated information during the training process. 1. Preprocessing based on the Knowledge Discovery in Databases
The RNN, LSTM, and GRU architectures represent an evolution (KDD) technique [49–51], obtaining a feature selection based on
of models based on the persistence of the state. All the knowledge PCA;
accumulated in the training epochs is updated and forwarded to other 2. Training the models on datasets composed of all the features
layers to describe the information collected along the steps. However, processed in the feature engineering step and those identified as
these models are commonly referred to as traditional models in the significant by PCA in the feature selection step [52], with the loss
literature due to their recurrent nature. However, there are also newer measured by the MSE indicator in the validation and evaluation
models that exhibit similar functionalities and processing mechanisms steps;
for input data. 3. Building an optimized ensemble using the SA metaheuristic to
The Transformer model is one of the state-of-the-art models that obtain a solution;
bring similar characteristics to recurrent models. Initially introduced 4. Finally, using SHAP to interpret the behavior of the trained
by Vaswani et al. [46] for NLP tasks, also exhibits a similar concept models, where the degree of impact of the features is identified
of state persistence. Additionally, Vaswani et al. [46] introduces a for a single inference of the object and the entire evaluation set.
Multi-Head Attention mechanism that indexes sequences based on their
The proposed methodology can be divided into steps described as
importance, effectively representing the memory state incorporated in preprocessing of the data and model analysis post-training, as described
recurrent units. in Fig. 1. The following sections can detail better the decision make
However, considering the multivariate regression problem, Zerveas related to data collection, model training using a rolling window and
et al. [30] presents the Time Series Transformer (TST) mode, which other features of the analysis method.
is based on the Transformer architecture introduced by Vaswani et al.
[46], but removes the decoder layer to present the model with only an 3.1. Preprocessing step
encoder. The decoder, responsible for assigning the probabilities of a
certain class, for example, was removed from the architecture. For the The initial step involves collecting descriptive data on the object
objective of regression, Zerveas et al. [30] describes the model with the of the study’s inference. It is important for the dataset to contain
feature of multiple indexing using weighted layers of MLP networks to meteorological data or other sets that facilitate the estimation of the
predict the variable. correlation of energy consumption to linked attributes.
By comparing separately trained DLM models, their individual char- Data collection includes a set of features that have numerous data
acteristics can be combined through an ensemble approach. Each ar- gaps. Thus, before merging consumption readings with other related
chitecture described earlier has its unique characteristics, resulting in readings, resource engineering techniques are applied for data process-
different error metrics. Constructing an ensemble allows for analyzing ing. During this step, collected data are organized by sampling time,
the potential improvement in accuracy compared to using each model null readings are addressed, and new features are derived by inference
in isolation. using a multivariate approach. The final step involves normalizing
The aforementioned paragraphs demonstrate that there are several the data considering the range of [−1, 1]. If the chosen dataset lacks
models available that can be utilized as solutions for consumption original meteorological data, other external datasets can be added to
forecasting. These models exhibit varying levels of accuracy. However, the descriptive consumption set, as long as the collection interval and
by leveraging their differences, it is possible to achieve better results location match.
compared to using them individually. This potential can be realized The feature engineering step results in a dataset that has multiple
through the use of an ensemble methodology. attributes related to the studied consumption profile. The product of
In the ensemble, a voting mechanism is employed, where the this preprocessing are distinct datasets that describe the consumption
weights assigned to each model are determined based on the validation profile of a person, property, or location. Since there are multiple linked
set. This process ensures an appropriate distribution of votes. To properties, it is possible to use PCA to ascertain the importance of
optimize the ensemble, an objective function can be defined to adjust each of them to the object of inference. Therefore, by reducing the
the relative importance of the models within the ensemble, aiming for dimensionality of the dataset to be used, but maintaining the repre-
better performance. sentativeness component, PCA is used to construct a new dataset, but
Applying grid search for weight distribution in an ensemble can lead restricted to the most descriptive components of the entire previously
to a significant increase in computational cost. The time complexity of analyzed dataset.
grid search is known to be 𝑂(𝑛𝑥 ), where 𝑥 is the number of models in Two datasets are obtained: the originally preprocessed one and
the ensemble [47]. As the number of models increases, the time com- the dataset composed by the attributes selected as more important by
plexity of grid search grows exponentially, making it computationally PCA. These dataset go through the training, validation, and evaluation
expensive. process, with respective proportions of 70%, 20%, and 10% of the total
To mitigate this issue, using a metaheuristic algorithm guided by an available samples. It is worth noting that, in this step, the use of the
objective function can be a more efficient approach. Metaheuristic algo- temporal windowing method is applied to adapt training. Therefore, it
rithms, such as genetic algorithms or particle swarm optimization, offer is a sample grouping that is linked to an expected result to be used as
a way to search for an optimal set of weights in a more cost-effective an input parameter for model training. Fig. 2 graphically demonstrates
manner [48]. These algorithms leverage heuristics and optimization the rolling window method used.
techniques to efficiently explore the weight space and find a set of Having the data divided into subsets for training, validation and
weights that yield better ensemble performance while reducing the evaluation in the rolling window format concludes the preprocessing
computational burden compared to grid search. step. Furthermore, the PCA dataset, divided into the same subsets,

4
P.V.B. Ramos et al. Applied Energy 350 (2023) 121705

Fig. 1. Summary of the proposed approach.

The MSE is calculated as the average squared difference between the


actual values 𝑦𝑖 and the predicted values 𝑦̇ 𝑖 . The MAE represents the
average absolute difference between the actual and predicted values.
The MAPE is the average percentage difference between the actual and
predicted values, expressed as a percentage. Lastly, the 𝑅2 score is a
statistical measure that represents the proportion of the variance in the
dependent variable that is predictable from the independent variables.
These metrics provide a comprehensive evaluation of the models’
performance and their ability to accurately predict the energy demand.
The evaluation step is based on the best epochs of the trained mod-
els, which enable the use of SHAP Values to interpret the impact of each
attribute for a single consumption inference, taking into perspective the
Fig. 2. Illustration of the rolling window method used. best weights of the innermost layers of each DL model. For this purpose,
the same test set is used to graphically verify the average contribution
of each feature.
provides a better analysis of the model training in two circumstances: Using SHAP Values not only contributes to interpreting the average
using all the previously processed features and those seen as most impact of each attribute for a single inference and for a set of inferences
important by the PCA technique. With these two feature sets, the post- but also establishes a ranking of the most important variables. Since
training step, presented in Section 3.2, describes the analysis involving each deep learning model is evaluated separately, it is possible to
the training and evaluation of the models. select the most impactful features while respecting the particularities
of each model. Therefore, a new dataset is selected to be used again
3.2. Post-training step as input for the models and to be evaluated by the specificity of the
training architecture. This step serves to verify if selecting the most
After the models have been trained using the subsets in appropriate important attributes for the specific model will bring improvements in
proportions to ensure fairness, it is important to evaluate the selected the evaluative results. It is important to note that the comparison is
models using metrics commonly used in regression tasks. Each model made with the other sets and not with the ensemble.
is saved at the optimal time for each trained dataset, and the subsets Since the preprocessed and selected input set by PCA provides
are then evaluated based on metrics such as MSE, Mean Absolute Error different evaluative metrics for the models, this step refers to the
(MAE), Mean Absolute Percentage Error (MAPE), and 𝑅2 score. construction of a voting ensemble capable of decrementing the MSE.

5
P.V.B. Ramos et al. Applied Energy 350 (2023) 121705

To have an optimized weight search, and do not have a high compu- consumption readings from 28 households in British Columbia, Canada,
tational cost in this distribution, metaheuristic algorithms are used to collected at hourly intervals.
distribute the collaborative percentage of each trained model. Initially, The consumption samples are synthesized into a single attribute,
the weights are equally distributed, and the algorithm uses the valida- describing the total consumption of the household during the interval.
tion set to better distribute the weights of each model so that, in the Although initially characterized as a univariate series belonging to the
end, a weighted average is made. In this study, the SA algorithm was Harvard Dataverse, which also provides meteorological readings from
chosen as the optimizer and MSE as the guide function to make the the same time interval as the consumption samples. Thus, through
algorithm benefits models with lower indicators of error. Consequently, the steps of the feature engineering process, the two data sources are
this change provides a better weight distribution to the best models. merged, the series becomes multivariate, and it becomes compatible
The algorithm receives the weights and inferences of the models to with the benchmarking to be constructed.
reduce the error until obtaining an optimal result, that is until the Although the HUE dataset provides a residential energy consump-
optimizer finds the system crystallization. tion scenario with multiple variables related to the inference object, it
Inspired by Brito et al. [48], the SA algorithm guided by the MSE is only available at the hourly granularity. As the comparison extends
as the objective function can be described in the following five steps: to the case study focused on the sample interval, the Pecan Street
dataset [54] was chosen as the second dataset to be analyzed in the
1. Generate a disturbance in the system by randomly selecting two methodology.
positions and a float value; As a dataset also focused on residential energy consumption, the
2. Create a new set of weights by applying the disturbance to the dataset provided by Pecan Street Inc. includes collections by residential
current set of weights; appliances. Available at different time resolutions for the locations of
3. Compute the energy generated by the new set of weights: Austin, the capital of Texas, New York, and California, the student ver-
sion of the dataset shows the energy consumption read by equipment.
• If the difference between the current energy and the pre- For this work, the chosen data was from Austin, as it is available at
vious energy is greater than zero, the new set of weights is resolutions of 1 s, 1 min, and 15 min, and has a 99% completeness of
considered as the current set, and the energy generated by reading collection.
this set is included in the entropy analysis; There are highlights to be raised for the last dataset worked on
• If the difference is negative, a probability value is com- by the analysis methodology: (𝑖.) the consumption is disaggregated
−𝛥𝐸
by equipment, (𝑖𝑖.) the lack of original meteorological data from the
puted by 𝑝 = 𝑒 𝑇 ⋅𝐵 , where 𝐵 is the Boltzmann point value.
publisher, and (𝑖𝑖𝑖.) the lack of hourly resolution. To solve the first
If this generated value falls within the range [0, 𝑝], the
issue, during the feature engineering stage, all attributes with the same
new set of weights is accepted. Otherwise, the new process
unit were summed to construct the total inference object read at the
starts.
sample resolution. To include meteorological data, the Application
4. Compute the difference between the present entropy and the pre- Programming Interface (API) World Weather Online (WWO) was used
vious 𝑖 entropy. If the difference is less than the limit 𝜀, consider to include data at the same resolution, period, and region as the data
thermal balance and decrease the temperature by the 𝛼 index. provided by Pecan Street. Finally, to obtain the hourly resolution of
Start a new iteration with the updated temperature. Otherwise, the dataset in question, the work uses a re-sampling of readings by the
generate a new set of weights using the same temperature; sample mean.
The use of different resolutions enables the inclusion of a case study
5. The process continues until the temperature approaches zero,
focused on the number of samples available for model training. Due to
which is known as ‘‘system crystallization’’. The set of weights
the fact that the Pecan Street dataset has a sample interval of only one
generated at this point is considered a solution because it pro-
year, as the temporal resolution becomes less disaggregated, the num-
duces the lowest values of the objective function.
ber of samples decreases. To obtain a brief overview of the available
The construction of the voting ensemble ends the comparison of observations and characteristics in each dataset, Table 1 demonstrates
models by obtaining the smallest possible error in the inference of the the described aspects of each dataset after passing through the data
models. With this comparison, it is possible to observe the behavior preprocessing stage, corresponding to feature engineering. Due to the
of the models and the errors obtained during the inference, taking inclusion of a feature representing the consumption readings of the
residence in the dataset, it becomes possible to derive statistical metrics
into account: (𝑖.) the trained architecture, (𝑖𝑖.) the time resolution, and
related to these readings. To this end, Table 1 presents the mean,
(𝑖𝑖𝑖.) the cases of feature selection by different methods. In addition to
standard deviation, and median values of each consumption reading.
these case studies, improvements are expected in the results through
The selection of the two datasets, namely the HUE dataset and the
the optimized voting ensemble by SA.
Pecan Street dataset, was based on their suitability for capturing and
recording consumption readings in specific areas or projects related
4. Case study
to smart grids. These datasets were chosen to validate the proposed
method across different levels of granularity and consider various
To analyze the proposed methodology, a case study is conducted relevant features in energy consumption forecasting.
using different datasets to compare the errors of the adopted DL models. Relying solely on the HUE dataset, which provides hourly readings,
The benchmarking process utilizes the temporal granularity available would not offer a comprehensive representation to fully achieve the
in multivariate series and selects attributes that are deemed the most objective of this study. To address this limitation and ensure a more
important for the entire dataset, based on the specificity of the predic- comprehensive and robust analysis, the Pecan Street dataset was incor-
tive model. This methodology is primarily based on KDD and can be porated in addition to the HUE dataset. This incorporation allows for
applied to any dataset that describes energy consumption as the object the capture of a broader range of consumption patterns and enhances
of inference. Therefore, any data source can be used to construct the the validity of the study.
time window and input it into the models. It is important to note that the proposed method of analysis and
Two datasets are used for benchmarking in this study. The energy data pre-processing is designed to be flexible and adaptable. It can
consumption readings describe the residential profile of different lo- accommodate other data sources, as long as consumption readings for
calities and are applicable to the short to medium-term timeframe. specific residences can be obtained. This flexibility enables the inclu-
The first dataset selected was the Hourly Usage Energy (HUE) dataset, sion of additional datasets in future studies, enhancing the replicability
which is published by the Harvard Dataverse [53]. It contains energy and generalizability of the findings.

6
P.V.B. Ramos et al. Applied Energy 350 (2023) 121705

Fig. 3. Training comparison of recurrence-based models using Pecan Street dataset.

Table 1
Descriptive aspects of the datasets used in their original versions.
Dataset Participant | Original dataset format Mean of Median Std. of Completeness Absolute description
Type of residence (obs. x features) demand demand demand percentage of energy demand?
Pecan street Inc. 661 1 min: (525.530, 79) −0.49900 −0.49697 0.19958 99% No
Family residence 15 min: (35.032, 79) −0.39251 −0.45907 0.27793
1 h: (8.759, 79) −0.69181 −0.83377 0.34275
HUE 1 - Bungalow 1 h: (29.202, 3) −0.82132 −0.85709 0.11107 100% Yes

5. Simulations and results Analyzing the recurrence models, the first analysis was to verify the
training behavior of the models RNN, LSTM and GRU to verify their
This section presents the simulations and results obtained from similarity. Fig. 3 shows the training comparison of these models, firstly
applying the analysis method described in Section 3 to the case study using the Pecan Street dataset in all its granularities.
discussed in Section 4. The simulations were implemented using Python Although all three models show good performance with low error
version 3.9.2, and the DL were developed using the PyTorch Lightning fluctuations, it can be observed in Fig. 3 that as the temporal granu-
1.6.3 framework with PyTorch 1.11.0. larity increases, the validation errors also increase. Furthermore, it is
The hardware used for the simulations and training of the models also possible to observe a reduction in accuracy during the validation
consisted of an NVIDIA GTX 1660 Super graphics card with 6 GB step. This reduction may be a result of the limited number of samples
GDDR6 memory. The Graphic Processing Units (GPUs) was controlled available, which is more evident in the hourly granularity.
by the CUDA 11.3 package and the cuDNN 8.7 framework. The Cen- It is noteworthy that the two datasets used in the study have a
tral Processing Units (CPUs) are Intel Core I5-9400F processors with difference of 20,443 available observations, due to the time interval
6 cores. This operates with 16 GB DDR4 memory, utilized in dual- of collection of readings from the data publishers. To assess the impact
channel mode. The complete implementation code and the training logs of the number of reads on model training, Fig. 4 presents a comparison
saved during the model training are publicly available on GitHub1 and of TST model training on both data sources.
Weights & Biases,2 respectively. When analyzing the comparison of resolutions presented in
Due to hardware limitations and in pursuit of standardizing the Fig. 4(a), it is possible to notice that the same behavior observed in the
experiments, default parameters were defined statically for all tests other models of Fig. 3 is seen for the resolutions of 1 min and 15 min.
related to the model’s training. These parameters include a (𝑖.) batch However, in the hourly resolution, where there is a reduced number of
size of 32; (𝑖𝑖.) a number of workers of 1; (𝑖𝑖𝑖.) a number of epochs of samples available, overfitting occurs, indicating the need for a larger
200; and (𝑖𝑣.) a rolling window width of 60 samples. It is important number of samples for the training step, since the amount provided
to point out that these hyperparameters associated with the input resulted in a bias in the DL model. To show the impact of this factor on
data for training ideally ought to be optimized during the validation the hourly resolution, Fig. 4(b) shows the training of the same model
step. Nevertheless, due to hardware limitations, the batch-size and the using data from the HUE source, which has a larger number of available
number of workers were configures accordingly with this limitations. samples. In this case, the model does not deteriorate.
It is also important to notice that no early stopping criteria were In order to evaluate the performance of the models, all feature sets
established during the tests since they were defined as exhaustive and were trained. Fig. 5 presents the evaluation metrics obtained for each
aimed to identify even the smallest fluctuations in the training and feature set. It is observed that using the feature set selected by PCA
validation curves. resulted in a smaller error. However, a closer look at Fig. 5(a) shows
In addiction to the architectures hyperparameterization, all the that using all pre-processed features led the RNN model to a MSE of
models employed in this study are configured with the objective of 0.0279, a 53.1% improvement over the MSE obtained with the set
accomplish simplicity. The emphasis lies in exploring their use in selected by PCA.
consumption forecasting by employing the most straightforward choice Fig. 5 not only provides a comparison of errors across different
of hyperparameters. Additionally, this uncomplicated configuration en- feature sets but also shows how the best models vary according to the
ables the validation of the training process using the proposed approach available training samples. It is observed that the TST model, despite
of pre-processing and feature selection analysis. having presented overfitting during the validation stage, obtained a
As previously mentioned, one of the objectives of this work is to lower MSE compared to the other recurrent models. This suggests that,
analyze the performance of prediction based on recurrence models and although being a more complex model, the epoch before overfitting was
transformers, using the inference error as a comparative parameter. enough for the model to present a superior performance. However, this
result does not apply to the Transformer-based model trained with the
HUE dataset, as shown in Fig. 6.
1
https://github.com/pv08/consumption-forecasting. Now that it is possible to verify the similar behavior between
2
https://wandb.ai/pvbr08/EnergyConsumption. traditional recurrent models for regression problems and those based

7
P.V.B. Ramos et al. Applied Energy 350 (2023) 121705

Fig. 4. Impact of the number of attributes on Transformer-based model training.

Fig. 5. Accuracy achieved using Pecan Street data on different sets of attributes.

and weather patterns for predicting the value of the next observa-
tion. This result demonstrates the relevance of including multiple fea-
tures, including weather-related ones, for accurate energy consumption
forecasting.
Fig. 7 shows the importance of different attributes for a single
inference. However, it is necessary to verify the entire evaluation set to
obtain a more comprehensive understanding of the average importance
across the other samples.
To achieve this, Fig. 8 presents the average attribute importance in
descending order when considering the entire evaluation set. Since each
model exhibits unique characteristics regarding attribute correlations
for prediction inference, distinct rankings are generated.
To observe the distinctiveness in attribute importance, it is possible
to observe in Fig. 8(a) the collaboration of timestamp attributes. How-
ever, when observing Fig. 8(b), the TST model gives more importance
to meteorological attributes, with little influence from the timestamp
and the consumption itself. Therefore, for the latter, the consumption
variable has little importance in inferring it, unlike the RNN model.
Fig. 6. Accuracy achieved using HUE data on different sets of attributes.
The ranking of attribute importance makes it possible to perform an
attribute selection in the original collection and retraining of each
model. This new training serves to verify if its particularity brings better
on Transformers, the next analysis refers to the observation of the effect results compared to the original dataset and the previously selected
that each attribute has in the context of each model’s particularities. To attributes using PCA.
verify the similarities and differences between the two types of deep As highlighted, this work mainly focuses on comparing recurrent
learning models, we can highlight the RNN model, using all features at models and Transformer-based models. However, other models that
a 1-min resolution from the Pecan Street dataset, and the TST model, are not widely explored in the literature are presented as potential
on the same dataset, but using at an hourly resolution. This highlight solutions. These DL models were also trained to form the voting ensem-
is given by the observations of the results described in Fig. 5. ble, which aims to combine the strengths of each model into a unified
The SHAP Values tool was used to investigate the impact of each inference set. To avoid the costly distribution of voting weights through
feature on the predictions made by the best-trained model. Initially, grid search, the selection of optimizer methods is necessary for this
the tool was applied to a single prediction, using the best epoch during task.
training and the evaluation set. Fig. 7 shows the features that had In order to achieve a more effective distribution of weights without
the greatest influence on the prediction, which was still normalized. incurring significant time complexity, a metaheuristic technique, called
It is possible to observe the importance of meteorological attributes Simulated Annealing, was used. Using the validation set and MSE, the

8
P.V.B. Ramos et al. Applied Energy 350 (2023) 121705

Fig. 7. Impact of attributes analyzed using SHAP values on the Pecan Street dataset.

Fig. 8. Average attribute importance measured by SHAP values using Pecan Street data.

optimizer was used to distribute the importance of each model until example of Pecan Street data, Fig. 10 shows, in a logarithmic scale,
a better MSE value than that achieved individually by each model was the MSE achieved by the ensemble and the models trained separately.
obtained. The weight configuration is tested until a thermal equilibrium Fig. 11 shows the use of HUE’s hourly data.
is reached, i.e., the guiding function does not decrease after a fixed When looking at the values achieved by each model, regardless of
number 𝑡 of iterations. the dataset used, a pattern of errors can be observed. However, when
To illustrate the weight distribution achieved by the ensemble, the ensemble, called ‘‘Simulated Annealing’’, is examined, there is a
Fig. 9 shows the influence of each model in three situations trained on significant improvement, regardless of the resolution. This outcome is
the Pecan Street dataset at a 1-min resolution: (𝑖) using all previously expected because the algorithm uses the error-based guide function to
processed features; (𝑖𝑖) employing attributes separated using PCA; (𝑖𝑖𝑖) distribute the impact of each model, ensuring that the best of each
separating the attributes following the specificity of the model analyzed model is included in the new prediction made by the ensemble.
by SHAP. The models previously mentioned, namely RNN and TST, are par-
To construct the ensemble, the distribution of weights obtained ticularly noteworthy, as depicted in Fig. 10. It emphasizes their per-
through SA is restricted to DL models. Therefore, other models that formances and underscores the effectiveness of the ensemble, enabling
are part of the comparison, but not included in the DL category, the observation of excellent performances from models based on con-
are not part of the set. Examples of such models include Seasonal volutional layers, such as ConvRNN and TCN. These outstanding re-
Auto-Regressive Integrated Moving Average with eXogenous (SARI- sults further validate the points emphasized by Oliveira et al. [29]. A
MAX), Support Vector Regression (SVR), and Extreme Gradient Boost- comprehensive overview of the model performances can be found in
ing (XGBoost), which were used for comparison but not included in the Table 2, which lists all the models trained in the described scenario. In
ensemble construction. this table, it is highlighted the model which has the best performance,
Based on the validation set, it is important to use the impact factor when the set of features and time resolution is analyzed. This table
to construct the ensemble in the evaluation set. Continuing with the allows for the analysis of metrics across different scenarios, including

9
P.V.B. Ramos et al. Applied Energy 350 (2023) 121705

Fig. 9. Weight distribution achieved by the simulated annealing algorithm using the 1-min resolution of the Pecan Street dataset.

Fig. 10. MSE achieved by all models, including the ensemble constructed by SA, using the Pecan Street dataset.

5.1. Balancing complexity and efficiency for model redesign and optimiza-
tion

Deep learning models are renowned for their ability to learn com-
plex patterns and extract insights from datasets. They excel in identi-
fying non-linear relationships and capturing long-term dependencies in
time series data, making them particularly valuable in cases where the
context exhibits non-trivial behavior or patterns that evolve.
This capacity of deep learning models to generalize makes them
applicable to a wide range of time series, irrespective of feature spec-
ifications. Therefore, although SARIMAX can demonstrates satisfac-
tory performance on specific time series, deep learning models offer
the advantage of applicability in diverse scenarios and domains of
study [56].
Nevertheless, it is important to note that selecting the appropriate
model depends on the features and context of the analyzed time series.
Factors such as dataset size, computational resource availability, and
interpretability of the results presented by the models must be con-
Fig. 11. MSE achieved by the models using the HUE dataset.
sidered. Therefore, conducting a comparative and meticulous analysis
is crucial to determine whether SARIMAX or a deep learning-based
approach is more suitable for each specific case [57].
time resolution, data collection, and feature reduction. Other metrics But, when the focus is the comprehension of these models, SHAP
Values are utilized to this intention due to the architecture of the model
are provided in Appendix.
is considered a ‘‘black-box’’. The technique ranking the importance of
A notable observation is the satisfactory performance of the SARI- each feature for predictions, it is feasible to leverage this approach
MAX model compared to deep neural network models. Such a result can for the redesign of deep learning model architectures. By providing
be attributed to the intrinsic characteristics of the analyzed time series. a means to interpret predictions, the model can be manipulated to
However, it is worth noting that deep learning models possess attributes influence its predictions by identifying features with the highest SHAP
that enable them to generalize well across diverse time series [55]. Values.

10
P.V.B. Ramos et al. Applied Energy 350 (2023) 121705

Table 2
MSE of all trained models across various scenarios.
Features Pecan street HUE
1 min 15 min 1 h 1 h
All PCA SHAP All PCA SHAP All PCA SHAP All PCA SHAP
RNN 0.00279 0.00595 0.00461 0.01065 0.01025 0.01080 0.06524 0.05395 0.05022 0.00847 0.00737 0.00802
LSTM 0.01372 0.00571 0.00315 0.01060 0.01025 0.01388 0.08315 0.06360 0.10878 0.00866 0.00879 0.00876
GRU 0.01676 0.00483 0.00295 0.01231 0.01235 0.01358 0.06731 0.05515 0.05887 0.00844 0.00748 0.00760
TST 0.00771 0.00785 0.00505 0.01712 0.01537 0.01744 0.05204 0.04512 0.06251 0.00798 0.01207 0.00964
ConvRNN 0.00584 0.00474 0.00747 0.02124 0.01657 0.01671 0.07268 0.06644 0.14099 0.00755 0.00776 0.01276
MLP 0.00847 0.01333 0.01274 0.02432 0.02016 0.02914 0.08480 0.07605 0.08350 0.00833 0.00851 0.00952
FCN 8.25760 10.3296 1.45320 0.41447 0.27995 0.69230 0.94316 4.68446 0.21022 0.08018 0.48447 0.15399
ResNet 2.23833 2.89366 2.13189 2.31748 0.89168 1.08180 1.53051 16.41117 0.98816 0.08627 0.77796 0.27713
TCN 0.00417 0.00501 0.01332 0.05626 0.06188 0.05751 0.09842 0.09936 0.09582 0.01166 0.01027 0.01641
Ensemble 0.00289 0.00376 – 0.01024 0.00999 – 0.05753 0.04566 – 0.00714 0.00656 –
SVR 7.42879 3.96012 – 3.08206 0.01072 – 0.03403 0.03004 – 0.00551 0.47276 –
XGBoost 0.01823 0.02271 – 0.02981 0.05983 – 0.00133 0.00025 – 0.00377 0.34326 –
SARIMAX 0.00337 – – 0.01059 – – 0.03733 – – 0.20541 – –

Through the redesign of these models, considering the most influen- performed well with regard to low error fluctuations. However, as the
tial features selected by this technique, model simplification becomes temporal granularity increases, higher errors were observed during the
attainable. Hence, features with the lowest impact, indicating minimal validation phase.
influence on predictions, can be excluded from the model. This reduc- When considering the feature set reduced by PCA, similar error
tion can lead to improved performance, but it may also have a negative patterns were observed during the validation phase compared to the
impact depending on feature engineering and model selection. complete set of pre-processed features. Interestingly, the TST model
In addition to simplification, model optimization can be achieved demonstrated superior performance, achieving a lower MSE compared
by utilizing SHAP values. These values can provide insights into which to the other recurrent models, despite exhibiting overfitting during
features may require more attention or refinement, thereby guiding validation. This suggests that the TST model, despite its complexity,
hyperparameter tuning, regularization techniques, or architecture mod- was able to capture valuable information during the training process
ifications to enhance overall model performance. before overfitting occurred.
In the context of electricity management, a dispatch manager who Using the SHAP Values tool allowed for an investigation into the
utilizes SHAP values to improve and design their model would pri- impact of each feature on the predictions. It was observed that me-
marily consider weather readings and other features relevant to their teorological attributes and climate patterns play a significant role in
geographical area or the energy demand and production context. By predicting the value of the next observation, highlighting the impor-
understanding the level of importance of each characteristic within the tance of including multiple features for accurate energy consumption
dataset, SHAP values can provide insights and interpretability crucial forecasts. Regarding architectures, the TST model gives more impor-
for designing and optimizing the model to its fullest potential. tance to meteorological attributes, while timestamps and consumption
These insights can help finding the balance between implemen- have less influence. This suggests that the consumption variable itself
tation complexity and the required level of precision in real-world is not as crucial for predicting energy consumption in the TST model
scenarios. By employing the techniques presented throughout this work compared to the RNN model.
To achieve a better weight distribution without incurring high time
and conducting simulations, it is possible to explore more effective
complexity, the metaheuristic technique called SA was employed. The
ways to analyze sensitivity and identify efficient solutions in terms of
weight configuration was continuously tested until a thermal equilib-
performance and computational resources.
rium was reached. The ensemble created by SA showed significant
Regarding model hyperparameterization, leveraging the SHAP Val-
improvement across different resolutions, indicating the effectiveness
ues can help comprehend their significance and prevent their excessive
of the approach.
use. This approach can contribute to reducing complexity and opti-
The proposed method of pre-processing, the results reached and the
mizing training time, enabling the optimal utilization of computational
idea of comparison of referenced architectures of DL models should
resources.
help energy managers. By understanding which are the main features
In summary, considering sensitivity analysis and the need to sim-
of energy forecasting by their level of importance, they can choose the
plify the model for real-world applicability can lead to more efficient
most suitable model to predict the energy respecting their geographical
use of computational resources. Striking the right balance is crucial to
area or other reality scope. The results of optimization by using SA
ensure model efficiency. The use of SHAP Values, along with the com-
shows the decrease of error because the same metric is used to reach
prehensive analysis and preprocessing methodology proposed, provides better results. Hence, the following of this step should give to the
researchers with valuable insights and ensures a well-balanced trade-off network managers a lower rate of error when their dispatch plan is
between complexity and efficiency. establish, because a committee of different predictors is weighted by
their errors.
6. Conclusions This comparison not only demonstrates the performance of each
trained model but also highlights the importance of using multivariate
This study conducted a comprehensive performance comparison series for predicting energy consumption. The SHAP values clearly
between recurrent models and Transformer models for energy de- show that other attributes can influence the inference of the primary
mand forecasting in different scenarios. The KDD methodology was object. With this comparison, the prediction of energy consumption can
employed to ensure a systematic analysis approach. Two case studies be expanded and linked to other issues to promote energy efficiency. As
were conducted using the HUE and Pecan Street datasets, allowing for a suggestion for future work, it is advisable to verify the enhancement
comparisons across various resource settings. of the models by optimizing the hyperparameters utilized for capturing
The first analysis was to evaluate the training performance of RNN input data and configuring the architectures employed in this study.
models, LSTM and GRU through case studies linked to the number Attached to this verification, it is possible to verify the methodol-
of samples and the class of models in question. The three models ogy of analysis and pre-processing in another dataset that describes

11
P.V.B. Ramos et al. Applied Energy 350 (2023) 121705

Table 3
MAE of all trained models across various scenarios.
Features Pecan Street HUE
1 min 15 min 1 h 1 h
All PCA SHAP All PCA SHAP All PCA SHAP All PCA SHAP
RNN 0.02650 0.06192 0.05041 0.05967 0.05339 0.05454 0.16139 0.12345 0.13206 0.05835 0.05107 0.05552
LSTM 0.10407 0.05833 0.02353 0.05701 0.05430 0.06807 0.19378 0.16430 0.25806 0.05781 0.05603 0.05982
GRU 0.12142 0.05074 0.02573 0.06728 0.06519 0.07016 0.13235 0.12082 0.12571 0.06372 0.05528 0.05679
TST 0.06021 0.06627 0.04091 0.09608 0.08182 0.08792 0.14368 0.13829 0.18738 0.05451 0.07641 0.06423
ConvRNN 0.05607 0.04361 0.06179 0.09559 0.08232 0.08900 0.17565 0.17111 0.28312 0.05987 0.06305 0.08548
MLP 0.06581 0.07008 0.06483 0.10038 0.09819 0.10875 0.20491 0.17724 0.19723 0.05359 0.05277 0.05961
FCN 2.46375 2.72781 0.93816 0.51989 0.43285 0.68219 0.76949 1.70751 0.38178 0.22906 0.60603 0.35249
ResNet 1.09798 1.34136 1.18267 1.11981 0.75079 0.84797 0.97476 3.70548 0.79325 0.22062 0.63251 0.41432
TCN 0.03321 0.03708 0.09722 0.16833 0.20496 0.16348 0.20806 0.23536 0.18053 0.06936 0.06448 0.09441
Ensemble 0.02652 0.03592 – 0.05358 0.05258 – 0.13363 0.12349 – 0.05145 0.04874 –
SVR 2.36098 1.53873 – 1.43284 0.09158 – 0.17547 0.16271 – 0.04510 0.45296 –
XGBoost 0.06970 0.07683 – 0.11077 0.00467 – 0.02385 0.01094 – 0.04146 0.35482 –
SARIMAX 0.02221 – – 0.05553 – – 0.10968 – – 0.26312 – –

Table 4
MAPE of all trained models across various scenarios.
Features Pecan Street HUE
1 min 15 min 1 h 1 h
All PCA SHAP All PCA SHAP All PCA SHAP All PCA SHAP
RNN 0.14220 0.30428 0.21210 0.92183 0.81904 0.90441 4.88546 3.79263 4.11293 0.09153 0.08186 0.08780
LSTM 0.48304 0.28010 0.17468 1.00041 0.81856 0.99095 4.33073 4.20825 3.74409 0.09147 0.08986 0.09359
GRU 0.53133 0.23991 0.16368 0.91839 1.06218 1.06875 4.35518 4.21207 4.20590 0.09654 0.08631 0.08784
TST 0.34994 0.29644 0.27696 1.00007 0.87233 0.96062 4.56583 4.59006 4.25748 0.08580 0.08631 0.09964
ConvRNN 0.22020 0.25833 0.28317 1.37777 1.02070 1.00774 5.13323 4.70171 6.05600 0.08850 0.09114 0.53038
MLP 0.26471 0.44704 0.47655 1.24312 1.14281 1.43518 5.38558 5.71011 5.36075 0.08622 0.08589 0.09459
FCN 8.73249 8.90999 3.97944 3.23451 1.50761 4.25863 11.02001 17.97226 6.95738 0.29699 0.76546 0.43055
ResNet 3.36829 4.12096 5.85828 6.37581 3.80967 6.07990 4.32818 42.33065 3.73976 0.28268 0.77889 0.53038
TCN 0.21167 0.27051 0.53340 1.85153 1.26917 2.110639 4.56645 4.16572 4.88300 0.10587 0.10071 0.14056
Ensemble 0.15628 0.23502 – 0.89930 0.81167 – 4.61107 4.29652 – 0.08183 0.07766 –
SVR 9.44577 5.78149 – 6.31611 0.36384 – 0.79564 0.79641 – 0.76676 0.37610 –
XGBoost 0.95270 1.08521 – 1.19227 0.03143 – 0.40294 0.09572 – 0.10496 0.46293 –
SARIMAX 0.15100 – – 0.92641 – – 3.39812 – – 0.26070 – –

Table 5
𝑅2 score of all trained models across various scenarios.
Features Pecan Street HUE
1 min 15 min 1 h 1 h
All PCA SHAP All PCA SHAP All PCA SHAP All PCA SHAP
RNN 0.91983 0.84811 0.88221 0.81609 0.82299 0.81351 0.20192 0.33999 0.38558 0.13196 0.24420 0.17742
LSTM 0.68827 0.85417 0.91949 0.81702 0.82298 0.76050 −0.01724 0.22197 −0.33072 0.11176 0.09883 0.10221
GRU 0.59062 0.87661 0.92451 0.78781 0.78684 0.76568 0.17656 0.32530 0.27976 0.13425 0.23313 0.22091
TST 0.83398 0.79954 0.87104 0.70449 0.73470 0.69895 0.36334 0.44801 0.23541 0.18152 −0.23784 −0.30817
ConvRNN 0.81516 0.87894 0.80934 0.63349 0.71427 0.71152 0.11085 0.18772 −0.72470 0.22580 0.20413 −0.30817
MLP 0.77968 0.45713 0.67474 0.58034 0.65212 0.49707 −0.03074 0.06963 −0.02154 0.14578 0.12751 0.02395
FCN −251.65753 −262.63471 −36.08856 −6.15132 −3.83041 −10.94495 −10.53742 −56.30353 −1.57157 −7.21670 −48.64789 −14.78137
ResNet −77.99939 −72.85197 −53.40999 −38.98575 −14.38502 −17.66547 −17.72228 −199.75251 −11.08786 −7.74121 −78.72427 −27.39951
TCN 0.88122 0.87207 0.65979 0.02914 −0.06783 0.00780 −0.20402 −0.12550 −0.17217 −0.19586 0.05263 −0.68257
Ensemble 0.92601 0.90394 – 0.82324 0.82756 – 0.29615 0.44145 – 0.26783 0.32702 –
SVR −189.19805 −100.39026 – −52.69748 0.81307 – 0.57062 0.62091 – 0.98515 −0.27461 –
XGBoost 0.53312 0.41844 – 0.48049 0.99907 – 0.98317 0.99683 – 0.98983 0.07454 –
SARIMAX 0.91347 – – 0.81527 – – 0.52844 – – 0.44619 – –

energy demand. Additionally, another viable proposal could involve Visualization, Writing – original draft. Bruno H. Dias: Writing – review
optimizing the purchase-sale iteration of excess energy in relation to & editing, Supervision.
anticipatory consumption within a specific residential context. At this
point, the comparison of the obtained performances, together with the Declaration of competing interest
development of the ensemble by metaheuristics, can be included in the
development of reinforcement learning models to build a profit among
The authors declare that they have no known competing finan-
the prosumer transactions.
cial interests or personal relationships that could have appeared to
influence the work reported in this paper.
CRediT authorship contribution statement

Paulo Vitor B. Ramos: Conceptualization, Methodology, Software, Data availability


Visualization, Writing – original draft. Saulo Moraes Villela: Writing –
review & editing, Supervision. Walquiria N. Silva: Conceptualization, We have shared the link to our data at the manuscript.

12
P.V.B. Ramos et al. Applied Energy 350 (2023) 121705

Acknowledgments [21] Shin S-Y, Woo H-G. Energy consumption forecasting in Korea using machine
learning algorithms. Energies 2022.
The authors thank the Coordenação de Aperfeiçoamento de Pessoal [22] Petropoulos F, Apiletti D, Assimakopoulos V, Zied Babai M. Forecasting: theory
and practice. Int J Forecast 2022.
de Nível Superior (CAPES) under Grant 001, Conselho Nacional de
[23] Hong T, Wang P. Artificial intelligence for load forecasting: history, illusions,
Desenvolvimento Científico e Tecnológico (CNPq) under the grants and opportunities. IEEE Power Energy Mag 2022.
404068/2020-0, Fundação de Amparo à Pesquisa do Estado de Minas [24] Forootani A, Rastegar M, Sami A. Short-term individual residential load fore-
Gerais (FAPEMIG) under the grant APQ-03609-17, Instituto Nacional casting using an enhanced machine learning-based approach based on a feature
de Energia Elétrica (INERGE) and the Smart4grids research group for engineering framework: a comparative study with deep learning methods. Electr
Power Syst Res 2022.
supporting this work.
[25] Kumar S, Hussain L, Banarjee S, Reza M. Energy load forecasting using deep
learning approach-LSTM and GRU in spark cluster. In: 2018 fifth international
Appendix. Complementary results conference on emerging applications of information technology. EAIT, 2018, p.
1–4.
In order to complement the results seen during the evaluation of the [26] Abumohsen M, Owda AY, Owda M. Electrical load forecasting using LSTM, GRU,
and RNN algorithms. Energies 2023.
trained models, Tables 3–5 show, respectively, the MAE, MAPE, and 𝑅2
[27] Chu Y, Mitra D, Cetin K. Data-driven energy prediction in residential buildings
score of the case study. For each, the best performance model in the set using LSTM and 1-D CNN. ASHRAE Trans 2020.
of features and time resolution are highlighted. [28] Yang W, Shi J, Li S, Song Z, Zhang Z, Chen Z. A combined deep learning load
forecasting model of single household resident user considering multi-time scale
References electricity consumption behavior. Appl Energy 2022.
[29] Oliveira N, Sousa N, Praça I. Deep learning for short-term instant energy
consumption forecasting in the manufacturing sector. In: Omatu S, Mehmood R,
[1] Marimuthu R, Sankaranarayanan B, Ali SM, de Sousa Jabbour ABL, Karuppiah K.
Sitek P, Cicerone S, Rodríguez S, editors. Distributed computing and artifi-
Assessment of key socio-economic and environmental challenges in the mining
cial intelligence, 19th international conference. Cham: Springer International
industry: Implications for resource policies in emerging economies. Sustain Prod
Publishing; 2023, p. 165–75.
Consum 2021.
[2] dos Santos Ferreira G, Martins dos Santos D, Luciano Avila S, Viana Luiz [30] Zerveas G, Jayaraman S, Patel D, Bhamidipaty A, Eickhoff C. A transformer-based
Albani V, Cardoso Orsi G, Cesar Cordeiro Vieira P, Nilson Rodrigues R. Short- framework for multivariate time series representation learning. In: Proceedings
and long-term forecasting for building energy consumption considering IPMVP of the 27th ACM SIGKDD conference on knowledge discovery; Data mining. KDD
recommendations, WEO and COP27 scenarios. Appl Energy 2023. ’21, Association for Computing Machinery; 2021, p. 2114–24.
[3] Karintseva O, Kharchenko M, Boon EK, Derykolenko O, Melnyk V, Kobzar O. [31] Zeng X, Hu Y, Shu L, Li J, Duan H, Shu Q, Li H. Explainable machine-learning
Environmental determinants of energy-efficient transformation of national predictions for complications after pediatric congenital heart surgery. Sci Rep
economies for sustainable development. Int J Glob Energy Issues 2021. 2021.
[4] Zhu J, Dong H, Zheng W, Li S, Huang Y, Xi L. Review and prospect of data-driven [32] Fei X, Zhigang W. Analysis of correlation between meteorological factors and
techniques for load forecasting in integrated energy systems. Appl Energy 2022. short-term load forecasting based on machine learning. In: 2018 international
[5] Nate S, Bilan Y, Cherevatskyi D, Kharlamova G, Lyakh O, Wosiak A. The impact conference on power system technology. POWERCON, 2018, p. 4449–54.
of energy consumption on the three pillars of sustainable development. Energies [33] Bian H, Wang Q, Xu G, Zhao X. Research on short-term load forecasting based on
2021. accumulated temperature effect and improved temporal convolutional network.
[6] Himeur Y, Ghanem K, Alsalemi A, Bensaali F, Amira A. Artificial intelligence Energy Rep 2022.
based anomaly detection of energy consumption in buildings: A review, current [34] Kim J-Y, Cho S-B. Explainable prediction of electric energy demand using a deep
trends and new perspectives. Appl Energy 2021. autoencoder with interpretable latent space. Expert Syst Appl 2021.
[7] Wang Z, Hong T, Li H, Piette MA. Predicting city-scale daily electricity [35] Kim J-Y, Cho S-B. Predicting residential energy consumption by explainable deep
consumption using data-driven models. Adv Appl Energy 2021. learning with long-term and short-term latent variables. Cybern Syst 2023.
[8] Albuquerque PC, Cajueiro DO, Rossi MD. Machine learning models for forecasting [36] Gao Y, Ruan Y. Interpretable deep learning model for building energy
power electricity consumption using a high dimensional dataset. Expert Syst Appl consumption prediction based on attention mechanism. Energy Build 2021.
2022. [37] Kim J-Y, Cho S-B. Interpretable deep learning with hybrid autoencoders to
[9] Khalil M, McGough AS, Pourmirza Z, Pazhoohesh M, Walker S. Machine predict electric energy consumption. In: 15th international conference on soft
Learning, Deep Learning and Statistical Analysis for forecasting building energy computing models in industrial and environmental applications (SOCO 2020).
consumption — A systematic review. Eng Appl Artif Intell 2022. 2021, p. 133–43.
[10] Dong H, Zhu J, Li S, Wu W, Zhu H, Fan J. Short-term residential household re- [38] Reddy A S, Akashdeep, Harshvardhan, Kamath S S. Ensemble learning approach
active power forecasting considering active power demand via deep transformer for short-term energy consumption prediction. In: 5th joint international confer-
sequence-to-sequence networks. Appl Energy 2023. ence on data science & management of data (9th ACM IKDD CODS and 27th
[11] Ahmad T, Zhang D, Huang C, Zhang H, Dai N, Song Y, Chen H. Artificial intel- COMAD). Association for Computing Machinery; 2022, p. 284–5.
ligence in sustainable energy industry: Status Quo, challenges and opportunities.
[39] Wang J, Xing Q, Zeng B, Zhao W. An ensemble forecasting system for short-
J Clean Prod 2021.
term power load based on multi-objective optimizer and fuzzy granulation. Appl
[12] Pourarshad M, Noorollahi Y, Atabi F, Panahi M. Modelling and optimisation of
Energy 2022.
long-term forecasting of electricity demand in oil-rich area, South Iran. Int J
[40] Hadjout D, Torres J, Troncoso A, Sebaa A, Martínez-Álvarez F. Electricity
Ambient Energy 2022.
consumption forecasting based on ensemble deep learning with application to
[13] Mir AA, Alghassab M, Ullah K, Khan ZA, Lu Y, Imran M. A review of
the Algerian market. Energy 2022.
electricity demand forecasting in low and middle income countries: The demand
[41] Onan A, Korukoğlu S, Bulut H. A multiobjective weighted voting ensemble clas-
determinants and horizons. Sustainability 2020.
sifier based on differential evolution algorithm for text sentiment classification.
[14] Rice R, North K, Hansen G, Pearson D, Schaer O, Sherman T, Vassallo D. Time-
Expert Syst Appl 2016.
series forecasting energy loads: A case study in Texas. In: 2022 systems and
[42] Bedi J, Toshniwal D. Deep learning framework to forecast electricity demand.
information engineering design symposium. SIEDS, 2022, p. 196–201.
Appl Energy 2019.
[15] Klyuev RV, Morgoev ID, Morgoeva AD, Gavrina OA, Martyushev NV, Efre-
menkov EA, Mengxu Q. Methods of forecasting electric energy consumption: [43] Bengio Y, Frasconi P, Simard PY. The problem of learning long-term dependen-
A literature review. Energies 2022. cies in recurrent networks. In: IEEE international conference on neural networks.
[16] Lin W, Wu D, Boulet B. Spatial-temporal residential short-term load forecasting 1993.
via graph neural networks. IEEE Trans Smart Grid 2021. [44] Kingma D, Ba J. Adam: a method for stochastic optimization. In: International
[17] Hong T, Pinson P, Wang Y, Weron R, Yang D, Zareipour H. Energy forecasting: conference on learning representations. 2014.
A review and outlook. IEEE Open Access J Power Energy 2020. [45] Chitalia G, Pipattanasomporn M, Garg V, Rahman S. Robust short-term electrical
[18] Wang Y, Chen Q, Hong T, Kang C. Review of smart meter data analytics: load forecasting framework for commercial buildings using deep recurrent neural
Applications, methodologies, and challenges. IEEE Trans Smart Grid 2018. networks. Appl Energy 2020.
[19] Haben S, Arora S, Giasemidis G, Voss M, Vukadinović Greetham D. Review of [46] Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł,
low voltage load forecasting: Methods, applications, and recommendations. Appl Polosukhin I. Attention is all you need. In: Guyon I, Luxburg UV, Bengio S,
Energy 2021. Wallach H, Fergus R, Vishwanathan S, Garnett R, editors. Advances in neural
[20] Hernandez L, Baladron C, Aguiar JM, Carro B, Sanchez-Esguevillas AJ, Lloret J, information processing systems. Curran Associates, Inc.; 2017, p. 1–11.
Massana J. A survey on electric power demand forecasting: future trends in smart [47] Khairalla M. Meta-heuristic search optimization and its application to time series
grids, microgrids and smart buildings. IEEE Commun Surv Tutor 2014. forecasting model. Intell Syst Appl 2022.

13
P.V.B. Ramos et al. Applied Energy 350 (2023) 121705

[48] Brito AS, Vieira MB, Villela SM, Tacon H, Chaves HL, Maia HA, Concha DT, [53] Makonin S. HUE: the hourly usage of energy dataset for buildings in british
Pedrini H. Weighted voting of multi-stream convolutional neural networks for columbia. 2018, Accessed: 2022-10-01.
video-based action recognition using optical flow rhythms. J Vis Commun Image [54] Street P. Pecan street dataport. 2022, URL: https://dataport.pecanstreet.org,
Represent 2021. Accessed: 2022-10-01.
[49] Fayyad UM, Piatetsky-Shapiro G, Smyth P, et al. Knowledge discovery and data [55] Ensafi Y, Amin SH, Zhang G, Shah B. Time-series forecasting of seasonal items
mining: Towards a unifying framework. In: KDD. 1996, p. 82–8. sales using machine learning – A comparative analysis. Int J Inf Manag Data
[50] Rotondo A, Quilligan F. Evolution paths for knowledge discovery and data Insights 2022.
mining process models. SN Comput Sci 2020. [56] Crone SF, Dhawan R. Forecasting seasonal time series with neural networks:
[51] Serrano-Guerrero X, Briceño-León M, Clairand J-M, Escrivá-Escrivá G. A new A sensitivity analysis of architecture parameters. In: 2007 international joint
interval prediction methodology for short-term electric load forecasting based conference on neural networks. 2007, p. 2099–104.
on pattern recognition. Appl Energy 2021. [57] Hewamalage H, Bergmeir C, Bandara K. Recurrent neural networks for time series
[52] Hasan BMS, Abdulazeez AM. A review of principal component analysis algorithm forecasting: Current status and future directions. Int J Forecast 2021.
for dimensionality reduction. J Soft Comput Data Min 2021.

14

You might also like