5.a Review of Reinforcement Learning For Controlling Building Energy Systems From A Computer Science Perspective
5.a Review of Reinforcement Learning For Controlling Building Energy Systems From A Computer Science Perspective
Keywords: Energy efficient control of energy systems in buildings is a widely recognized challenge due to the use of low
Building Energy System temperature heating, renewable electricity sources, and the incorporation of thermal storage. Reinforcement
HVAC Learning (RL) has been shown to be effective at minimizing the energy usage in buildings with maintained
Heating
thermal comfort despite the high system complexity. However, RL has certain disadvantages that make
Cooling
it challenging to apply in engineering practices. In this review, we take a computer science approach to
Reinforcement learning
Machine learning
identifying three main categories of challenges of using RL for control of Building Energy Systems (BES).
RL The three categories are the following: RL in single buildings, RL in building clusters, and multi-agent aspects.
ML For each topic, we analyse the main challenges, and the state-of-the-art approaches to alleviate them. We also
identify several future research directions on subjects such as sample efficiency, transfer learning, and the
theoretical properties of RL in building energy systems. In conclusion, our review shows that the work on
RL for BES control is still in its initial stages. Although significant progress has been made, more research is
needed to realize the goal of RL-based control of BES at scale.
✩ Financial Disclosure: This work is financially supported by Swedish Energy Agency with project grant agreement No. 51544-1. Special thanks to RISE Research
Institutes of Sweden for providing in-kind supports.
∗ Corresponding author.
E-mail addresses: [email protected] (D. Weinberg), [email protected] (Q. Wang), [email protected] (T.O. Timoudas), [email protected]
(C. Fischione).
https://doi.org/10.1016/j.scs.2022.104351
Received 25 May 2022; Received in revised form 11 December 2022; Accepted 11 December 2022
Available online 14 December 2022
2210-6707/© 2022 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
D. Weinberg et al. Sustainable Cities and Society 89 (2023) 104351
with various comfort objectives such as indoor air quality, noise, and
thermal comfort (Han, et al., 2019). Commonly used algorithms as
well as exploration strategies are analysed. A discussion on value-based
versus policy-based methods is included. Multi-agent RL is identified
as a future research direction of significant importance. Although not
strictly a review, Nweye, Liu, Stone, and Nagy identify nine practical
challenges of RL in grid-interactive buildings (Nweye et al., 2022). Sam-
ple efficiency, partial observability, and explainability are, among other
topics, identified as important future research directions. An example
of off-line learning of an RL controller in the proposed test-environment
CityLearn is also provided.
Some previous reviews treat RL in combination with other top-
ics related to building HVAC and energy systems control. Royapoor,
Fig. 1. Number of publications per year in the field of RL for HVAC control. The results Antony, and Roskilly reviews holistic control of a building with HVAC
were obtained from Web of Science using the search query topic=(reinforcement control included as a section (Royapoor et al., 2018). Popular control
learning AND (hvac OR heating OR cooling)).
methods, including RL, are investigated. The use of thermal comfort
and occupancy modelling for control to reduce operational energy
usage is discussed. A survey on industry familiarity with various control
next generation of thermal systems, which will use temperature lev- strategies is also included. Hong, Wang, Luo, and Zhang reviews the
els much closer to room temperatures and integrate thermal storage use of ML in the entire building life cycle, with a section on HVAC
solutions. The main challenge is the increased thermal inertia of such control (Hong et al., 2020). Supervised Learning (SL) for personal
next-generation systems–the reduced energy transfer capacity requires comfort and occupancy-modelling is treated. MPC, RL and their respec-
longer control horizons to maintain a sufficient thermal comfort. Model tive advantages for HVAC control, are discussed. Potential benefits of
Predictive Control (MPC), a method for dynamic long-term planning planning in RL methods are mentioned. Pinto, Wang, Roy, Hong, and
of control systems, has been proposed as a solution to these prob- Capozzoli review the use of transfer learning in smart buildings (Pinto,
lems (Afram, Janabi-Sharifi, Fung, & Raahemifar, 2017; Jin, Baker, et al., 2022). Transfer learning for both RL and various kinds of
Christensen, & Isley, 2017; Manjarres, Mera, Perea, Lejarazu, & Gil- predictive modelling are treated both in the context of single buildings
Lopez, 2017). However, such methods require accurate building plant and building clusters.
models, which are inherently hard to obtain due to the complex and
time-varying dynamics. 1.3. Contributions of this review
Reinforcement Learning (RL) is getting increased attention in the
field of building Heating, Ventilation and Air Conditioning (HVAC)
The contributions of this review to the existing literature consist of
and energy network control research. It has proven to be an attractive
four key aspects:
alternative to MPC due to its model-free approaches. Commonly, RL
algorithms do not strictly require a model of the environment, but learn • Many challenges of applying RL to BES control have their roots in
control policies through interaction. While this alleviates the problems computer science, not in building energy research. We provide a
of building modelling, RL suffers from other drawbacks, such as the guide for building energy researchers to become familiar with the
need for massive quantities of data and long training times. computer science related challenges, and to identify key research
Several advancements have been made in the field of RL in the last gaps where meaningful contributions can be made.
decade, and many effective learning algorithms have emerged. Some • Previous treatment of the computer science related challenges of
of these algorithms have been applied to a wide range of applications RL for BES control is distributed across a vast body of literature,
in BES (Chen, Norford, Samuelson, & Malkawi, 2018; Du, Zandi, et al., which is difficult to overlook. We address this by identifying and
2021; Gao, Li, & Wen, 2020; Wei, Wang, & Zhu, 2017). The number of structuring the main challenges into three categories, outlined in
publications per year on RL for HVAC control can be seen in Fig. 1. Due Sections 5–7.
to the growth of the field, there is a need for an up-to-date literature • Previous reviews address only limited parts of the computer sci-
review summarizing the main research gaps and previous research. ence related challenges of RL for BES control. We address this by
providing a comprehensive treatment of all the main challenges.
1.2. Previous reviews • The study identifies promising future research directions to ad-
dress the computer science related challenges of applying RL to
Previous reviews focus on RL methodology, but do not consider BES control.
the full extent of the computer science related challenges of BES
control. Vázquez-Canteli and Nagy review the application of RL for The three categories of challenges that are identified in this review
demand response applications (Vázquez-Canteli & Nagy, 2019). RL are referred to as reinforcement learning in single buildings, reinforcement
theory focusing on Q-learning and the exploration/exploitation trade- learning in building clusters, and multi-agent aspects. To further justify our
off is treated. Commonly used algorithms and action selections are contributions, the coverage by previous reviews of the three categories
also presented. Future research directions are identified, including the can be seen in Table 1. Note that coverage of the main challenges in
incorporation of expert knowledge, and reduction of the state–action previous reviews can mean two things: (1) The reviewers investigate
space to improve the sample efficiency of the RL algorithms. The need previous work concerning the challenges, or (2) discuss them as future
for multi-agent systems to facilitate simultaneous control of building research directions. It is evident that no previous review treats all
clusters is also emphasized. Wang and Hong review the use of RL categories of challenges.
for applications in building control such as window opening, lighting,
and HVAC (Wang & Hong, 2020). Many aspects of RL, such as the 1.4. Organization of the review
popularity of various algorithms, the exploration/exploitation trade-
off, and the choice of states and actions are investigated. Reduction In Section 2, we introduce the BES and discuss the challenges of
of the state–action space and multi-agent systems are suggested as control in such systems. This section is followed by a primer on the
future research directions. Han, et al. reviews the use of RL for control theory behind RL in Section 3. Given an overview of key theoretical
2
D. Weinberg et al. Sustainable Cities and Society 89 (2023) 104351
Table 1
Coverage of the three main challenges in previous reviews. Review work, and identification as a future research direction are denoted by RW
and FR, respectively.
Author Reference RL in single RL in building Multi-agent
buildings clusters aspects
Vázquez-Cante and Nagy Vázquez-Canteli and Nagy (2019) FR/RW – FR
Wang and Hong Wang and Hong (2020) RW – FR
Han et al. Han, et al. (2019) RW – FR
Nweye et al. Nweye et al. (2022) FR – FR
Royapoor et al. Royapoor et al. (2018) FR – –
Hong et al. Hong et al. (2020) FR/RW FR –
Pinto et al. Pinto, et al. (2022) RW FR/RW –
Fig. 2. A generic illustration of a BES, and the associated measurement and control system. The dashed lines represent exchange of sensor measurements and control signals.
aspects, we discuss the computer science perspective, and review state- such that the load can always be supplied, while keeping the cost and
of-the-art approaches to RL-based BES control in Sections 4–7. In energy usage to a minimum. But the components of a BES are highly
Section 8, we make a future outlook and identify promising research interactive, and achieving these goals using traditional control methods
directions. Lastly, we conclude the paper and summarize our findings is prohibitively difficult. Using Fig. 2 as an example, the controller must
in Section 9. coordinate the operation of all components in the BES while taking
into account complications such as the system’s dynamic response, and
2. Problem formulation time delays in sensor measurements and actuators. In large buildings,
where the number of measured and controlled variables increases, this
The Building Energy System (BES) can be thought of as all systems coordination problem becomes highly complex, so the need for new
in a building that use energy. This includes both electrical appliances control methods is apparent.
and lighting, and the thermal energy system. In this review, we limit
the definition of BES to encompass heating equipment, thermal storage, 2.1. Data collection in buildings
emission systems, and ventilation and air conditioning, as well as their
auxiliary systems, as shown in Fig. 2. Even though the constituent To control and monitor a BES, operational data must be collected
parts of the BES have their own intricate principles of operation, we and stored. This is typically done in a Building Management System
will not cover them in detail. The purpose of this section is merely to (BMS), which integrates sensors, actuators, and networking devices to
establish the role of RL in BES control. For case-specific treatment of provide information about all subsystems in a building. For control
BES components, we refer the reader to Frederiksen and Werner (2013), purposes, the BMS operates on multiple levels which represent distinct
Tymkow, Tassou, Kolokotroni, and Jouhara (2020). levels of abstraction of the system. For example, the level closest to
All the BES components in Fig. 2 are part of an interconnected sensors and actuators is known as the field level (Levermore, 2000).
system: heat is generated using heating equipment, injected into a Devices at the field level receive instructions from higher level con-
thermal storage, and distributed in the building using an emission trollers to write specified output signals to the actuators. They also
system. On top of this, ventilation, and air conditioning are added to read sensor values and passes them back to the higher-level devices.
ensure an acceptable indoor environment. Naturally, all components The higher-level controllers receive the sensor data, which is stored
must be controlled to ensure a comfortable indoor climate, preferably and processed to determine suitable control signals to send to the field
at a low monetary cost while using as little energy as possible. An level devices. This is demonstrated using dashed lines in Fig. 2 where
important example of this is optimal control of thermal storage systems at each BES-component, there are sensors that measure the variables
to shift the time of heat production away from times of high load. For which we want to control. These measurements are communicated to
optimal operation, the charging of the storage tank must be scheduled a central controller, which computes the control signals to be applied
3
D. Weinberg et al. Sustainable Cities and Society 89 (2023) 104351
BES is a vast area on its own, and for brevity we will not discuss it in ∙ Incorporation of expert ∙ Difficult in complex
knowledge systems
more detail here. We instead refer the interested reader to Levermore Physics-based
∙ Interpretable system ∙ Governing physical laws
(2000), Sofos, et al. (2020). dynamics might not exist
These days, BMS systems are common and serve as promising
∙ No system-model ∙ Requires large quantities of
platforms for integrating data-driven control into BES. Stored opera- required data
RL-based
tional data can be used inside ML workflows to create control systems ∙ Can control very complex ∙ Few stability- or robustness
that autonomously optimize the energy performance of a building. In systems guarantees
particular, BMS systems facilitate large scale data collection in clusters
of buildings, paving the way for coordination and control of entire
energy communities. In the next section, we explain the role of RL in policy, which describes the optimal actions in a given state. The optimal
this scenario, and how it can be integrated to solve control problem in policy may be found by solving
complex BES. [∞ ]
∑
max E 𝛾 𝑡 𝑟(𝐬𝑡 , 𝐚𝑡 ) , (1)
𝜋
2.2. The need for reinforcement learning 𝑡=0
4
D. Weinberg et al. Sustainable Cities and Society 89 (2023) 104351
Table 3
Classification of popular RL algorithms.
On-policy Off-policy
Value-based SARSA (Sutton & Barto, 2018) DQN (Mnih, et al., 2013)
Policy-based TRPO (Schulman, Levine, et al., 2017), PPO (Schulman, Wolski, et al., 2017) DDPG (Lillicrap, et al., 2019), SAC (Haarnoja et al., 2018)
2013). Once the Q-function has been estimated, the optimal policy may algorithm, the use of learnt plant models, as well as incorporation of
be extracted by choosing the action with the highest 𝑄-value in a given expert knowledge.
state. Secondly, in the section on RL in building clusters we review
Policy-based algorithms, on the other hand, directly parametrize the approaches taken by building energy researchers which focus on ex-
policy as a function 𝜋𝜃 ∶ → , or a probability distribution over ploiting data from other buildings when deploying RL in building
actions 𝜋𝜃 (𝐚|𝐬). The optimal parameters are found by directly solving retrofitting practices, or in newly built. Here, unlike in the first topic,
the problem we focus on approaches which transcend the isolated treatment of
[∞ ] single buildings. Instead, we review methods that rely on transfer-
∑
max E 𝛾 𝑡 𝑟(𝐬, 𝐚) , (4) ring information for efficient learning in clusters of buildings. The
𝜃
𝑡=0 most prominent approaches are related to various forms of trans-
where the expectation is over the stationary state distribution 𝐬 ∼ 𝜌 fer learning—either by directly transferring control policies, or by
and the policy 𝐚 ∼ 𝜋𝜃 (Sutton & Barto, 2018). There are many popular transferring plant models between buildings.
policy-based algorithms in RL, such as Trust Region Policy Optimiza- Lastly, the section on multi-agent aspects treats simultaneous learn-
tion (TRPO), Proximal Policy Optimization (PPO), Deep Deterministic ing of RL-controllers in building clusters using multi-agent RL. Un-
Policy Gradient (DDPG) and the Soft Actor Critic (SAC) (Haarnoja, der imposed privacy restrictions or severe communication constraints,
Zhou, Abbeel, & Levine, 2018; Lillicrap, et al., 2019; Schulman, Levine, agents might not be able to mutually share information, so we outline
Moritz, Jordan, & Abbeel, 2017; Schulman, Wolski, Dhariwal, Radford, the distinction between independent and joint learning. Moreover, we
& Klimov, 2017). discuss the safety and scalability aspects of multi-agent control of BES
in building clusters.
3.2. On-policy and off-policy algorithms In summary, these three sections provide an overview of the most
important challenges of applying RL to BES control. They also provide
detailed descriptions of the tools and approaches adopted to address
Another useful classification of RL algorithms is whether the data
these challenges. However, it is critical to remember that although the
used to improve the current policy must be collected in the system using
challenges are treated in isolation here, in engineering, they may arise
that same policy. On-policy algorithms have this requirement, whereas
simultaneously. For example, in the case of using transfer learning for
off-policy algorithms allow for the use of data collected in the system
RL in building clusters, it may be advisable to still use some approaches
using another policy.
outlined in the section on RL in single buildings. Similarly, even if
In BMS systems, sensors are typically read with a sampling time in-
one is faced with a multi-agent RL problem, it is essential to still
terval of several minutes—sometimes even hourly. This slow collection
consider the potential benefits of, e.g., transfer learning to speed up
of data is, as will be shown in the following sections, problematic as
the training process. Also, even though the challenges of applying RL
RL algorithms need large quantities of data to find the optimal control
to BES control are considered here on the scale of single buildings and
policy. Hence, off-policy algorithms appear attractive for BES control
building clusters, they could in principle arise on smaller scales. For
due to their ability to reuse all data collected throughout the learning
example, the issue of controlling the BES in a multi-zone building can
process. With on-policy algorithms, the duration of data collection
be solved by having separate controllers for each thermal zone. In such
increases, as it is required to collect larger quantities of data after every
an arrangement, each thermal zone could possibly utilize information
policy update. In Table 3, a summary of the workings of some popular
from the other zones in a transfer learning-like arrangement. The task
RL algorithms is shown.
now strongly resembles RL in building clusters, but on the scale of a
single building. In the next three sections, the three main challenges
4. The computer science perspective
mentioned earlier will be reviewed in detail.
Many of the commonly encountered challenges in applications of 5. Reinforcement learning in single buildings
RL for BES control do not have their roots solely in energy systems
research, but also in computer science. Therefore, in this review, we We now turn our attention to the problem of data collection, and
aim to investigate these challenges by an interdisciplinary approach. In learning in BES. Data collection in BES can be very time-consuming
the context of the problem at hand, this means studying the limitations because of the low measurement sampling frequency. The total time
of RL in BES and how these issues can be resolved. We do this by between sensor readings is often in the order of several minutes. This is
dividing the review into 3 main sections: troublesome as most RL methods require substantial quantities of data
• RL in single buildings to produce useful results. To capture the seasonal variations inherent in
• RL in building clusters the surrounding weather conditions, it is potentially necessary to collect
• Multi-agent aspects data over several years. This would lead to unacceptably long training
times for practical applications. However, the solution is not as simple
Firstly, the section on RL in single buildings addresses the approaches as sampling more often, as the underlying reason for the low sampling
taken by building energy researchers to alleviate the issue of learning frequency is the slow dynamics of the BES. Oversampling in such a
inefficiencies in RL. RL algorithms typically require massive quantities system would not capture any meaningful information, because the
of data and extensive interaction with the building before convergence important variations are taking place on a time-scale of hours, or even
is achieved (Yu, 2018). This implies that the training process could be days. Moreover, in a BES control problem, RL methods naturally require
prohibitively long for RL to be practically feasible in real-life BES. In direct interaction with the building installations. Given the suboptimal
this topic, we restrict our review to solutions treating single buildings behaviour of a randomly initialized controller, there could be periods of
in isolation—we do not allow the reuse of data from other buildings. time when the building becomes uninhabitable due to the controller’s
The main remedies turn out to be related to the choice of learning stochastic exploration of heating and cooling settings. If, however, some
5
D. Weinberg et al. Sustainable Cities and Society 89 (2023) 104351
6
D. Weinberg et al. Sustainable Cities and Society 89 (2023) 104351
other aspects such as learning a policy 𝜋 ∶ → as opposed to start’’ using transfer learning from previous successful implementa-
directly optimizing over an action sequence. One advantage of doing tions. However, all buildings are unique; both in terms of envelope
so is that model free methods can be used for learning. Chen et al. and design, climate, heat loss status and internal heat gains. What
use MPC and an environment dynamics model to perform planning works in one building might not work in another. Moreover, the state–
of actions for controlling the supply-water temperature from a district action space × may vary between buildings due to the changing
heating connection (Chen et al., 2020). It is implemented in conjunction availability of measured state information and controllable variables.
with a PPO agent to improve a parametrized policy. This leads to large As a result, the possibility of transferring information between buildings
improvements in terms of sample efficiency over the purely model free might diminish because this drastically changes the structure of the
alternative. problem.
Real time learning of a dynamics model is also possible. For in- Identical state–action spaces in similar building types can also be
stance, the system developed by Nagy, Kazmi, Cheaib, and Driesen imagined. Using digital twins, which are defined as digital represen-
simultaneously learns a neural network model of the environment tations of the building installation and energy systems, for simulating
dynamics combined with MPC to perform planning of the supply power operation is an important example of this: Training the RL agent in
to an air-source heat pump (Nagy et al., 2018). They conclude that the simulated environment and then deploying it in the real building
the model-based method is consistently better in terms of both sample can be seen as a special case of transferring information between
efficiency and the resulting average reward. Zhang, Kuppannagari, buildings. In that case, the main challenge is instead the discrepancy
Kannan, and Prasanna confirm the findings by noting that the com- between the simulated and real-life building performances. But even
bination of a neural network dynamics model and MPC converges if the simulated dynamics are validated and consistent, the effects
to its final average reward an order of magnitude faster than PPO of exogenous variables and disturbances can have a profound impact
for temperature setpoint control in a simulated datacenter (Zhang when transferring a controller to a new environment. The RL controller
et al., 2019). Comparable results are obtained by Ding, Du, and Cerpa could as a result be forced into parts of the state–action space unseen
who use model predictive path integral control in combination with during training.
learning a dynamics model using a neural network to control VAV RL in building clusters concerns the performance gained by trans-
system temperature setpoints in a five-zone building (Ding et al., 2020). ferring knowledge from one building to another. Controlling buildings
Their proposed model-based controller also reduces the training time are separate, but related tasks which can be exploited in the training
compared to a PPO agent by an order of magnitude. process. It should be noted that RL in building clusters and in single
buildings are distinctly different. The end goal is the same–to reduce the
5.2. Expert knowledge training time or increase the final average reward of the RL controller–
but the means of achieving it are different. RL in single buildings
Expert knowledge incorporation in the learning process is the sec- concerns the utilization of data and knowledge available within a
ond virtue of an algorithm for RL in single buildings. In some cases, in single building. RL in building clusters, on the other hand, focuses
RL, the training time can be reduced by using information that is avail- on transferring information between buildings, as visualized in Fig. 5.
able a priori. This is particularly useful for developing control system in Here, the focus will be on transfer learning for RL applications, but for
building stock with renovation practices. The prior knowledge can take a more general review on transfer learning for smart buildings, we refer
on many forms in BES control. For instance, Vázquez-Canteli, Ulyanin, the reader to the publication by Pinto, et al. (2022).
Kämpf, and Nagy pre-train the action value function network of a DQN-
controller for the temperature setpoint of an air-to-water heat pump 6.1. Transferring policies
using the fitted Q-iterations algorithm (Vázquez-Canteli et al., 2019).
The algorithm amounts to running the DQN algorithm offline with an Transferring information from one building to another is required to
experience buffer filled with data from an already existing rule-based facilitate RL-based BES control in building clusters. However, the infor-
controller. Approximately 20 days of historical data are used for offline mation term is ambiguous regarding what is in fact being transferred.
training. The pre-training gives the DQN a head start, and it performs If there exists a policy that works well for controlling one building,
on the same level as the rule-based controller at deployment. After transferring it to a new building with a different state–action space
deployment, learning continues online, and the DQN agent shortly out- × requires a corresponding change to the policy and value function
performs the rule-based controller in terms of electricity cost. Another approximations. It is possible that some learned features about the
approach to pre-training is investigated by Chen et al. A parametrized states can be transferred among buildings and the approximation be
policy is pre-trained using SL on data from an existing district heating fine-tuned for a new task with a new action space. However, to the
supply-water temperature controller (Chen et al., 2020). It is then authors’ best knowledge, this remains unexplored in the context of BES
deployed and refined using PPO. control.
Providing the agent with more granular information about the Assuming a fixed state–action space is common, but changing dy-
environment is another way of incorporating expert knowledge in an namics and exogenous variables are still novel challenges. Wei et al.
RL algorithm. Du, Li, et al. use multitask learning to jointly learn the note that despite two buildings being identical, surrounding climate
separate tasks of heating and cooling a building by controlling the characteristics such as weather patterns have an impact on the RL
indoor temperature setpoint (Du, Li, et al., 2021). A binary task ID is controller’s learning process when applied to VAV-system control (Wei
fed to the policy network of a DDPG agent to signify whether heating et al., 2017). This suggests that even if an RL agent is trained success-
or cooling is to be performed. The resulting learning process is twice as fully for one building, it cannot be seamlessly transferred to a similar
fast as the single task implementation. Moreover, restricting the action building in another climate without further fine-tuning. This raises the
space using the ideas discussed earlier is another way of introducing important question of whether performance can be improved by trans-
information about the environment. If certain actions are known a ferring policies between buildings without re-training from scratch. Du,
priori to be unreasonable, or even forbidden, they can be eliminated. Zandi, et al. transfer a learned policy to several new simulated build-
ings with varying thermal mass to investigate generalizability and
6. Reinforcement learning in building clusters robustness to changing transition dynamics when controlling indoor
temperature setpoints in a building (Du, Zandi, et al., 2021). The
Reusing information from previous building installations to ensure transferred RL controller outperforms a rule-based controller in terms
efficient training of new RL agents in BES control is highly desirable. of temperature violation-time, with a slight increase in electricity cost
Rather than training an agent from scratch, it can be given a ‘‘warm for all test buildings. On the other hand, compared to a fixed setpoint
7
D. Weinberg et al. Sustainable Cities and Society 89 (2023) 104351
Fig. 5. RL in building clusters concerns the transfer of information between buildings to increase control performance. The arrows represent flow of information.
controller, the RL controller reduces the electricity cost by 15% with transition dynamics 𝑇 depend on the joint action of all agents. Hence,
a slight increase in temperature violation-time. The issue is further if a single agent does not have information about what the others
studied by Biemann, Scheller, et al. who investigate the robustness to are doing, the environment will appear non-stationary because, in
varying weather conditions when controlling the setpoint of a cooling general, 𝑇 (𝐬′ |𝐬, 𝐚1 , 𝐚2 … 𝐚𝑁 ) ≠ 𝑇 (𝐬′ |𝐬, 𝐚1 , 𝐚′2 … 𝐚′𝑁 ). Another complication
coil and the fan air flow in a simulated datacenter (Biemann, Scheller, of MARL is the problem of credit assignment. In the case of agents
et al., 2021). By randomizing the weather patterns used in each training receiving a collective reward, it is not clear how this reward should
episode to be one from a few different climates, the resulting agent be divided among the agents. Hence, a single agent can be solely
becomes more robust to such changes. It is also reported that the responsible for a high collective reward, despite the other agents not
electricity consumption is consistently reduced by approximately 15% acting optimally. For these reasons, the naive application of the single-
compared to a rule-based controller when evaluated on weather condi- agent approaches from the previous sections in a multi-agent setting
tions not seen during training. Zhang, et al. perform a limited study of may be problematic.
transferring policies between buildings with varying user preferences There are also practical implications of introducing multi-agent
and appliance parameters (Zhang, et al., 2020). It is found that the control systems. For instance, rather than installing powerful compu-
training time is reduced by transferring policies when the buildings are tational hardware locally in each building, the system designer has
similar. The advantages diminish when the buildings are more distinct. the choice of confining the data processing to a single location as
depicted in Fig. 6. By transmitting data to a central processing server,
6.2. Transferring dynamics models the need for investing in hardware for every building is mitigated.
However, depending on the quantity of data and the rate of trans-
In model-based RL, information about the environment can be mission, communication can introduce a major bottleneck, especially
transferred instead of policies. In a new building, efficient learning of for communication intensive methods in RL. As processing power is
the environment transition dynamics 𝑇 (𝐬𝑡+1 |𝐬𝑡 , 𝐚𝑡 ) is enabled by transfer becoming increasingly affordable and available, using distributed pro-
learning. On a high level, the idea is that a model of 𝑇 in an old building cessing on the edge is a viable alternative to central server processing in
can be transferred to new buildings. This would reduce the training smart buildings. Furthermore, with companies becoming increasingly
time and increase the prediction performance compared to training the restrictive about sharing their collected data, the imposed privacy
model from scratch. Fan, et al. use transfer learning to create predictive restrictions can make transmission of raw data impossible.
models for building energy usage (Fan, et al., 2020). Models are trained
on a set of source buildings and are then transferred and fine-tuned on 7.1. Independent- and joint learning
another set of target buildings. Remarkably large performance improve-
ments of up to 60% in terms of relative reduction of the prediction error It is imaginable that, given the ever-increasing available compu-
are observed. The improvements are most apparent when the available tational power, one could formulate the control of many buildings
datasets at the target buildings are limited in size. Similarly, Qian, in a network as one big, central RL problem. Accordingly, the joint
Gao, Yang, and Yu use transfer learning to bridge the sim-to-real gap action of every building in existence would be controlled using a single
by pre-training an energy prediction model in a simulator and using policy. Under the assumption of not having the communication con-
real building data for fine-tuning (Qian et al., 2020). An increase in straints outlined earlier, this effectively outlines a single-agent problem.
prediction performance of 10% can be observed, and the benefit is the However, as emphasized in the section on RL in single buildings, the
highest when only limited real-world measurement data is available. state–action space would grow exponentially with the number of agents
in the system. In the case of representing a policy or value function
7. Multi-agent aspects using a neural network, the training would require vast amounts of
data to converge. Wei et al. recognizes this problem in the context of
In the context of a local energy community, it is likely that the control of the air flow in a VAV-system in a building with multiple
energy systems in several buildings need to be synchronized for re- thermal zones (Wei et al., 2017). They propose maintaining separate
newable energy sharing purposes, such as geothermal and photovoltaic policies for each thermal zone, and thus implicitly formulating it as a
production. From an RL perspective, controlling multiple buildings multi-agent problem. This transforms the problem into training several
simultaneously translates to a multi-agent control problem. It is a neural networks of manageable sizes, which leads to a larger reduction
system of many agents interacting simultaneously with a common envi- of energy cost and temperature violations compared to the single-
ronment, and with each other. One of the main difficulties concerning agent formulation. The RL control system is evaluated in simulations of
Multi-Agent Reinforcement Learning (MARL) is that the environment three different buildings using weather data from two distinct locations,
8
D. Weinberg et al. Sustainable Cities and Society 89 (2023) 104351
Fig. 6. Illustration of a generic multi-agent system with centralized control. The arrows indicate flow of information.
namely Riverside and Los Angeles. Compared to the single-agent for- not become intractable. MARL methods have the desirable property
mulation, the multi-agent system consistently achieves a larger relative of preserving computational tractability, even in systems with many
cost reduction. Compared to a rule-based control of the VAV-system, agents. Yu, et al. conclude that MARL is indeed scalable by evaluating
the multi-agent RL controller gives a reduction of the monetary energy their proposed algorithm in a system of 30 agents (Yu, et al., 2021).
cost of 20%–70%. The wide range of cost reductions are believed to They draw a conclusion which can be summarized as follows. In a
arise due to the difference in weather patterns in the two test locations. system of 30 agents where each agent has access to 10 discrete actions,
It is important to note the distinct difference between agents learn- using a single-agent formulation–amounting to central control of all
ing independently and jointly. It is possible to formulate the multi-agent actions in the entire system–would yield 1030 different actions that
control problem as each agent solving a task independent of one can be applied in the system. Such an enormous number of actions
another, but with a reward function that is dependent on the joint would make the problem impossible to solve in practice. When instead
performance of all agents. Vázquez-Canteli et al. define the reward formulated as a MARL problem, each agent learns a separate policy, so
function of each agent to be the sum of the monetary cost of all
each agent only needs to consider 10 different actions. This makes the
the agents’ individual energy usages (Vázquez-Canteli et al., 2019).
system scalable even as more agents are added.
In contrast to the purely independent learning case, there is now an
Another important aspect of safe and reliable heating and cooling
incentive for agents to avoid using a lot of energy simultaneously, such
systems in local energy communities is the power capacity limit. It
that the agents must learn to coordinate. The algorithm is evaluated
is undesirable for agents to require power simultaneously. Even if
on a task of controlling the temperature setpoints of air-to-water heat
pumps in two large residential building equipped with PV panels. It is RL-control of a building cluster leads to a reduced collective energy sig-
demonstrated that the multi-agent formulation leads to lower energy nature, there may be other undesirable effects related to the power grid
cost compared to rule-based controllers. When photovoltaic arrays are supplying electricity to the BES. An example of such a situation is when
included in only one of the buildings, the multi-agent controller outper- the heat in the building cluster is supplied by heat pumps. Vazquez-
forms independent control in terms of energy cost. This is believed to be Canteli, Henze, and Nagy study several grid impact factors when apply-
due to coordination of the photovoltaic production so that the building ing multi-agent SAC to controlling BES with heat pumps, photovoltaic
without photovoltaic panels may consume more electricity from the arrays, and thermal storage (Vazquez-Canteli et al., 2020). It is found
grid when the other is self-sufficient. Li, Zhang, et al. also study MARL that their MARL algorithm shows dramatically reduced peak electrical
with joint learning for HVAC control, but uses a more sophisticated load compared to rule-based control. The ramping of the power is also
measure of thermal comfort, namely the Predicted Mean Vote (PMV) reduced, suggesting less sharp peaks. Effects on the power grid are
metric (Li, Zhang, et al., 2021). When applied to controlling the indoor particularly important to consider because power capacity shortage is
temperature- and humidity setpoints of a multi-zone laboratory build- an increasing problem in general.
ing, it is concluded that MARL outperforms rule-based control by up to
15% in terms of operational energy usage without sacrificing thermal 8. Future outlook
comfort. Moreover, the MARL formulation outperforms the approach
of using a single agent to control all actions in terms of the PMV. This The research on RL for BES control is undoubtedly still in its early
again suggests that coordination between agents can be beneficial in stages. In this review, we have identified the three main computer
BES control. science related challenges in the area. Each of the three challenges
Finally, the choice of independent or joint learning depends on still has unexplored research directions that require attention. In this
whether the actions of one agent affect the others. If it is believed section, we will highlight such directions that the authors believe are
that 𝑇 (𝐬′ |𝐬, 𝐚1 , 𝐚2 … 𝐚𝑁 ) = 𝑇 (𝐬′ |𝐬, 𝐚1 ), then independent learning can be
promising and meaningful.
performed because there is no need for coordination. However, this
would be oversimplifying, as parts of the heating/cooling system might
8.1. Reinforcement learning in single buldings
be common to many buildings. Thermal storage and PV production are
examples of such common utilities that create the need for coordina-
tion. Moreover, if a power or energy limit is persistent in the system, An algorithm with high sample efficiency is crucial in BES control.
coordination is needed to avoid exceeding this constraint. A promising approach which possesses such a virtue is model-based RL.
When using MPC as a basis for model-based RL, the biggest bottleneck
7.2. Safety and scalability is the requirement for an accurate environment model. However, if
the learning of the model were to happen online, this problem would
Apart from learning efficiency, an especially important aspect of RL be alleviated. Moreover, online learning of a model, integrated with a
algorithms for BES control is their scalability. As the number of build- model-free method for learning a policy in a so-called DYNA scheme,
ings in an energy community increases, the control problem should has been successful in many other applications of RL (Feinberg, et al.,
9
D. Weinberg et al. Sustainable Cities and Society 89 (2023) 104351
2018; Janner, Fu, Zhang, & Levine, 2019; Sutton, 1991). However, it is sample complexity of an RL algorithm under reasonable assumptions.
not clear how detailed the environment dynamics model must be in the Such a bound would offer insight into the number of environment inter-
case of BES control. BES are simple systems in the sense that one might actions needed to guarantee a certain expected reward. Furthermore, it
not need a complicated model to describe their dynamics, but they are is also desirable to upper bound the performance degradation in terms
susceptible to change over time, nonetheless. This would require some of the expected reward of transferring an agent to from one building to
degree of adaptivity in the models used for planning in an RL algorithm. another when the buildings have different transition dynamics. If such
Pre-training of policies by imitating already existing suboptimal a bound were obtained, it would be helpful when studying the transfer
controllers is another approach. There are however issues of such of policies between BES.
strategies that remain untreated in the context of BES. The most promi-
nent drawback of pre-training is that it might affect exploration when 8.5. Explainability
learning the final policy. Therefore, application of the current state of
the art in imitation learning along with systematic empirical studies of There exists an increasing body of literature on attempting to ex-
pre-training would also provide valuable insights. plain or visualize why RL agents take the actions that they do (Gunning,
et al., 2019; Puiutta & Veith, 2020). This interest is not purely aca-
8.2. Reinforcement learning in building clusters demic, but a strong demand from industry is also persistent due to the
ethical–and sometimes legal–obligations to motivate the actions taken
Transferring information between buildings is highly desirable in in autonomous systems. BES are no exception, and studies on which
practice and deserves more attention in the future. However, there are features are influential when controlling them would provide a better
major obstacles like differing state–action spaces between buildings, understanding of the problem. Moreover, explainability could aid in
making such a process difficult. Even under the assumption of a fixed feature selection when designing the state space in RL for BES control,
state–action space, there is need for algorithms that produce generally which could in turn improve the sample efficiency of the algorithms.
capable agents that can be deployed with reasonable results ‘‘out of the
box’’. Fast fine-tuning of the policy would then be required to obtain 8.6. System design
optimal behaviour in as little time as possible. A promising approach
to obtain agents that are robust to changing environment dynamics is Apart from RL-methodology and theory, there is the issue of design-
domain randomization (Tobin, et al., 2017). Such a training framework ing a scalable computation infrastructure to process measurement data
would be useful when making the sim-to-real transfer–i.e., training an in BES. Today, measurements are typically processed locally–e.g., in
agent in a simulation and then deploying it in the real word–due to a microcontroller–on the building premise. An alternative approach is
the mismatch between simulation and reality. Another approach is to to instead transmit the measurements to a datacenter for processing
use meta-RL to train agents with the goal of quickly adapting to new and storage in the cloud. Such data collection opens for the use of
tasks, rather than being good at one task (Finn, Abbeel, & Levine, 2017; online processing using RL for controlling the BES without the need
Nichol, Achiam, & Schulman, 2018). for installing computational hardware in the building itself. On the
In connection to the topic of model-based RL, it would be desirable other hand, this would require a channel of communication between
to use transfer learning when learning a model of the environment dy- the sensors and actuators in the BES, and the cloud provider. Therefore,
namics. Doing so could potentially alleviate some scalability issues of, there is a demand for interoperability between all system components,
e.g., MPC. Comprehensive investigations into the effects of transferring which is a major challenge in practice.
dynamics models as well as entire policies between buildings would be Moreover, when controlling BES, data naturally arrives sequentially
beneficial. The ideas of meta-learning can potentially be useful here as in time as a stream. In the case of controlling a cluster of buildings
well. concurrently, data could arrive asynchronously from any of the build-
ings and would require fast processing. Moreover, since each building
8.3. Multi-agent aspects could have unique control logic, the computation cannot be statically
allocated but must be dynamic. Considering the design of the data
Multi-agent RL is a promising approach to control communities processing pipeline is an important future research direction, as it is
with multiple buildings and allows for coordination to reduce the crucial for making large-scale control possible.
collective energy usage on the community level. The popular approach
of defining a common reward function to incentivize coordination 9. Conclusions
between agents does however have its limits. For instance, defining
the reward as the sum of individual rewards can create unfairness In this review, we have identified the main computer science related
among the agents. Further research into reward constellations that take challenges in BES control using RL. They grouped into three categories:
fairness into account are needed to ensure proper functionality in a RL in single buildings, RL in building clusters, and multi-agent aspects.
multi-agent system. Another topic that requires more attention is that The main conclusions that can be drawn from this review in connection
coordination between agents is not always possible. It can be due to the to the three main categories are the following:
lack of a central server for data processing or privacy constraints which
makes the transmission of raw data impossible. Hence, there should be • There exist empirical studies of model-based RL methods which
investigation into the effects on the collective performance of having show promising results in terms of sample efficiency. However,
completely decentralized control. Such a scenario would be especially the requirement for historical data to model the building and the
important to consider in the context of power grid effects such as peak energy systems system remains problematic.
loads. • Transferring policies between buildings with similar state-
transition dynamics has empirically shown indications to be
8.4. Theoretical guarantees highly effective in BES control.
• Multi-agent RL is a promising framework for controlling building
The past research on RL for BES control is to a considerable extent clusters when shared utilities such as thermal storage and on-
empirical. But to gain further intuition about the challenges of sample site electricity production are present. However, the challenges
efficiency and performance in terms of the final reward, those problems of sample efficiency persist.
should be studied theoretically. One of the important theoretical ques- • There is a shortage of theoretical analysis of the problems encoun-
tions that needs answering is whether a bound can be placed on the tered in RL for BES control.
10
D. Weinberg et al. Sustainable Cities and Society 89 (2023) 104351
Declaration of competing interest Gao, G., Li, J., & Wen, Y. (2020). DeepComfort: energy-efficient thermal comfort control
in buildings via reinforcement learning. IEEE Internet of Things Journal, 7(9), 8472–
8484. http://dx.doi.org/10.1109/JIOT.2020.2992117, URL: https://ieeexplore.ieee.
The authors declare that they have no known competing finan- org/document/9085925/.
cial interests or personal relationships that could have appeared to Gunning, D., Stefik, M., Choi, J., Miller, T., Stumpf, S., & Yang, G.-Z. (2019). XAI—
influence the work reported in this paper. Explainable artificial intelligence. Science Robotics, 4(37), eaay7120. http://dx.
doi.org/10.1126/scirobotics.aay7120, URL: https://www.science.org/doi/10.1126/
scirobotics.aay7120.
Data availability Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). Soft actor-critic: off-policy
maximum entropy deep reinforcement learning with a stochastic actor. URL: http:
No data was used for the research described in the article. //arxiv.org/abs/1801.01290. [Cs, Stat] arXiv:1801.01290.
Han, M., May, R., Zhang, X., Wang, X., Pan, S., Yan, D., et al. (2019). A re-
view of reinforcement learning methodologies for controlling occupant comfort
References in buildings. Sustainable Cities and Society, 51, Article 101748. http://dx.doi.
org/10.1016/j.scs.2019.101748, URL: https://linkinghub.elsevier.com/retrieve/pii/
Afram, A., & Janabi-Sharifi, F. (2014). Theory and applications of HVAC control systems S2210670719307589.
– A review of model predictive control (MPC). Building and Environment, 72, 343– Hong, T., Wang, Z., Luo, X., & Zhang, W. (2020). State-of-the-art on research and
355. http://dx.doi.org/10.1016/j.buildenv.2013.11.016, URL: https://linkinghub. applications of machine learning in the building life cycle. Energy and Buildings,
elsevier.com/retrieve/pii/S0360132313003363. 212, Article 109831. http://dx.doi.org/10.1016/j.enbuild.2020.109831, URL: https:
Afram, A., Janabi-Sharifi, F., Fung, A. S., & Raahemifar, K. (2017). Artificial neural //linkinghub.elsevier.com/retrieve/pii/S0378778819337879.
network (ANN) based model predictive control (MPC) and optimization of HVAC International Energy Agency (2022). Buildings. IEA, URL: https://www.iea.org/reports/
systems: A state of the art review and case study of a residential HVAC system. buildings.
Energy and Buildings, 141, 96–113. http://dx.doi.org/10.1016/j.enbuild.2017.02. Janner, M., Fu, J., Zhang, M., & Levine, S. (2019). When to trust your model: model-
012, URL: https://linkinghub.elsevier.com/retrieve/pii/S0378778816310799. based policy optimization. In Proceedings of the 33rd international conference on
neural information processing systems (1122), (pp. 12519–12530). Red Hook, NY,
Biemann, M., Liu, X., Zeng, Y., & Huang, L. (2021). Addressing partial observability
USA: Curran Associates Inc.
in reinforcement learning for energy management. In Proceedings of the 8th
ACM international conference on systems for energy-efficient buildings, cities, and Jiang, Z., Risbeck, M. J., Ramamurti, V., Murugesan, S., Amores, J., Zhang, C., et
transportation (pp. 324–328). Coimbra Portugal: ACM, http://dx.doi.org/10.1145/ al. (2021). Building HVAC control with reinforcement learning for reduction of
3486611.3488730, URL: https://dl.acm.org/doi/10.1145/3486611.3488730. energy cost and demand charge. Energy and Buildings, 239, Article 110833. http://
dx.doi.org/10.1016/j.enbuild.2021.110833, URL: https://linkinghub.elsevier.com/
Biemann, M., Scheller, F., Liu, X., & Huang, L. (2021). Experimental evaluation of
retrieve/pii/S0378778821001171.
model-free reinforcement learning algorithms for continuous HVAC control. Applied
Jin, X., Baker, K., Christensen, D., & Isley, S. (2017). Foresee: A user-centric home
Energy, 298, Article 117164. http://dx.doi.org/10.1016/j.apenergy.2021.117164,
energy management system for energy efficiency and demand response. Applied
URL: https://linkinghub.elsevier.com/retrieve/pii/S0306261921005961.
Energy, 205, 1583–1595. http://dx.doi.org/10.1016/j.apenergy.2017.08.166, URL:
Chen, B., Cai, Z., & Bergés, M. (2020). Gnu-RL: A practical and scalable reinforce-
https://linkinghub.elsevier.com/retrieve/pii/S0306261917311856.
ment learning solution for building HVAC control using a differentiable MPC
Kurte, K., Amasyali, K., Munk, J., & Zandi, H. (2021). Comparative analysis of model-
policy. Frontiers in Built Environment, 6, Article 562239. http://dx.doi.org/10.3389/
free and model-based HVAC control for residential demand response. In Proceedings
fbuil.2020.562239, URL: https://www.frontiersin.org/articles/10.3389/fbuil.2020.
of the 8th ACM international conference on systems for energy-efficient buildings, cities,
562239/full.
and transportation (pp. 309–313). Coimbra Portugal: ACM, http://dx.doi.org/10.
Chen, Y., Norford, L. K., Samuelson, H. W., & Malkawi, A. (2018). Optimal control of
1145/3486611.3488727, URL: https://dl.acm.org/doi/10.1145/3486611.3488727.
HVAC and window systems for natural ventilation through reinforcement learning.
Lee, Z. E., & Zhang, K. M. (2021). Scalable identification and control of res-
Energy and Buildings, 169, 195–205. http://dx.doi.org/10.1016/j.enbuild.2018.03.
idential heat pumps: A minimal hardware approach. Applied Energy, 286,
051, URL: https://linkinghub.elsevier.com/retrieve/pii/S0378778818302184.
Article 116544. http://dx.doi.org/10.1016/j.apenergy.2021.116544, URL: https://
Ding, X., Du, W., & Cerpa, A. E. (2020). MB2C: Model-based deep reinforcement
linkinghub.elsevier.com/retrieve/pii/S0306261921000945.
learning for multi-zone building control. In Proceedings of the 7th ACM international
Levermore, G. J. (2000). Building energy management systems: applications to low energy
conference on systems for energy-efficient buildings, cities, and transportation (pp. 50–
HVAC and natural ventilation control (2nd ed.). London ; New York: E & FN Spon.
59). Virtual Event Japan: ACM, http://dx.doi.org/10.1145/3408308.3427986, URL:
Li, Y., O’Neill, Z., Zhang, L., Chen, J., Im, P., & DeGraw, J. (2021). Grey-
https://dl.acm.org/doi/10.1145/3408308.3427986.
box modeling and application for building energy simulations - A critical
Drgoňa, J., Arroyo, J., Cupeiro Figueroa, I., Blum, D., Arendt, K., Kim, D., et al. (2020).
review. Renewable and Sustainable Energy Reviews, 146, Article 111174. http:
All you need to know about model predictive control for buildings. Annual Reviews
//dx.doi.org/10.1016/j.rser.2021.111174, URL: https://linkinghub.elsevier.com/
in Control, 50, 190–232. http://dx.doi.org/10.1016/j.arcontrol.2020.09.001, URL:
retrieve/pii/S1364032121004639.
https://linkinghub.elsevier.com/retrieve/pii/S1367578820300584.
Li, J., Zhang, W., Gao, G., Wen, Y., Jin, G., & Christopoulos, G. (2021). Toward
Du, Y., Li, F., Munk, J., Kurte, K., Kotevska, O., Amasyali, K., et al. (2021). Multi- intelligent multizone thermal control with multiagent deep reinforcement learning.
task deep reinforcement learning for intelligent multi-zone residential HVAC IEEE Internet of Things Journal, 8(14), 11150–11162. http://dx.doi.org/10.1109/
control. Electric Power Systems Research, 192, Article 106959. http://dx.doi. JIOT.2021.3051400, URL: https://ieeexplore.ieee.org/document/9321466/.
org/10.1016/j.epsr.2020.106959, URL: https://linkinghub.elsevier.com/retrieve/ Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., et al. (2019).
pii/S0378779620307574. Continuous control with deep reinforcement learning. [Cs, Stat] arXiv:1509.02971.
Du, Y., Zandi, H., Kotevska, O., Kurte, K., Munk, J., Amasyali, K., et al. (2021). URL: http://arxiv.org/abs/1509.02971.
Intelligent multi-zone residential HVAC control strategy based on deep rein- Manjarres, D., Mera, A., Perea, E., Lejarazu, A., & Gil-Lopez, S. (2017). An energy-
forcement learning. Applied Energy, 281, Article 116117. http://dx.doi.org/10. efficient predictive control for HVAC systems applied to tertiary buildings based
1016/j.apenergy.2020.116117, URL: https://linkinghub.elsevier.com/retrieve/pii/ on regression techniques. Energy and Buildings, 152, 409–417. http://dx.doi.org/
S030626192031535X. 10.1016/j.enbuild.2017.07.056, URL: https://linkinghub.elsevier.com/retrieve/pii/
European Commission (2021). 2030 Digital compass: the European way for the digital S037877881731321X.
decade. European Commission. URL: https://eur-lex.europa.eu/legal-content/en/ Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., et al.
TXT/?uri=CELEX:52021DC0118. (2013). Playing atari with deep reinforcement learning. arXiv:1312.5602.
Fan, C., Sun, Y., Xiao, F., Ma, J., Lee, D., Wang, J., et al. (2020). Statistical Nagy, A., Kazmi, H., Cheaib, F., & Driesen, J. (2018). Deep reinforcement learning for
investigations of transfer learning-based methodology for short-term building optimal control of space heating. arXiv:1805.03777.
energy predictions. Applied Energy, 262, Article 114499. http://dx.doi.org/10. Nichol, A., Achiam, J., & Schulman, J. (2018). On first-order meta-learning algorithms.
1016/j.apenergy.2020.114499, URL: https://linkinghub.elsevier.com/retrieve/pii/ [Cs] arXiv:1803.02999. URL: http://arxiv.org/abs/1803.02999.
S0306261920300118. Nweye, K., Liu, B., Stone, P., & Nagy, Z. (2022). Real-world challenges for multi-
Feinberg, V., Wan, A., Stoica, I., Jordan, M. I., Gonzalez, J. E., & Levine, S. (2018). agent reinforcement learning in grid-interactive buildings. http://dx.doi.org/10.
Model-based value estimation for efficient model-free reinforcement learning. [Cs, 1016/j.egyai.2022.100202. URL: http://arxiv.org/abs/2112.06127. [cs, eess] arXiv:
Stat] arXiv:1803.00101. URL: http://arxiv.org/abs/1803.00101. 2112.06127.
Finn, C., Abbeel, P., & Levine, S. (2017). Mode-agnostic meta-learning for fast Pinto, G., Wang, Z., Roy, A., Hong, T., & Capozzoli, A. (2022). Transfer learning
adaptation of deep networks. [Cs] arXiv:1703.03400. URL: http://arxiv.org/abs/ for smart buildings: A critical review of algorithms, applications, and future
1703.03400. perspectives. Advances in Applied Energy, 5, Article 100084. http://dx.doi.org/
Frederiksen, S., & Werner, S. (2013). District heating and cooling. Lund: Studentlitteratur 10.1016/j.adapen.2022.100084, URL: https://linkinghub.elsevier.com/retrieve/pii/
AB. S2666792422000026.
11
D. Weinberg et al. Sustainable Cities and Society 89 (2023) 104351
Puiutta, E., & Veith, E. M. S. P. (2020). Explainable reinforcement learning: A Vazquez-Canteli, J. R., Henze, G., & Nagy, Z. (2020). MARLISA: Multi-agent rein-
survey. In A. Holzinger, P. Kieseberg, A. M. Tjoa, & E. Weippl (Eds.), Machine forcement learning with iterative sequential action selection for load shaping of
learning and knowledge extraction. Vol. 12279 (pp. 77–95). Cham: Springer Inter- grid-interactive connected buildings. In Proceedings of the 7th ACM international
national Publishing, http://dx.doi.org/10.1007/978-3-030-57321-8_5, URL: http: conference on systems for energy-efficient buildings, cities, and transportation (pp. 170–
//link.springer.com/10.1007/978-3-030-57321-8_5. 179). Virtual Event Japan: ACM, http://dx.doi.org/10.1145/3408308.3427604,
Qian, F., Gao, W., Yang, Y., & Yu, D. (2020). Potential analysis of the transfer learning URL: https://dl.acm.org/doi/10.1145/3408308.3427604.
model in short and medium-term forecasting of building HVAC energy consump- Vázquez-Canteli, J. R., & Nagy, Z. (2019). Reinforcement learning for demand response:
tion. Energy, 193, Article 116724. http://dx.doi.org/10.1016/j.energy.2019.116724, A review of algorithms and modeling techniques. Applied Energy, 235, 1072–
URL: https://linkinghub.elsevier.com/retrieve/pii/S0360544219324193. 1089. http://dx.doi.org/10.1016/j.apenergy.2018.11.002, URL: https://linkinghub.
Royapoor, M., Antony, A., & Roskilly, T. (2018). A review of building climate and plant elsevier.com/retrieve/pii/S0306261918317082.
controls, and a survey of industry perspectives. Energy and Buildings, 158, 453–465. Vázquez-Canteli, J. R., Ulyanin, S., Kämpf, J., & Nagy, Z. (2019). Fusing TensorFlow
http://dx.doi.org/10.1016/j.enbuild.2017.10.022, URL: https://linkinghub.elsevier. with building energy simulation for intelligent energy management in smart cities.
com/retrieve/pii/S0378778817318522. Sustainable Cities and Society, 45, 243–257. http://dx.doi.org/10.1016/j.scs.2018.
Salsbury, T. I. (2005). A survey of control technologies in the building automa- 11.021, URL: https://linkinghub.elsevier.com/retrieve/pii/S2210670718314380.
tion industry. IFAC Proceedings Volumes, 38(1), 90–100. http://dx.doi.org/10. Wang, Z., & Hong, T. (2020). Reinforcement learning for building controls: The oppor-
3182/20050703-6-CZ-1902.01397, URL: https://linkinghub.elsevier.com/retrieve/ tunities and challenges. Applied Energy, 269, Article 115036. http://dx.doi.org/10.
pii/S1474667016374092. 1016/j.apenergy.2020.115036, URL: https://linkinghub.elsevier.com/retrieve/pii/
Schulman, J., Levine, S., Moritz, P., Jordan, M. I., & Abbeel, P. (2017). Trust region S0306261920305481.
policy optimization. URL: http://arxiv.org/abs/1502.05477. [Cs] arXiv:1502.05477. Wei, T., Wang, Y., & Zhu, Q. (2017). Deep reinforcement learning for building HVAC
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal control. In Proceedings of the 54th annual design automation conference 2017 (pp.
policy optimization algorithms. URL: http://arxiv.org/abs/1707.06347. [Cs] arXiv: 1–6). Austin TX USA: ACM, http://dx.doi.org/10.1145/3061639.3062224, URL:
1707.06347. https://dl.acm.org/doi/10.1145/3061639.3062224.
Sofos, M., Langevin, J., Deru, M., Gupta, E., Benne, K., Blum, D., et al. (2020). Innova- Yu, Y. (2018). Towards sample efficient reinforcement learning. In Proceedings of
tions in sensors and controls for building energy management: research and development the twenty-seventh international joint conference on artificial intelligence (pp. 5739–
opportunities report for emerging technologies: Technical Report NREL/TP–5500-75601, 5743). Stockholm, Sweden: International Joint Conferences on Artificial Intelligence
DOE/GO–102019-5234, 1601591, http://dx.doi.org/10.2172/1601591, URL: https: Organization, http://dx.doi.org/10.24963/ijcai.2018/820, URL: https://www.ijcai.
//www.osti.gov/servlets/purl/1601591/. org/proceedings/2018/820.
Sutton, R. S. (1991). Dyna, an integrated architecture for learning, planning, and Yu, L., Sun, Y., Xu, Z., Shen, C., Yue, D., Jiang, T., et al. (2021). Multi-agent
reacting. ACM SIGART Bulletin, 2(4), 160–163. http://dx.doi.org/10.1145/122344. deep reinforcement learning for HVAC control in commercial buildings. IEEE
122377, URL: https://dl.acm.org/doi/10.1145/122344.122377. Transactions on Smart Grid, 12(1), 407–419. http://dx.doi.org/10.1109/TSG.2020.
Sutton, R. S., & Barto, A. G. (2018). Adaptive computation and machine learning series, 3011739, URL: https://ieeexplore.ieee.org/document/9146920/.
Reinforcement learning: an introduction (2nd ed.). Cambridge, Massachusetts: The Zhang, X., Jin, X., Tripp, C., Biagioni, D. J., Graf, P., & Jiang, H. (2020). Transferable
MIT Press. reinforcement learning for smart homes. In Proceedings of the 1st international
Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W., & Abbeel, P. (2017). Domain workshop on reinforcement learning for energy management in buildings & cities (pp.
randomization for transferring deep neural networks from simulation to the real 43–47). Virtual Event Japan: ACM, http://dx.doi.org/10.1145/3427773.3427865,
world. In 2017 IEEE/RSJ international conference on intelligent robots and systems URL: https://dl.acm.org/doi/10.1145/3427773.3427865.
(pp. 23–30). Vancouver, BC: IEEE, http://dx.doi.org/10.1109/IROS.2017.8202133, Zhang, C., Kuppannagari, S. R., Kannan, R., & Prasanna, V. K. (2019). Building
URL: http://ieeexplore.ieee.org/document/8202133/. HVAC scheduling using reinforcement learning via neural network based model
Tymkow, P., Tassou, S., Kolokotroni, M., & Jouhara, H. (2020). Building services design approximation. In Proceedings of the 6th ACM international conference on systems
for energy efficient buildings (2nd ed.). New York: Routledge. for energy-efficient buildings, cities, and transportation (pp. 287–296). New York NY
USA: ACM, http://dx.doi.org/10.1145/3360322.3360861, URL: https://dl.acm.org/
doi/10.1145/3360322.3360861.
12