0% found this document useful (0 votes)

39 views12 pages

5.a Review of Reinforcement Learning For Controlling Building Energy Systems From A Computer Science Perspective

Uploaded by

a.maker

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views12 pages

5.a Review of Reinforcement Learning For Controlling Building Energy Systems From A Computer Science Perspective

Uploaded by

a.maker

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Sustainable Cities and Society 89 (2023) 104351

Contents lists available at ScienceDirect

Sustainable Cities and Society

journal homepage: www.elsevier.com/locate/scs

A Review of Reinforcement Learning for Controlling Building Energy

Systems From a Computer Science Perspective✩
David Weinberg a ,∗, Qian Wang a,d , Thomas Ohlson Timoudas c , Carlo Fischione b
a
Department of Civil and Architectural Engineering, KTH Royal Institute of Technology, Teknikringen 78, 114 28, Stockholm, Sweden
b
Department of Network and Systems Engineering, KTH Royal Institute of Technology, Teknikringen 33, 114 28, Stockholm, Sweden
c
RISE Research Institutes of Sweden, Division Digital Systems, Computer Science, Isafjordsgatan 28 A, 164 40, Kista, Sweden
d
Uponor AB, Hackstavägen 1, 721 32, Västerås, Sweden

ARTICLE INFO ABSTRACT

Keywords: Energy efficient control of energy systems in buildings is a widely recognized challenge due to the use of low
Building Energy System temperature heating, renewable electricity sources, and the incorporation of thermal storage. Reinforcement
HVAC Learning (RL) has been shown to be effective at minimizing the energy usage in buildings with maintained
Heating
thermal comfort despite the high system complexity. However, RL has certain disadvantages that make
Cooling
it challenging to apply in engineering practices. In this review, we take a computer science approach to
Reinforcement learning
Machine learning
identifying three main categories of challenges of using RL for control of Building Energy Systems (BES).
RL The three categories are the following: RL in single buildings, RL in building clusters, and multi-agent aspects.
ML For each topic, we analyse the main challenges, and the state-of-the-art approaches to alleviate them. We also
identify several future research directions on subjects such as sample efficiency, transfer learning, and the
theoretical properties of RL in building energy systems. In conclusion, our review shows that the work on
RL for BES control is still in its initial stages. Although significant progress has been made, more research is
needed to realize the goal of RL-based control of BES at scale.

1. Introduction sustainability goals (Vázquez-Canteli & Nagy, 2019). These control

systems are particularly important in the transition to an energy-
According to a study by the International Energy Agency (IEA) in neutral, carbon-free future. On the one hand, the increasing use of
2022, buildings account for 30% of global final energy usage and 27% low-temperature heating and the integration of renewables into BES
of carbon emissions (International Energy Agency, 2022). As buildings require the development of novel control methods to optimize sys-
contribute significantly to energy usage and emissions, they must be tem performance. In addition, coordinating building clusters in local
targeted for change in order to create a sustainable future society. The energy communities creates an unprecedented level of system com-
IEA states that the next decade will be critical for implementing the plexity, requiring scalable BES control methods that can be applied
new technologies necessary to make 20% of the building stock zero- to multiple-building systems. With the increasing trend of large-scale
carbon-ready by 2030. In this regard, the European Commission’s 2021 data collection in buildings, there is an opportunity to use Machine
Digital Compass notes that the goal of accelerating the sustainable Learning (ML) approaches to control BES and reduce final energy usage
transformation is closely tied to digitalization (European Commission, and carbon emissions.
2021).
Digitalization trends in buildings, such as increased device connec- 1.1. Background
tivity and large-scale data collection, necessitate a reconsideration of
how building energy systems (BES) are operated. The development of Today, most building heating and cooling systems are operated
data-driven BES control systems, which have been shown to reduce using rule-based control logic (Salsbury, 2005). However, these tra-
operating energy usage by 10%–20%, is a crucial enabler for achieving ditional control systems have proved insufficient for operating the

✩ Financial Disclosure: This work is financially supported by Swedish Energy Agency with project grant agreement No. 51544-1. Special thanks to RISE Research
Institutes of Sweden for providing in-kind supports.
∗ Corresponding author.
E-mail addresses: [email protected] (D. Weinberg), [email protected] (Q. Wang), [email protected] (T.O. Timoudas), [email protected]
(C. Fischione).

https://doi.org/10.1016/j.scs.2022.104351
Received 25 May 2022; Received in revised form 11 December 2022; Accepted 11 December 2022
Available online 14 December 2022
2210-6707/© 2022 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
D. Weinberg et al. Sustainable Cities and Society 89 (2023) 104351

with various comfort objectives such as indoor air quality, noise, and
thermal comfort (Han, et al., 2019). Commonly used algorithms as
well as exploration strategies are analysed. A discussion on value-based
versus policy-based methods is included. Multi-agent RL is identified
as a future research direction of significant importance. Although not
strictly a review, Nweye, Liu, Stone, and Nagy identify nine practical
challenges of RL in grid-interactive buildings (Nweye et al., 2022). Sam-
ple efficiency, partial observability, and explainability are, among other
topics, identified as important future research directions. An example
of off-line learning of an RL controller in the proposed test-environment
CityLearn is also provided.
Some previous reviews treat RL in combination with other top-
ics related to building HVAC and energy systems control. Royapoor,
Fig. 1. Number of publications per year in the field of RL for HVAC control. The results Antony, and Roskilly reviews holistic control of a building with HVAC
were obtained from Web of Science using the search query topic=(reinforcement control included as a section (Royapoor et al., 2018). Popular control
learning AND (hvac OR heating OR cooling)).
methods, including RL, are investigated. The use of thermal comfort
and occupancy modelling for control to reduce operational energy
usage is discussed. A survey on industry familiarity with various control
next generation of thermal systems, which will use temperature lev- strategies is also included. Hong, Wang, Luo, and Zhang reviews the
els much closer to room temperatures and integrate thermal storage use of ML in the entire building life cycle, with a section on HVAC
solutions. The main challenge is the increased thermal inertia of such control (Hong et al., 2020). Supervised Learning (SL) for personal
next-generation systems–the reduced energy transfer capacity requires comfort and occupancy-modelling is treated. MPC, RL and their respec-
longer control horizons to maintain a sufficient thermal comfort. Model tive advantages for HVAC control, are discussed. Potential benefits of
Predictive Control (MPC), a method for dynamic long-term planning planning in RL methods are mentioned. Pinto, Wang, Roy, Hong, and
of control systems, has been proposed as a solution to these prob- Capozzoli review the use of transfer learning in smart buildings (Pinto,
lems (Afram, Janabi-Sharifi, Fung, & Raahemifar, 2017; Jin, Baker, et al., 2022). Transfer learning for both RL and various kinds of
Christensen, & Isley, 2017; Manjarres, Mera, Perea, Lejarazu, & Gil- predictive modelling are treated both in the context of single buildings
Lopez, 2017). However, such methods require accurate building plant and building clusters.
models, which are inherently hard to obtain due to the complex and
time-varying dynamics. 1.3. Contributions of this review
Reinforcement Learning (RL) is getting increased attention in the
field of building Heating, Ventilation and Air Conditioning (HVAC)
The contributions of this review to the existing literature consist of
and energy network control research. It has proven to be an attractive
four key aspects:
alternative to MPC due to its model-free approaches. Commonly, RL
algorithms do not strictly require a model of the environment, but learn • Many challenges of applying RL to BES control have their roots in
control policies through interaction. While this alleviates the problems computer science, not in building energy research. We provide a
of building modelling, RL suffers from other drawbacks, such as the guide for building energy researchers to become familiar with the
need for massive quantities of data and long training times. computer science related challenges, and to identify key research
Several advancements have been made in the field of RL in the last gaps where meaningful contributions can be made.
decade, and many effective learning algorithms have emerged. Some • Previous treatment of the computer science related challenges of
of these algorithms have been applied to a wide range of applications RL for BES control is distributed across a vast body of literature,
in BES (Chen, Norford, Samuelson, & Malkawi, 2018; Du, Zandi, et al., which is difficult to overlook. We address this by identifying and
2021; Gao, Li, & Wen, 2020; Wei, Wang, & Zhu, 2017). The number of structuring the main challenges into three categories, outlined in
publications per year on RL for HVAC control can be seen in Fig. 1. Due Sections 5–7.
to the growth of the field, there is a need for an up-to-date literature • Previous reviews address only limited parts of the computer sci-
review summarizing the main research gaps and previous research. ence related challenges of RL for BES control. We address this by
providing a comprehensive treatment of all the main challenges.
1.2. Previous reviews • The study identifies promising future research directions to ad-
dress the computer science related challenges of applying RL to
Previous reviews focus on RL methodology, but do not consider BES control.
the full extent of the computer science related challenges of BES
control. Vázquez-Canteli and Nagy review the application of RL for The three categories of challenges that are identified in this review
demand response applications (Vázquez-Canteli & Nagy, 2019). RL are referred to as reinforcement learning in single buildings, reinforcement
theory focusing on Q-learning and the exploration/exploitation trade- learning in building clusters, and multi-agent aspects. To further justify our
off is treated. Commonly used algorithms and action selections are contributions, the coverage by previous reviews of the three categories
also presented. Future research directions are identified, including the can be seen in Table 1. Note that coverage of the main challenges in
incorporation of expert knowledge, and reduction of the state–action previous reviews can mean two things: (1) The reviewers investigate
space to improve the sample efficiency of the RL algorithms. The need previous work concerning the challenges, or (2) discuss them as future
for multi-agent systems to facilitate simultaneous control of building research directions. It is evident that no previous review treats all
clusters is also emphasized. Wang and Hong review the use of RL categories of challenges.
for applications in building control such as window opening, lighting,
and HVAC (Wang & Hong, 2020). Many aspects of RL, such as the 1.4. Organization of the review
popularity of various algorithms, the exploration/exploitation trade-
off, and the choice of states and actions are investigated. Reduction In Section 2, we introduce the BES and discuss the challenges of
of the state–action space and multi-agent systems are suggested as control in such systems. This section is followed by a primer on the
future research directions. Han, et al. reviews the use of RL for control theory behind RL in Section 3. Given an overview of key theoretical

2
D. Weinberg et al. Sustainable Cities and Society 89 (2023) 104351

Table 1
Coverage of the three main challenges in previous reviews. Review work, and identification as a future research direction are denoted by RW
and FR, respectively.
Author Reference RL in single RL in building Multi-agent
buildings clusters aspects
Vázquez-Cante and Nagy Vázquez-Canteli and Nagy (2019) FR/RW – FR
Wang and Hong Wang and Hong (2020) RW – FR
Han et al. Han, et al. (2019) RW – FR
Nweye et al. Nweye et al. (2022) FR – FR
Royapoor et al. Royapoor et al. (2018) FR – –
Hong et al. Hong et al. (2020) FR/RW FR –
Pinto et al. Pinto, et al. (2022) RW FR/RW –

Fig. 2. A generic illustration of a BES, and the associated measurement and control system. The dashed lines represent exchange of sensor measurements and control signals.

aspects, we discuss the computer science perspective, and review state- such that the load can always be supplied, while keeping the cost and
of-the-art approaches to RL-based BES control in Sections 4–7. In energy usage to a minimum. But the components of a BES are highly
Section 8, we make a future outlook and identify promising research interactive, and achieving these goals using traditional control methods
directions. Lastly, we conclude the paper and summarize our findings is prohibitively difficult. Using Fig. 2 as an example, the controller must
in Section 9. coordinate the operation of all components in the BES while taking
into account complications such as the system’s dynamic response, and
2. Problem formulation time delays in sensor measurements and actuators. In large buildings,
where the number of measured and controlled variables increases, this
The Building Energy System (BES) can be thought of as all systems coordination problem becomes highly complex, so the need for new
in a building that use energy. This includes both electrical appliances control methods is apparent.
and lighting, and the thermal energy system. In this review, we limit
the definition of BES to encompass heating equipment, thermal storage, 2.1. Data collection in buildings
emission systems, and ventilation and air conditioning, as well as their
auxiliary systems, as shown in Fig. 2. Even though the constituent To control and monitor a BES, operational data must be collected
parts of the BES have their own intricate principles of operation, we and stored. This is typically done in a Building Management System
will not cover them in detail. The purpose of this section is merely to (BMS), which integrates sensors, actuators, and networking devices to
establish the role of RL in BES control. For case-specific treatment of provide information about all subsystems in a building. For control
BES components, we refer the reader to Frederiksen and Werner (2013), purposes, the BMS operates on multiple levels which represent distinct
Tymkow, Tassou, Kolokotroni, and Jouhara (2020). levels of abstraction of the system. For example, the level closest to
All the BES components in Fig. 2 are part of an interconnected sensors and actuators is known as the field level (Levermore, 2000).
system: heat is generated using heating equipment, injected into a Devices at the field level receive instructions from higher level con-
thermal storage, and distributed in the building using an emission trollers to write specified output signals to the actuators. They also
system. On top of this, ventilation, and air conditioning are added to read sensor values and passes them back to the higher-level devices.
ensure an acceptable indoor environment. Naturally, all components The higher-level controllers receive the sensor data, which is stored
must be controlled to ensure a comfortable indoor climate, preferably and processed to determine suitable control signals to send to the field
at a low monetary cost while using as little energy as possible. An level devices. This is demonstrated using dashed lines in Fig. 2 where
important example of this is optimal control of thermal storage systems at each BES-component, there are sensors that measure the variables
to shift the time of heat production away from times of high load. For which we want to control. These measurements are communicated to
optimal operation, the charging of the storage tank must be scheduled a central controller, which computes the control signals to be applied

3
D. Weinberg et al. Sustainable Cities and Society 89 (2023) 104351

to the actuators at the BES-components. Even in buildings where BMS Table 2

Pros and cons of physics-based, and RL-based control.
systems are not installed, the use of networked sensors and actuators
allows for scaled down control solutions. Measurement technology in Pros Cons

BES is a vast area on its own, and for brevity we will not discuss it in ∙ Incorporation of expert ∙ Difficult in complex
knowledge systems
more detail here. We instead refer the interested reader to Levermore Physics-based
∙ Interpretable system ∙ Governing physical laws
(2000), Sofos, et al. (2020). dynamics might not exist
These days, BMS systems are common and serve as promising
∙ No system-model ∙ Requires large quantities of
platforms for integrating data-driven control into BES. Stored opera- required data
RL-based
tional data can be used inside ML workflows to create control systems ∙ Can control very complex ∙ Few stability- or robustness
that autonomously optimize the energy performance of a building. In systems guarantees
particular, BMS systems facilitate large scale data collection in clusters
of buildings, paving the way for coordination and control of entire
energy communities. In the next section, we explain the role of RL in policy, which describes the optimal actions in a given state. The optimal
this scenario, and how it can be integrated to solve control problem in policy may be found by solving
complex BES. [∞ ]
∑
max E 𝛾 𝑡 𝑟(𝐬𝑡 , 𝐚𝑡 ) , (1)
𝜋
2.2. The need for reinforcement learning 𝑡=0

where 𝑟 ∶  ×  → R is a so-called reward function (Sutton &

The possibility of large scale data collection in buildings allows for Barto, 2018), and 𝛾 ∈ [0, 1) is a discount factor. This reward function
different forms of data-driven control. An important example is to use measures the immediate benefit of taking the action 𝐚𝑡 in state 𝐬𝑡 , so
data to calibrate the parameters of a physics-based BES model, in a the objective is to find a control policy 𝜋 that maximizes the expected
so-called grey-box modelling arrangement (Li, O’Neill, et al., 2021). long-term reward. Note that in the equation above, 𝐚𝑡 is a random
The calibrated system model could then be used for optimal control. variable with density 𝜋(𝐚𝑡 |𝐬𝑡 ). The density of 𝐬𝑡 is given by the state-
This approach, taken by many researchers in the MPC community, has transition dynamics of the system, which are typically assumed to be
been demonstrated to suffer from reliability issues tied to the model’s Markovian, resulting in a so-called Markov Decision Process (MDP). We
agreement with the real system (Drgoňa, et al., 2020). According to Lee shall denote the transition dynamics by 𝑇 (𝐬𝑡+1 |𝐬𝑡 , 𝐚𝑡 ).
and Zhang, grey-box techniques may be used to automate the param- In practice, the reward function may be difficult to specify for a
eter estimate process in physics-based modelling (Lee & Zhang, 2021). given problem, and the resulting optimal policy 𝜋 changes accordingly.
It is possible that the use of a fairly generic physics-based grey-box The form of the reward function is typically left to the control systems
model allows for it to be transferred between multiple BES, removing engineer as a design choice, which is non-trivial in general. In the case
the need for re-modelling of each building. RL should be viewed as of BES control, there is a consensus that a suitable reward function
a complement to such approaches, with the added benefit of being should penalize high energy usage in the BES, as well as violations of
capable of controlling buildings with complex dynamics, and perhaps any thermal comfort requirements. For a more detailed overview of the
removing altogether the requirement to comply with a pre-specified choice of reward function, we refer the reader to Han, et al. (2019),
system model. Vázquez-Canteli and Nagy (2019), Wang and Hong (2020).
RL offers a data-driven approach to learning the optimal control
signals to send to each component in the BES by interacting with the 3.1. Value-based and policy-based algorithms
physical system. No model of the BES is strictly required by many RL
algorithms, thanks to its purely data-driven procedures. Moreover, RL is RL provides algorithms to solve the problem in Eq. (1) without
in principle agnostic to the underlying technologies used in the BES—it explicit knowledge of the dynamics 𝑇 . Instead, RL algorithms rely on
does not matter whether it uses a boiler, heat pump, district heating, interactions with the physical system to update the control policy 𝜋
or something else. Even the integration of renewables and thermal towards the optimum, essentially by trial and error. In fact, there are
storage can be done conveniently using the RL-formulation by including several benefits of using RL over physics-based modelling, but as shall
relevant measurement information from the corresponding components be clarified later in the review, there are also drawbacks. The pros and
in the BES. However, as shall be clarified in the next section, the cons of RL compared to physics-based modelling are summarized in
problem remains identical from an RL point of view. Table 2.
There is a plethora of algorithms for finding the optimal policy for
3. A primer on reinforcement learning a specific system using RL. They are typically classified as either value-
based, or policy-based algorithms. Value-based algorithms typically try
Clearly, from a control system perspective, certain variables of to find a policy which maximizes the so-called action value function,
interest in the BES must be measured. For instance, it is natural to which is defined as
[∞ ]
measure the indoor air temperature, as it is directly related to the ∑ |
|
𝑄𝜋 (𝐬, 𝐚) ∶= E 𝛾 𝑘 𝑟𝑡+𝑘+1 |𝐬𝑡 = 𝐬, 𝐚𝑡 = 𝐚 . (2)
thermal comfort. But many other variables might also be useful in the |
𝑘=0 |
control problem, such as the relative humidity and the thermal energy
usage of the building. In RL jargon, the collection of such relevant A prominent approach is to use a parametrized representation 𝑄𝜃 ∶
measured information is referred to as the system state, and is denoted  ×  → R, such as a neural network, for the action-value function.
by the real valued vector 𝐬𝑡 ∈ . The subscript 𝑡 is a discrete time The parameters would be estimated by minimizing the mean squared
index, and the set  of possible state values is called the state space. error between the true action-value function and the approximation.
[ ( )2 ]
Similarly, there are actuators in the BES such as valves, pumps and 1
min E 𝑄𝜃 (𝐬, 𝐚) − 𝑄(𝐬, 𝐚) . (3)
power regulators which allows for manipulating the system. In RL, the 𝜃 2
control signals sent to the actuators are referred to as actions, and are The target 𝑄 is typically not known exactly, but an estimate can be
denoted by the vector 𝐚𝑡 ∈ . Here, the set  is the set of possible provided by running the controller in the system. The parameters of 𝑄𝜃
actions, called the action space. RL algorithms have the goal of finding would then be updated using stochastic gradient descent or variations
the optimal action 𝐚𝑡 to take given a measured state 𝐬𝑡 . More formally, thereof Sutton and Barto (2018). For instance, this is the general idea
they attempt to find a probability density function 𝜋(𝐚𝑡 |𝐬𝑡 ) known as a behind the famous Deep Q-Networks (DQN) algorithm (Mnih, et al.,

4
D. Weinberg et al. Sustainable Cities and Society 89 (2023) 104351

Table 3
Classification of popular RL algorithms.
On-policy Off-policy
Value-based SARSA (Sutton & Barto, 2018) DQN (Mnih, et al., 2013)
Policy-based TRPO (Schulman, Levine, et al., 2017), PPO (Schulman, Wolski, et al., 2017) DDPG (Lillicrap, et al., 2019), SAC (Haarnoja et al., 2018)

2013). Once the Q-function has been estimated, the optimal policy may algorithm, the use of learnt plant models, as well as incorporation of
be extracted by choosing the action with the highest 𝑄-value in a given expert knowledge.
state. Secondly, in the section on RL in building clusters we review
Policy-based algorithms, on the other hand, directly parametrize the approaches taken by building energy researchers which focus on ex-
policy as a function 𝜋𝜃 ∶  → , or a probability distribution over ploiting data from other buildings when deploying RL in building
actions 𝜋𝜃 (𝐚|𝐬). The optimal parameters are found by directly solving retrofitting practices, or in newly built. Here, unlike in the first topic,
the problem we focus on approaches which transcend the isolated treatment of
[∞ ] single buildings. Instead, we review methods that rely on transfer-
∑
max E 𝛾 𝑡 𝑟(𝐬, 𝐚) , (4) ring information for efficient learning in clusters of buildings. The
𝜃
𝑡=0 most prominent approaches are related to various forms of trans-
where the expectation is over the stationary state distribution 𝐬 ∼ 𝜌 fer learning—either by directly transferring control policies, or by
and the policy 𝐚 ∼ 𝜋𝜃 (Sutton & Barto, 2018). There are many popular transferring plant models between buildings.
policy-based algorithms in RL, such as Trust Region Policy Optimiza- Lastly, the section on multi-agent aspects treats simultaneous learn-
tion (TRPO), Proximal Policy Optimization (PPO), Deep Deterministic ing of RL-controllers in building clusters using multi-agent RL. Un-
Policy Gradient (DDPG) and the Soft Actor Critic (SAC) (Haarnoja, der imposed privacy restrictions or severe communication constraints,
Zhou, Abbeel, & Levine, 2018; Lillicrap, et al., 2019; Schulman, Levine, agents might not be able to mutually share information, so we outline
Moritz, Jordan, & Abbeel, 2017; Schulman, Wolski, Dhariwal, Radford, the distinction between independent and joint learning. Moreover, we
& Klimov, 2017). discuss the safety and scalability aspects of multi-agent control of BES
in building clusters.
3.2. On-policy and off-policy algorithms In summary, these three sections provide an overview of the most
important challenges of applying RL to BES control. They also provide
detailed descriptions of the tools and approaches adopted to address
Another useful classification of RL algorithms is whether the data
these challenges. However, it is critical to remember that although the
used to improve the current policy must be collected in the system using
challenges are treated in isolation here, in engineering, they may arise
that same policy. On-policy algorithms have this requirement, whereas
simultaneously. For example, in the case of using transfer learning for
off-policy algorithms allow for the use of data collected in the system
RL in building clusters, it may be advisable to still use some approaches
using another policy.
outlined in the section on RL in single buildings. Similarly, even if
In BMS systems, sensors are typically read with a sampling time in-
one is faced with a multi-agent RL problem, it is essential to still
terval of several minutes—sometimes even hourly. This slow collection
consider the potential benefits of, e.g., transfer learning to speed up
of data is, as will be shown in the following sections, problematic as
the training process. Also, even though the challenges of applying RL
RL algorithms need large quantities of data to find the optimal control
to BES control are considered here on the scale of single buildings and
policy. Hence, off-policy algorithms appear attractive for BES control
building clusters, they could in principle arise on smaller scales. For
due to their ability to reuse all data collected throughout the learning
example, the issue of controlling the BES in a multi-zone building can
process. With on-policy algorithms, the duration of data collection
be solved by having separate controllers for each thermal zone. In such
increases, as it is required to collect larger quantities of data after every
an arrangement, each thermal zone could possibly utilize information
policy update. In Table 3, a summary of the workings of some popular
from the other zones in a transfer learning-like arrangement. The task
RL algorithms is shown.
now strongly resembles RL in building clusters, but on the scale of a
single building. In the next three sections, the three main challenges
4. The computer science perspective
mentioned earlier will be reviewed in detail.

Many of the commonly encountered challenges in applications of 5. Reinforcement learning in single buildings
RL for BES control do not have their roots solely in energy systems
research, but also in computer science. Therefore, in this review, we We now turn our attention to the problem of data collection, and
aim to investigate these challenges by an interdisciplinary approach. In learning in BES. Data collection in BES can be very time-consuming
the context of the problem at hand, this means studying the limitations because of the low measurement sampling frequency. The total time
of RL in BES and how these issues can be resolved. We do this by between sensor readings is often in the order of several minutes. This is
dividing the review into 3 main sections: troublesome as most RL methods require substantial quantities of data
• RL in single buildings to produce useful results. To capture the seasonal variations inherent in
• RL in building clusters the surrounding weather conditions, it is potentially necessary to collect
• Multi-agent aspects data over several years. This would lead to unacceptably long training
times for practical applications. However, the solution is not as simple
Firstly, the section on RL in single buildings addresses the approaches as sampling more often, as the underlying reason for the low sampling
taken by building energy researchers to alleviate the issue of learning frequency is the slow dynamics of the BES. Oversampling in such a
inefficiencies in RL. RL algorithms typically require massive quantities system would not capture any meaningful information, because the
of data and extensive interaction with the building before convergence important variations are taking place on a time-scale of hours, or even
is achieved (Yu, 2018). This implies that the training process could be days. Moreover, in a BES control problem, RL methods naturally require
prohibitively long for RL to be practically feasible in real-life BES. In direct interaction with the building installations. Given the suboptimal
this topic, we restrict our review to solutions treating single buildings behaviour of a randomly initialized controller, there could be periods of
in isolation—we do not allow the reuse of data from other buildings. time when the building becomes uninhabitable due to the controller’s
The main remedies turn out to be related to the choice of learning stochastic exploration of heating and cooling settings. If, however, some

5
D. Weinberg et al. Sustainable Cities and Society 89 (2023) 104351

2020). Discretization of the action space increases its cardinality expo-

nentially, and exploring the state–action space  ×  during training
requires more interaction samples. Moreover, the maximization over
action-values as in the DQN algorithm may be intractable for large or
continuous action spaces. Gao et al. also conclude that using DDPG–
which is compatible with continuous action spaces–leads to a fast
convergence and to a high final average reward. Du, Zandi, et al.
confirm these findings by noting that DDPG converges almost one order
of magnitude faster than DQN in terms of the number of training
episodes (Du, Zandi, et al., 2021). Evidently, using continuous actions
is not always an option if discrete setpoints are all that is available. In
such cases, the size of the action space  can still be reduced through
other approaches. Jiang, et al. propose a preprocessing of the actions
Fig. 3. Illustration of RL in single buildings as the intersection of high sample efficiency
suggested by a DQN agent (Jiang, et al., 2021). During peak load hours,
and incorporation of expert knowledge.
the indoor temperature setpoint is clipped at the comfort boundary.
Any higher setpoint suggested by the DQN agent would inevitably
lead to a lower reward. Jiang, et al. report that such reduction of the
size of the action space helps produce better agents that consistently
outperform the non-reduced version.

5.1.2. Off-policy algorithms

On-policy or off-policy operation is another determinant of the
sample efficiency of an RL algorithm. Many researchers have used
DQN and DDPG, which are both off-policy algorithms, for BES con-
trol (Du, Zandi, et al., 2021; Gao et al., 2020; Jiang, et al., 2021;
Wei et al., 2017). They have the benefit of being able to utilize a
buffer of collected experience that can be used throughout training,
despite having been collected using an old policy. Biemann, Scheller,
Liu, and Huang compare several on-policy and off-policy methods on
a problem of controlling the temperature setpoints and fan mass flow
Fig. 4. The main approaches taken in BES control research to increase sample rates in a simulated datacenter with two thermal zones (Biemann,
efficiency of RL algorithms.
Scheller, et al., 2021). It is concluded that the off-policy SAC agent
converges one order of magnitude faster than its on-policy counterparts
in terms of the number of environment interactions. Due to its high
exploratory actions are known to be unreasonable a priori, considering sample efficiency, they hypothesize that the SAC algorithm will have
them less often might speed up the training phase. Hence, incorpo- a key role to play in the next generation of HVAC control systems.
rating expert knowledge and guidance before deployment can be of On the other hand, successful implementations of on-policy algorithms
time-saving benefit. have also been produced (Chen, Cai, & Bergés, 2020). To increase the
The observations above motivate the need for RL algorithms with data efficiency of on-policy algorithms, it is possible to use recurrent
a short training time and a high final average reward within a single policies. Biemann, Liu, Zeng, and Huang demonstrate that doing so can
building. More precisely, an algorithm suitable for RL in single buildings drastically increase the sample efficiency of the algorithm (Biemann,
should possess the following two virtues: it should utilize the collected Liu, et al., 2021). The reason is pointed out as being that the state
data as efficiently as possible, and it should incorporate prior knowl- process is non-Markovian, so that an internal state must be maintained
edge before training as shown in Fig. 3. In this section, previous work with a recurrent neural network.
treating the two virtues in the context of BES control will be reviewed.
5.1.3. Model-based RL
5.1. Sample efficiency Model-based RL is an alternative measure to improve sample ef-
ficiency. The idea is that knowledge of the environment dynamics
Sample efficiency is the first virtue of an RL algorithm for BES 𝑇 (𝐬𝑡+1 |𝐬𝑡 , 𝐚𝑡 ) can be leveraged to reduce the required number of environ-
control. Because the data collection process in buildings can be slow, it ment interactions. Using a learned model of the environment dynamics
is desirable for an algorithm to exhaust the collected data as much as for optimizing actions is strongly related to MPC (Afram & Janabi-
possible, even if it leads to a higher computational load. Moreover, it Sharifi, 2014; Drgoňa, et al., 2020). In fact, MPC can be regarded as a
is important to consider what variables are monitored. For a discussion special instance of model-based RL where instead of finding an optimal
about what measurements should make up the system state in an RL policy, one directly maximizes over an action sequence, c.f. Eq. (1):
algorithm for BES control, we refer to the review by Wang and Hong
(2020). Suitable algorithms should obtain the highest possible final ∑
𝑡+𝐻
max 𝑟(𝐬𝑖 , 𝐚𝑖 ) (5)
performance from the smallest number of environment interactions. 𝐚𝑡 …𝐚𝑡+𝐻
𝑖=𝑡
The approaches taken in BES control to increase the sample efficiency
s.t. 𝐬𝑖+1 = 𝑓𝜙 (𝐬𝑖 , 𝐚𝑖 ) . (6)
are illustrated in Fig. 4 and discussed in detail below.
In the optimization problem above, 𝑓𝜙 ∶  ×  →  is a parametrized
5.1.1. MDP modifications model of the environment dynamics. Kurte, Amasyali, Munk, and Zandi
High dimensionality of the action space  has been pointed out compare MPC and DQN for indoor temperature setpoint control (Kurte
as one of the fundamental issues of RL algorithms in BES control in et al., 2021). It is found that the relative energy cost savings are
several publications: Gao et al. emphasize that algorithms such as Q- approximately 60% larger for the MPC controller than for the DQN
learning, DQN and SARSA are unsuitable for the HVAC control problem agent. This is to be expected since the MPC controller has access to
due to their inability to handle continuous action spaces (Gao et al., a detailed model of the building physics. Model-based RL also includes

6
D. Weinberg et al. Sustainable Cities and Society 89 (2023) 104351

other aspects such as learning a policy 𝜋 ∶  →  as opposed to start’’ using transfer learning from previous successful implementa-
directly optimizing over an action sequence. One advantage of doing tions. However, all buildings are unique; both in terms of envelope
so is that model free methods can be used for learning. Chen et al. and design, climate, heat loss status and internal heat gains. What
use MPC and an environment dynamics model to perform planning works in one building might not work in another. Moreover, the state–
of actions for controlling the supply-water temperature from a district action space  ×  may vary between buildings due to the changing
heating connection (Chen et al., 2020). It is implemented in conjunction availability of measured state information and controllable variables.
with a PPO agent to improve a parametrized policy. This leads to large As a result, the possibility of transferring information between buildings
improvements in terms of sample efficiency over the purely model free might diminish because this drastically changes the structure of the
alternative. problem.
Real time learning of a dynamics model is also possible. For in- Identical state–action spaces in similar building types can also be
stance, the system developed by Nagy, Kazmi, Cheaib, and Driesen imagined. Using digital twins, which are defined as digital represen-
simultaneously learns a neural network model of the environment tations of the building installation and energy systems, for simulating
dynamics combined with MPC to perform planning of the supply power operation is an important example of this: Training the RL agent in
to an air-source heat pump (Nagy et al., 2018). They conclude that the simulated environment and then deploying it in the real building
the model-based method is consistently better in terms of both sample can be seen as a special case of transferring information between
efficiency and the resulting average reward. Zhang, Kuppannagari, buildings. In that case, the main challenge is instead the discrepancy
Kannan, and Prasanna confirm the findings by noting that the com- between the simulated and real-life building performances. But even
bination of a neural network dynamics model and MPC converges if the simulated dynamics are validated and consistent, the effects
to its final average reward an order of magnitude faster than PPO of exogenous variables and disturbances can have a profound impact
for temperature setpoint control in a simulated datacenter (Zhang when transferring a controller to a new environment. The RL controller
et al., 2019). Comparable results are obtained by Ding, Du, and Cerpa could as a result be forced into parts of the state–action space unseen
who use model predictive path integral control in combination with during training.
learning a dynamics model using a neural network to control VAV RL in building clusters concerns the performance gained by trans-
system temperature setpoints in a five-zone building (Ding et al., 2020). ferring knowledge from one building to another. Controlling buildings
Their proposed model-based controller also reduces the training time are separate, but related tasks which can be exploited in the training
compared to a PPO agent by an order of magnitude. process. It should be noted that RL in building clusters and in single
buildings are distinctly different. The end goal is the same–to reduce the
5.2. Expert knowledge training time or increase the final average reward of the RL controller–
but the means of achieving it are different. RL in single buildings
Expert knowledge incorporation in the learning process is the sec- concerns the utilization of data and knowledge available within a
ond virtue of an algorithm for RL in single buildings. In some cases, in single building. RL in building clusters, on the other hand, focuses
RL, the training time can be reduced by using information that is avail- on transferring information between buildings, as visualized in Fig. 5.
able a priori. This is particularly useful for developing control system in Here, the focus will be on transfer learning for RL applications, but for
building stock with renovation practices. The prior knowledge can take a more general review on transfer learning for smart buildings, we refer
on many forms in BES control. For instance, Vázquez-Canteli, Ulyanin, the reader to the publication by Pinto, et al. (2022).
Kämpf, and Nagy pre-train the action value function network of a DQN-
controller for the temperature setpoint of an air-to-water heat pump 6.1. Transferring policies
using the fitted Q-iterations algorithm (Vázquez-Canteli et al., 2019).
The algorithm amounts to running the DQN algorithm offline with an Transferring information from one building to another is required to
experience buffer filled with data from an already existing rule-based facilitate RL-based BES control in building clusters. However, the infor-
controller. Approximately 20 days of historical data are used for offline mation term is ambiguous regarding what is in fact being transferred.
training. The pre-training gives the DQN a head start, and it performs If there exists a policy that works well for controlling one building,
on the same level as the rule-based controller at deployment. After transferring it to a new building with a different state–action space
deployment, learning continues online, and the DQN agent shortly out-  ×  requires a corresponding change to the policy and value function
performs the rule-based controller in terms of electricity cost. Another approximations. It is possible that some learned features about the
approach to pre-training is investigated by Chen et al. A parametrized states can be transferred among buildings and the approximation be
policy is pre-trained using SL on data from an existing district heating fine-tuned for a new task with a new action space. However, to the
supply-water temperature controller (Chen et al., 2020). It is then authors’ best knowledge, this remains unexplored in the context of BES
deployed and refined using PPO. control.
Providing the agent with more granular information about the Assuming a fixed state–action space is common, but changing dy-
environment is another way of incorporating expert knowledge in an namics and exogenous variables are still novel challenges. Wei et al.
RL algorithm. Du, Li, et al. use multitask learning to jointly learn the note that despite two buildings being identical, surrounding climate
separate tasks of heating and cooling a building by controlling the characteristics such as weather patterns have an impact on the RL
indoor temperature setpoint (Du, Li, et al., 2021). A binary task ID is controller’s learning process when applied to VAV-system control (Wei
fed to the policy network of a DDPG agent to signify whether heating et al., 2017). This suggests that even if an RL agent is trained success-
or cooling is to be performed. The resulting learning process is twice as fully for one building, it cannot be seamlessly transferred to a similar
fast as the single task implementation. Moreover, restricting the action building in another climate without further fine-tuning. This raises the
space  using the ideas discussed earlier is another way of introducing important question of whether performance can be improved by trans-
information about the environment. If certain actions are known a ferring policies between buildings without re-training from scratch. Du,
priori to be unreasonable, or even forbidden, they can be eliminated. Zandi, et al. transfer a learned policy to several new simulated build-
ings with varying thermal mass to investigate generalizability and
6. Reinforcement learning in building clusters robustness to changing transition dynamics when controlling indoor
temperature setpoints in a building (Du, Zandi, et al., 2021). The
Reusing information from previous building installations to ensure transferred RL controller outperforms a rule-based controller in terms
efficient training of new RL agents in BES control is highly desirable. of temperature violation-time, with a slight increase in electricity cost
Rather than training an agent from scratch, it can be given a ‘‘warm for all test buildings. On the other hand, compared to a fixed setpoint

7
D. Weinberg et al. Sustainable Cities and Society 89 (2023) 104351

Fig. 5. RL in building clusters concerns the transfer of information between buildings to increase control performance. The arrows represent flow of information.

controller, the RL controller reduces the electricity cost by 15% with transition dynamics 𝑇 depend on the joint action of all agents. Hence,
a slight increase in temperature violation-time. The issue is further if a single agent does not have information about what the others
studied by Biemann, Scheller, et al. who investigate the robustness to are doing, the environment will appear non-stationary because, in
varying weather conditions when controlling the setpoint of a cooling general, 𝑇 (𝐬′ |𝐬, 𝐚1 , 𝐚2 … 𝐚𝑁 ) ≠ 𝑇 (𝐬′ |𝐬, 𝐚1 , 𝐚′2 … 𝐚′𝑁 ). Another complication
coil and the fan air flow in a simulated datacenter (Biemann, Scheller, of MARL is the problem of credit assignment. In the case of agents
et al., 2021). By randomizing the weather patterns used in each training receiving a collective reward, it is not clear how this reward should
episode to be one from a few different climates, the resulting agent be divided among the agents. Hence, a single agent can be solely
becomes more robust to such changes. It is also reported that the responsible for a high collective reward, despite the other agents not
electricity consumption is consistently reduced by approximately 15% acting optimally. For these reasons, the naive application of the single-
compared to a rule-based controller when evaluated on weather condi- agent approaches from the previous sections in a multi-agent setting
tions not seen during training. Zhang, et al. perform a limited study of may be problematic.
transferring policies between buildings with varying user preferences There are also practical implications of introducing multi-agent
and appliance parameters (Zhang, et al., 2020). It is found that the control systems. For instance, rather than installing powerful compu-
training time is reduced by transferring policies when the buildings are tational hardware locally in each building, the system designer has
similar. The advantages diminish when the buildings are more distinct. the choice of confining the data processing to a single location as
depicted in Fig. 6. By transmitting data to a central processing server,
6.2. Transferring dynamics models the need for investing in hardware for every building is mitigated.
However, depending on the quantity of data and the rate of trans-
In model-based RL, information about the environment can be mission, communication can introduce a major bottleneck, especially
transferred instead of policies. In a new building, efficient learning of for communication intensive methods in RL. As processing power is
the environment transition dynamics 𝑇 (𝐬𝑡+1 |𝐬𝑡 , 𝐚𝑡 ) is enabled by transfer becoming increasingly affordable and available, using distributed pro-
learning. On a high level, the idea is that a model of 𝑇 in an old building cessing on the edge is a viable alternative to central server processing in
can be transferred to new buildings. This would reduce the training smart buildings. Furthermore, with companies becoming increasingly
time and increase the prediction performance compared to training the restrictive about sharing their collected data, the imposed privacy
model from scratch. Fan, et al. use transfer learning to create predictive restrictions can make transmission of raw data impossible.
models for building energy usage (Fan, et al., 2020). Models are trained
on a set of source buildings and are then transferred and fine-tuned on 7.1. Independent- and joint learning
another set of target buildings. Remarkably large performance improve-
ments of up to 60% in terms of relative reduction of the prediction error It is imaginable that, given the ever-increasing available compu-
are observed. The improvements are most apparent when the available tational power, one could formulate the control of many buildings
datasets at the target buildings are limited in size. Similarly, Qian, in a network as one big, central RL problem. Accordingly, the joint
Gao, Yang, and Yu use transfer learning to bridge the sim-to-real gap action of every building in existence would be controlled using a single
by pre-training an energy prediction model in a simulator and using policy. Under the assumption of not having the communication con-
real building data for fine-tuning (Qian et al., 2020). An increase in straints outlined earlier, this effectively outlines a single-agent problem.
prediction performance of 10% can be observed, and the benefit is the However, as emphasized in the section on RL in single buildings, the
highest when only limited real-world measurement data is available. state–action space would grow exponentially with the number of agents
in the system. In the case of representing a policy or value function
7. Multi-agent aspects using a neural network, the training would require vast amounts of
data to converge. Wei et al. recognizes this problem in the context of
In the context of a local energy community, it is likely that the control of the air flow in a VAV-system in a building with multiple
energy systems in several buildings need to be synchronized for re- thermal zones (Wei et al., 2017). They propose maintaining separate
newable energy sharing purposes, such as geothermal and photovoltaic policies for each thermal zone, and thus implicitly formulating it as a
production. From an RL perspective, controlling multiple buildings multi-agent problem. This transforms the problem into training several
simultaneously translates to a multi-agent control problem. It is a neural networks of manageable sizes, which leads to a larger reduction
system of many agents interacting simultaneously with a common envi- of energy cost and temperature violations compared to the single-
ronment, and with each other. One of the main difficulties concerning agent formulation. The RL control system is evaluated in simulations of
Multi-Agent Reinforcement Learning (MARL) is that the environment three different buildings using weather data from two distinct locations,

8
D. Weinberg et al. Sustainable Cities and Society 89 (2023) 104351

Fig. 6. Illustration of a generic multi-agent system with centralized control. The arrows indicate flow of information.

namely Riverside and Los Angeles. Compared to the single-agent for- not become intractable. MARL methods have the desirable property
mulation, the multi-agent system consistently achieves a larger relative of preserving computational tractability, even in systems with many
cost reduction. Compared to a rule-based control of the VAV-system, agents. Yu, et al. conclude that MARL is indeed scalable by evaluating
the multi-agent RL controller gives a reduction of the monetary energy their proposed algorithm in a system of 30 agents (Yu, et al., 2021).
cost of 20%–70%. The wide range of cost reductions are believed to They draw a conclusion which can be summarized as follows. In a
arise due to the difference in weather patterns in the two test locations. system of 30 agents where each agent has access to 10 discrete actions,
It is important to note the distinct difference between agents learn- using a single-agent formulation–amounting to central control of all
ing independently and jointly. It is possible to formulate the multi-agent actions in the entire system–would yield 1030 different actions that
control problem as each agent solving a task independent of one can be applied in the system. Such an enormous number of actions
another, but with a reward function that is dependent on the joint would make the problem impossible to solve in practice. When instead
performance of all agents. Vázquez-Canteli et al. define the reward formulated as a MARL problem, each agent learns a separate policy, so
function of each agent to be the sum of the monetary cost of all
each agent only needs to consider 10 different actions. This makes the
the agents’ individual energy usages (Vázquez-Canteli et al., 2019).
system scalable even as more agents are added.
In contrast to the purely independent learning case, there is now an
Another important aspect of safe and reliable heating and cooling
incentive for agents to avoid using a lot of energy simultaneously, such
systems in local energy communities is the power capacity limit. It
that the agents must learn to coordinate. The algorithm is evaluated
is undesirable for agents to require power simultaneously. Even if
on a task of controlling the temperature setpoints of air-to-water heat
pumps in two large residential building equipped with PV panels. It is RL-control of a building cluster leads to a reduced collective energy sig-
demonstrated that the multi-agent formulation leads to lower energy nature, there may be other undesirable effects related to the power grid
cost compared to rule-based controllers. When photovoltaic arrays are supplying electricity to the BES. An example of such a situation is when
included in only one of the buildings, the multi-agent controller outper- the heat in the building cluster is supplied by heat pumps. Vazquez-
forms independent control in terms of energy cost. This is believed to be Canteli, Henze, and Nagy study several grid impact factors when apply-
due to coordination of the photovoltaic production so that the building ing multi-agent SAC to controlling BES with heat pumps, photovoltaic
without photovoltaic panels may consume more electricity from the arrays, and thermal storage (Vazquez-Canteli et al., 2020). It is found
grid when the other is self-sufficient. Li, Zhang, et al. also study MARL that their MARL algorithm shows dramatically reduced peak electrical
with joint learning for HVAC control, but uses a more sophisticated load compared to rule-based control. The ramping of the power is also
measure of thermal comfort, namely the Predicted Mean Vote (PMV) reduced, suggesting less sharp peaks. Effects on the power grid are
metric (Li, Zhang, et al., 2021). When applied to controlling the indoor particularly important to consider because power capacity shortage is
temperature- and humidity setpoints of a multi-zone laboratory build- an increasing problem in general.
ing, it is concluded that MARL outperforms rule-based control by up to
15% in terms of operational energy usage without sacrificing thermal 8. Future outlook
comfort. Moreover, the MARL formulation outperforms the approach
of using a single agent to control all actions in terms of the PMV. This The research on RL for BES control is undoubtedly still in its early
again suggests that coordination between agents can be beneficial in stages. In this review, we have identified the three main computer
BES control. science related challenges in the area. Each of the three challenges
Finally, the choice of independent or joint learning depends on still has unexplored research directions that require attention. In this
whether the actions of one agent affect the others. If it is believed section, we will highlight such directions that the authors believe are
that 𝑇 (𝐬′ |𝐬, 𝐚1 , 𝐚2 … 𝐚𝑁 ) = 𝑇 (𝐬′ |𝐬, 𝐚1 ), then independent learning can be
promising and meaningful.
performed because there is no need for coordination. However, this
would be oversimplifying, as parts of the heating/cooling system might
8.1. Reinforcement learning in single buldings
be common to many buildings. Thermal storage and PV production are
examples of such common utilities that create the need for coordina-
tion. Moreover, if a power or energy limit is persistent in the system, An algorithm with high sample efficiency is crucial in BES control.
coordination is needed to avoid exceeding this constraint. A promising approach which possesses such a virtue is model-based RL.
When using MPC as a basis for model-based RL, the biggest bottleneck
7.2. Safety and scalability is the requirement for an accurate environment model. However, if
the learning of the model were to happen online, this problem would
Apart from learning efficiency, an especially important aspect of RL be alleviated. Moreover, online learning of a model, integrated with a
algorithms for BES control is their scalability. As the number of build- model-free method for learning a policy in a so-called DYNA scheme,
ings in an energy community increases, the control problem should has been successful in many other applications of RL (Feinberg, et al.,

9
D. Weinberg et al. Sustainable Cities and Society 89 (2023) 104351

2018; Janner, Fu, Zhang, & Levine, 2019; Sutton, 1991). However, it is sample complexity of an RL algorithm under reasonable assumptions.
not clear how detailed the environment dynamics model must be in the Such a bound would offer insight into the number of environment inter-
case of BES control. BES are simple systems in the sense that one might actions needed to guarantee a certain expected reward. Furthermore, it
not need a complicated model to describe their dynamics, but they are is also desirable to upper bound the performance degradation in terms
susceptible to change over time, nonetheless. This would require some of the expected reward of transferring an agent to from one building to
degree of adaptivity in the models used for planning in an RL algorithm. another when the buildings have different transition dynamics. If such
Pre-training of policies by imitating already existing suboptimal a bound were obtained, it would be helpful when studying the transfer
controllers is another approach. There are however issues of such of policies between BES.
strategies that remain untreated in the context of BES. The most promi-
nent drawback of pre-training is that it might affect exploration when 8.5. Explainability
learning the final policy. Therefore, application of the current state of
the art in imitation learning along with systematic empirical studies of There exists an increasing body of literature on attempting to ex-
pre-training would also provide valuable insights. plain or visualize why RL agents take the actions that they do (Gunning,
et al., 2019; Puiutta & Veith, 2020). This interest is not purely aca-
8.2. Reinforcement learning in building clusters demic, but a strong demand from industry is also persistent due to the
ethical–and sometimes legal–obligations to motivate the actions taken
Transferring information between buildings is highly desirable in in autonomous systems. BES are no exception, and studies on which
practice and deserves more attention in the future. However, there are features are influential when controlling them would provide a better
major obstacles like differing state–action spaces between buildings, understanding of the problem. Moreover, explainability could aid in
making such a process difficult. Even under the assumption of a fixed feature selection when designing the state space in RL for BES control,
state–action space, there is need for algorithms that produce generally which could in turn improve the sample efficiency of the algorithms.
capable agents that can be deployed with reasonable results ‘‘out of the
box’’. Fast fine-tuning of the policy would then be required to obtain 8.6. System design
optimal behaviour in as little time as possible. A promising approach
to obtain agents that are robust to changing environment dynamics is Apart from RL-methodology and theory, there is the issue of design-
domain randomization (Tobin, et al., 2017). Such a training framework ing a scalable computation infrastructure to process measurement data
would be useful when making the sim-to-real transfer–i.e., training an in BES. Today, measurements are typically processed locally–e.g., in
agent in a simulation and then deploying it in the real word–due to a microcontroller–on the building premise. An alternative approach is
the mismatch between simulation and reality. Another approach is to to instead transmit the measurements to a datacenter for processing
use meta-RL to train agents with the goal of quickly adapting to new and storage in the cloud. Such data collection opens for the use of
tasks, rather than being good at one task (Finn, Abbeel, & Levine, 2017; online processing using RL for controlling the BES without the need
Nichol, Achiam, & Schulman, 2018). for installing computational hardware in the building itself. On the
In connection to the topic of model-based RL, it would be desirable other hand, this would require a channel of communication between
to use transfer learning when learning a model of the environment dy- the sensors and actuators in the BES, and the cloud provider. Therefore,
namics. Doing so could potentially alleviate some scalability issues of, there is a demand for interoperability between all system components,
e.g., MPC. Comprehensive investigations into the effects of transferring which is a major challenge in practice.
dynamics models as well as entire policies between buildings would be Moreover, when controlling BES, data naturally arrives sequentially
beneficial. The ideas of meta-learning can potentially be useful here as in time as a stream. In the case of controlling a cluster of buildings
well. concurrently, data could arrive asynchronously from any of the build-
ings and would require fast processing. Moreover, since each building
8.3. Multi-agent aspects could have unique control logic, the computation cannot be statically
allocated but must be dynamic. Considering the design of the data
Multi-agent RL is a promising approach to control communities processing pipeline is an important future research direction, as it is
with multiple buildings and allows for coordination to reduce the crucial for making large-scale control possible.
collective energy usage on the community level. The popular approach
of defining a common reward function to incentivize coordination 9. Conclusions
between agents does however have its limits. For instance, defining
the reward as the sum of individual rewards can create unfairness In this review, we have identified the main computer science related
among the agents. Further research into reward constellations that take challenges in BES control using RL. They grouped into three categories:
fairness into account are needed to ensure proper functionality in a RL in single buildings, RL in building clusters, and multi-agent aspects.
multi-agent system. Another topic that requires more attention is that The main conclusions that can be drawn from this review in connection
coordination between agents is not always possible. It can be due to the to the three main categories are the following:
lack of a central server for data processing or privacy constraints which
makes the transmission of raw data impossible. Hence, there should be • There exist empirical studies of model-based RL methods which
investigation into the effects on the collective performance of having show promising results in terms of sample efficiency. However,
completely decentralized control. Such a scenario would be especially the requirement for historical data to model the building and the
important to consider in the context of power grid effects such as peak energy systems system remains problematic.
loads. • Transferring policies between buildings with similar state-
transition dynamics has empirically shown indications to be
8.4. Theoretical guarantees highly effective in BES control.
• Multi-agent RL is a promising framework for controlling building
The past research on RL for BES control is to a considerable extent clusters when shared utilities such as thermal storage and on-
empirical. But to gain further intuition about the challenges of sample site electricity production are present. However, the challenges
efficiency and performance in terms of the final reward, those problems of sample efficiency persist.
should be studied theoretically. One of the important theoretical ques- • There is a shortage of theoretical analysis of the problems encoun-
tions that needs answering is whether a bound can be placed on the tered in RL for BES control.

10
D. Weinberg et al. Sustainable Cities and Society 89 (2023) 104351

Declaration of competing interest Gao, G., Li, J., & Wen, Y. (2020). DeepComfort: energy-efficient thermal comfort control
in buildings via reinforcement learning. IEEE Internet of Things Journal, 7(9), 8472–
8484. http://dx.doi.org/10.1109/JIOT.2020.2992117, URL: https://ieeexplore.ieee.
The authors declare that they have no known competing finan- org/document/9085925/.
cial interests or personal relationships that could have appeared to Gunning, D., Stefik, M., Choi, J., Miller, T., Stumpf, S., & Yang, G.-Z. (2019). XAI—
influence the work reported in this paper. Explainable artificial intelligence. Science Robotics, 4(37), eaay7120. http://dx.
doi.org/10.1126/scirobotics.aay7120, URL: https://www.science.org/doi/10.1126/
scirobotics.aay7120.
Data availability Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). Soft actor-critic: off-policy
maximum entropy deep reinforcement learning with a stochastic actor. URL: http:
No data was used for the research described in the article. //arxiv.org/abs/1801.01290. [Cs, Stat] arXiv:1801.01290.
Han, M., May, R., Zhang, X., Wang, X., Pan, S., Yan, D., et al. (2019). A re-
view of reinforcement learning methodologies for controlling occupant comfort
References in buildings. Sustainable Cities and Society, 51, Article 101748. http://dx.doi.
org/10.1016/j.scs.2019.101748, URL: https://linkinghub.elsevier.com/retrieve/pii/
Afram, A., & Janabi-Sharifi, F. (2014). Theory and applications of HVAC control systems S2210670719307589.
– A review of model predictive control (MPC). Building and Environment, 72, 343– Hong, T., Wang, Z., Luo, X., & Zhang, W. (2020). State-of-the-art on research and
355. http://dx.doi.org/10.1016/j.buildenv.2013.11.016, URL: https://linkinghub. applications of machine learning in the building life cycle. Energy and Buildings,
elsevier.com/retrieve/pii/S0360132313003363. 212, Article 109831. http://dx.doi.org/10.1016/j.enbuild.2020.109831, URL: https:
Afram, A., Janabi-Sharifi, F., Fung, A. S., & Raahemifar, K. (2017). Artificial neural //linkinghub.elsevier.com/retrieve/pii/S0378778819337879.
network (ANN) based model predictive control (MPC) and optimization of HVAC International Energy Agency (2022). Buildings. IEA, URL: https://www.iea.org/reports/
systems: A state of the art review and case study of a residential HVAC system. buildings.
Energy and Buildings, 141, 96–113. http://dx.doi.org/10.1016/j.enbuild.2017.02. Janner, M., Fu, J., Zhang, M., & Levine, S. (2019). When to trust your model: model-
012, URL: https://linkinghub.elsevier.com/retrieve/pii/S0378778816310799. based policy optimization. In Proceedings of the 33rd international conference on
neural information processing systems (1122), (pp. 12519–12530). Red Hook, NY,
Biemann, M., Liu, X., Zeng, Y., & Huang, L. (2021). Addressing partial observability
USA: Curran Associates Inc.
in reinforcement learning for energy management. In Proceedings of the 8th
ACM international conference on systems for energy-efficient buildings, cities, and Jiang, Z., Risbeck, M. J., Ramamurti, V., Murugesan, S., Amores, J., Zhang, C., et
transportation (pp. 324–328). Coimbra Portugal: ACM, http://dx.doi.org/10.1145/ al. (2021). Building HVAC control with reinforcement learning for reduction of
3486611.3488730, URL: https://dl.acm.org/doi/10.1145/3486611.3488730. energy cost and demand charge. Energy and Buildings, 239, Article 110833. http://
dx.doi.org/10.1016/j.enbuild.2021.110833, URL: https://linkinghub.elsevier.com/
Biemann, M., Scheller, F., Liu, X., & Huang, L. (2021). Experimental evaluation of
retrieve/pii/S0378778821001171.
model-free reinforcement learning algorithms for continuous HVAC control. Applied
Jin, X., Baker, K., Christensen, D., & Isley, S. (2017). Foresee: A user-centric home
Energy, 298, Article 117164. http://dx.doi.org/10.1016/j.apenergy.2021.117164,
energy management system for energy efficiency and demand response. Applied
URL: https://linkinghub.elsevier.com/retrieve/pii/S0306261921005961.
Energy, 205, 1583–1595. http://dx.doi.org/10.1016/j.apenergy.2017.08.166, URL:
Chen, B., Cai, Z., & Bergés, M. (2020). Gnu-RL: A practical and scalable reinforce-
https://linkinghub.elsevier.com/retrieve/pii/S0306261917311856.
ment learning solution for building HVAC control using a differentiable MPC
Kurte, K., Amasyali, K., Munk, J., & Zandi, H. (2021). Comparative analysis of model-
policy. Frontiers in Built Environment, 6, Article 562239. http://dx.doi.org/10.3389/
free and model-based HVAC control for residential demand response. In Proceedings
fbuil.2020.562239, URL: https://www.frontiersin.org/articles/10.3389/fbuil.2020.
of the 8th ACM international conference on systems for energy-efficient buildings, cities,
562239/full.
and transportation (pp. 309–313). Coimbra Portugal: ACM, http://dx.doi.org/10.
Chen, Y., Norford, L. K., Samuelson, H. W., & Malkawi, A. (2018). Optimal control of
1145/3486611.3488727, URL: https://dl.acm.org/doi/10.1145/3486611.3488727.
HVAC and window systems for natural ventilation through reinforcement learning.
Lee, Z. E., & Zhang, K. M. (2021). Scalable identification and control of res-
Energy and Buildings, 169, 195–205. http://dx.doi.org/10.1016/j.enbuild.2018.03.
idential heat pumps: A minimal hardware approach. Applied Energy, 286,
051, URL: https://linkinghub.elsevier.com/retrieve/pii/S0378778818302184.
Article 116544. http://dx.doi.org/10.1016/j.apenergy.2021.116544, URL: https://
Ding, X., Du, W., & Cerpa, A. E. (2020). MB2C: Model-based deep reinforcement
linkinghub.elsevier.com/retrieve/pii/S0306261921000945.
learning for multi-zone building control. In Proceedings of the 7th ACM international
Levermore, G. J. (2000). Building energy management systems: applications to low energy
conference on systems for energy-efficient buildings, cities, and transportation (pp. 50–
HVAC and natural ventilation control (2nd ed.). London ; New York: E & FN Spon.
59). Virtual Event Japan: ACM, http://dx.doi.org/10.1145/3408308.3427986, URL:
Li, Y., O’Neill, Z., Zhang, L., Chen, J., Im, P., & DeGraw, J. (2021). Grey-
https://dl.acm.org/doi/10.1145/3408308.3427986.
box modeling and application for building energy simulations - A critical
Drgoňa, J., Arroyo, J., Cupeiro Figueroa, I., Blum, D., Arendt, K., Kim, D., et al. (2020).
review. Renewable and Sustainable Energy Reviews, 146, Article 111174. http:
All you need to know about model predictive control for buildings. Annual Reviews
//dx.doi.org/10.1016/j.rser.2021.111174, URL: https://linkinghub.elsevier.com/
in Control, 50, 190–232. http://dx.doi.org/10.1016/j.arcontrol.2020.09.001, URL:
retrieve/pii/S1364032121004639.
https://linkinghub.elsevier.com/retrieve/pii/S1367578820300584.
Li, J., Zhang, W., Gao, G., Wen, Y., Jin, G., & Christopoulos, G. (2021). Toward
Du, Y., Li, F., Munk, J., Kurte, K., Kotevska, O., Amasyali, K., et al. (2021). Multi- intelligent multizone thermal control with multiagent deep reinforcement learning.
task deep reinforcement learning for intelligent multi-zone residential HVAC IEEE Internet of Things Journal, 8(14), 11150–11162. http://dx.doi.org/10.1109/
control. Electric Power Systems Research, 192, Article 106959. http://dx.doi. JIOT.2021.3051400, URL: https://ieeexplore.ieee.org/document/9321466/.
org/10.1016/j.epsr.2020.106959, URL: https://linkinghub.elsevier.com/retrieve/ Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., et al. (2019).
pii/S0378779620307574. Continuous control with deep reinforcement learning. [Cs, Stat] arXiv:1509.02971.
Du, Y., Zandi, H., Kotevska, O., Kurte, K., Munk, J., Amasyali, K., et al. (2021). URL: http://arxiv.org/abs/1509.02971.
Intelligent multi-zone residential HVAC control strategy based on deep rein- Manjarres, D., Mera, A., Perea, E., Lejarazu, A., & Gil-Lopez, S. (2017). An energy-
forcement learning. Applied Energy, 281, Article 116117. http://dx.doi.org/10. efficient predictive control for HVAC systems applied to tertiary buildings based
1016/j.apenergy.2020.116117, URL: https://linkinghub.elsevier.com/retrieve/pii/ on regression techniques. Energy and Buildings, 152, 409–417. http://dx.doi.org/
S030626192031535X. 10.1016/j.enbuild.2017.07.056, URL: https://linkinghub.elsevier.com/retrieve/pii/
European Commission (2021). 2030 Digital compass: the European way for the digital S037877881731321X.
decade. European Commission. URL: https://eur-lex.europa.eu/legal-content/en/ Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., et al.
TXT/?uri=CELEX:52021DC0118. (2013). Playing atari with deep reinforcement learning. arXiv:1312.5602.
Fan, C., Sun, Y., Xiao, F., Ma, J., Lee, D., Wang, J., et al. (2020). Statistical Nagy, A., Kazmi, H., Cheaib, F., & Driesen, J. (2018). Deep reinforcement learning for
investigations of transfer learning-based methodology for short-term building optimal control of space heating. arXiv:1805.03777.
energy predictions. Applied Energy, 262, Article 114499. http://dx.doi.org/10. Nichol, A., Achiam, J., & Schulman, J. (2018). On first-order meta-learning algorithms.
1016/j.apenergy.2020.114499, URL: https://linkinghub.elsevier.com/retrieve/pii/ [Cs] arXiv:1803.02999. URL: http://arxiv.org/abs/1803.02999.
S0306261920300118. Nweye, K., Liu, B., Stone, P., & Nagy, Z. (2022). Real-world challenges for multi-
Feinberg, V., Wan, A., Stoica, I., Jordan, M. I., Gonzalez, J. E., & Levine, S. (2018). agent reinforcement learning in grid-interactive buildings. http://dx.doi.org/10.
Model-based value estimation for efficient model-free reinforcement learning. [Cs, 1016/j.egyai.2022.100202. URL: http://arxiv.org/abs/2112.06127. [cs, eess] arXiv:
Stat] arXiv:1803.00101. URL: http://arxiv.org/abs/1803.00101. 2112.06127.
Finn, C., Abbeel, P., & Levine, S. (2017). Mode-agnostic meta-learning for fast Pinto, G., Wang, Z., Roy, A., Hong, T., & Capozzoli, A. (2022). Transfer learning
adaptation of deep networks. [Cs] arXiv:1703.03400. URL: http://arxiv.org/abs/ for smart buildings: A critical review of algorithms, applications, and future
1703.03400. perspectives. Advances in Applied Energy, 5, Article 100084. http://dx.doi.org/
Frederiksen, S., & Werner, S. (2013). District heating and cooling. Lund: Studentlitteratur 10.1016/j.adapen.2022.100084, URL: https://linkinghub.elsevier.com/retrieve/pii/
AB. S2666792422000026.

11
D. Weinberg et al. Sustainable Cities and Society 89 (2023) 104351

Puiutta, E., & Veith, E. M. S. P. (2020). Explainable reinforcement learning: A Vazquez-Canteli, J. R., Henze, G., & Nagy, Z. (2020). MARLISA: Multi-agent rein-
survey. In A. Holzinger, P. Kieseberg, A. M. Tjoa, & E. Weippl (Eds.), Machine forcement learning with iterative sequential action selection for load shaping of
learning and knowledge extraction. Vol. 12279 (pp. 77–95). Cham: Springer Inter- grid-interactive connected buildings. In Proceedings of the 7th ACM international
national Publishing, http://dx.doi.org/10.1007/978-3-030-57321-8_5, URL: http: conference on systems for energy-efficient buildings, cities, and transportation (pp. 170–
//link.springer.com/10.1007/978-3-030-57321-8_5. 179). Virtual Event Japan: ACM, http://dx.doi.org/10.1145/3408308.3427604,
Qian, F., Gao, W., Yang, Y., & Yu, D. (2020). Potential analysis of the transfer learning URL: https://dl.acm.org/doi/10.1145/3408308.3427604.
model in short and medium-term forecasting of building HVAC energy consump- Vázquez-Canteli, J. R., & Nagy, Z. (2019). Reinforcement learning for demand response:
tion. Energy, 193, Article 116724. http://dx.doi.org/10.1016/j.energy.2019.116724, A review of algorithms and modeling techniques. Applied Energy, 235, 1072–
URL: https://linkinghub.elsevier.com/retrieve/pii/S0360544219324193. 1089. http://dx.doi.org/10.1016/j.apenergy.2018.11.002, URL: https://linkinghub.
Royapoor, M., Antony, A., & Roskilly, T. (2018). A review of building climate and plant elsevier.com/retrieve/pii/S0306261918317082.
controls, and a survey of industry perspectives. Energy and Buildings, 158, 453–465. Vázquez-Canteli, J. R., Ulyanin, S., Kämpf, J., & Nagy, Z. (2019). Fusing TensorFlow
http://dx.doi.org/10.1016/j.enbuild.2017.10.022, URL: https://linkinghub.elsevier. with building energy simulation for intelligent energy management in smart cities.
com/retrieve/pii/S0378778817318522. Sustainable Cities and Society, 45, 243–257. http://dx.doi.org/10.1016/j.scs.2018.
Salsbury, T. I. (2005). A survey of control technologies in the building automa- 11.021, URL: https://linkinghub.elsevier.com/retrieve/pii/S2210670718314380.
tion industry. IFAC Proceedings Volumes, 38(1), 90–100. http://dx.doi.org/10. Wang, Z., & Hong, T. (2020). Reinforcement learning for building controls: The oppor-
3182/20050703-6-CZ-1902.01397, URL: https://linkinghub.elsevier.com/retrieve/ tunities and challenges. Applied Energy, 269, Article 115036. http://dx.doi.org/10.
pii/S1474667016374092. 1016/j.apenergy.2020.115036, URL: https://linkinghub.elsevier.com/retrieve/pii/
Schulman, J., Levine, S., Moritz, P., Jordan, M. I., & Abbeel, P. (2017). Trust region S0306261920305481.
policy optimization. URL: http://arxiv.org/abs/1502.05477. [Cs] arXiv:1502.05477. Wei, T., Wang, Y., & Zhu, Q. (2017). Deep reinforcement learning for building HVAC
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal control. In Proceedings of the 54th annual design automation conference 2017 (pp.
policy optimization algorithms. URL: http://arxiv.org/abs/1707.06347. [Cs] arXiv: 1–6). Austin TX USA: ACM, http://dx.doi.org/10.1145/3061639.3062224, URL:
1707.06347. https://dl.acm.org/doi/10.1145/3061639.3062224.
Sofos, M., Langevin, J., Deru, M., Gupta, E., Benne, K., Blum, D., et al. (2020). Innova- Yu, Y. (2018). Towards sample efficient reinforcement learning. In Proceedings of
tions in sensors and controls for building energy management: research and development the twenty-seventh international joint conference on artificial intelligence (pp. 5739–
opportunities report for emerging technologies: Technical Report NREL/TP–5500-75601, 5743). Stockholm, Sweden: International Joint Conferences on Artificial Intelligence
DOE/GO–102019-5234, 1601591, http://dx.doi.org/10.2172/1601591, URL: https: Organization, http://dx.doi.org/10.24963/ijcai.2018/820, URL: https://www.ijcai.
//www.osti.gov/servlets/purl/1601591/. org/proceedings/2018/820.
Sutton, R. S. (1991). Dyna, an integrated architecture for learning, planning, and Yu, L., Sun, Y., Xu, Z., Shen, C., Yue, D., Jiang, T., et al. (2021). Multi-agent
reacting. ACM SIGART Bulletin, 2(4), 160–163. http://dx.doi.org/10.1145/122344. deep reinforcement learning for HVAC control in commercial buildings. IEEE
122377, URL: https://dl.acm.org/doi/10.1145/122344.122377. Transactions on Smart Grid, 12(1), 407–419. http://dx.doi.org/10.1109/TSG.2020.
Sutton, R. S., & Barto, A. G. (2018). Adaptive computation and machine learning series, 3011739, URL: https://ieeexplore.ieee.org/document/9146920/.
Reinforcement learning: an introduction (2nd ed.). Cambridge, Massachusetts: The Zhang, X., Jin, X., Tripp, C., Biagioni, D. J., Graf, P., & Jiang, H. (2020). Transferable
MIT Press. reinforcement learning for smart homes. In Proceedings of the 1st international
Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W., & Abbeel, P. (2017). Domain workshop on reinforcement learning for energy management in buildings & cities (pp.
randomization for transferring deep neural networks from simulation to the real 43–47). Virtual Event Japan: ACM, http://dx.doi.org/10.1145/3427773.3427865,
world. In 2017 IEEE/RSJ international conference on intelligent robots and systems URL: https://dl.acm.org/doi/10.1145/3427773.3427865.
(pp. 23–30). Vancouver, BC: IEEE, http://dx.doi.org/10.1109/IROS.2017.8202133, Zhang, C., Kuppannagari, S. R., Kannan, R., & Prasanna, V. K. (2019). Building
URL: http://ieeexplore.ieee.org/document/8202133/. HVAC scheduling using reinforcement learning via neural network based model
Tymkow, P., Tassou, S., Kolokotroni, M., & Jouhara, H. (2020). Building services design approximation. In Proceedings of the 6th ACM international conference on systems
for energy efficient buildings (2nd ed.). New York: Routledge. for energy-efficient buildings, cities, and transportation (pp. 287–296). New York NY
USA: ACM, http://dx.doi.org/10.1145/3360322.3360861, URL: https://dl.acm.org/
doi/10.1145/3360322.3360861.

Energy Plus With RL
No ratings yet
Energy Plus With RL
112 pages
VTT Book Advanced Energy Systems in ZeroPositive Energy Buildings Communities and Districts
No ratings yet
VTT Book Advanced Energy Systems in ZeroPositive Energy Buildings Communities and Districts
313 pages
Modeling and Forecasting Building Energy Consumption - A Review of Data-Driven Techniques
No ratings yet
Modeling and Forecasting Building Energy Consumption - A Review of Data-Driven Techniques
27 pages
First Oral Presentation
No ratings yet
First Oral Presentation
60 pages
(Step-Up) Samir S. Shah, Brian Alverson, Jeanine Ronan - Step-Up To Pediatrics-LWW (2013)
100% (5)
(Step-Up) Samir S. Shah, Brian Alverson, Jeanine Ronan - Step-Up To Pediatrics-LWW (2013)
607 pages
Journal Pre-Proof: Journal of Building Engineering
No ratings yet
Journal Pre-Proof: Journal of Building Engineering
43 pages
Review On Advanced Storage Control Applied To Optimized Operation of Energy Systems For Buildings and Districts: Insights and Perspectives
No ratings yet
Review On Advanced Storage Control Applied To Optimized Operation of Energy Systems For Buildings and Districts: Insights and Perspectives
26 pages
A Low Cost IoT Based Buildings Management
No ratings yet
A Low Cost IoT Based Buildings Management
18 pages
Brandi 2020 146
No ratings yet
Brandi 2020 146
53 pages
Systematic Review of Deep Learning and Machine Learning For Building Energy
No ratings yet
Systematic Review of Deep Learning and Machine Learning For Building Energy
48 pages
Applications of Reinforcement Learning in Energy Systems
No ratings yet
Applications of Reinforcement Learning in Energy Systems
23 pages
RockettHathway BR - I Final
No ratings yet
RockettHathway BR - I Final
23 pages
RL For HVAC Control-A Technical and Conceptual Review-Ref 65
No ratings yet
RL For HVAC Control-A Technical and Conceptual Review-Ref 65
18 pages
Reinforcement Learning For HVAC Control in Intelligent Buildings - A Technical and Conceptual Review - 1-s2.0-S235271022401653X-Main
No ratings yet
Reinforcement Learning For HVAC Control in Intelligent Buildings - A Technical and Conceptual Review - 1-s2.0-S235271022401653X-Main
25 pages
1 s2.0 S0306261921015932 Main
No ratings yet
1 s2.0 S0306261921015932 Main
16 pages
Multi Agent Deep Reinforcement Learning Optimization Framework - 2022 - Applied
No ratings yet
Multi Agent Deep Reinforcement Learning Optimization Framework - 2022 - Applied
17 pages
Data-Driven Prediction and Optimization Toward Net-Zero and Positive-Energy Buildings - A Systematic Review
No ratings yet
Data-Driven Prediction and Optimization Toward Net-Zero and Positive-Energy Buildings - A Systematic Review
19 pages
Intelligent Energy Management Systems: A Review: Stavros Mischos Eleanna Dalagdi Dimitrios Vrakas
No ratings yet
Intelligent Energy Management Systems: A Review: Stavros Mischos Eleanna Dalagdi Dimitrios Vrakas
40 pages
Journal of Building Engineering: Srinivas Yelisetti, Vikash Kumar Saini, Rajesh Kumar, Ravita Lamba, Akash Saxena
No ratings yet
Journal of Building Engineering: Srinivas Yelisetti, Vikash Kumar Saini, Rajesh Kumar, Ravita Lamba, Akash Saxena
41 pages
Anomoly Detcetion
No ratings yet
Anomoly Detcetion
41 pages
1 s2.0 S0360544224024101 Main
No ratings yet
1 s2.0 S0360544224024101 Main
18 pages
Aplication in Citylearn Gym
No ratings yet
Aplication in Citylearn Gym
12 pages
1 s2.0 S0306261920314306 Main
No ratings yet
1 s2.0 S0306261920314306 Main
17 pages
Building Energy Management With Reinforcement Learning and Model Predictive Control A Survey
No ratings yet
Building Energy Management With Reinforcement Learning and Model Predictive Control A Survey
10 pages
1 s2.0 S221067071730135X Main
No ratings yet
1 s2.0 S221067071730135X Main
14 pages
Physically Consistent Neural Networks For Building
No ratings yet
Physically Consistent Neural Networks For Building
20 pages
Forecasting of Residential Unit's Heat Demands: A Comparison of Machine Learning Techniques in A Real World Case Study
No ratings yet
Forecasting of Residential Unit's Heat Demands: A Comparison of Machine Learning Techniques in A Real World Case Study
35 pages
1 s2.0 S0378778823001081 Main
No ratings yet
1 s2.0 S0378778823001081 Main
11 pages
4.ED-DQN An Event-Driven Deep Reinforcement Learning Control Method For Multi-Zone Residential Buildings
No ratings yet
4.ED-DQN An Event-Driven Deep Reinforcement Learning Control Method For Multi-Zone Residential Buildings
17 pages
J Cleaner Production 2023 v405 136942
No ratings yet
J Cleaner Production 2023 v405 136942
18 pages
1 s2.0 S2666792423000148 Main
No ratings yet
1 s2.0 S2666792423000148 Main
24 pages
1 s2.0 S2666546820300434 Main
No ratings yet
1 s2.0 S2666546820300434 Main
9 pages
A Comparison of ML Algorithms
No ratings yet
A Comparison of ML Algorithms
18 pages
Pruvost Et Al Ontology Based Expert System For Automated Monitoring of Building Energy Systems
No ratings yet
Pruvost Et Al Ontology Based Expert System For Automated Monitoring of Building Energy Systems
11 pages
Lionel Corbett - The Religious Function of The Psyche (1996)
100% (1)
Lionel Corbett - The Religious Function of The Psyche (1996)
273 pages
Study On Deep Reinforcement Learning Techniques For Building Energy
No ratings yet
Study On Deep Reinforcement Learning Techniques For Building Energy
14 pages
A Survey of Reinforcement Learning For Optimization in Automation
No ratings yet
A Survey of Reinforcement Learning For Optimization in Automation
8 pages
Usage of GAMS-Based Digital Twins and Clustering To Improve Energetic Systems Control
No ratings yet
Usage of GAMS-Based Digital Twins and Clustering To Improve Energetic Systems Control
17 pages
1 s2.0 S0360544223012331 Main
No ratings yet
1 s2.0 S0360544223012331 Main
9 pages
Officelearn: An Openai Gym Environment For Reinforcement Learning On Occupant-Level Building'S Energy Demand Response
No ratings yet
Officelearn: An Openai Gym Environment For Reinforcement Learning On Occupant-Level Building'S Energy Demand Response
6 pages
On-Line Building Energy Optimization Using Deep Reinforcement Learning
No ratings yet
On-Line Building Energy Optimization Using Deep Reinforcement Learning
11 pages
B 275
No ratings yet
B 275
14 pages
Reinforcement Learning For Building Management Systems
No ratings yet
Reinforcement Learning For Building Management Systems
9 pages
Environmental and Economic Benefits of Building Retrofit Measures For The Residential Sector by Utilizing Sensor Data and Advanced Calibrated Models
No ratings yet
Environmental and Economic Benefits of Building Retrofit Measures For The Residential Sector by Utilizing Sensor Data and Advanced Calibrated Models
30 pages
Challenges and Opportunities in European Smart Bu - 2024 - Renewable and Sustain
No ratings yet
Challenges and Opportunities in European Smart Bu - 2024 - Renewable and Sustain
24 pages
SSRN Id4232295
No ratings yet
SSRN Id4232295
30 pages
Exploring Deep Reinforcement Learning For Holistic Smart Building Control
No ratings yet
Exploring Deep Reinforcement Learning For Holistic Smart Building Control
14 pages
Reinforcement Learning-Based BEMS Architecture For Energy Usage Optimization
No ratings yet
Reinforcement Learning-Based BEMS Architecture For Energy Usage Optimization
33 pages
Gokhale 2024 Safe
No ratings yet
Gokhale 2024 Safe
5 pages
Deep Learning in Energy Modeling Application in Smart Buildings With Distributed Energy Generation
No ratings yet
Deep Learning in Energy Modeling Application in Smart Buildings With Distributed Energy Generation
23 pages
Electronics 13 01459
No ratings yet
Electronics 13 01459
17 pages
Building Simulation: Ten Challenges: Tianzhen Hong ( ), Jared Langevin, Kaiyu Sun
No ratings yet
Building Simulation: Ten Challenges: Tianzhen Hong ( ), Jared Langevin, Kaiyu Sun
28 pages
Document From ?
No ratings yet
Document From ?
25 pages
Gokhale 2024 Eeposter
No ratings yet
Gokhale 2024 Eeposter
2 pages
Smart Management Energy Systems in Indus
No ratings yet
Smart Management Energy Systems in Indus
3 pages
Building Energy Prediction
No ratings yet
Building Energy Prediction
14 pages
Demand Response Whit Functional Building
No ratings yet
Demand Response Whit Functional Building
6 pages
Towards Optimal District Heating Temperature Contr-5
No ratings yet
Towards Optimal District Heating Temperature Contr-5
1 page
Unit1 (Integrationcurriculum) - Millie Tapia
50% (4)
Unit1 (Integrationcurriculum) - Millie Tapia
16 pages
Energies 17 04277
No ratings yet
Energies 17 04277
35 pages
Guide To Coursera For Business 2019
100% (1)
Guide To Coursera For Business 2019
31 pages
Roberto Larghetti
No ratings yet
Roberto Larghetti
14 pages
BSBCRT511 Critical Thinking Skills
100% (7)
BSBCRT511 Critical Thinking Skills
18 pages
Deep Reinforcement Learning For HVAC Control in Smart Buildings
No ratings yet
Deep Reinforcement Learning For HVAC Control in Smart Buildings
6 pages
Achieving Goals
No ratings yet
Achieving Goals
19 pages
DLP in EDUC 105 - GROUP2
No ratings yet
DLP in EDUC 105 - GROUP2
5 pages
4'as LESSON PLAN
0% (1)
4'as LESSON PLAN
3 pages
Training CalendaR 2020@EDI PDF
No ratings yet
Training CalendaR 2020@EDI PDF
28 pages
Self Esteem
0% (1)
Self Esteem
3 pages
Introduction On Microsoft Project
100% (1)
Introduction On Microsoft Project
20 pages
Conditional Sentence0
No ratings yet
Conditional Sentence0
4 pages
Panini 90%
No ratings yet
Panini 90%
2 pages
Planner 20250528080818 Class 5 Holiday HW 25-26
No ratings yet
Planner 20250528080818 Class 5 Holiday HW 25-26
6 pages
Learning & Development 2
No ratings yet
Learning & Development 2
10 pages
EMB 602 Human Resource Management 1 PDF
No ratings yet
EMB 602 Human Resource Management 1 PDF
5 pages
DARWIN Form 2
No ratings yet
DARWIN Form 2
148 pages
Instrument: Fagerstrom Test For Nicotine Dependence (FTND) : Description
No ratings yet
Instrument: Fagerstrom Test For Nicotine Dependence (FTND) : Description
3 pages
Lecture-5 Applications of Determinants
No ratings yet
Lecture-5 Applications of Determinants
16 pages
Ayushi Singh: Objective Interest
No ratings yet
Ayushi Singh: Objective Interest
1 page
Review On Remote Sensing Methods For Landslide Detection Using Machine and Deep Learning
No ratings yet
Review On Remote Sensing Methods For Landslide Detection Using Machine and Deep Learning
24 pages
Gmail - Your ADHD Test Results Are Inside
No ratings yet
Gmail - Your ADHD Test Results Are Inside
2 pages
Intergrated B.Ed-M.Ed: Cluster University of Jammu
No ratings yet
Intergrated B.Ed-M.Ed: Cluster University of Jammu
18 pages
Tuition Fees For 2019/20: A: Senior School Validated Programmes U
No ratings yet
Tuition Fees For 2019/20: A: Senior School Validated Programmes U
5 pages
Advertisement&applicatiosjmmsse2024 25
No ratings yet
Advertisement&applicatiosjmmsse2024 25
3 pages
Checklist For Enrollment of Providers
No ratings yet
Checklist For Enrollment of Providers
7 pages
Effectiveness of Finger Held Relaxation On The Decrease in Intensity of Pain in Patient of Post-Sectio Caesarea in RSUD Sorong Regency
No ratings yet
Effectiveness of Finger Held Relaxation On The Decrease in Intensity of Pain in Patient of Post-Sectio Caesarea in RSUD Sorong Regency
4 pages
Qualitative Data Worksheet: Historical Design Teacher's Feedback Working Title
No ratings yet
Qualitative Data Worksheet: Historical Design Teacher's Feedback Working Title
5 pages
Pre-Test: Submitted By: Clifford Llupar Grade 12 Stem
No ratings yet
Pre-Test: Submitted By: Clifford Llupar Grade 12 Stem
3 pages
Vocabulary: Country Nationality Language
No ratings yet
Vocabulary: Country Nationality Language
2 pages

5.a Review of Reinforcement Learning For Controlling Building Energy Systems From A Computer Science Perspective

Uploaded by

5.a Review of Reinforcement Learning For Controlling Building Energy Systems From A Computer Science Perspective

Uploaded by

Sustainable Cities and Society 89 (2023) 104351

Contents lists available at ScienceDirect

Sustainable Cities and Society

A Review of Reinforcement Learning for Controlling Building Energy

ARTICLE INFO ABSTRACT

1. Introduction sustainability goals (Vázquez-Canteli & Nagy, 2019). These control

to the actuators at the BES-components. Even in buildings where BMS Table 2

where 𝑟 ∶  ×  → R is a so-called reward function (Sutton &

2020). Discretization of the action space increases its cardinality expo-

5.1.2. Off-policy algorithms

You might also like