Officelearn: An Openai Gym Environment For Reinforcement Learning On Occupant-Level Building'S Energy Demand Response

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

OfficeLearn: An OpenAI Gym Environment for Reinforcement Learning on

Occupant-Level Building’s Energy Demand Response

Lucas Spangher 1 Akash Gokul 1 Joseph Palakapilly 1 Utkarsha Agwan 1 Manan Khattar 1 Wann-Jiun Ma 2
Costas Spanos 1

Abstract et al., 2013), (Srinivasan et al., 2011)). In addition, the


Energy Demand Response (DR) will play a cru- quantity of energy used by plugs is increasing more quickly
cial role in balancing renewable energy genera- than any other load type in both residential and commercial
tion with demand as grids decarbonize. There buildings (Comstock & Jarzomski, 2012).
is growing interest in developing Reinforcement Machine Learning (ML), while transformative in many sec-
Learning (RL) techniques to optimize DR pric- tors of the economy, is somewhat underdeveloped when it
ing, as pricing set by electric utilities often cannot comes to energy applications. The creation of the AI in
take behavioral irrationality into account. How- Climate Change community exist to bridge this gap, encour-
ever, so far, attempts to standardize RL efforts in aging the collaboration of research to develop and apply
this area do not exist. In this paper, we present a broad array of techniques into an equally broad array of
a first of the kind OpenAI gym environment for applications. To encourage exploration in occupant level
testing DR with occupant level building dynam- building DR, we propose to formalize an OpenAI Gym en-
ics. We demonstrate the flexibility with which vironment for the testing of Reinforcement Learning (RL)
a researcher can customize our simulated office agents within a single office building.
environment through the explicit input parameters
we have provided. We hope that this work enables
future work in DR in buildings. 2. Related Works
Deep RL is a subfield in ML that trains an agent to choose
actions that maximize its rewards in an environment (Sutton
1. Introduction & Barto, 1998). RL has had extensive success in complex
Efforts to address climate change are impossible without a control environments like Atari games (Mnih et al., 2013)
quick and safe transition to renewable energy. Barring strate- and in previously unsolved domains like PSPACE-complete
gies to address the “volatility” of generation in renewable Sokoban planning (Feng et al., 2020). It has limited success
energy sources, grids with increasing shares of renewable in energy: Google implemented RL controls to reduce en-
energy will face daunting consquences. These range from ergy consumption in data centers by 40%. When DeepMind
wasting of energy through curtailment (i.e., the shutting off used a form of multi-agent RL to beat the world champion
of green energy sources when generation exceeds supply) in the complex and strategic game of Go (Borowiec, 2016),
(Spangher et al., 2020), voltage instability, or damage to the researchers called for similar advancement in RL for power
physical infrasture of the grid. Indeed, the California grid of systems (Li & Du, 2018).
2019 needed to curtail roughly 3% of its energy, with some OpenAI Gym environments are a series of standardized
days seeing up to 33% of solar energy curtailed. environments that provide platform for benchmarking the
One solution to curtailment commonly touted is energy progress of learning agents (Brockman et al., 2016). Gym
Demand Response (DR) which entails the deferment of environments allow researcher to create custom environ-
energy demand from when it is demanded to when it is most ments under a general and widely accepted API frame-
opportune for it to be filled. DR is essentially costless, as work/format that immediately allows deployment of a suite
it requires no infrastructure, so it is important as a direct of out-of-the-box RL techniques; therefore therefore so
solution. Gym environments tend to concentrate work around the
specific problem that they describe.
One primary area of application for DR is in buildings.
Buildings make up a significant, increasing component of Other groups have produced OpenAI Gym environments
US energy demand. In residential and commercial buildings, around similar goals but with different scopes. CityLearn
plug loads represent 30% of total electricity use ((Lanzisera aims to integrate multi-agent RL applications for DR in
OfficeLearn: An OpenAI Gym Environment for Building Level Energy Demand Response

connected communities (Vázquez-Canteli et al., 2019). The


environment supports customizing an array of variables in
the creation of a heterogenous group of buildings, includ-
ing number of buildings, type of buildings, and demand
profile. A competition was hosted in CityLearn, in which
the creators solicited submissions of agents that could learn
appropriately in their environment (Kathirgamanathan et al.,
2020).
We are unaware of an effort that attempts to focus study
around occupant level energy DR in a Gym environment:
that is, an effort that focuses on occupant level DR within a Figure 1. A schematic showing the interplay between agent and
building. Therefore, we endeavor in this work to present the office environment, and ensuing energy reponses. The agent re-
OfficeLearn Gym environment, an environment that may ceives prices from the grid, then transforms it into “points” (called
serve as preparation grounds for experiments implementing as such for differentiation.) Office workers engage with the points
DR within real world test buildings. in the way an individual might be engaged with their home energy
bill, which is reasonable assuming behavioral incentives detailed
2.1. Paper Outline in (Spangher et al., 2020). The office recieves these points at the
beginning of the “day”. Workers proceed to use energy throughout
We have contextualized the creation of our Gym environ- the day and at night the system delivers a record of their energy
ment within the broader effort of applying ML techniques consumption, which is reduced into a reward that trains the agent.
to climate change in Sections 1 and 2. In Section 3, we
describe the technical details of our environment and how
those will differ in future iterations. In Section 4, we illus- the state space has several different components (described
trate the dynamics of the system by comparing key design below), each is of ten dimensions as each one is hourly in
choices in environmental setup. In Section 5, 6, and 7, we nature.
conclude, note how you might use the environment, and
discuss future directions. 3.2.1. G RID PRICE REGIMES
Utilities are increasingly moving towards time dependent
3. Description of Gym Environment energy pricing, especially for bigger consumers such as
commercial office buildings with the capacity to shift their
3.1. Overview energy usage. Time of use (TOU) pricing involves is a
In this section, we highlight a summary of the environment simple, two-level daily price curve that changes seasonally
and the underlying Markov Decision Process. The flow of and is declared ahead of time. We use PG&E’s TOU price
information is succinctly expressed in Figure 1. curves from 2019. Real time pricing (RTP), meanwhile, is
dynamic for every hour and changes according to supply
The environment takes the format of the following Markov and demand in the energy market. We simulate it by sub-
Decision Process (MDP), (S, A, p, r): tracting the solar energy from demand of a sample building.
There is significant seasonal variation in prices depending
• State Space S: The prices, energy usage, and baseline on geography, e.g. in warmer climates, the increased cool-
energy, all 10-dim vectors. ing load during summer can cause an increase in energy
prices.
• Action Space A: A 10-dim vector (continuous or dis-
crete) containing the agent’s points. 3.2.2. E NERGY OF THE PRIOR STEPS
• Transition probability p. The default instantiation of the environment includes the
energy use of office workers of the prior step. This allows
• Reward r defined in Section 3.5. the agent to directly consider a day-to-day time dependence.
The simulated office workers in this version are currently
We describe out design choices and variants of the MDP memoryless day to day in their energy consumption, but
below. a future simulation will allow for weekly deferable energy
demands to simulate weekly work that can be deferred and
3.2. State space then accomplished.
The steps of the agent are currently formulated day by day, The energy of the prior steps may be optionally excluded
with ten-hour working days considered. Therefore, while from the state space by those who use our environment.
OfficeLearn: An OpenAI Gym Environment for Building Level Energy Demand Response

3.2.3. G RID PRICES OF THE PRIOR STEP In the threshold exponential response, we define an office
worker who does not respond to points until they are high,
Users may optionally include the grid price from prior steps
at which point they respond exponentially. Therefore, the
in the state space. This would allow the agent to directly
energy demand d is dt = bt − (exp pt ∗ (pt > 5)) , clipped
consider the behavioral hysteresis that past grid prices may
at dmin and dmax .
have on a real office worker’s energy consumption. Al-
though this is a noted phenomenon in human psychology
3.4.2. “C URTAIL A ND S HIFT O FFICE W ORKER ”
generally (Richards & Green, 2003), it is not well quanti-
fied and so we have not included it in how we calculate our Office workers need to consume electricity to do their work,
simulated human agents. and may not be able to curtail their load below a minimum
threshold, e.g. the minimum power needed to run a PC.
3.2.4. BASELINE ENERGY They may have the ability to shift their load over a definite
time interval, e.g. choosing to charge their laptops ahead
Baseline Energy may optionally be included in the state
of time or at a later time. We model a response function
space. If the agent directly observes its own action and the
that exhibits both of these behaviors. We can model the
baseline energy, it observes all of the information neces-
aggregate load of a person (bt ) as a combination of fixed
sary to calculate certain simpler simulated office worker
inflexible demand (bft ixed ), curtailable demand (bcurtail t ),
responses. Therefore, inclusion of this element will make
and shiftable demand (bshif
t
t
), i.e., b t = b f ixed
t + b curtail
t +
the problem fully observable, and truly an MDP rather than
Partially Observable MDP (POMDP). bshif
t
t
. All of the curtailable demand is curtailed for the
Tcurtail hours (set to 3 hours in practice) with the highest
points, and for every hour t the shiftable demand is shifted
3.3. Action space
to the hour within [t − Tshif t , t + Tshif t ] with the lowest
The agent’s action space expresses the points that the agent energy price.
delivers to the office. The action space is by default a con-
tinuous value between zero and ten, but may be optionally 3.5. Reward
discretized to integer values if the learning algorithm outputs
discrete values. Specification of the reward function is notoriously difficult,
as it is generally hand-tailored and must reduce a rich and
The purpose of the action is to translate the grid price into often multi-dimensional environmental response into a sin-
one that optimizes for behavioral response to points. There- gle metric. Although we include many possible rewards
fore, the policy will learn over time how people respond to in the code, we outline the two rewards that we feel most
the points given and maximally shift their demand towards accurately describe the environment. As we already demon-
the prices that the grid gives. strated in prior work the ability to reduce overall energy
consumption (Spangher et al., 2019), we endeavor to direct
3.4. Office workers: simulated response functions this agent away from reducing consumption and towards
optimally shifting energy consumption to favorable times
In this section, we will summarize various simulated re-
of day.
sponses that office workers may exhibit.
3.5.1. S CALED C OST D ISTANCE
3.4.1. “D ETERMINISTIC O FFICE W ORKER ”
This reward is defined as the difference between the day’s to-
We include three types of deterministic response, with the
tal cost of energy and the ideal cost of energy. The ideal cost
option for the user to specify a mixed office of all three.
of energy is obtained using a simple convex optimization. If
In the linear response, we define simple office worker who d~ are the actual demand of energy computed for the day, ~g is
decreases their energy consumption linearly below a base- the vector of the grid prices for the day, E is the total amount
line with respect to points given. Therefore, if bt is the of energy, and dmin , dmax are 5% and 95% values of en-
baseline energy consumption at time t and pt are the points ergy observed over the past year, then the ideal demands are
given, the energy demand d at time t is dt = bt − pt , clipped calculated by optimizingP10the objective: d∗ = mind dT g sub-
at dmin and dmax as defined in Section 3.5. ject to the constraints t=0 d = E and dmin < d < dmax .
∗T T

In the sinusoidal response, we define an office worker who Then, the reward becomes: R(d) = d dg−d ∗T g
g
, i.e. taking
responds well to points towards the middle of the distribu- the difference and scaling by the total ideal cost to normalize
tion and not well to prices at the. Therefore, the energy the outcome.
demand d at time t is dt = bt − sin pt , clipped at dmin and
dmax .
OfficeLearn: An OpenAI Gym Environment for Building Level Energy Demand Response

3.5.2. L OG C OST R EGULARIZED detail the design choices that we made while constructing
the environment. We then show demos of different reward
Although the concept of the ideal cost is intuitive, the sim-
types and simulated office responses.
plicity of the convex optimizer means that the output en-
ergy is often an unrealistic, quasi step function. There-
fore, we propose an alternate reward of log cost regular- 6. Simulating DR in your building
ized. Following thePnotation from above, the reward is
The environment we provide contains many ways to cus-
R(d) = −dT g − λ( d < 10 ∗ (.5 ∗ bmax )) , where bmax
tomize your own building. You may choose the number
refers to the max value from the baseline. In practice, we set
of occupants, their response types, baseline energies, grid
λ to some high value like 100. The purpose of the regular-
price regimes, and frequency with which grid price regimes
izer is to penalize the agent for driving down energy across
change. You may also choose from a host of options when
the domain, and instead encourage it to shift energy.
it comes to customizing the agent and its state space. Please
contact us if you are interested in deeper customization and
4. Illustration of Features would like a tutorial on the code.
We will now demo the environment’s functioning. All com-
parisons are done with a vanilla Soft Actor Critic RL agent 7. Future Work
that learns throughout 10000 steps (where one step is equal
7.1. Variants of the MDP
to one day), with a TOU pricing regime fixed at a single day.
The agent’s points are scaled between -1 and 1. We plan to offer the user the choice between a step size
that is a day’s length and a step size that is an hour’s length.
4.0.1. C OMPARISON OF R EWARDS TYPE The alteration can provide a more efficient state space repre-
sentation that provides for a fully observable MDP for the
We present the effect of using the Log Distance Regularized
agent, as well as a longer trajectory for action sequences
and the Scaled Cost Distance. Please see Figure 2, in the
(i.e., ten steps for every trajectory to determine the ten hours
Appendix, for side by side comparison of the reward types.
rather than a single step producing all ten hours), at which
In this figure, you can see that not only is the agent capable
RL tends to excel.
of learning an action sequence that accomplishes a lower
cost than if the simulated office workers were to respond
7.2. Reality Gap
directly to the untransformed grid prices, but also differs in
how the learning is guided. The log cost regularized reward Similar to existing simulations, e.g. Sim2Real (CITE
accomplishes smoother prices that result in the agent defer- SIM2REAL), there is a gap between our environment and
ring most of the energy for the end of the day, whereas the reality. Future work in this direction will build more real-
scaled cost distance reward allows for more energy earlier istic response functions by relying on existing modelling
in the day, guiding the simulated office worker to increase literature (CITE Planning Circuit from Fall’19).
energy gradually throughout the day.
7.3. OfficeLearn Competition
4.0.2. C OMPARISON OF OFFICE WORKER RESPONSE
FUNCTIONS We plan to host a OfficeLearn competition in the future.
This competition will prioritize agents that can maximize
We present the effect of using different simulated office sample efficiency, due to the realistic time constraints of
workers on the output of energy demand. Please see Figure a social game, and deferability of energy on a test set of
3, in the Appendix, for a comparison of two types of sim- simulated office workers.
ulated office workers. In the exponential response, we see
an example of how the office worker’s energy demand re-
sponds to points – that is, perhaps, too coarsely for a learner 8. Acknowledgements
to make much difference. Meanwhile, the Curtail and Shift We would like to thank our wonderful and responsive peers
response demonstrates a much richer response, which en- Peter Henderson and Andreea Bobu for their thoughtful ad-
ables a learner to learn the situation and perform better than vice and commentary on the environment. We would like to
the control. also extend thanks to Alex Devonport, Adam Bouyamourn,
and Akaash Tawade for their help in earlier versions. We
5. Conclusion would all, finally, like to thank our respective parents.

We present technical details of a novel gym environment for


the testing of RL for energy DR within a single building. We
OfficeLearn: An OpenAI Gym Environment for Building Level Energy Demand Response

References demand response in a social game framework. In Pro-


ceedings of the 2nd International Workshop on Applied
Borowiec, S. Alphago seals 4-1 victory over go grandmaster
Machine Learning for Intelligent Energy Systems (AM-
lee sedol. The Guardian, 15, 2016.
LIES) 2020, 2020.
Brockman, G., Cheung, V., Pettersson, L., Schneider, J.,
Srinivasan, R. S., Lakshmanan, J., Santosa, E., and Srivastav,
Schulman, J., Tang, J., and Zaremba, W. Openai gym,
D. Plug load densities for energy analysis: K-12 schools,.
2016.
Energy and Buildings, 43:3289 – 3294, 2011.
Comstock, O. and Jarzomski, K. Consumption and satu-
Sutton, R. S. and Barto, A. G. Reinforcement learning i:
ration trends of residential miscellaneous end-use loads.
Introduction, 1998.
ACEEE Summer Study on Energy Efficiency in Buildings,
Pacific Grove, CA, USA, 2012. Vázquez-Canteli, J. R., Kämpf, J., Henze, G., and Nagy,
Z. Citylearn v1.0: An openai gym environment
Feng, D., Gomes, C. P., and Selman, B. Solving hard ai
for demand response with deep reinforcement learn-
planning instances using curriculum-driven deep rein-
ing. In Proceedings of the 6th ACM International
forcement learning. arXiv preprint arXiv:2006.02689,
Conference on Systems for Energy-Efficient Buildings,
2020.
Cities, and Transportation, BuildSys ’19, pp. 356–357,
Kathirgamanathan, A., Twardowski, K., Mangina, E., and New York, NY, USA, 2019. Association for Comput-
Finn, D. A centralised soft actor critic deep reinforcement ing Machinery. ISBN 9781450370059. doi: 10.1145/
learning approach to district demand side management 3360322.3360998. URL https://doi.org/10.
through citylearn, 2020. 1145/3360322.3360998.

Lanzisera, S., Dawson-Haggerty, S., Cheung, H. Y. I.,


Taneja, J., Culler, D., and Brown, R. Methods for de- 9. Appendix
tailed energy data collection of miscellaneous and elec-
tronic loads in a commercial office building. Building
and Environment, 65:170–177, 2013.

Li, F. and Du, Y. From alphago to power system ai: What


engineers can learn from solving the most complex board
game. IEEE Power and Energy Magazine, 16(2):76–84,
2018.

Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A.,


Antonoglou, I., Wierstra, D., and Riedmiller, M. Playing
atari with deep reinforcement learning. arXiv preprint
arXiv:1312.5602, 2013.

Richards, T. J. and Green, G. P. Economic hysteresis in


variety selection. Journal of Agricultural and Applied
Economics, 35(1):1–14, 2003.

Spangher, L., Tawade, A., Devonport, A., and Spanos, C.


Engineering vs. ambient type visualizations: Quantify-
ing effects of different data visualizations on energy
consumption. In Proceedings of the 1st ACM Inter-
national Workshop on Urban Building Energy Sensing,
Controls, Big Data Analysis, and Visualization, Urb-
Sys’19, pp. 14–22, New York, NY, USA, 2019. Associa-
tion for Computing Machinery. ISBN 9781450370141.
doi: 10.1145/3363459.3363527. URL https://doi.
org/10.1145/3363459.3363527.

Spangher, L., Gokul, A., Khattar, M., Palakapilly, J.,


Tawade, A., Bouyamourn, A., Devonport, A., and Spanos,
C. Prospective experiment for reinforcement learning on
OfficeLearn: An OpenAI Gym Environment for Building Level Energy Demand Response

Figure 3. A comparison of the “Exponential Deterministic Office


Worker” to the “Curtail and Shift Office Worker”. The energy
output of the simulated office workers is drawn in light blue, and
Figure 2. A comparison of the Log Cost Regularized and the corresponds to the primary axes. The grid prices are drawn in red
Scaled Cost Distance rewards. The energy output of the simu- and corresponds to the secondary axes. The agent’s actions are
lated office workers is drawn in light blue, and corresponds to the drawn in dark blue, is scaled between -1 and 1, and correspond to
primary axes. The grid prices are drawn in red, and refers to TOU the secondary axes.
pricing. It corresponds to the secondary axes. The agent’s actions
are drawn in dark blue, is scaled between -1 and 1 to improve
readability of the plots, and correspond to the secondary axes.

You might also like