Officelearn: An Openai Gym Environment For Reinforcement Learning On Occupant-Level Building'S Energy Demand Response
Officelearn: An Openai Gym Environment For Reinforcement Learning On Occupant-Level Building'S Energy Demand Response
Officelearn: An Openai Gym Environment For Reinforcement Learning On Occupant-Level Building'S Energy Demand Response
Lucas Spangher 1 Akash Gokul 1 Joseph Palakapilly 1 Utkarsha Agwan 1 Manan Khattar 1 Wann-Jiun Ma 2
Costas Spanos 1
3.2.3. G RID PRICES OF THE PRIOR STEP In the threshold exponential response, we define an office
worker who does not respond to points until they are high,
Users may optionally include the grid price from prior steps
at which point they respond exponentially. Therefore, the
in the state space. This would allow the agent to directly
energy demand d is dt = bt − (exp pt ∗ (pt > 5)) , clipped
consider the behavioral hysteresis that past grid prices may
at dmin and dmax .
have on a real office worker’s energy consumption. Al-
though this is a noted phenomenon in human psychology
3.4.2. “C URTAIL A ND S HIFT O FFICE W ORKER ”
generally (Richards & Green, 2003), it is not well quanti-
fied and so we have not included it in how we calculate our Office workers need to consume electricity to do their work,
simulated human agents. and may not be able to curtail their load below a minimum
threshold, e.g. the minimum power needed to run a PC.
3.2.4. BASELINE ENERGY They may have the ability to shift their load over a definite
time interval, e.g. choosing to charge their laptops ahead
Baseline Energy may optionally be included in the state
of time or at a later time. We model a response function
space. If the agent directly observes its own action and the
that exhibits both of these behaviors. We can model the
baseline energy, it observes all of the information neces-
aggregate load of a person (bt ) as a combination of fixed
sary to calculate certain simpler simulated office worker
inflexible demand (bft ixed ), curtailable demand (bcurtail t ),
responses. Therefore, inclusion of this element will make
and shiftable demand (bshif
t
t
), i.e., b t = b f ixed
t + b curtail
t +
the problem fully observable, and truly an MDP rather than
Partially Observable MDP (POMDP). bshif
t
t
. All of the curtailable demand is curtailed for the
Tcurtail hours (set to 3 hours in practice) with the highest
points, and for every hour t the shiftable demand is shifted
3.3. Action space
to the hour within [t − Tshif t , t + Tshif t ] with the lowest
The agent’s action space expresses the points that the agent energy price.
delivers to the office. The action space is by default a con-
tinuous value between zero and ten, but may be optionally 3.5. Reward
discretized to integer values if the learning algorithm outputs
discrete values. Specification of the reward function is notoriously difficult,
as it is generally hand-tailored and must reduce a rich and
The purpose of the action is to translate the grid price into often multi-dimensional environmental response into a sin-
one that optimizes for behavioral response to points. There- gle metric. Although we include many possible rewards
fore, the policy will learn over time how people respond to in the code, we outline the two rewards that we feel most
the points given and maximally shift their demand towards accurately describe the environment. As we already demon-
the prices that the grid gives. strated in prior work the ability to reduce overall energy
consumption (Spangher et al., 2019), we endeavor to direct
3.4. Office workers: simulated response functions this agent away from reducing consumption and towards
optimally shifting energy consumption to favorable times
In this section, we will summarize various simulated re-
of day.
sponses that office workers may exhibit.
3.5.1. S CALED C OST D ISTANCE
3.4.1. “D ETERMINISTIC O FFICE W ORKER ”
This reward is defined as the difference between the day’s to-
We include three types of deterministic response, with the
tal cost of energy and the ideal cost of energy. The ideal cost
option for the user to specify a mixed office of all three.
of energy is obtained using a simple convex optimization. If
In the linear response, we define simple office worker who d~ are the actual demand of energy computed for the day, ~g is
decreases their energy consumption linearly below a base- the vector of the grid prices for the day, E is the total amount
line with respect to points given. Therefore, if bt is the of energy, and dmin , dmax are 5% and 95% values of en-
baseline energy consumption at time t and pt are the points ergy observed over the past year, then the ideal demands are
given, the energy demand d at time t is dt = bt − pt , clipped calculated by optimizingP10the objective: d∗ = mind dT g sub-
at dmin and dmax as defined in Section 3.5. ject to the constraints t=0 d = E and dmin < d < dmax .
∗T T
In the sinusoidal response, we define an office worker who Then, the reward becomes: R(d) = d dg−d ∗T g
g
, i.e. taking
responds well to points towards the middle of the distribu- the difference and scaling by the total ideal cost to normalize
tion and not well to prices at the. Therefore, the energy the outcome.
demand d at time t is dt = bt − sin pt , clipped at dmin and
dmax .
OfficeLearn: An OpenAI Gym Environment for Building Level Energy Demand Response
3.5.2. L OG C OST R EGULARIZED detail the design choices that we made while constructing
the environment. We then show demos of different reward
Although the concept of the ideal cost is intuitive, the sim-
types and simulated office responses.
plicity of the convex optimizer means that the output en-
ergy is often an unrealistic, quasi step function. There-
fore, we propose an alternate reward of log cost regular- 6. Simulating DR in your building
ized. Following thePnotation from above, the reward is
The environment we provide contains many ways to cus-
R(d) = −dT g − λ( d < 10 ∗ (.5 ∗ bmax )) , where bmax
tomize your own building. You may choose the number
refers to the max value from the baseline. In practice, we set
of occupants, their response types, baseline energies, grid
λ to some high value like 100. The purpose of the regular-
price regimes, and frequency with which grid price regimes
izer is to penalize the agent for driving down energy across
change. You may also choose from a host of options when
the domain, and instead encourage it to shift energy.
it comes to customizing the agent and its state space. Please
contact us if you are interested in deeper customization and
4. Illustration of Features would like a tutorial on the code.
We will now demo the environment’s functioning. All com-
parisons are done with a vanilla Soft Actor Critic RL agent 7. Future Work
that learns throughout 10000 steps (where one step is equal
7.1. Variants of the MDP
to one day), with a TOU pricing regime fixed at a single day.
The agent’s points are scaled between -1 and 1. We plan to offer the user the choice between a step size
that is a day’s length and a step size that is an hour’s length.
4.0.1. C OMPARISON OF R EWARDS TYPE The alteration can provide a more efficient state space repre-
sentation that provides for a fully observable MDP for the
We present the effect of using the Log Distance Regularized
agent, as well as a longer trajectory for action sequences
and the Scaled Cost Distance. Please see Figure 2, in the
(i.e., ten steps for every trajectory to determine the ten hours
Appendix, for side by side comparison of the reward types.
rather than a single step producing all ten hours), at which
In this figure, you can see that not only is the agent capable
RL tends to excel.
of learning an action sequence that accomplishes a lower
cost than if the simulated office workers were to respond
7.2. Reality Gap
directly to the untransformed grid prices, but also differs in
how the learning is guided. The log cost regularized reward Similar to existing simulations, e.g. Sim2Real (CITE
accomplishes smoother prices that result in the agent defer- SIM2REAL), there is a gap between our environment and
ring most of the energy for the end of the day, whereas the reality. Future work in this direction will build more real-
scaled cost distance reward allows for more energy earlier istic response functions by relying on existing modelling
in the day, guiding the simulated office worker to increase literature (CITE Planning Circuit from Fall’19).
energy gradually throughout the day.
7.3. OfficeLearn Competition
4.0.2. C OMPARISON OF OFFICE WORKER RESPONSE
FUNCTIONS We plan to host a OfficeLearn competition in the future.
This competition will prioritize agents that can maximize
We present the effect of using different simulated office sample efficiency, due to the realistic time constraints of
workers on the output of energy demand. Please see Figure a social game, and deferability of energy on a test set of
3, in the Appendix, for a comparison of two types of sim- simulated office workers.
ulated office workers. In the exponential response, we see
an example of how the office worker’s energy demand re-
sponds to points – that is, perhaps, too coarsely for a learner 8. Acknowledgements
to make much difference. Meanwhile, the Curtail and Shift We would like to thank our wonderful and responsive peers
response demonstrates a much richer response, which en- Peter Henderson and Andreea Bobu for their thoughtful ad-
ables a learner to learn the situation and perform better than vice and commentary on the environment. We would like to
the control. also extend thanks to Alex Devonport, Adam Bouyamourn,
and Akaash Tawade for their help in earlier versions. We
5. Conclusion would all, finally, like to thank our respective parents.