Deep Reinforcement Learning For Smart Home Energy Management
Deep Reinforcement Learning For Smart Home Energy Management
Abstract
arXiv:1909.10165v2 [eess.SY] 19 Dec 2019
In this paper, we investigate an energy cost minimization problem for a smart home in the absence of a building thermal
dynamics model with the consideration of a comfortable temperature range. Due to the existence of model uncertainty, parameter
uncertainty (e.g., renewable generation output, non-shiftable power demand, outdoor temperature, and electricity price) and
temporally-coupled operational constraints, it is very challenging to design an optimal energy management algorithm for scheduling
Heating, Ventilation, and Air Conditioning (HVAC) systems and energy storage systems in the smart home. To address the
challenge, we first formulate the above problem as a Markov decision process, and then propose an energy management algorithm
based on Deep Deterministic Policy Gradients (DDPG). It is worth mentioning that the proposed algorithm does not require the
prior knowledge of uncertain parameters and building thermal dynamics model. Simulation results based on real-world traces
demonstrate the effectiveness and robustness of the proposed algorithm.
Index Terms
Smart home, energy management, deep reinforcement learning, energy cost, thermal comfort, energy storage systems, HVAC
systems
I. I NTRODUCTION
As a next-generation power system, smart grid is typified by an increased use of information and communications technology
(e.g., Internet of Things) in the generation, transmission, distribution, and consumption of electrical energy. In smart grid
environment, there are many opportunities for saving the energy cost of smart homes, which are evolved from traditional homes
by adopting three components, i.e., the internal networks, intelligent controls, and home automations [1]. For example, dynamic
electricity prices could be utilized to reduce energy cost by scheduling Energy Storage Systems (ESS) and thermostatically
controllable loads intelligently. As one kind of thermostatically controllable loads, Heating, Ventilation, and Air Conditioning
(HVAC) systems consume about 40% of total energy in a household [2], which results in energy cost concerns for smart home
owners. Since the primary purpose of HVAC systems is to maintain thermal comfort for the occupants, it is of great importance
to optimize the energy cost of smart homes without sacrificing thermal comfort.
In this paper, we investigate an energy optimization problem for a smart home with renewable energies, ESS, HVAC systems,
and non-shiftable loads (e.g., televisions) in the absence of a building thermal dynamics model. To be specific, our objective is
to minimize the energy cost of the smart home during a time horizon with the consideration of a comfortable indoor temperature
range. However, it is very challenging to achieve the above aim due to the following reasons. Firstly, it is often intractable to
obtain accurate dynamics of indoor temperature, which can be affected by many factors [3]. Secondly, it is difficult to know
the statistical distributions of all combinations of random system parameters (e.g., renewable generation output, power demand
of non-shiftable loads, outdoor temperature, and electricity price). Thirdly, there are temporally-coupled operational constraints
associated with ESS and HVAC systems, which means that the current action would affect the future decisions. To address
the above challenge, we propose a Deep Deterministic Policy Gradients (DDPG) based energy management algorithm, which
can make decision about ESS charging/discharging power and HVAC input power simply based on the current observation
information.
The main contributions of this paper are summarized as follows.
• We investigate an energy cost minimization problem for smart homes in the absence of a building thermal dynamics model
with the consideration of a comfortable temperature range, energy exchange between the smart home and the utility grid,
ESS charging/discharging, HVAC input power adjustment, and parameter uncertainties. Then, we reformulate the problem
as a Markov Decision Process (MDP), where environment state, action and reward function are designed.
L. Yu, W. Xie, D. Xie, Y. Zou, Z. Sun, L. Zhang are with Key Laboratory of Broadband Wireless Communication and Sensor Network Technology of Ministry
of Education, Nanjing University of Posts and Telecommunications, Nanjing 210003, P. R. China.
D. Zhang is with Jiangsu Key Laboratory of Broadband Wireless Communication and Internet of Things, School of Internet of Things, Nanjing University of
Posts and Telecommunications, Nanjing 210003, P. R. China.
Y. Zhang is with the Department of Engineering, University of Leicester, Leicester LE1 7RH, U.K.
T. Jiang is with Wuhan National Laboratory for Optoelectronics, School of Electronic Information and Communications, Huazhong University of Science and
Technology, Wuhan 430074, P. R. China.
IEEE INTERNET OF THINGS JOURNAL, VOL. XX, NO. XX, MONTH 2019 2
• We propose an energy management algorithm to jointly schedule ESS and HVAC systems based on DDPG. Since the
proposed algorithm makes decision simply based on the current environment state, it does not require prior knowledge of
uncertain parameters and building thermal dynamics model.
• Extensive simulation results based on real-world traces show that the proposed algorithm can save energy cost by 8.10%-
15.21% without sacrificing thermal comfort when compared with two baselines. Moreover, the robustness testing shows
that the proposed algorithm has the potential of providing a more efficient and practical tradeoff between maintaining
thermal comfort and reducing energy cost than an “optimal” strategy.
The remainder of this paper is organized as follows. In Section II, we introduce related works. In Section III, system model
and problem formulation are given. Then, we propose a DDPG-based energy management algorithm in Section IV and its
effectiveness is verified by simulation results in Section V. Finally, we make a conclusion and discuss the future work in
Section VI.
A. Model-based approaches
In [4], Angelis et al. presented a home energy management approach to minimize the energy cost related to task execution,
energy storage, energy selling and heat pump without violating the given comfortable temperature range and other constraints.
In [5], Fan et al. proposed an online home energy management scheme to minimize the energy cost associated with electric
water heaters and HVAC systems with the consideration of indoor temperature ranges. In [6], Zhang et al. developed a home
energy management strategy to minimize energy cost related to the HVAC load and deferrable loads without violating the
given comfortable temperature range. In [7], Pilloni et al. proposed a Quality of Experience (QoE)-aware smart home energy
management system to save energy cost while minimizing the annoyance perceived by users. In [8], Yu et al. proposed an
online home energy management algorithm to minimize the sum of energy cost and thermal discomfort cost (Here, thermal
discomfort cost is the function of temperature deviation between indoor temperature and the comfortable temperature level).
In [11], Franceschelli et al. proposed a heuristic approach to optimize the peak-to-average power ratio of a large population
of thermostatically controlled loads considering comfortable temperature ranges. Although some advances have been made in
the above-mentioned works, their approaches need to model building thermal dynamics with simplified mathematical models,
e.g., Equivalent Thermal Parameters (ETP) model.
A. ESS Model
Let Bt be the stored energy in the ESS at time slot t. Then, the ESS storage dynamics model is given by
dt
Bt+1 = Bt + ηc ct + , ∀ t, (1)
ηd
where ηc ∈ (0, 1] and ηd ∈ (0, 1] are the charging and discharging efficiency coefficients, respectively; ct and dt are ESS
charging power and discharging power, respectively. Here, ct and dt are assigned with different signs (i.e., ct ≥ 0 and dt ≤ 0),
which contributes to the design of action in Section II-F.
Since ESS cannot be charged above its capacity B max or discharged below the minimal energy level B min , we have
B min ≤ Bt ≤ B max , ∀ t. (2)
Due to the existence of ESS charging and discharging rate limitations, we have
0 ≤ ct ≤ cmax , ∀ t, (3)
−dmax ≤ dt ≤ 0, ∀ t, (4)
IEEE INTERNET OF THINGS JOURNAL, VOL. XX, NO. XX, MONTH 2019 4
where cmax and dmax are maximum charging and discharging power of the ESS, respectively.
To avoid the simultaneous ESS charging and discharging, we have
ct · dt = 0, ∀ t. (5)
B. HVAC Model
The HVAC system can be dynamically adjusted to maintain thermal comfort of the occupants in the smart home. Since
thermal comfort depends on many factors (e.g., air temperature, mean radiant temperature, relative humidity, air speed, clothing
insulation, and metabolic rate), its representation is very complex. In existing studies, many modeling approaches and parameter
measurement methods associated with thermal comfort have been developed [16] [22]–[28]. Similar to [3]–[6], this paper uses
a comfortable temperature range as the representation of thermal comfort for simplicity, i.e.,
T min ≤ Tt ≤ T max , ∀ t, (6)
min max
where T and T are the minimum and maximum comfort level, respectively.
In this paper, we consider an HVAC system with inverter in the smart home, i.e., the HVAC system can adjust its input
power et continuously [8]. Suppose emax be the rating power of the HVAC system, we have
0 ≤ et ≤ emax , ∀ t. (7)
C. Power Balancing
To keep the power balance in the smart home, the aggregated power supply should be equal to the served power demand.
Then, we have
gt + pt − dt = bt + et + ct , ∀ t, (8)
where gt , pt , bt are power drawn from the utility grid, renewable generation output, and non-shiftable power demand,
respectively. If gt < 0, it means that energy form the smart home will be sold to the utility grid. Otherwise, the smart
home will purchase energy from the utility grid.
D. Cost Model
Let vt and ut be the buying and selling price of energy, respectively. Then, the energy cost of the smart home at time slot
t can be calculated by
vt − ut vt + ut
C1,t = ( |gt | + gt ), ∀ t, (9)
2 2
where the intuition behind (9) is that just one variable gt is needed to reflect the behavior of electricity buying or selling. For
example, when gt ≥ 0, C1,t = vt gt . For the case gt < 0, C1,t = ut gt .
It is well known that frequent discharging or charging would do harm to the lifetime of the ESS. To capture this phenomenon,
ESS depreciation cost at time slot t is introduced as follows [29]
C2,t = ψ(|ct | + |dt |), ∀ t, (10)
where ψ denotes ESS depreciation coefficient in $/kW.
that the current action would affect future decisions. To handle the “time-coupling” property, typical methods are based on
dynamic programming [8], which suffers from “the curse of dimensionality” problem. In this paper, we provide a way of solving
P1 without requiring the dynamics of indoor temperature and prior knowledge of random system parameters. In particular, we
reformulate the above-mentioned sequential decision making problem as a MDP problem. Then, we develop a DDPG-based
energy management algorithm for the problem.
F. MDP Formulation
In the smart home, the indoor temperature at next time slot is only determined by the indoor temperature, HVAC power input,
and environment disturbances (e.g., outdoor temperature and solar irradiance intensity) in the current time slot [6] [7] [30] [31].
Moreover, the ESS energy level at next time slot just depends on the current energy level and current discharging/charging
power according to (1), which is independent of previous states and actions. Thus, both of ESS scheduling and HVAC control
can be regarded as a MDP. In the following parts, we will formulate the sequential decision making problem associated with
smart home energy management as a MDP. It is worth noting that the MDP formulation is an approximation description of the
smart home energy management problem since some components of the environment state may be not Markovian in practice,
e.g., renewable generation output and electricity price. According to existing works [15] [32], even though the environment is
not strictly MDP, the corresponding problem can still be solved by reinforcement learning based algorithms empirically, which
is also validated by simulation results in this paper. For non-Markovian environment, many approaches could be adopted
to improve the performance of reinforcement learning based algorithms, e.g., approximate state [32] [33], recurrent neural
networks [34], gated end-to-end memory policy networks [35], and eligibility traces [33].
Rt +1
Environment
Indoor & outdoor
Non-shiftable loads Utility grid temperatures
HEMS at
Agent
st +1
st
Fig. 2. The agent-environment interaction in the MDP.
A discounted MDP is formally defined as a five-tuple M = (S, A, P, R, γ), where S is the set of environment states and
A is the set of actions. P : S × A × S → [0, 1] is the transition probability function, which models the uncertainty in the
evolution of states of the system based on the action taken by the agent [36]. R : S × A → R is the reward function and
γ ∈ [0, 1] is a discount factor. In this paper, the agent denotes the learner and decision maker (i.e., HEMS agent), while
the environment comprising many objects outside the agent (e.g., renewable generators, non-shiftable loads, ESS, the HVAC
system, utility grid, indoor/outdoor temperature). The interaction between the agent and the environment can be depicted by
Fig. 2, where the HEMS agent observes environment state st and takes action at . Then, environment state becomes st+1 and
the reward Rt+1 is returned. In the following parts, we will design key components of the MDP, including environment state,
action and reward function.
1) Environment State: The environment state consists of seven kinds of information, i.e., renewable generation output pt ,
non-shiftable power demand bt , ESS energy level Bt , outdoor temperature Ttout , indoor temperature Tt , buying electricity price
vt , and time slot index in a day t′ (t′ = mod (t, 24)). Since selling electricity price ut is typically related to buying electricity
price vt (e.g., ut = δvt [37]–[39], δ is a constant), ut is not selected as a part of the environment state. For brevity, st is
adopted to describe the environment state, i.e., st = (pt , bt , Bt , Ttout , Tt , vt , t′ ).
2) Action: The aim of HEMS agent is to optimally decide the amount of energy exchange between the smart home and
the utility grid (i.e., gt ), ESS charging power (i.e., ct ), ESS discharging power (i.e., dt ), and HVAC input power et . After ct ,
dt , and et are jointly decided, gt can be known immediately according to (8). Therefore, the action of the MDP consists of
ESS charging/discharging power ct /dt and HVAC input power et . Since adopting ct and dt simultaneously would complicate
IEEE INTERNET OF THINGS JOURNAL, VOL. XX, NO. XX, MONTH 2019 6
the design of the energy management algorithm, we use just one variable ft , where the range of ft is [−dmax , cmax ]. When
ft ≥ 0, ct = ft and dt = 0. When ft ≤ 0, ct = 0 and dmax t = ft . Therefore, the constraints (3)-(5) could be guaranteed. To
guarantee the feasibility of (1)-(2), 0 ≤ ct ≤ min{cmax , B ηc−Bt } when ft ≥ 0, and min{−dmax , (B min − Bt )ηd } ≤ dt ≤ 0
when ft ≤ 0. According to (6), the range of et is [0, emax ]. When indoor temperature Tt is lower than T min, et should be
zero for avoiding further temperature deviation. Similarly, when Tt > T max , the feasible et should be nonnegative. For brevity,
at is used to describe the action, i.e., at = (ft , et ).
3) Reward: According to the MDP theory in [33], the transition of the environment state from st−1 to st could be triggered
by the execution of at−1 . Finally, the reward Rt will be obtained. Since the aim of the agent is to minimize the total energy
cost while maintaining the comfortable temperature range, the corresponding reward consists of three parts, namely the penalty
for the energy consumption of the HVAC system, the penalty for ESS depreciation, and the penalty for temperature deviation.
Since the energy cost of the HVAC system at slot t−1 is C1,t−1 , the first part of Rt can be represented by −C1,t−1 (st−1 , at−1 ).
Similarly, the second part of Rt can be described by −C2,t−1 (st−1 , at−1 ). To maintain the comfortable temperature range,
the third part of Rt can be computed by −C3,t (st ), where
+ +
C3,t (st ) = ([Tt − T max ] + T min − Tt ), ∀ t,
(12)
which means that C3,t = 0 if T min ≤ Tt ≤ T max . Otherwise, C3,t = Tt − T max if Tt > T max , and C3,t = T min − Tt if
Tt < T min .
Taking three parts into consideration, the final reward function can be designed as follows,
Rt = −β(C1,t−1 (st−1 , at−1 ) + C2,t−1 (st−1 , at−1 )) − C3,t (st ),
where β denotes a positive weight coefficient in o C/$.
4) Action-Value Function: When jointly controlling the ESS and the HVAC system at time slot t, the HEMS agent intends
to maximize the expected P∞return it receives over the future. In particular, the return is defined as the sum of the discounted
rewards [33], i.e., R = i=1 γ i−1 Rt+i . Let Qπ (s, a) be the action-value function under a policy π (note that a policy is a
mapping from states to probabilities of selecting each possible action), which represents the expected return if action at = a
is taken in state st = s under the policy π. Then, the optimal action-value function Q∗ (s, a) is maxπ Qπ (st , at ) and can be
calculated by the following Bellman optimality equation in a recursive manner, i.e.,
Q∗ (s, a) = E[R
P t+1 + γmax
∗ ′
a′ Q (st+1 , a )|st = s, at = a].
= s′ ,r P (s , r|s, a)[r + γmaxa′ Q∗ (s′ , a′ ),
′
where s′ ∈ S, r ∈ R, a′ ∈ A, and P ∈ P.
To obtain Q∗ (s, a), system state transition probabilities P (s′ , r|s, a) are required. Since indoor temperature in the smart
home could be affected by many disturbances, it is difficult to accurately obtain state transition probabilities. To overcome this
challenge, Q-learning methods could be used, which do not require the knowledge of state transition probabilities. To support
the case with continuous system states, a function approximator could be adopted to estimate Q-function. When a neural
network with weight θ is adopted as the non-linear function approximator, we refer it as Q-network. In [15], a deep Q-network
(DQN) algorithm was proposed, which can use experience replay and target network to ensure the stability of reinforcement
learning methods when function approximators are adopted. However, DQN cannot be directly applied to the problem with
continuous action spaces since it needs to discretize the action space and lead to an explosion of the number of actions. As
a result, low computational efficiency, decreased performance, and the requirement of more training data would be incurred
[16] [40].
A. Algorithmic Design
To solve the MDP problem defined in Section III-F, we propose a DDPG-based energy management algorithm. Different
from DQN, DDPG is capable of dealing with continuous states and actions. For example, just two network outputs are needed
to represent continuous actions in this paper, which avoids the explosion of the number of actions. Since DDPG is a kind of
actor-critic methods (i.e., methods that learn approximations to both policy function and value function), actor network and
critic network are incorporated, which are shown in Fig. 3. The input and output of actor network is the environment state st
and action a, respectively. Then, a and st are adopted as the input of critic network, whose output is action-value function
(i.e., Q(st , a)). Next, the policy gradient can be computed and used to update the weight of actor network. Before computing
Q(st , a), the weight of critic network should be updated based on two mechanisms, i.e., memory replay and target networks.
More details will be introduced when explaining Algorithm 2.
IEEE INTERNET OF THINGS JOURNAL, VOL. XX, NO. XX, MONTH 2019 7
st θQ Q ( st , a )
θµ
The proposed DDPG-based energy management algorithm can be found in the Algorithm 1, where the key step is to load
the weight of the actor network θµ , which is trained by Algorithm 2. In each time slot, the actor network selects an action
on ESS charging/discharing power and HVAC input power according to the current environment state st . Then, the action
at is executed and the environment state becomes st+1 . Meanwhile, the reward Rt+1 is obtained. In Algorithm 2, we first
initialize a replay memory D with capacity N , which stores the transition tuple (st , at , Rt+1 , st+1 ). Moreover, a preprocess
function φ(st ) is introduced to facilitate the learning process by normalizing the input data. Specifically, each component in
the environment state at time slot t (e.g., κt ) should be normalized within the range [0,1] using the following expression:
κt −mint κt Q µ Q
maxt κt −mint κt . Then, we randomly initialize critic network Q(φ(s), a|θ ) and actor network µ(φ(s)|θ ) with weights θ
µ
and θ , respectively. Their architectures in the proposed energy management algorithm are described by Fig. 4, where there
are two hidden layers in the actor network and four hidden layers in the critic network. Next, we initialize the weights of
′ ′ ′ ′
target critic network Q(φ(s), a|θQ ) and target actor network µ(φ(s)|θµ ) by copying, i.e., θQ ← θQ and θµ ← θµ . In each
time slot of each episode, an action is selected based on the following expression in the line 8, i.e.,
at = µ(φ(st )|θµ ) + Nt , (13)
where Nt is the exploration noise. In this paper, we use the following way to introduce exploration noise, i.e.,
µ(φ(st )|θµ ), if ωt > ξt ,
at = (14)
(Ut,1 , Ut,2 ), if ωt ≤ ξt ,
where ωt , Ut,1 , and Ut,2 follow uniform distributions with parameters (0,1), (-dmax / max{cmax , dmax }, cmax / max{cmax , dmax }),
and (0,1), respectively. ξt = max(ξt − ζ ∗ (episode − N/P ), ξmin ), ξ0 = 1 and 0 < ζ < 1. After at is obtained, it will be
applied to ESS and the HVAC system. At the end of time slot t, the new state st+1 and the reward Rt+1 are returned from
the environment. Then, the transition tuple (φ(st ), at , Rt+1 , φ(st+1 )) will be stored in the memory for the training of actor
and critic networks as shown in the line 10. Next, K transitions are randomly sampled for training deep neural networks, i.e.,
actor network, critic network, target actor network, and target critic network. As shown in lines 12-14, Q(φ(si ), ai ) and yi
generated by critic network and target network are used to calculate mean square error loss. By minimizing the loss function,
the weight of critic network could be updated. Then, we can calculate the sampled policy gradient as shown in the line 15,
which is used to update the weight of actor network. Finally, the weights of target actor network and target critic network
could be updated as shown in lines 17-19. Note that a small τ should be selected in order to improve the learning stability.
Typically, 0 < τ ≪ 1.
can be described by O(Htest ). Given the fixed testing time horizon, a shorter duration of a time slot would results in a larger
Htest . However, the time slot’s duration can not be selected arbitrarily in practice due to the following reasons. On one hand,
too long duration would results in the loss of many control opportunities of saving energy cost and maintaining a comfortable
temperature range. On the other hand, too short duration may affects the training convergence of DRL-based algorithms
since the control actions taken by the DRL agent cannot take effect immediately in terms of environment states (e.g., indoor
temperature) [17]. Therefore, the duration of a time slot should be selected appropriately in practice. In existing works, the
typical duration of a time slot is several minutes or one hour (e.g., 15 minutes [3], 1 hour [17]), which is far greater than the
computation time of the proposed energy management algorithm in a time slot. Therefore, the proposed energy management
algorithm can be implemented in a real-time way.
V. P ERFORMANCE E VALUATION
In this section, we evaluate the performance of the proposed energy management algorithm. We first describe the simulation
setup. Then, we describe the baselines used for performance comparisons. Finally, we provide simulation results about algo-
rithmic convergence process, algorithmic performance under varying β, algorithmic effectiveness, and algorithmic scalability.
A. Simulation setup
In simulations, we use real-world traces related to solar generation, non-shiftable power demand, outdoor temperature, and
electricity price, which are extracted from Pecan Street database1 . Note that such database is the largest real-world open
energy database on the planet and includes the data related to home energy consumption and solar generation of the Mueller
neighborhood in Austin, Texas, USA. For simplicity, the cooling mode of a residential HVAC system is considered. Since
summers in Austin are very hot2 , we use the data during the period from June 1 to August 31, 2018 for model training and
testing. To be specific, the data in June and July is used to train neural network models and the data in August is adopted for
performance testing. Some important system parameters are configured as follows: ut = 0.9vt [37], γ = 0.995, ηc = ηd = 0.95
[41], ζ = 0.0005, ξmin = 0.1, T min = 66.2o F (19o C) [3], T max = 75.2o F (24o C) [3], other parameter configurations are
shown in TABLE I, where αa and αc denote the learning rate of actor network and critic network, respectively. In TABLE I,
Na and Nc denote the number of neurons in each hidden layer of actor network and critic network, respectively. To simulate the
environment, we adopt the following indoor temperature dynamics model for simplicity, i.e., Tt+1 = εTt +(1−ε)(Ttout − ηA hvac
et )
1 https://www.pecanstreet.org/
2 https://en.wikipedia.org/wiki/Austin, Texas#Climate
IEEE INTERNET OF THINGS JOURNAL, VOL. XX, NO. XX, MONTH 2019 9
[6] [7] [30] [31], where ε = 0.7 [42], ηhvac = 2.5 [30], A = 0.14kW/oF [30]. Note that the variant of the proposed energy
management algorithm can be applicable to any indoor temperature dynamics model by incorporating more environment-related
variables in system state, e.g., relative humidity and solar radiation intensity.
TABLE I
M AIN PARAMETER SETTINGS
B. Baselines
To evaluate the performance of the proposed algorithm, we adopt three baselines as follows.
IEEE INTERNET OF THINGS JOURNAL, VOL. XX, NO. XX, MONTH 2019 10
• Baseline1: this scheme adopts ON/OFF policy [3] for building HVAC control but without considering the use of the ESS.
Specifically, the HVAC system will be turned on if Ti > T max and it will be turned off if Ti < T min.
• Baseline2: this scheme uses the DDPG-based control policy in this paper for HVAC control but without considering the use
of the ESS, i.e., cmax = dmax = 0. Based on the performance comparison between Baseline2 and the proposed algorithm,
the energy cost saving caused by the use of the ESS can be known. Similarly, the energy cost saving incurred by the use
of DDPG-based control policy can be obtained by comparing the performance of Baseline2 with thatPof Baseline1.
Htest
• Baseline3: this scheme intends to minimize the cumulative cost during the testing period Htest (i.e., t=1 (C1,t + C2,t ))
with the consideration of constraints (1)-(8), assuming that all uncertainty system parameters and the dynamics model of
indoor temperature can be known beforehand. Although the optimal solution of this scheme is not achievable in practice
due to the existence of parameter and model uncertainties, it can provide the lower bound for the performance of the
proposed algorithm when all constraints in P1 are satisfied.
C. Simulation Results
1) Algorithmic convergence process: According to Algorithm 1, the proposed energy management algorithm needs to know
the training result of Algorithm 2 before testing. In Fig. 5, the reward received during each episode generally increases.
Since the minimum exploration probability ξmin is 0.1 and system parameters (e.g., solar radiation power, non-shiftable power
demand, outdoor temperature, and electricity price) are varying in each episode, the episode reward fluctuates within a small
range. To show the changing trend of rewards more clearly, we provide the average value of the past 50 episodes. In Fig. 5,
it can be found that the average reward generally increases and becomes more and more stable.
2) Algorithmic performance under varying β: Since many random number generators are adopted in neural network
initialization, mini-batch data collection for training, and action choice, the performance of the proposed algorithm is varying
even the same system parameters are configured. To show the impact of β on the performance of the proposed algorithm
more clearly, mean values of total energy cost (i.e., the sum of energy cost and ESS depreciation cost) and total temperature
deviation with 95% confidence interval across 40 runs are considered and the corresponding results can be found in Fig. 6.
It can be observed that the mean value of total energy cost and that of total temperature deviation generally decreases and
increases with the increase of β, respectively. Such tendency is obvious since larger β results in more importance of energy
cost and less importance of temperature deviation. By taking mean values of total energy cost and total temperature deviation
into consideration, a proper value of β is 1 when the mean value of total temperature deviation is less than 1o C.
3) Algorithmic effectiveness: Performance comparisons among four schemes are shown in Fig. 7, where the proposed
energy management algorithm achieves better performance than Baseline1 and Baseline2. To be specific, the proposed energy
management algorithm can reduce the mean value of total energy cost by 15.21% and 8.10% when compared with Baseline1
and Baseline2, respectively. Moreover, the mean value of total temperature deviation under the proposed algorithm is smaller
than Baseline1 and Baseline2, which can be illustrated by Figs. 7(b) and (c). Compared with Baseline1, Baseline2 and the
proposed algorithm could save energy cost by increasing/decreasing HVAC input power when electricity price is low/high,
which can be depicted by Figs. 8(a) and (b). Compared with Baseline2, the proposed algorithm could reduce energy cost by
charging/discharging ESS when electricity price is low/high, which can be shown in Figs. 8(a) and (c). Though Baseline3
achieves the best performance, it requires all prior knowledge of uncertain system parameters and thermal dynamics model.
IEEE INTERNET OF THINGS JOURNAL, VOL. XX, NO. XX, MONTH 2019 11
Thus, Baseline3 is just adopted for performance reference. By observing the performance gap between the proposed algorithm
and Baseline3, it can be known that the potential of reducing the mean value of total energy cost is great. In future work, more
training data and advanced DRL-based energy management algorithms would be adopted for reducing the performance gap.
4) Algorithmic robustness: Note that the thermal dynamics model used in above-mentioned simulations can not capture
thermal disturbances in practice, e.g., thermal disturbances from solar irradiance, lighting systems, and computers. Thus, we
evaluate the robustness of the proposed algorithm when random thermal disturbance is introduced. To be specific, Tt+1 =
εTt + (1 − ε)(Ttout − ηA hvac
et ) + ǫt [10], where the error item ǫt is assumed to follow a uniform distribution with parameters
o
[ϑl , ϑu ] F . In this scenario, three cases are considered, i.e., ϑu = −ϑl = 1.8, 3.6, 5.4. In Fig. 9, it can be observed that the
proposed algorithm achieves better performances than Baseline1 under three cases. Compared with Baseline3, the proposed
algorithm can save the total energy cost by up to 10% with a small increase of the total temperature violation. Moreover, unlike
Baseline3, the proposed algorithm does not require any prior knowledge of all uncertain parameters and thermal dynamics model.
Therefore, the proposed algorithm has the potential of providing a more efficient and practical tradeoff between maintaining
thermal comfort and reducing energy cost than Baseline3.
VI. C ONCLUSION
In this paper, we proposed a DDPG-based energy management algorithm for a smart home to efficiently control HVAC
systems and energy storage systems in the absence of a building thermal dynamics model, with the consideration of a
IEEE INTERNET OF THINGS JOURNAL, VOL. XX, NO. XX, MONTH 2019 12
Fig. 7. Performance comparisons among three schemes (β = 0.6, 95% confidence interval across 40 runs is considered).
IEEE INTERNET OF THINGS JOURNAL, VOL. XX, NO. XX, MONTH 2019 13
(a) Price
comfortable temperature range and many parameter uncertainties. Extensive simulation results based on real-world traces
showed the effectiveness and robustness of the proposed algorithm. In future work, more reasonable thermal comfort models
and more types of controllable loads (e.g., electric vehicles, electric water heaters) would be incorporated. In addition, more
opportunities of saving energy cost could be grasped by utilizing real-world occupant behavior information [43], which requires
the adoption of more advanced deep neural network architectures/algorithms.
R EFERENCES
[1] S. Wu, J. Rendall, M. Smith, S. Zhu, J. Xu, Q. Yang, H. Wang, and P. Qin, “Survey on prediction algorithms in smart homes,” IEEE Internet of Things
Journal, vol. 4, no. 3, pp. 636-644, June 2017.
[2] A. Afram and F. Janabi-Sharif, “Effects of dead-band and set-point settings of on/off controllers on the energy consumption and equipment switching
frequency of a residential HVAC system,” Journal of Process Control, vol. 47, pp. 161-174, 2016.
[3] T. Wei, Y. Wang, and Q. Zhu, “Deep reinforcement learning for building HVAC control,” The 54th Annual Design Automation Conference, 2017.
[4] F. Angelis, M. Boaro, D. Fuselli, S. Squartini, F. Piazza, and Q. Wei, “Optimal home energy management under dynamic electrical and thermal constraints,”
IEEE Trans. on Industrial Informatics, vol. 9, no. 3, pp. 1518-1527, Aug. 2013.
[5] W. Fan, N. Liu, and J. Zhang, “An event-triggered online energy management algorithm of smart home: lyapunov optimization approach,” Energies, vol.
9, no. 5, pp. 381-404, 2016.
[6] D. Zhang, S. Li, M. Sun, and Z. O’Neill, “An optimal and learning-based demand response and home energy management system,” IEEE Trans. on Smart
Grid, vol. 7, no. 4, pp. 1790-1801, July 2016.
[7] V. Pilloni, A. Floris, A. Meloni and L. Atzori, “Smart home energy management including renewable sources: a QoE-driven approach,” IEEE Trans. on
Smart Grid, vol. 9, no. 3, pp. 2006-2018, May 2018.
IEEE INTERNET OF THINGS JOURNAL, VOL. XX, NO. XX, MONTH 2019 15
[8] L. Yu, T. Jiang, and Y. Zou, “Online energy management for a sustainable smart home with an HVAC load and random occupancy,” IEEE Trans. on
Smart Grid, vol. 10, no. 2, pp. 1646-1659, March 2019.
[9] M. Shad, A. Momeni, R. Errouissi, C.P. Diduch, M.E. Kaye, and L. Chang, “Identification and estimation for electric water heaters in direct load control
programs”, IEEE Trans. on Smart Grid, vol. 8, no. 2, pp. 947-955, Nov. 2015.
[10] E.C. Kara, M. Bergés, and G. Hug, “Impact of disturbances on modeling of thermostatically controlled loads for demand response”, IEEE Trans. on
Smart Grid, vol. 6, no. 5, pp. 2560-2568, Nov. 2015.
[11] M. Franceschelli, A. Pilloni, and A. Gasparri, “A heuristic approach for online distributed optimization of multi-agent networks of smart sockets and
thermostatically controlled loads based on dynamic average consensus”, 2018 European Control Conference, 2018.
[12] R. Lu, S. Hong, and M. Yu, “Demand response for home energy management using reinforcement learning and artificial neural network,” IEEE Trans.
on Smart Grid, DOI: 10.1109/TSG.2019.2909266, 2019.
[13] F. Ruelens, B. Claessens, S. Vandael, B. Schutter, R. Babuška, and R. Belmans, “Residential demand response of thermostatically controlled loads using
batch reinforcement learning,” IEEE Trans. on Smart Grid, vol. 8, no. 5, pp. 2149-2159, Sept. 2017.
[14] J. Vázquez-Canteli, and Z. Nagy, “Reinforcement learning for demand response: A review of algorithms and modeling techniques,” Applied Energy,
vol. 235, pp. 1072-1089, 2019.
[15] V. Mnih, et. al. “Human-level control through deep reinforcement learning,” Nature, vol. 518, pp. 529-541, 2015.
[16] G. Gao, J. Li, Y. Wen, “Energy-efficient thermal comfort control in smart buildings via deep reinforcement learning,” arXiv:1901.04693v1, 2019.
[17] Z. Zhang and K.P. Lam, “Practical implementation and evaluation of deep reinforcement learning control for a radiant heating system,” The 5th ACM
International Conference on Systems for Built Environments, 2018.
[18] W. Valladares, M. Galindo, J. Gutiérrez, W. Wu, K. Liao, J. Liao, K. Lu, and C. Wang, “Energy optimization associated with thermal comfort and indoor
air control via a deep reinforcement learning algorithm,” Building and Environment, vol. 155, pp. 105-117, 2019.
[19] Z. Wan, H. Li, and H. He, “Residential energy management with deep reinforcement learning,” International Joint Conference on Neural Networks
(IJCNN), 2018.
[20] A.I. Nousdilis, E.O. Kontis, G.C. Kryonidis, G.C. Christoforidis, and G.K. Papagiannis, “Economic assessment of lithium-ion battery storage systems
in the nearly zero energy building environment,” 20th International Symposium on Electrical Apparatus and Technologies, 2018.
[21] M. Yousefi, A. Hajizadeh, and M. Soltani, “A comparison study on stochastic modeling methods for home energy management system,” IEEE Trans.
on Industrial Informatices, DOI: 10.1109/TII.2019.2908431, 2019.
[22] B. Yang, X. Cheng, D. Dai, T. Olofsson, H. Li, and A. Meier, “Real-time and contactless measurements of thermal discomfort based on human poses
for energy efficient control of buildings,” Building and Environment, vol. 162, pp. 1-10, 2019.
[23] X. Cheng, B. Yang, A. Hedman, T. Olofsson, H. Li, and L. Gool, “A pilot study of online non-invasive measuring technology based on video magnification
to determine skin temperature,” Building and Environment, vol. 198, pp. 340-352, 2019.
[24] X. Cheng, B. Yang, T. Olofsson, G. Liu, and H. Li, “A pilot study of online non-invasive measuring technology based on video magnification to determine
skin temperature”, Building and Environment, vol. 121, pp. 1-10, 2017.
[25] Y. Wang, and Z. Lian, “A thermal comfort model for the non-uniform thermal environments,” Energy and Buildings, vol. 172, pp. 397-404, 2018.
[26] W. Li, J. Zhang, T. Zhao, and R. Liang, “Experimental research of online monitoring and evaluation method of human thermal sensation in different
active states based on wristband device,” Energy and Buildings, vol. 173, pp. 613-622, 2018.
[27] L. Yang, Z. Zheng, J. Sun, D. Wang, and X. Li, “A domain-assisted data driven model for thermal comfort prediction in buildings,” The ninth ACM
International Conference on Future Energy Systems, 2018.
[28] L. Yu, D. Xie, T. Jiang, Y. Zou, and K. Wang, “Distributed real-time hvac control for cost-efficient commercial buildings under smart grid environment,”
IEEE Internet of Things Journal, vol. 5, no. 1, pp. 44-55, Feb. 2018.
[29] H. Xu, X. Li, X. Zhang, and J. Zhang, “Arbitrage of energy storage in electricity markets with deep reinforcement learning,” arXiv:1904.12232v1, 2019.
[30] P. Constantopoulos, F. C. Schweppe, and R. C. Larson, “Estia: A realtime consumer control scheme for space conditioning usage under spot electricity
pricing,” Computers & Operations Research, vol. 18, no. 8, pp. 751-765, 1991.
[31] A.A. Thatte and L. Xie, “Towards a unified operational value index of energy storage in smart grid environment,” IEEE Trans. on Smart Grid, vol. 3,
no. 3, pp. 1418-1426, Sep. 2012.
[32] Z. Zhang, A. Chong, Y. Pan, C. Zhang, S. Lu, and K. Lam, “A deep reinforcement learning approach to using whole building energy model for hvac
optimal control,” 2018 Building Performance Modeling Conference and SimBuild co-organized by ASHRAE and IBPSA-USA, 2018.
[33] R.S. Sutton and A.G. Barto, “Reinforcement learning: an introduction,” The MIT Press, London, England, 2018.
[34] J. Schmidhuber, “Reinforcenlent learning in Markovian and non-Markovian environments,” Proceedings of the 3rd International Conference on Neural
Information Processing Systems, 1990.
[35] J. Perez and T. Silander, “Non-Markovian control with gated end-to-end memory policy networks,” arXiv:1705.10993v1, 2017.
[36] S. Padakandla, K.J. Prabuchandran and S. Bhatnagar, “Reinforcement learning in non-stationary environments,” https://arxiv.org/pdf/1905.03970.pdf
[37] L. Yu, T. Jiang and Y. Cao, “Energy cost minimization for distributed internet data centers in smart microgrids considering power outages,” IEEE Trans.
on Parallel and Distributed Systems, vol. 26, no. 1, pp. 120-130, Jan. 2015.
[38] L. Yu, T. Jiang, and Y. Zou, “Distributed real-time energy management in data center microgrids,” IEEE Trans. on Smart Grid, vol. 9, no. 4, pp.
3748-3762, July 2018.
[39] Y. Zhang, N. Gatsis, and G. B. Giannakis, “Robust management of distributed energy resources for microgrids with renewables,” IEEE Trans. on
Sustainable Energy, vol. 4, no. 4, pp. 944-953, Oct. 2013.
[40] T. Lillicrap, J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,”
International Conference on Learning Representations, 2016.
[41] Y. Xu, L. Xie, and C. Singh, “Optimal scheduling and operation of load aggregator with electric energy storage in power markets,” North American
Power Symposium 2010, 2010.
[42] R. Deng, Z. Zhang, J. Ren, and H. Liang, “Indoor temperature control of cost-effective smart buildings via real-time smart grid communications,” IEEE
Globecom, 2016.
[43] S. Chen, T. Liu, F. Gao, J. Ji, Z. Xu, B. Qian, H. Wu, and X. Guan, “Butler, not servant: A human-centric smart home energy management system,”
IEEE Communications Magazine, vol. 55, no. 2, pp. 27-33, Feb. 2017.