0% found this document useful (0 votes)
12 views21 pages

DRLMicrogrid 5 2 2020

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views21 pages

DRLMicrogrid 5 2 2020

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Deep reinforcement learning for energy management

in a microgrid with flexible demand


Taha Abdelhalim Nakabi and Pekka Toivanen
University of Eastern Finland, School of Computing, Kuopio Campus, P.O. Box 1627, 70211 Kuopio, Finland
[email protected], [email protected]

ABSTRACT limited local electricity supply and demand compared


In this paper, we study the performance of with the main grid. Microgrids can either operate in
various deep reinforcement learning algorithms to parallel with the grid, buying and selling energy
enhance the energy management system of a through the electricity market, or autonomously, using
microgrid. We propose a novel microgrid model local generation and storage [1]. Therefore, they offer
that consists of a wind turbine generator, an energy technical and economic benefits, including system
storage system, a set of thermostatically controlled reliability, local energy delivery, and additional sources
loads, a set of price-responsive loads, and a of capital investment for the DERs.
connection to the main grid. The proposed energy In order to maintain the reliability of the microgrid,
management system is designed to coordinate two levels of control are required. The lower-level
among the different flexible sources by defining the control consists of regulating the electricity voltage and
priority resources, direct demand control signals, current, and the frequency of the power grid, typically
and electricity prices. Seven deep reinforcement achieved at the power electronics interface. The higher-
learning algorithms were implemented and are level control consists of an energy management system
empirically compared in this paper. The numerical (EMS) that maintains the energy reserve, maximizing
results show that the deep reinforcement learning the overall system efficiency and optimizing the
algorithms differ widely in their ability to converge dispatch of local resources. Because of the nature of the
to optimal policies. By adding an experience replay microgrid, the PEM faces major challenges, primarily
and a second semi-deterministic training phase to related to the small scale, volatility, uncertainty, and
the well-known asynchronous advantage actor- intermittency of DERs, as well as the demand
critic algorithm, we achieved the highest model uncertainty and the dynamic electricity market prices.
performance as well as convergence to superior To overcome these challenges, further improvements in
policies in terms of energy efficiency and economic microgrid architecture and control are required. On the
value. architecture level, additional sources of flexibility must
be exploited to balance the high volatility of DERs. In
Keywords: addition, new control mechanisms and intelligent
Artificial intelligence, Deep reinforcement learning, control methods are needed to optimize the energy
Demand Response, Dynamic pricing, Energy dispatch and overcome the uncertainties of the
management system, Microgrid, Neural networks, microgrid components.
Price-responsive loads, Smart grid, Thermostatically Typically, microgrid components include DERs,
controlled loads. electric loads, and an ESS. The DERs consist of
renewable energy resources, typically based on wind
1. INTRODUCTION turbines [2] or solar PV [3], and commonly backed up
by an energy generator using a natural gas [4] or diesel
The ongoing transformation of the power network is engine [5]. The emerging interest in DERs stems from
mainly related to the transition from conventional their potential to reduce the inconveniences of
centralized energy resources to distributed energy centralized energy production. In addition, DERs can
resources (DERs) that have low impacts on the provide the microgrid with high autonomy, leading to
environment. This transition requires innovative less dependency on traditional, high-carbon-emitting
solutions to deal with the challenges arising from the energy resources. The electric load components of a
intermittent nature of renewable energy resources. microgrid can be either residential loads [6] or
Smart grid technologies such as advanced metering industrial loads [7]. The ESSs are typically based on
infrastructures, energy storage systems (ESSs), and batteries and can either be distributed in the microgrid
home energy management systems are being deployed [8] or centralized [9].
around the globe to support this transformation. Several innovative components have been presented in
Microgrid systems use these technologies alongside the literature to improve reliability and flexibility of
DERs to efficiently meet the local power demand and microgrids. Innovative electric loads, which can offer
support the decentralization of the power supply. high demand-side flexibility, have been proposed in
Microgrids are usually low-voltage networks with a

1
several works. These include directly controllable loads algorithm was proposed as a regression algorithm for
[10], thermostatically controlled loads (TCLs) [6], [11], state-action pairs, to solve the problem of uncertainty.
[12], price-responsive loads [11], [13], and electric Deep reinforcement learning (DRL) methods, however,
vehicles [7], [14]. However, the combination of these use artificial neural networks as function
demand-side flexibility sources in a microgrid has been approximators, capable of learning continuous state-
rarely studied in the literature. In this paper, we action transitions under uncertainty [32]. The neural
propose to increase the flexibility in demand by networks enable the use of continuous and high-
combining groups of TCLs and price-responsive loads dimensional state spaces and can extract hidden
participating in a demand response (DR) program, features from the state space. This enables the DRL
alongside a shared ESS, a wind power resource, and a agent to overcome the uncertainty and partial
connection to the main grid. TCLs can provide observability of the environment [33].
significant flexibility due to their thermal conservation Driven by the recent successes of DRL in solving
of energy [15], [16], whereas price-responsive loads complex tasks, such as the superhuman performance
can offer more flexibility by shifting their consumption achieved using AlphaZero in many challenging games
to periods of high energy production [17]. [34], several works have shown interest in DRL
On the control level, we differentiate between two applications for microgrid control problems. Based on
categories of EMS, namely, model-based EMS and the control mechanisms, specifically the effect of the
model-free EMS. In model-based approaches, an control actions on the components of the microgrid,
explicit model is used to formulate the dynamics of the two approaches to DRL-based EMSs can be
microgrid and the different interactions between its distinguished in the literature. The first approach
components. The uncertainties are estimated using a consists of simple control mechanisms that manage
predictor, and the control problem is solved using a individual components of the microgrid. Optimal
scheduling optimizer. Model predictive control is the management of the ESS under uncertainties in
most commonly and successfully used algorithm in the electricity consumption and production has been
literature [18]–[20], consisting of repeated proposed using a variety of methods, including deep Q-
optimizations of the predictive model over a learning (DQN) [35], SARSA [36], and double DQN
progressing time period. Model-based approaches rely [37]. In [38], a batch DQN was proposed to optimally
heavily on domain expertise for constructing accurate control a cluster of TCLs. The second approach
models and parameters for a microgrid. Therefore, revealed in the literature consists of jointly managing
model-based approaches are not transferable nor multiple components of the microgrid using complex
scalable, which leads to high development costs. objective functions. In this approach, the DRL
Furthermore, if the uncertainties in the microgrid algorithm is based on a complex action space, which
change over time, the model, predictor, and solver must combines the actions related to each component. DQN
be redesigned correspondingly, which significantly algorithms have been used to optimize local energy
increases the maintenance costs. trading and sharing by either controlling battery
Model-free or data-driven approaches require charging/discharging and buying/selling operations
identifying the optimal control strategy and with the main grid [39] or managing the consumer’s
uncertainties in the microgrid from its operational data. energy sharing options [40]. Another energy exchange
Learning-based methods have been introduced in strategy, based on the energy internet concept, was
recent years as an alternative to model-based proposed in [41], in which the control mechanisms
approaches, as they can reduce the need for an explicit included an A3C algorithm for management of the
system model, improve the EMS scalability, and backup generators, fuel cells, and ESS. A real-time
reduce the maintenance costs of the EMS [21]. One of energy management approach using a DQN algorithm
the most promising learning-based EMS methods is the was proposed in [42] to jointly schedule backup
reinforcement learning (RL) paradigm [22], in which generator utilization and manage ESS and grid
an agent learns the dynamics of the microgrid by operations, whereas [43] proposed jointly controlling
interacting with its components. Several works have the hydrogen storage, diesel generation, and ESS using
proposed successful implementations of reinforcement the same algorithm. However, the control mechanisms
learning-based EMSs in different microgrid adopted in these works did not consider DR programs
architectures, either within a single agent [23]–[25] or as a flexibility providers in the microgrid. DR acts as
multi-agent framework [26]–[29]. However, the basic an indirect control mechanism, which provides optimal
and most popular RL methods, such as Q-learning [30], incentives, or price signals that steer the consumption
face several challenges related to inefficient data usage, toward periods of high production.
high dimensionality, state space continuity, and Furthermore, several works have proposed RL methods
transition function uncertainty. A batch RL algorithm to control the consumption and exploit demand
was proposed in [23] to overcome the problem of flexibility in smart grids [44]–[46]. However, the
inefficient data usage using a batch of past experiences. combination of DR programs and direct control
In [28], a fuzzy Q-learning algorithm was proposed to mechanisms, such as TCL control, in the EMS of a
cope with the continuous nature of the state and action microgrid is still absent from the literature. Therefore,
spaces. In [31], an extremely randomized trees we propose a novel microgrid EMS architecture that

2
maximizes flexibility by combining direct control of
typical microgrid components, direct control of a TCL
cluster, and indirect control of price-responsive loads
through price-based DR. Moreover, the
abovementioned DRL methods focused mainly on
DQN algorithms and rarely investigated the recent
policy gradient and actor-critic algorithms.
Additionally, no thorough comparison of the algorithm
performances has been reported in the context of a
microgrid EMS. Therefore, in this paper, we present
and compare the performances of seven state-of-the-art
DRL algorithms and two baseline control methods. We
also propose two improvements on the A3C and
Proximal policy optimization (PPO) methods, which
demonstrate better performances than the other
algorithms proposed in this study. The algorithms are
tested in different scenarios using a realistic microgrid
simulation, based on real electricity price and
renewable energy production data from Finland. The
performance of each algorithm is evaluated through the
lenses of energy efficiency and cost reduction. Other
issues addressed in this paper include DRL algorithm
overfitting and the premature convergence to
suboptimal deterministic policies. Figure 1: The proposed microgrid architecture.
The major contributions of this study are as follows:
comparisons are presented and discussed in Section 6.
· A novel microgrid model that includes TCLs and
price-responsive loads as flexibility resources Section 7 present our conclusions.
alongside the typical microgrid components.
· A novel control mechanism that combines direct 2. MICROGRID MODEL AND PROBLEM
FORMULATION
control of TCLs with indirect control of price-
responsive loads, alongside priority management This study considers a microgrid with an
of the ESS and main grid in case of energy independent supply and demand infrastructure. The
deficiency or excess. microgrid is managed by an aggregator or a utility
· An MDP formulation of the control problem, company that is responsible for supplying the
considering the various control actions, along with electricity to meet the local demand. The microgrid has
a multi-objective reward function that considers its own wind-turbine-based DER but is also connected
the energy cost, return on operations, ESS state-of- to the main grid, through which it continuously buys or
charge, and pricing constraints. sells energy on the electricity markets.
· A comprehensive and numerical comparison of The architecture of the microgrid is illustrated in
value-based DRL algorithms (DQN, Double DQN, Figure 1. It consists of three layers: the physical,
SARSA) and policy-based algorithms information, and control layers. The physical layer
(REINFORCE, Actor -critic, A3C, PPO). includes a wind-based DER, a communal ESS, a group
· Novel variations of the A3C and PPO algorithms, of TCLs, and a group of residential price-responsive
which incorporate an additional experience replay
loads. The information layer consists of the smart
to avoid inefficient data usage and destructive
meters and the system for two-way communication
searching, and a second semi-deterministic
training to exploit the optimal local policies. The between each of the individual components and the
proposed variations outperformed the EMS. Information such as electricity prices, battery
abovementioned methods on energy efficiency and states of charge, and energy generation is transferred
cost reduction bases. through this layer. The control layer represents the
The remainder of this paper is organized as follows: In infrastructure through which the EMS sends control
section 2, we present the microgrid model and the EMS signals to the controllable components of the grid. As
control mechanisms. In section 3, we formulate the illustrated in Figure 1, there are three direct control
problem as a Markov decision process. Section 4 points, namely, the TCL on/off control, ESS
presents a theoretical framework for the DRL charge/discharge control, and energy grid buy/sell
algorithms used in this study. Experimental control. Given this architecture, we modeled our
implementation details are presented in section 5. microgrid as a multi-agent system in which each
component operated as an autonomous agent,
Numerical results for each method and their numerical
interacting with the environment, and other agents. The
simple or complex behavior of each component was

3
Figure 3: The intermediary role of the TCL aggregator.

2.2 Distributed energy resource model (DER)


Figure 2: The control mechanisms used in the microgrid
management system. The microgrid in this study is considered to be
equipped with wind turbines capable of generating
governed by an internal model. In the following varying energy quantities, depending on the weather
sections, we present the models adopted for each conditions. Instead of using a model for the energy
component of the microgrid. generation, we utilized real wind energy production
data from Finland [48]. The DER agent shares the
2.1 Energy storage system model (ESS)
information about the current energy generation ,
For technical and economic convenience [47], we
with the EMS, and supplies the energy generated
adopted a community ESS instead of individual
directly to the local grid.
household battery storages. The utilized ESS is capable
of covering up to 40% of the energy demand of the 2.3 Main electricity grid
microgrid for 1 h. At each time step , the storage
dynamics of the ESS were modeled by The microgrid is connected to a main grid that acts as a
regulation reserve. The supply and demand in the
microgrid cannot be balanced using the DERs alone,
= + − , (1) due to the intermittent, and uncontrollable nature of
these resources. The main electricity grid can instantly
where ∈ [0, ] is the stored energy in the ESS at supply power to the microgrid in the case of an energy
time , is the ESS maximum capacity, and deficiency or accept the excess power in the case of
( , ) ∈ ]0,1] are the charging and discharging surplus. The transactions between the main grid and the
efficiency coefficients, respectively. The variables ∈ microgrid happen in real-time using the up-regulation
[0, ] and ∈ [0, ] are the charging and and down-regulation market prices. The main grid
discharging powers, respectively, which are agent shares the real-time up- and down-regulation
constrained by the ESS charging and discharging rate prices, represented respectively as ( , ), with the
limitations and . We also defined the ESS EMS. In the model, we implemented real up- and
state-of-charge variable as down-regulation price data from the balancing
= . (2) electricity market in Finland provided in [49]. To
define the priority supply source in the case of a
deficiency and the priority power discharge source in
The behavior of the ESS in response to the the case of an excess, the EMS controls only the
charge/discharge control signals is represented by the electrical switch to the main grid, as illustrated in
energy provided to and requested from the batteries. In Figure 2. After each time step, the EMS receives
the case of a charging signal, the ESS agent receives a information on the energy , purchased, or sold to the
power rate for storage in the batteries, verifies the main grid, where positive values indicate purchased
feasibility of the charging operations (based on
energy and negative values sold energy.
maximum capacity and maximum charging rate), stores
the energy accordingly, and returns the remaining 2.4 Thermostatically controlled loads (TCLs)
power to be sold to the main grid. Similarly, in the A cluster of TCLs can provide a significant source of
discharging case, the ESS agent receives a power flexibility because of their thermal conservation of
request from the grid, verifies the supply conditions, energy. We assumed that most of the households in the
and returns the available power accordingly. If the
microgrid were equipped with a TCL, such as an air
requested power cannot be completely supplied by the
conditioner, heat pump, water heater, or refrigerator.
ESS, the difference is automatically supplied from the
These TCLs are directly controllable at each time step
main grid.
, using control signals from the TCL aggregator. In
order to preserve end user comfort levels, each TCL is

4
equipped with a backup controller that maintains the variable load of household , whereas and are
temperatures in an acceptable range. The backup the price level at time and the medium price level
controller receives the on/off action , from the TCL determined by the microgrid manager, respectively.
aggregator, verifies the temperature constraints, and Finally, the sensitivity of each customer was
modifies the action as follows: considered to decrease every time they reduce their
power consumption in response to high prices,
0 > according to
, = < < , (3) − <0 (8)
= .
1 < max − 0.1, 0.1 − >0
This sensitivity updating method is important in
where , is the final on/off action after the decision of describing user behavior in response to electricity
the backup controller; is the operational temperature prices. The users will most likely shift their
of TCL at time ; and and are the upper consumption only a few times per day, after which
and lower temperature boundaries set by the end user, their price sensitivity will decrease if prices continue to
respectively. The temperature dynamics of each TCL be high. This represents a notably simplistic model of
were modeled using a second-order model based on the price-responsive loads. However, this model is only
[38] used herein as a proof of concept for DRL methods that
can learn an abstract representation of the residential
̇ = − + − + + ,
, , load price responsiveness based only on the reward
̇ (4)
, = − , , feedback.
where is the measured indoor air temperature; , 2.6 EMS agent
is the non-observable building mass temperature; is The proposed EMS agent uses the information
the outdoor temperature; and are the thermal provided by the different grid components and the
masses of the air and the building materials, observable environment to determine the optimal
respectively; is the internal heating in the building; supply/demand balancing strategy. The agent performs
and is the nominal power of the TCL. Finally, a overall management of the microgrid using four control
state-of-charge measure , which determines the mechanisms: TCL direct control, price level control,
relative position of in the desired temperature range, energy deficiency actions, and energy excess actions.
was defined for each TCL as These mechanisms are illustrated in Figure 2 and
detailed in the following sections.
− (5)
= . A. TCL direct control

At each time step , the EMS agent allocates a certain
2.5 Residential price-responsive loads
amount of energy for use in TCL operations. This
The residential loads represent the electricity demand energy is then dispatched through an intermediate
from households in the microgrid that cannot be agent, the TCL aggregator, to the individual TCLs.
directly controlled. We assumed that these loads follow Based on the energy allocation issued by the EMS
a daily pattern, with a variable component that can be agent, the aggregator determines the on/off actions of
affected by the electricity prices. In the model, each each TCL based on their SoC priority; TCLs with the
household has a sensitivity factor ∈]0,1] that lowest SoC values are served before TCLs with higher
determines its response to the price variations. The SoCs. The TCL aggregator also operates as an
electric load of household at time was modeled information aggregator, by communicating the real-
using the following equations: time average SoC of the TCL cluster to the EMS agent,
= , + ,, (6) as illustrated in Figure 3. It is worth noting that because
of the backup controller at each TCL, the allocated
, = ( − ), (7) energy is not always equal to the actual energy used by
the TCLs.
where , > 0 and indicates the basic load, which
follows the daily consumption pattern [50]. This B. Pricing mechanism
pattern can be deducted from the average daily The EMS agent determines the pricing level to use at
consumption curve of the residential area in which the every time step in order to exploit the household price
microgrid is implemented. The variable , denotes the elasticity. For practical reasons related to the action
flexible component of the household consumption, space discontinuity discussed in the next section, we
which is positive in low price scenarios and negative in consider that the DR program is based on discrete price
high price scenarios. Furthermore, > 0 is the levels [51]. The agent decides which price level , , to
use at each time step . The price level serves as a tool

5
for shifting the loads from peak periods toward periods
with more power availability. The prices are also used
to compensate for the costs of energy bought from the
grid. However, the agent should not raise the prices for
long periods of the day. Hence, we implemented the
constraint that the sum of the price levels at any time
step should not exceed the sum of the medium prices
during the whole control period. We defined a pricing
counter , to control the pricing levels as

= ≤ , ∀ ∈ {0,1, … , − 1}, (9)

where is the control period of one episode.

C. Energy deficiency action


When the local DERs are not able to meet the demand,
the local microgrid can either use the energy stored in
the ESS or purchase energy from the main grid and
save the ESS energy for later use. At each time step,
the EMS agent sets the usage priority among these two
resources. Consequently, when there is a voltage drop Figure 4: Interaction process between the control
in the microgrid, the energy can be supplied agent and microgrid environment.
automatically from the priority resource. In the event
: × × → [0,1],
that the priority resource is the ESS and the required
( , , )= ( = | = , = ). (10)
energy cannot be entirely fulfilled, the remaining
demand is automatically supplied from the main grid. The reward function ∈ ℝ describes the immediate
D. Energy excess action reward received by the agent after it performs the
transition to state from state , given action .
The energy generated by the local DERs can also
The objective is to optimize a stochastic policy ,
exceed the demand. In this case, the excess energy
defined by the probability distribution over the possible
must be either stored in the ESS or sold to the main
actions given a state , mathematically represented
grid. The EMS agent specifies the priority option for
: → [0,1] ,
excess energy usage in advance, similarly to in the
energy deficiency scenario. If the ESS is the priority ( )= ( | ) . (11)

option and the battery capacity is reached, the The notation ( || ) = ( | ) is used herein to refer
remaining energy is automatically transferred to the to the probability of choosing action given a state .
main grid. At each time step , the agent receives a state and
selects an action from the set of possible actions ,
3. MARKOV DECISION PROCESS FORMALISM
according to its policy π. In return, the agent receives
The RL paradigm refers to the set of control methods the next state and a reward . The process
in which an agent learns the optimal control policies by continues until the agent reaches a terminal state, after
interacting with an environment. In this study, this which the process restarts. Following policy , the goal
agent learning was achieved through a Markov is to maximize the expected cumulative discounted
decision process formalism [52]. At each time step, the reward, given by
agent performs an action based on the current state of
the environment and, in return, receives a reward, and = , (12)
information about the next state, as illustrated in Figure where ∈]0,1] is the discount factor, which
4. In order to find the optimal policy, the agent must determines the importance of the rewards in the next
estimate the quality of its actions using information step compared with that of the immediate reward. The
from the previously explored states of the environment. state value under policy is simply the expected return
In the following MDP formulation, the MDP is for following policy from state :
characterized by a state space , an action space , a : → ℝ,
transition function , and a reward function . The ( ) = [ ]. (13)
transition function , describes the probability of a
transition from state ∈ to ∈ given an action , can also be defined as
such that

6
( )= [ + ( )], ∀ < − 1, (14) Therefore, the action space represents the 80 potential
combinations of the possible actions of these four
( )= [ ]. (15) components, given by
Conversely, the action-value function, or Q-function, =( , , , ), (20)
describes the expected return for selecting an action , ∈ = × × × .
at state , and following policy onward:
3.3 Reward function
: × → ℝ,
( , )= + ( ). (16) We designed our reward function to maximize the
autonomy of the microgrid and the economic profit
The agent begins searching for the optimal policy ∗ , from operations while keeping the electricity prices in
at an initial state , that maximizes the action-value an acceptable range. The four components of the
function: reward function , are the supply reward ,
∗ ( , ), describing the return generated from supplying the
= (17)
microgrid demand, the grid reward , describing the

∗(
, )= ( , ). (18) return generated from the buy/sell operations with the
main grid, the penalties related to violating the pricing
The following section describes the states, actions, and constraint , and the ESS reward . The latter is
reward function specific to our problem formulation. related to the compensation for the energy stored in the
3.1 State description ESS but not used during the control period, which can
be used in future control episodes. The reward
Based on the problem formulation in the previous
components are formulated as follows:
section, the state space consists of the information that
the agent uses in the decision making process at each = + , (21)
,
time step . The state consists of a controllable state
component , an exogenous state component , and
a time-dependent component . The controllable state = − ( ), (22)
information includes all environmental variables that
where and are the energies sold to and
the agent affects directly or indirectly. In this work, the
average SoC of the TCLs, the state-of-charge of the purchased from the main grid, respectively, and is
the cost function related to purchasing energy from the
ESS , and the pricing counter comprise the
grid. We used the following quadratic cost function to
controllable state. The exogenous information consists
penalize the purchase of large amounts of energy from
of all the variables that the agent has no control over,
the grid:
such as the temperature , energy generation , and
( )= + ( ) , (23)
electricity prices at the regulation market . We
assumed that the controller could accurately forecast where is a constant describing the extent of the
these three variables for the following hour. The time- penalization chosen for the grid purchases.
dependent component reflects the time-dependent
behavioral patterns of the environment. In this work, 0 ≤ (24)
denotes the hour of day t, and , the current load =
|| + || − >
value of the daily consumption pattern. Therefore, the
In equation 24, is the penalty factor, and is a
state space can be described as
negative value proportional to the profit generated from
∈ = × × , (19) the operations and the magnitude of the violation.
= , , , , , , ,, .
At the terminal state, a small reward is added to the
3.2 Action description reward function if the ESS is charged, given by
The agent acts on the environment using the control 0 < −1 (25)
= ,
mechanisms described in the section 2. The action || + + || = −1
space consists of four components: the TCL action where is a factor that determines the compensation
space , price action space , energy deficiency for the unused energy stored in the ESS. Finally, the
action space , and energy excess action space . total reward function is described as follows:
The TCL action space consists of four possible actions = + + + (26)
that specify the energy allocation level for the TCLs, .
the price action space consists of five possible actions
that specify the price level, and the energy deficiency, 4. PROPOSED METHODS
and energy excess action spaces each have two possible In this section, we present an overview of the DRL
actions that specify the battery or grid priorities. methods used in this work. The DRL methods are

7
divided into two main categories, namely, value-based, has a lower update frequency for the parameters ′. The
and policy-based methods. In value-based methods, the loss function becomes
neural network learns the Q-function of each action
given a state , whereas in policy-based methods, the ( )= [( + max ( , , ) (29)
neural network learns a probability distribution of the − ( , , )) ].
actions given a state . After several steps, the target network is updated by
copying the parameters from the original network:
−→ −→ ( , ), - ′←
:
−→ −→ ( || ), -
To be effective, the interval between updates must be
4.1 Value-based DRL methods large enough to provide adequate time for the original
Based on the MDP formulation notations, the Q- network to converge. The target network provides
function is represented as an approximator using a stable Q-function targets for the loss function, which
neural network with parameters , as shown in Figure allows the original network to converge to the desired
Q-function. Algorithm 1 outlines the pseudocode for
5. Deep Q-learning (DQN) is one of the most
the DQN with a target network. In practice, we first ran
commonly used algorithms to tune the parameters
a random agent (taking random actions) to collect
and aims to directly approximate the optimal Q-
experiences from the environment and store the
function: ( , , ) ≈ ∗ ( , ). Below, we present the
transitions ( , , , ) in the replay memory.
three DQN variations used in this work:

A. One-step DQN Algorithm 1: DQN with experience replay and target net
In one-step DQN, the parameters , are learned by 1: Initialize replay memory RM, with experiences
iteratively minimizing a sequence of loss functions, collected by the random agent.
where the loss function is defined as 2: Initialize DQN network parameters randomly
3: Initialize target network parameters ←
( )= [( + ( , , ) 4: For episode = 1, Max_episodes do:
− ( , , )) ] (27) 5: For t=0, D-1 do:
6: Get initial state
7: Take action with -greedy policy-based on
The Q-function is updated toward the one-step return ( , ; )
+ max ′, ′, 8: Receive new state and reward,
−1 . To increase the efficient
′ 9: Store transition ( , , , ) in RM
usage of previously accumulated experience, we also 10: Sample a random minibatch of transitions
included an experience replay mechanism [53]. In ( , , , )
experience replay, the learning phase is logically 11: For each sample, set =
separated from experience gain. Experience replay uses = −
randomly sampled batches of transitions + ( , , ′)
( , , , ) from an experience dataset. Through 12: Perform a gradient descent step on: [( −
this process, the neural network can overcome the ( , , )) ] w.r.t to .
limitation of non-stationary data distributions, resulting 13: Every steps, update the target network ←
in better algorithm convergence.
14: End for
B. SARSA 15: Reset environment
The deep SARSA algorithm [54] is an on-policy 16: End for
version of DQN. Therefore, updates to the parameters 4.2 Policy-based DRL methods
are related to the actual agent behavior instead of the In contrast to value-based methods, policy-based
optimal actions. The SARSA loss function is defined as methods [56] aim to directly find an optimal policy
( ) = [( + ( , , )
(28) by parametrizing the policy function ( || ; ) and
− ( , , )) ].
updating the parameters using a gradient ascent on
The target value used by SARSA is + the expected return [ ]. Deep policy gradient
( , , ), where is the action taken in (DPG) is one of the most common policy-based DRL
state . SARSA learns a near-optimal policy while algorithms, in which the parametrized model
following an -greedy exploration strategy. ( || ; ) is based on a DNN that estimates the
probability of taking an action in a specific state :
C. Double DQN/Target network ( , )= ( | , ) . (30)

Double DQN [55] uses a separate neural network for
target setting, called a target network. The target
network is a frozen copy of the original network that

8
Figure 6: The actor-critic neural network architecture.
Figure 5: The DQN neural network architecture.
Therefore, an objective function can be defined as
Therefore, the DNN becomes a probability density
( )= [ ( || ; )], (32)
function over its inputs, the states of the environment.
The DPG is then reduced to solving the following where indicates the empirical average over a finite
optimization problem: batch of samples in an algorithm that alternates
[ ]. between sampling and optimization. The learning
In the following sections, we present the four DPG process is summarized in Algorithm 2.
algorithms used in this work.
A. REINFORCE B. Actor-Critic
In standard REINFORCE algorithms [57], the gradient The idea behind actor-critic methods is to use the value
ascent of [ ] is achieved by updating the policy function information to assist in the policy update.
parameters in the direction of ∇ ( || ; ), Actor-critic methods [58] use two models, which may
the unbiased gradient estimate of the objective optionally share parameters:
function: · The critic model is used to calculate an estimate of
[ ] ∝ ( || ; ). (31) the state value function ( ; ) ≈ ( ) that is
used as baseline for updating the policy
parameters. The critic model updates its
Algorithm 2: REINFORCE algorithm parameters using basic supervised learning from
1: Initialize network parameters randomly past experiences.
2: Initialize training memory with capacity · The actor model updates the policy parameters
3: Initialize agent memory for ( , ) in the direction suggested by the
4: For episode = 1, Max_episodes do: advantage function . The advantage function
5: For t=0, D-1 do: specifies the advantage of choosing an action in
6: Get initial state . a state , over the expected value of and is
7: Take action with -greedy policy-based on defined as
( , ). ( , ) = ( , ) − ( ). (33)
8: Receive new state and reward .
9: Store transition ( , , , ) in agent The advantage can also be estimated using and
memory. ( ) as
10: End for ( , ) = − ( ; ), (34)
11: Calculate for each transition in the agent
memory. because is an estimate of Q(a , s ) and ( ; )
is an estimate of V (s ). Therefore, the actor
12 Store ( , , ) in the training memory.
updates its parameters in the direction of
13: Reset agent memory ( , )∇ ( || ; ).
14: Reset environment For the actor and critic models in this work, we used a
15: If training memory is full: shared DNN. The DNN, shown in Figure 6, features
16: Perform a gradient ascent step on: ( ) two output layers that predict the policy using
w.r.t to . SoftMax activation and the baseline ( ) using linear
17: Reset training memory
18: End for
9
activation. The learning process uses a loss function ( || ; )
( )= . (39)
that combines the critic and actor loss functions: ( || ; )
( )= [ ( )+ ( )], (35) This ratio is clipped inside a small interval around 1 to
prevent large policy updates. The clipped surrogate
where and are the predefined learning rates for objective function is defined as:
the value function and the policy parameter updates,
respectively. The definitions of and are ( )
( ) ( , ), (40)
( )=− ( , ) ( || ; ), (36) = ,
( ), 1 − ,1 + ( , )

( ) = || ( ) − ( ; )|| . (37)
where is a hyperparameter that determines the
Interestingly, the value function in this case, ( ), is
policy update interval. The clip function is defined as
calculated using the collected experience, by
accumulating the discounted rewards received ( )<
following a given policy . In other words, ( ) is ( ( ), , ) = ( ) < ( )< . (41)
simply the discounted return . Therefore, ( )>
The final objective is a lower bound (i.e., a pessimistic
( ) = || − ( ; )|| . (38)
bound) on the unclipped objective. In this scheme, the
The actor-critic algorithm is described in Algorithm 3. change in probability ratio is ignored only when it
would cause the objective to improve and included
Algorithm 3: Actor-Critic algorithm when it would worsen the objective. Thus, in an actor-
1: Initialize network parameters randomly critic framework, the PPO loss function is defined as
2: Initialize training memory with capacity
3: Initialize agent memory ( )= [ ( )− ( )] (42)
4: For episode = 1, Max_episodes do:
5: For t=0, D-1 do: The pseudocode of the PPO algorithm is presented in
6: Get initial state . Algorithm 4.
7: Take action with -greedy policy-based
Algorithm 4: PPO algorithm
on ( , ).
8: Receive new state and reward . 1: Initialize network parameters randomly
9: Store transition ( , , , ) in agent 2: Initialize training memory with capacity
memory. 3: Initialize agent memory
10 End for 4: For episode = 1, Max_episodes do:
11: Calculate for each transition in the agent 5: For t=0, D-1 do:
memory. 6: Get initial state .
12 Store ( , , ) in the training memory. Take action with -greedy policy-based
7:
13: Reset agent memory on ( , ).
14: Reset environment 8: Receive new state and reward .
15: If training memory is full: 9: Store transition ( , , , ) in agent
16: Calculate ( , ), ( || ; ), memory.
( ) and ( ) for the whole batch. 10 End for
17: Perform a gradient descent step 11: Calculate for each transition in the agent
on: ( ) w.r.t to . memory.
18: Reset training memory 12 Store ( , , ) in the training memory.
19: End for Reset agent memory
13:
C. Proximal policy optimization (PPO)
14: Reset environment
PPO [59], is a variation of the standard actor-critic
algorithm that addresses the problem of destructive, 15: If training memory is full:
large policy updates. The aim is to constrain the policy Store the current parameters as =
updates to a small range to ensure that the updated ←
policy does not move too far from the old policy, in For k=0, Max_epochs do:
which the training data were collected. Therefore, a 16: Calculate ( , ), ( ), ( ),
clipped surrogate objective function is implemented, in and ( ) for the entire batch.
which ( || ; ) is replaced by a probability 17: Perform a gradient descent step
ratio of the old policy to the new policy , defined as on ( ) w.r.t to .
End for

10
18: Reset training memory Subsequently, using the probability 1 − ε, an action is
19: End for chosen using the policy, as in equation (30). In the
proposed semi-deterministic training, equation (43) is
still used to follow the -greedy strategy, but using
D. Asynchronous Advantage Actor-Critic (A3C) probability of 1 − ε, an action is deterministically
A3C [60] is a multi-thread version of the actor-critic selected according to the maximum value of the policy:
method that aims to reduce the convergence time,
widen the search area, and overcome the problem of = ( ( || , )). (44)
highly correlated samples gathered by a single agent. In
the A3C algorithm, multiple actor agents interact in The following section details the improved
parallel with copies of the environment and collect performance and robustness of the algorithm achieved
experience that is used by multiple learners to train the through this process. This variation is notably different
model. The A3C learning process is illustrated in from the deterministic DPG algorithm because it
Figure 7. After each episode, the actor agents input the requires a preliminary stochastic training before the
n-steps experience ( , , , ) into the training deterministic training, and it retains the -greedy
memory. When the maximum number of samples strategy for exploration. We refer to this variation as
is reached, the learner agent uses these samples to A3C++ (A3C + experience reply + deterministic
update the network parameters using the actor-critic training and semi-deterministic training). The PPO
loss function and resets the training memory. algorithm was also run in a multi-thread framework,
similar to the A3C algorithm and given the name
PPO++ (PPO + semi-deterministic training).

5. EXPERIMENTAL IMPLEMENTATION
5.1 Implementation details
We used an OpenAI toolkit called Gym [62] to
implement the microgrid simulation. OpenAI Gym is
an open-source toolkit for developing and comparing
RL algorithms. The parameters of the components used
in the simulation are summarized in Table 1 and are
based on the microgrid model and notations described
in section 2. The environment represents several days
Figure 7: Asynchronous Advantage Actor-Critic (A3C)
framework of energy management in which each episode
E. Proposed variations corresponds to 1 day. For each episode, one of the 10
According to the experiments conducted in this work, first days was chosen arbitrarily.
the A3C algorithm suffered from experience losses due
Table 1: Microgrid component parameters
to the memory reset operation. Additionally, when the
policy was updated to regions outside the regions from Parameter Value
which the latest samples were collected, the accuracy ESS parameters
of the model was damaged beyond recovery. In order 0.9
to overcome these challenges, we introduced an 0.9
experience replay technique similar to that introduced 200 kW
in the DQN algorithm. In the added experience replay, 200 kW
the learner agents update the parameters using random 400 kW
samples from all experiences collected during the DER parameters
entire process, without a memory reset. This A3C 1‰ of the hourly wind energy
algorithm with an additional experience replay is generation records in Finland (kW) [48]
hereafter referred to as A3C+. We also propose a Grid parameters
second semi-deterministic training in which the search Down-regulation prices in Finland
follows an -greedy strategy. In the normal case, the (€/kW) [49]
selection of an action follows a stochastic process Up-regulation prices in Finland (€/kW)
[49]
and an -greedy strategy by choosing a random action
TCL parameters
with a probability :
100
( = )= . (43) Temperature records in Helsinki (ºC)
[61]
(0.004, 0.0008)

11
(0.3, 0.004) · Energy deficiency action: The ESS has priority.
(0.0, 0.01) · Energy excess action: The ESS has priority.
(1.5, 0.01) (kW)
19.0 ºC B. Baseline 2:
25.0 ºC · TCL action: The TCL control is proportional to
Residential load parameters the energy availability. Thus, if the energy
100 generated is minimal, the energy allocated to the
Represented in Figure 15. TCL is zero and vice versa.
,
(3.0, 1.0) (kW) · Price action: The same as in Baseline 1
(0.5, 0.3) · Energy deficiency action: The main grid has
General parameters priority.
24 · Energy excess action: The main grid has priority.
{0, 1, 2, 3, 4} (index)
{2.8, 5.8, 7.8, 9.3, 10.81} (€cents/kW) 6. RESULTS AND DISCUSSION
actual prices according to [51] 6.1 Total rewards and training time
The DRL algorithms were implemented separately We ran the RL algorithms in the simulated
using Python, TensorFlow, and Keras [63]. The environment and recorded both the training
parameters related to the RL algorithms are performance and the total reward per day averaged
summarized in Table 2. over 10 days. The learning processes are shown in
Table 2: Hyper-parameters of the DRL algorithm Figure 8 and Figure 9 for each of the RL algorithms.
The learning curves indicate high instability in the
Parameter Value basic one-step DQN algorithm and non-recoverable
MDP Parameters destruction of both the REINFORCE and actor-critic
80 algorithms. The remainder of the algorithms displays a
{0, 50, 100, 150} satisfactory learning process that led to a reasonable
{0, 1, 2, 3, 4} convergence. The double DQN algorithm appears to
{ESS, Grid} have the most stable learning curve, which supports the
{ESS, Grid} argument that the target network contributes
0.9 advantageously to the stability of the DQN algorithm,
0.015 as outlined in section 4. In addition, the SARSA
2.0 algorithm demonstrates good stability compared with
0.5 the policy gradient algorithms.
1 hour
Starts at 0.4 and decays to 0.001 Table 3: Average total rewards at the end of the training
DQN Parameters
Max_episodes 1000 Training time Reward
500 Algorithm
(min) 10
Batch size 200
1-step DQN 7.76 −0.675
Policy gradient parameters
200 SARSA 7.78 0.212
0.007 Double DQN 11.30 −0.48
1.0 REINFORCE 3.96 −1.961
0.2 Actor-Critic 3.97 −1.848
5.2 Baselines PPO (16 threads) 0.36 0.026
In addition to the reinforcement methods, we PPO++ 0.36 + 0.55 0.219
implemented two deterministic management methods A3C (16 threads) 0.29 −0.159
for comparing the results. The following describes the A3C+ 4.03 0.037
actions included in each baseline considered. A3C++ 4.03+1.69 0.266
Baseline 1 - −0.062
A. Baseline 1:
· TCL action: The TCLs are totally controlled by
Baseline 2 - −0.031
their backup controllers. This means that the
energy allocation for the TCLs is zero for the The improved version of A3C, named A3C+, does not
whole control period. appear to significantly improve on the stability of A3C.
· Price action: The price levels are designed to However, A3C+ managed to converge to a better
match the expected peaks and valleys of demand, reward level than A3C, which indicates the
as shown in Figure 16. considerable advantage of using experience replay in

12
On the other hand, policy-based algorithms seem to be
able to reach superior reward levels at the end of the
learning process. Figure 10 displays the learning
processes of the proposed variations A3C++ and
PPO++, in which the models were clearly able to find
new optimal policies by starting from a trained network
and following the semi-deterministic search described
in section 4.
A summary of the average total reward result from the
test simulations and the training time for each
algorithm is presented in Table 3. The results show that
only five out of the 10 DRL algorithms outperformed
the baselines. The DRL algorithms that were
Figure 8: Deep Q-learning algorithms: Learning curves. significantly outperformed by the baselines are not
Average daily rewards over 10 days surrounded by an considered for further performance analysis in the
envelope containing 90% of the daily rewards. Rewards are following section.
scaled by dividing the original reward function by 1000.
6.2 Energy efficiency and peak shaving
In this section, we investigate different aspects of the
performance derived from the resulting policies. The
associated models were tested in different energy
availability and weather condition scenarios to evaluate
their ability to minimize the energy cost, reduce grid
dependency, and flatten the load curve. Only
algorithms with favorable results in terms of rewards,
namely, SARSA, A3C, A3C++, PPO, PPO++, baseline
1, and baseline 2 are considered.
We ran the models over 10 days of simulation and
registered the total daily energy cost, total hourly
consumption, and total hourly energy purchased from
and sold to the grid. Figure 11 shows the average
Figure 9: Deep Policy gradient algorithms: learning curves. energy cost, considering the quadratic cost described in
Average daily rewards of 10 days surrounded by an envelope section 3. A3C++ achieved operation at a very low
containing 90% of the daily rewards energy cost compared with the rest of the DRL
algorithms and the baselines. Contrastingly, PPO++ did
not reduce the energy cost compared with the basic
PPO. Except for the A3C algorithm, the DRL
algorithms outperformed the baselines in terms of cost
reduction.
In order to analyze the autonomy of each model, we
examined the amount of energy exchanged with the
grid on an hourly basis. The results are presented in
Figure 12, wherein positive values reflect energy
purchased from the grid and negative values indicate
energy sold to the grid. From the boxplot distributions,
it is clear that A3C++ and the baseline models did not
sell any energy to the grid but they also have a lower
Figure 10: Second learning process of A3C++ and PPO++. amount of energy purchased compared to the models
Average daily rewards of 10 days surrounded by an envelope that sold energy to the grid. Interestingly, even though
containing 90% of the daily rewards baseline 2 has a set priority to sell excess power to the
grid, no energy was sold. This can be explained by the
an A3C algorithm. The PPO algorithm also achieved a
proportional form of the TCL energy allocation. It is
good final reward but did not demonstrate a
clear that A3C++ outperformed the two baselines by
meaningful improvement in stability compared with
having lower median and lower maximum of the
A3C. Overall, deep Q-learning algorithms exhibit
hourly energy exchange. Similarly, SARSA
better learning stability than policy-based algorithms.
outperformed A3C, PPO, and PPO++. However, an

13
generates better economic value by contributing to the
main grid.

The total consumption of the power grid per hour is


analyzed using the total load curves depicted in Figure
13. Most notably, A3C++ has a distinct shape and a
remarkable peak reduction compared with the rest of
the DRL algorithms. The baselines also display a
unique shape with larger periods of high consumption.
For a clearer analysis of the peak shaving performance,
we further analyzed the shape of the consumption
curves [64]. To do so, we defined two load shape
indices: the load factor and the peak valley factor.
Figure 11: Average energy cost per day. Load factor represents the peak load information and
the whole load curve shape and can be defined as
= , (45)

where is the average load over a single day and


is the maximum load of the day. High values of
indicate a more uniform load curve profile.
Peak valley factor denotes the variation interval of the
load and is defined by

= . (46)

A low value of can indicate small differences


between the peaks and valleys and thus a more uniform
Figure 12: Distribution of the amount of energy exchanged load curve profile. The load factor and peak valley
with the grid per hour using different algorithms. Negative factor results are shown in Figure 14 and indicate an
values represent energy sold to the grid and positive values
advantage of the baselines over most of the DRL
represent energy purchased.
algorithms. This can be explained by the nature of the
pricing policy of the baselines, which follows the
expected load shape. It is also worth noting that the
average consumption of the baselines, seen in Figure
13, appears higher than the average consumption of the
DRL methods, which therefore gives advantage to the
baseline in terms of . A3C++ appears to outperform
the other DRL methods in both indices, which confirms
the observations in Figure 13. PPO++ improves over
PPO according to both indices. Finally, SARSA
demonstrates a similar performance to PPO++ on both
indices, but the load profile has a slightly lower peak in
SARSA.
6.3 Detailed single day results
In this section, we present the results of each method
according to the data of an individual day presented in
Figure 15, Figure 16 and Figure 17. However, the
Figure 13: Average total power consumption profiles with a algorithms and simulations are available in [65], and
surrounding envelope representing 90% of the power consumption the reader is invited to examine the results using
profiles over 10 days. different days of the simulation. The results for each
method presented in Figure 16 include the electricity
obvious best method among A3C++ and SARSA sale prices, TCL energy allocation, and consumption,
cannot be determined based on energy exchange with distribution of individual TCL states of charge, and
the grid. A3C++ provides more autonomy to the distribution of individual residential loads. The energy
microgrid by purchasing less energy, whereas SARSA

14
Figure 14: Average load factor and peak valley factor
calculated over 10 days.
Figure 15: Simulation environment data for 1 day of the
purchased from the grid, energy sold to the grid, and simulation: outdoor temperature, energy generated, grid prices,
and expected individual basic load.
ESS state-of-charge are presented in Figure 17.
The first notable result is that SARSA and PPO++ did PPO++. PPO++ charged the battery to 80% of its
not use the option of dynamic pricing but rather the capacity without using any of its energy to meet later
medium price level during the entire control period, demand. Clearly, PPO++ has learned to take advantage
which led to uniform residential load distributions. of the compensation for unused energy stored in the
These two methods also allocated the same amount of ESS . This explains the reward increase seen in
energy to the TCLs during the entire control period. PPO++ over that in PPO without the presence of a
Despite the good results in terms of rewards, these superior policy increasing the energy efficiency.
methods constitute clear cases of premature Therefore, PPO++ has essentially managed to “cheat”
convergence to suboptimal deterministic policies. The the environment.
rest of the methods have exploited the potential offered The prior observations indicate that the proposed
by dynamic pricing, which clearly affected the variation A3C++ outperforms the other the methods
distribution of residential loads. presented in this paper, including the baselines.
By jointly observing Figure 15 and Figure 16, a Furthermore, PPO++ cannot be considered to have
correlation between the expected basic load and the found a superior policy to that of the basic PPO,
sale prices can be seen in A3C++ and, as expected, in despite the appealing reward results. The same
the baseline methods. This is an interesting result assumption can be made for SARSA, as it has shown
because A3C++ demonstrates a successful recognition clear premature convergence to suboptimal policies.
that the electricity prices must follow the expected The A3C algorithm additionally demonstrated little
peaks and valleys without being explicitly taught to do success in terms of energy efficiency and produced the
so. Another observation of note is that A3C++ lowest-quality results in almost every aspect. The
allocated large amounts of energy to the TCLs at times success of A3C++ suggests that adding memory replay
of low grid price and high energy availability. and a second semi-deterministic training to the A3C
Subsequently, the allowance for the rest of the day was algorithm can significantly improve the search quality,
cut when the grid prices increased. This is a clear in turn leading to superior policy results.
exploitation of the thermal flexibility of the TCLs. The
success of A3C++ can also be represented by the 7. CONCLUSION
amount of energy purchased per hour; this algorithm In this paper, we studied a multi-task EMS for a
significantly reduced the peak energy purchases residential microgrid with multiple sources of
compared with the rest of the DRL methods. flexibility. The microgrid model proposed considers
Conversely, PPO and A3C did not demonstrate clear, the potential demand flexibility offered by price-
understandable relationships between the prices, the responsive loads and TCLs. The proposed EMS
TCL energy allocations, and the state variables. coordinates between the ESS, the main grid, the TCLs,
In terms of energy storage and energy sold to the grid and the price-responsive loads to ensure optimal
illustrated in Figure 17, unlike the baseline methods, all management of the local resources. The uncertain
the DRL methods managed to either sell some energy nature of the microgrid components and the high
to the grid, charge the battery, or, as in the cases of dimensionality of their variables incentivizes the use of
PPO and A3C, do both. Another interesting result can intelligent learning-based methods in the EMS, such as
be observed in relation to the battery charging policy of DRL algorithms. In this paper, we also presented a

15
comprehensive and experimental comparison of the
current state-of-the-art DRL algorithms and proposed
improved versions of the A3C and PPO algorithms that
outperformed the existing algorithms. The numerical
results show that the classic DRL algorithms (DQN,
REINFORCE, and Actor-critic) could not converge to
an optimal policy. The remainder of the algorithms
achieved different levels of convergence, though to
suboptimal policies. The results revealed that adding an
experience replay and a second semi-deterministic
training to the A3C algorithm improved the
convergence and resulted in superior policies.
Designing and implementing a successful EMS for
future microgrids constitutes a challenging task
because of the high dimensionality and uncertainty of
the microgrid components. Though DRL methods have
proved successful in the gaming field, they are far from
perfect, and such methods face implementation
difficulties in real problems due to their data
inefficiency, instability, and slow convergence.
Currently, concerted efforts are being made to improve
the performance of DRL algorithms and enhance their
applicability in real-world problems.
Further research based on the results of this paper can
be extended in several directions. First, in the model of
price-responsive loads, we used a simple model to
describe the responsiveness of residential loads to the
electricity prices. This model can be replaced by
intelligent agents in future work, which would learn the
optimal consumption considering the trade-off between
comfort and the electricity bill, given the price
uncertainty. The TCLs can also be modeled as
intelligent price-responsive agents that adjust their
power according to the price levels in future models.
Secondly, in the control mechanisms, the ESS, and grid
priorities were only determined in cases of energy
deficiency or excess. However, charging and
discharging of the ESS can also be achieved through
exchanging electricity with the main grid or
neighboring microgrids. Additionally, this model
assumed that the electricity balancing from the grid
was performed automatically in real-time using up- and
down-regulation markets. However, in reality, utility
companies buy, and sell electricity in advance in the
day-ahead and intraday markets using energy demand
forecasts. Therefore, a DRL-based planning module
can be added to determine the optimal purchases and
sales in the day-ahead and intraday markets. We plan to
implement this model and investigate its potential in
future work. Finally, we intend to study the ability of
these algorithms to transfer their knowledge learned in
simulations to a physical microgrid with a similar
setup.

16
Figure 16: Energy management details related to pricing, TCL energy allocations, the distribution of TCL states of charge, and the
distribution of the individual residential loads for different algorithms.

17
Figure 17: Energy management results on energy purchased, energy sold, and energy stored in the ESS for different algorithms.

18
REFERENCES thermostatically controlled loads under
uncertainty using LSTM networks and genetic
[1] Q. Jiang, M. Xue, and G. Geng, “Energy algorithms,” F1000Res, vol. 8, p. 1619, Sep.
management of microgrid in grid-connected 2019, doi: 10.12688/f1000research.20421.1.
and stand-alone modes,” IEEE Trans. Power [12] F. Ruelens, B. J. Claessens, P. Vrancx, F.
Syst., vol. 28, no. 3, pp. 3380–3389, 2013, doi: Spiessens, and G. Deconinck, “Direct Load
10.1109/TPWRS.2013.2244104. Control of Thermostatically Controlled Loads
[2] L. Guo et al., “Energy management system for Based on Sparse Observations Using Deep
stand-alone wind-powered-desalination Reinforcement Learning,” Jul. 2017.
microgrid,” IEEE Trans. Smart Grid, vol. 7, no. [13] C. Zhang, Y. Xu, Z. Y. Dong, and K. P. Wong,
2, p. 1, 2014, doi: 10.1109/TSG.2014.2377374. “Robust coordination of distributed generation
[3] M. Patterson, N. F. Macia, and A. M. Kannan, and price-based demand response in
“Hybrid microgrid model based on solar microgrids,” IEEE Trans. Smart Grid, vol. 9,
photovoltaic battery fuel cell system for no. 5, pp. 4236–4247, 2018, doi:
intermittent load applications,” IEEE Trans. 10.1109/TSG.2017.2653198.
Energy Convers., vol. 30, no. 1, pp. 359–366, [14] L. Jian, H. Xue, G. Xu, X. Zhu, D. Zhao, and
2015, doi: 10.1109/TEC.2014.2352554. Z. Y. Shao, “Regulated charging of plug-in
[4] X. Xu, H. Jia, H. -D. Chiang, D. C. Yu, and D. hybrid electric vehicles for minimizing load
Wang, “Dynamic modeling and interaction of variance in household smart microgrid,” IEEE
hybrid natural gas and electricity supply system Trans. Ind. Electron., vol. 60, no. 8, pp. 3218–
in microgrid,” IEEE Trans. Power Syst., vol. 3226, 2013, doi: 10.1109/TIE.2012.2198037.
30, no. 3, pp. 1212–1221, 2015, doi: [15] H. Hao, B. M. Sanandaji, K. Poolla, and T. L.
10.1109/TPWRS.2014.2343021. Vincent, “Aggregate flexibility of
[5] S. Krishnamurthy, T. M. Jahns, and R. H. thermostatically controlled loads,” IEEE Trans.
Lasseter, “The operation of diesel gensets in a Power Syst., vol. 30, no. 1, pp. 189–198, 2015,
CERTS microgrid,” IEEE Power Energy Soc. doi: 10.1109/TPWRS.2014.2328865.
Gen. Meet. Convers. Deliv. Electr. Energy 21st [16] J. L. Mathieu, M. Kamgarpour, J. Lygeros, G.
Century, pp, vol. 2008, pp. 1–8, 2008, doi: Andersson, and D. S. Callaway, “Arbitraging
10.1109/pes.2008.4596500. intraday wholesale energy market prices with
[6] M. Tasdighi, H. Ghasemi, and A. Rahimi-Kian, aggregations of thermostatic loads,” IEEE
“Residential microgrid scheduling based on Trans. Power Syst., vol. 30, no. 2, 763–772,
smart meters data and temperature dependent 2015, doi: 10.1109/TPWRS.2014.2335158.
thermal load modeling,” IEEE Trans. Smart [17] C. De Jonghe, B. F. Hobbs, and R. Belmans,
Grid, vol. 5, no. 1, pp. 349–357, 2014, doi: “Value of price responsive load for wind
10.1109/TSG.2013.2261829. integration in unit commitment,” IEEE Trans.
[7] S. Y. Derakhshandeh, A. S. Masoum, S. Power Syst., vol. 29, no. 2, pp. 675–685, 2014,
Deilami, M. A. S. Masoum, and M. E. H. doi: 10.1109/TPWRS.2013.2283516.
Hamedani Golshan, “Coordination of [18] Y. Zhang, L. Fu, W. Zhu, X. Bao, and C. Liu,
generation scheduling with PEVs charging in “Robust model predictive control for optimal
industrial microgrids,” IEEE Trans. Power energy management of island microgrids with
Syst., vol. 28, no. 3, pp. 3451–3461, 2013, doi: uncertainties,” Energy, vol. 164, pp. 1229–
10.1109/TPWRS.2013.2257184. 1241, 2018, doi: 10.1016/j.energy.2018.08.200.
[8] Y. Xu, W. Zhang, G. Hug, S. Kar, and Z. Li, [19] A. Parisio, E. Rikos, and L. Glielmo, “A model
“Cooperative control of distributed energy predictive control approach to microgrid
storage systems in a microgrid,” IEEE Trans. operation optimization,” IEEE Trans. Contr.
Smart Grid, vol. 6, no. 1, pp. 238–248, 2015, Syst. Technol., vol. 22, no. 5, pp. 1813–1827,
doi: 10.1109/TSG.2014.2354033. 2014, doi: 10.1109/TCST.2013.2295737.
[9] K. M. M. Huq, M. E. Baran, S. Lukic, and O. [20] A. Parisio, E. Rikos, G. Tzamalis, and L.
E. Nare, “An Energy Management System for Glielmo, “Use of model predictive control for
a community energy storage system,” 2012 experimental microgrid optimization,” Appl.
IEEE Energy Convers. congress Expo, 1, pp. Energy, vol. 115, pp. 37–46, 2014, doi:
2759–2763, doi: 10.1109/ecce.2012.6342532. 10.1016/j.apenergy.2013.10.027.
[10] C. Zhang, Y. Xu, Z. Y. Dong, and J. Ma, [21] S. Baldi, I. Michailidis, C. Ravanis, and E. B.
“Robust operation of microgrids via two-stage Kosmatopoulos, “Model-based and model-free
coordinated energy storage and direct load ‘plug-and-play’ building energy efficient
control,” IEEE Trans. Power Syst., vol. 32, no. control,” Appl. Energy, vol. 154, pp. 829–841,
4, pp. 2858–2868, 2017, doi: 2015, doi: 10.1016/j.apenergy.2015.05.081.
10.1109/TPWRS.2016.2627583. [22] M. Wiering and M. van Otterlo, Reinforcement
[11] T. A. Nakabi and P. Toivanen, “Optimal price- learning state-of-the, Art, vol. 12, 2012.
based control of heterogeneous [23] B. Mbuwir, F. Ruelens, F. Spiessens, and G.

19
Deconinck, “Battery energy management in a 10.1126/science.aar6404.
microgrid using batch reinforcement learning,” [35] V. François-Lavet, R. Fonteneau, and D. Ernst,
Energies, vol. 10, no. 11, p. 1846, 2017, doi: “Deep reinforcement learning solutions for
10.3390/en10111846. energy microgrids management,” Eur. Work
[24] P. Kofinas, G. Vouros, and A. I. Dounis, Reinf. Learn., vol. 2015, pp. 1–7, 2016.
“Energy management in solar microgrid via [36] N. Ebell, F. Heinrich, J. Schlund, and M.
reinforcement learning using fuzzy reward,” Pruckner, “Reinforcement Learning Control
Adv. Build. Energy Res., vol. 12, no. 1, pp. 97– Algorithm for a PV-Battery-System Providing
115, Jan. 2018, doi: Frequency Containment Reserve Power” 2018.
10.1080/17512549.2017.1314832. [37] V.-H. Bui, A. Hussain, and H.-M. Kim,
[25] B. C. Phan and Y. C. Lai, “Control strategy of “Double deep Q-learning-based distributed
a hybrid renewable energy system based on operation of battery energy storage system
reinforcement learning approach for an isolated considering uncertainties,” IEEE Trans. Smart
microgrid,” Appl. Sci., vol. 9, no. 19, Oct., Grid, vol. 11, no. 1, 457–469, Jun. 2019, doi:
2019, doi: 10.3390/app9194001. 10.1109/TSG.2019.2924025.
[26] B. G. Kim, Y. Zhang, M. Van Der Schaar, and [38] B. J. Claessens, P. Vrancx, and F. Ruelens,
J. W. Lee, “Dynamic pricing and energy “Convolutional neural networks for automatic
consumption scheduling with reinforcement state-time feature extraction in reinforcement
learning,” IEEE Trans. Smart Grid, vol. 7, no. learning applied to residential load control,”
5, pp. 2187–2198, Sep. 2016, doi: IEEE Trans. Smart Grid, vol. 9, no. 4, pp.
10.1109/TSG.2015.2495145. 3259–3269, Jul. 2018, doi:
[27] S. Zhou, Z. Hu, W. Gu, M. Jiang, and X.-P. 10.1109/TSG.2016.2629450.
Zhang, “Artificial intelligence based smart [39] T. Chen and W. Su, “Local energy trading
energy community management: A behavior modeling with deep reinforcement
reinforcement learning approach,” CSEE J. learning,” IEEE Access, vol. 6, pp. 62806–
Power Energy Syst., 2019, doi: 62814, 2018, doi:
10.17775/CSEEJPES.2018.00840. 10.1109/ACCESS.2018.2876652.
[28] P. Kofinas, A. I. Dounis, and G. A. Vouros, [40] A. Prasad and I. Dusparic, “Multi-agent deep
“Fuzzy Q-Learning for multi-agent reinforcement learning for zero energy
decentralized energy management in communities,” in ISGT Eur., 2019, doi:
microgrids,” Appl. Energy, vol. 219, pp. 53– 10.1109/ISGTEurope.2019.8905628.
67, Jun. 2018, doi: [41] H. Hua, Y. Qin, C. Hao, and J. Cao, “Optimal
10.1016/j.apenergy.2018.03.017. energy management strategies for energy
[29] E. Foruzan, L. K. Soh, and S. Asgarpoor, Internet via deep reinforcement learning
“Reinforcement learning approach for optimal approach,” Appl. Energy, vol. 239, 598–609,
distributed energy management in a 2019, doi: 10.1016/j.apenergy.2019.01.145.
microgrid,” IEEE Trans. Power Syst., vol. 33, [42] Y. Ji, J. Wang, J. Xu, X. Fang, and H. Zhang,
no. 5, pp. 5749–5758, Sep. 2018, doi: “Real-time energy management of a microgrid
10.1109/TPWRS.2018.2823641. using deep reinforcement learning,” Energies,
[30] C. J. C. H. Watkins and P. Dayan, “Q- vol. 12, no. 12, 2019, doi:
learning,” Mach. Learn., vol. 8, no. 3–4, pp. 10.3390/en12122291.
279–292, 1992, doi: 10.1007/BF00992698. [43] N. Tomin, A. Zhukov, and A. Domyshev,
[31] B. V. Mbuwir, D. Geysen, F. Spiessens, and G. “Deep reinforcement learning for energy
Deconinck, “Reinforcement learning for microgrids management considering flexible
control of flexibility providers in a residential energy sources,” EPJ Web Conf., vol. 217, p.
microgrid,” IET Smart Grid, Sep. 2019, doi: 01016, 2019, doi:
10.1049/iet-stg.2019.0196. 10.1051/epjconf/201921701016.
[32] T. P. Lillicrap et al., “Continuous control with [44] R. Lu and S. H. Hong, “Incentive-based
deep reinforcement learning,” 2016– 4th demand response for smart grid with
international conference Learned reinforcement learning and deep neural
Representacion ICLR conference Track Proc., network,” Appl. Energy, vol. 236, pp. 937–
2016. 949, Feb. 2019, doi:
[33] V. Mnih et al., “Human-level control through 10.1016/j.apenergy.2018.12.061.
deep reinforcement learning,” Nature, vol. 518, [45] T. A. Nakabi, K. Haataja, and P. Toivanen,
no. 7540, 529–533, 2015, doi: “Computational intelligence for demand side
10.1038/nature14236. management and demand response programs in
[34] D. Silver et al., “A general reinforcement smart grids,” 2018 8th international conference
learning algorithm that masters chess, shogi, Bioinspired Optim. methods their Appl. Paris,
and Go through self-play,” Science, vol. 362, 2018.
no. 6419, pp. 1140–1144, 2018, doi: [46] J. R. Vázquez-Canteli and Z. Nagy,

20
“Reinforcement learning for demand response: critic algorithms,” SIAM J. Control Optim.,
A review of algorithms and modeling vol. 42, no. 4, pp. 1143–1166, 2003, doi:
techniques,” Appl. Energy, vol. 235, pp. 1072– 10.1137/S0363012901385691.
1089, Feb. 2019, doi: [59] J. Schulman, F. Wolski, P. Dhariwal, A.
10.1016/j.apenergy.2018.11.002. Radford, and O. Klimov, “Proximal Policy
[47] E. Barbour, D. Parra, Z. Awwad, and M. C. Optimization Algorithms.”,” arXiv e-prints, p.
González, “Community energy storage: A Available: arXiv:1707.06347, 2017.
smart choice for the smart grid?,” Appl. [60] V. Mnih et al., “Asynchronous Methods for
Energy, vol. 212, pp. 489–497, Feb. 2018, doi: Deep Reinforcement Learning.”,” arXiv e-
10.1016/j.apenergy.2017.12.056. prints, p. Available: arXiv:1602.01783, 2016.
[48] Fingrid, “Wind power generation in Finland - [61] Finnish meteorological institute, “weather
Hourly data.” [Online]. Available: observations, Kaisaniemi observation station
https://data.fingrid.fi/open-data- Helsinki.” [Online]. Available:
forms/search/fi/?selected_datasets=75. https://en.ilmatieteenlaitos.fi/download-
[Accessed: 12 Dec. 2019]. observations. [Accessed: 12 Dec. 2019].
[49] Fingrid, “Fingrid open datasets.” [Online]. [62] G. Brockman et al., “OpenAI Gym.”,” arXiv e-
Available: https://data.fingrid.fi/open-data- prints, p. Available: arXiv:1606.01540, 2016.
forms/search/en/index.html. [Accessed: [63] F. Chollet 2015, ‘Keras Documentation,’
12 Dec. 2019]. keras.Io [Online]. Available: https://keras.io/.
[50] T. A. Nakabi and P. Toivanen, “An ANN- [64] B. Peng et al., “A two-stage pattern recognition
based model for learning individual customer method for electric customer classification in
behavior in response to electricity prices,” smart grid,” in Smartgridcomm. 2016 IEEE
Sustain. Energy Grids Netw., vol. 18, 2019, International Conference on Smart Grid
doi: 10.1016/j.segan.2019.100212. Communications, 2016, pp. 758–763, doi:
[51] “Residential electric rates & line items.” 10.1109/SmartGridComm.2016.7778853.
[Online]. Available: [65] T. A. Nakabi, “DRL for Microgrid Energy
https://austinenergy.com/ae/residential/rates/res Management.” 2020, doi:
idential-electric-rates-and-line-items. 10.5281/zenodo.3598386.
[Accessed: 16 Dec. 2019].
[52] M. L. Littman, “Markov decision processes,”
in “International Encyclopedia of the Social &
Behavioral Sciences” 2001, Elsevier, pp. 9240–
9242.
[53] L.-J. Lin, “Self-improving reactive agents
based on reinforcement learning, planning and
teaching,” Mach. Learn., vol. 8, no. 3–4, pp.
293–321, May 1992, doi:
10.1007/BF00992699.
[54] D. Zhao, H. Wang, K. Shao, and Y. Zhu,
“Deep reinforcement learning with experience
replay based on SARSA,” in SSCI 2016 IEEE
Symposium Series on Computational
Intelligence, 2016, p. 2017, doi:
10.1109/SSCI.2016.7849837.
[55] H. Van Hasselt, A. Guez, and D. Silver, “Deep
reinforcement learning with double Q-
Learning,” 2016 in 30th AAAI Conference on
Artificial Intelligence, AAAI, 2016, pp. 2094–
2100.
[56] R. S. Sutton, D. McAllester, S. Singh, and Y.
Mansour, “Policy gradient methods for
reinforcement learning with function
approximation,” in Adv. Neural Inf. Process.
Syst., pp. 1057–1063, 2000.
[57] R. J. Willia, “Simple statistical gradient-
following algorithms for connectionist
reinforcement learning,” Mach. Learn., vol. 8,
no. 3, pp. 229–256, 1992, doi:
10.1023/A:1022672621406.
[58] V. R. Konda and J. N. Tsitsiklis, “On actor-

21

You might also like