Zhang 2020
Zhang 2020
Approach
Feiye Zhang1 , Qingyu Yang2
1. School of Automation Science and Engineering, Xi’an Jiaotong University, Xi’an 710049, China
E-mail: [email protected]
2. SKLMSE Lab, MOE Key Laboratory for Intelligent Networks and Network Security, School of Automation Science and
Engineering, Xi’an Jiaotong University, Xi’an 710049, China
E-mail: [email protected]
Abstract: To achieve the efficient operation of the smart grid, appropriate energy trading strategy plays an important
role in reducing multi-agent costs in the trading process as well as alleviating grid pressure. However, with the increase
of the number of participants in smart grid, energy trading has been greatly challenged in terms of stable and effective
operation. In this paper, we propose a deep reinforcement learning-based energy double auction trading strategy. Through
the deep reinforcement learning algorithm, buyers and sellers can gradually learn the environment by treating the three
elements: total supply, total demand and their own supply and demand as states, in addition, regarding both bidding
price and quantity as bidding strategy. Results from simulation indicate that as the learning continues and reaches the
convergence, both the cost which buyers pay in the auction has decreased significantly, and the profit which sellers earn
in the auction will increase.
Key Words: Smart Grid, Energy Trading, Double Auction, Deep Reinforcement Learning
978-1-7281-5855-6/20/$31.00 2020
c IEEE 3677
Authorized licensed use limited to: UNIVERSITY OF ROCHESTER. Downloaded on August 29,2020 at 17:39:27 UTC from IEEE Xplore. Restrictions apply.
vides the following trading services for all participants: 1) energy that is willing to buy or sell, and p represents the
Collecting the trading information from buyers and sellers, price that is willing to pay or accept for each unit of energy.
the MGO monitors and regulates the operation of energy 3): The MGO calculates the valid price pv,b,t and pv,s,t
market. 2) Carrying out reasonable auction mechanism, respectively for buyers and sellers. Any buyer that satisfies
the MGO guarantees the supply and demand balance. 3) vb,i > pv,b,t and any seller that satisfies vs,j < pv,s,t wins
Adopting payment and allocation rules, the MGO ensures the bid.
the bilateral flow of electricity and price information. The 4): Based on the winners in step 3), the MGO updates the
basic structure of the market model is shown as Figure 1. winning buyers set and winning sellers set
The electricity is transacted according to double auction
mechanism in discrete time duration under the regulation Bt = {i | vb,i > pv,b,t }
(1)
of MGO. The buyers aim to purchase the power from the St = {j | vs,j < pv,s,t }
grid with a relatively low price, while the seller is willing to
sell power to the grid at higher prices to gain greater bene- 5): The MGO decides the trading amount of winning
fits. In a discrete time slot t, when there are both sellers and agents, which
is derived
according the following two cases:
buyers in the market, the MGO will firstly gather the buy- Case A: vb,i ≤ vs,j :
i∈Bt j∈St
ers bid information (vb,i , pb,i ) indicating the total volume
Q(b, i) = vb,i
and unit price that buyers are willing to pay and sellers bid
information (vs,j , ps,j ) that sellers are willing to accept. Δ (2)
Q(s, j) = vs,j −
Then the MGO determines the trading price and volume of |St |
each buyer and seller based on the valid price and allocation
Case B: vs,j ≤ vb,i :
rules. Finally the MGO allocates energy from the sellers to j∈St i∈Bt
the buyers, and transfers the money from the buyers to the Δ
Q(b, i) = vb,i −
sellers. |Bt | (3)
Q(s, j) = vs,j
6HOOHU DOORFDWLRQ %X\HU Notice that, |Bt | and |St | represent the number of winning
YV SD\PHQW YE seller and buyer sets at time step t. Δ is the difference
6HOOHU SV SE %X\HU between total demand and total supply.
6): The MGO updates the energy demands or supply for
YV YE
6HOOHU SV SE %X\HU next time step.
From the trading process above, we can find that computing
6HOOHU SYV 0*2
YE
SE %X\HU the valid price is a key step in the double auctions, and is
V
presented as follows:
1): For the active buyers i ∈ Bt , sorts the price they bid
in descending order, and for the active sellers j ∈ St , sorts
the price they ask in ascending order
6HOOHU, SYVL
VL
YEM
SEM %X\HU-
pb,1 > pb,2 > ... > pb,n
(4)
Figure 1: Market Structure ps,1 < ps,2 < ... < ps,m
1): At the beginning of the time slot t, each buyer from pv,b = pb,l
(6)
sets B and each seller from sets S report their demand and pv,s = ps,k
supply to the MGO.
l−1
k
l
2): The MGO computes and announces the total demand Case C: ps,k+1 ≥ pb,l ≥ ps,k and vb,i ≤ vs,j ≤ vb,i :
and supply at time slot t. Each participant submit their bids 1 1 1
(v, p) to the MGO based on the total demand and supply pv,b = pb,l
(7)
with their own needs, where v represents the amount of pv,s = ps,k
DVW VWPVW UVW VWPVW
ble auction scheme, and then we introduce the deep rein- 6*2
forcement learning approach to learn the optimal strategy VWPVW UEW VWPVW
DEW
of buyers and sellers. %X\HU
%X\HU%XIIHU 6W $W 5W 6W
3.1 Markov Decision Process
In this paper, we apply a finite Markov decision process
8SGDWH
$FWLRQ
(MDP) with discrete time step to formulate the double auc- 43UHGLFW
49DOXH
tion scheme. Specifically, buyers and sellers involved in
the trading are regarded as agents who aim to pay the least SDUDPHWHUV
DUJPLQ
/RVV
cost and get the most benefit respectively, where the state $FWLRQ
of the current time t is only related to the state and action 49DOXH 47DUJHW
S denotes the system state space, besides Sb and Ss respec- Figure 2: Deep Reinforcement Learning Structure
tively reflect the possible state of the buyers and sellers. We
consider s(b, i, t) ∈ Sb is the state of buyer i at time t. We
propose to choose the buyers’ total demand D, sellers’ total
R is the reward function. The immediate reward at time
supply P and each buyer’s own demand di to form the state
step t is defined as rt . For buyer i, the reward at the current
space for the demand and supply relationship has a deci-
time t consists of two parts: the cost of purchasing energy
sive impact on each buyer’s bidding. When supply exceeds
and the dissatisfaction that does not meet the demand of the
demand, buyers will choose to raise the bidding price and
time t
reduce the bidding amount. When demand exceeds supply,
the result is opposite. Therefore the state of buyer i at time
t is defined as: r(b, i, t) = α ∗ (pv,b ∗ Q(b, i)) + (1 − α) ∗ (di,t − Q(b, i))
(12)
s(b, i, t) = {D, P, di } (8) where α is the corresponding coefficient in the range [0, 1],
The above analysis is also equally applicable to the sellers, which balance the cost and dissatisfaction. Also, the reward
s(s, j, t) ∈ Ss is the state of seller j at time t, we define the of seller j at the current time is its profit, which is defined
state of seller as: as follows:
A represents the set of available actions of the trading par- 3.2 Deep Reinforcement Learning
ticipants, and a(t) is the bidding price and quantity at time In the energy trading market, it is difficult for buyers and
slot t. In our trading model, we propose a two-dimensional sellers to decide their bidding strategy via an analytical ap-
tuple a(t) = {pt , qt } combining both bidding price and proach, due to the uncertainty of future energy prices and
quantity to represent action for deep reinforcement learn- supply-demand relations. Notice that Deep reinforcement
ing. learning (DRL) is an effective way to make optimal strate-
The action of buyer i at time t is represented as a(b, i, t) = gies in a specific environment, we utilize DRL in the trad-
{pb,i , vb,i }, where pb,i is the biding price and vb,i is the ing model to find the optimal bidding strategy for both buy-
purchase amount. Notice that, to be practical, we assume ers and sellers. The structure of proposed deep reinforce-
that the purchase amount of the buyer i at time t cannot ment learning based double auction scheme is illustrated in
exceed its demand. Figure 2.
In our reinforcement learning model, both buyers and sell-
vb,i ≤ di,t (10) ers are considered as the agent that learns their best bidding
strategy from observing the rewards of interactions with
Similarly, The action of seller j at time t is a(s, j, t) =
MGO over time. At a discrete time step t, the agent ob-
{ps,j , vs,j }, and that the sold amount of the seller j at time
serve supply and demand relationship, and gets the state
t cannot exceed its supply
value s(t). Then taking an action a(t) = {pt , qt } based
on the output of the neural network called the state action
vs,j ≤ uj,t (11)
value Q(s, a), which indicates the cumulative reward ob-
P is defined as the transition function. We collectively rep- tained by the agent interacting with the environment using
resent the state of buyers and sellers at time t as s(t). In this action a in state s. Notice that, for buyers, the minimum
paper, The state transition probability from the state s(t) to value in the output is selected, and for sellers, the result is
s(t + 1) is denoted as pt : s(t) × a(t) → s(t + 1), which the opposite. Next, SGO will determine the valid price and
meet the definition of MDP that the state of the current time energy allocation rules for each seller and buyer. Once the
is only related to the state and action of the previous time. trading is complete, the trading of environment from state
$GYDQWDJH
Dȥ
$FWLRQ
VWP 511
&HOO
two streams share the same part of previous two networks
\WV DQ
$JJUHJDWRU
,
511 4 VD
and merged by a special aggregator to produce the state-
VWP &HOO
, \WV
action value:
9DOXH
6WDWH
RW
YƎ
\W
,
VW 511
&HOO
\W
aψ (fζ (s), a )
a
,QSXW /LQHDU 1RLV\ 'HXOLQJ
Q(s, a) = vλ (fζ (s)) + aψ (fζ (s), a) −
,QSXW N
/D\HU /D\HU /D\HU /D\HU (15)
where ξ, λ and ψ is respectively the parameters of the shar-
Figure 3: Deep Neural Network Structure ing part, value stream and action advantage.
Finally, we use two networks with the structure in the learn-
ing process: the main network with the parameters θ and
s(t) to s(t + 1) generates a reward r(t), reflecting the im- target network with the parameters θ to avoid the overesti-
mediate evaluation of the action a(t) at state s(t). mation. Both networks have the same parameters initially.
The state s(t), action a(t), reward r(t) and next step θ is updated to be equal to θ once every C steps.
state s(t + 1) form a experience tuple, defined as After sampling the minibatch [St , At , Rt , St+1 ] with size
[s(t), at , r(t), s(t+1)], which describes an interaction with B from the batch, we use St as the input of main network,
the environment and will be stored in the replay buffer for and use At to choose the action-state value in the output of
the process of training. We use two similar buffers to store main network Q(St , θ) to calculate the evaluation Q value
the experiences of buyers and sellers, respectively. illustrated by red lines in the Figure 2:
To approximate the state action value, a deep neural net-
work is introduced by taking the states as the input and
generating state-action value Q(s, a) ≈ Q(s, a, θ), where Qevel = Q (St , at , θ) (16)
θ is the parameter of the neural network. The proposed
deep neural network is a fully-connected that consists of To calculate the target Q value, we first find the best action
five layers as shown in Figure 3. a of the state st+1 that corresponds to the minimum or
We use RNN layer as the input layer which is fed by the maximin action-state value Q(st+1 , a , θ) for buyers and
time series state values of length m, and represented by sellers respectively in the main network with the input is
green circles in the Figure 3. Specifically, the input of the St+1 . Then the selected action a and reward Rt from the
first cell is st−m which represents the state at time t−m, the minibatch are used to calculate the target Q value in the tar-
first RNN cell’s parameters I and output yt−s are passed get network with the input is also St+1 , which is indicated
into the second cell. The above process is repeated until by blue lines in the Fig.
the last layer. 2
By concatenating the states information, the output of the
RNN layer are fed into one linear layer, after that two noisy Qtar = rt + γ ∗ Q(st+1 , arg Q(st+1 , a , θ), θ ) (17)
a
layers represented by red and blue circles in Figure 3 re-
spectively. The limitations of traditional exploring policy Where γ is the discount factor, which indicates the degree
-greedy are clear in many conditions, weights with greater of influence of the future reward on the current reward. The
uncertainty introduces more variability into the decisions smaller the γ, the more the agent pays attention to the cur-
made by the policy, which has potential for exploratory rent reward, and vice versa. We update the parameters of
actions[10]. The scheme of the noisy layer is shown as main network θ by performing gradient descent according
follows: to the loss function calculated from the difference between
the target Q value and the evaluation Q value:
y = (μω + σ ω ω )x + μb + σ b b (14)
B
where ω and b are random variables. By dong so, the 2
L(t) = (Qtar − Qevel ) (18)
Equ. 14 can then be used in place of the standard linear
i=1
one y = ωx + b. The last layer is the dueling layer, duel-
ing network is proposed to obtain better policy evaluation The pseudocode of our algorithm is given in Algorithm 1.
in the presence of many similar-valued actions[11]. In the
problem of double auction, the state s is only related to 4 PERFORMANCE EVALUATION
the supply and demand relations in the market, the agent’s
actions do not affect the state estimation in any relevant In this section, we introduce the performance evaluation to
way for they don’t know the supply and demand of others. demonstrate the effectiveness of our energy trading strate-
The proposed dueling network can be seen as a single net- gies. We first present the simulation setup, and then show
work with two streams which is illustrated by the purple the evaluation results.
!
crease, and sellers’ average profit will increase, which re- [4] PankiRaj, Jema Sharin, Abdulsalam Yassine, Salimur
flects that by using reinforcement learning, the agents can Choudhury, An Auction Mechanism for Profit Maximiza-
make better bidding strategies. As it reaches to the conver- tion of Peer-to-Peer Energy Trading in Smart Grids, Procedia
gence, the performance of reinforcement learning agents is Computer Science, 151, 361-368, 2019.
better than the empirical ones. [5] Ramachandran B., Srivastava S. K., Edrington C. S., Cartes
D. A, An intelligent auction scheme for smart grid market
5 CONCLUSION using a hybrid immune algorithm, IEEE Transactions on In-
dustrial Electronics, Vol.58, No.10, 4603-4612, 2010.
In this paper, a deep reinforcement learning based algo- [6] Ma J., Deng J., Song L., Han Z, Incentive mechanism for
rithm is proposed for synchronous bilateral energy auction demand side management in smart grid using auction, IEEE
in the trading market by using the supply and demand re- Transactions on Smart Grid, Vol.5, No.3, 1379-1388, 2014.
lations as well as their own needs as the input. Simulation [7] Xu H., Sun H., Nikovski D., Kitamura S., Mori K.,
experiments prove the effectiveness of the autonomous bid- Hashimoto H., Deep Reinforcement Learning for Joint Bid-
ding strategy learning for buyers and sellers from the trad- ding and Pricing of Load Serving Entity, IEEE Transactions
ing environment. Through deep reinforcement learning, on Smart Grid, 2019.
buyers and sellers can reduce their own costs and increase [8] Wang H., Huang T., Liao X., Abu-Rub H., Chen G., Re-
inforcement learning for constrained energy trading games
the profits in the market, respectively.
with incomplete information, IEEE transactions on cybernet-
REFERENCES ics, Vol.47, No.10, 3404-3416, 2016.
[9] Wang N., Xu W., Shao W., Xu Z., A q-cube framework of
[1] J. Gao, Y. Xiao, J. Liu, W. Liang, and C. L. P. Chen, A survey reinforcement learning algorithm for continuous double auc-
of communication/networking in smart grids, Future Gener- tion among microgrids, Energies, Vol.12, No.15, 2891, 2019.
ation Computer Systems, Vol.28, No.2, 391−404, 2012. [10] Fortunato M., Azar M. G., Piot B., Menick J., Osband
[2] Vytelingum P., Cliff D., Jennings N. R., Strategic bidding in I., Graves A., et al, Noisy networks for exploration, arXiv
continuous double auctions, Artificial Intelligence, Vol.172, preprint arXiv:1706.10295, 2017.
No.14, 1700−1729, 2008. [11] Wang Z., Schaul T., Hessel M., Van Hasselt H., Lanctot M.,
[3] An Dou, Yang Qingyu, Yu Wei, Yang Xinyu, Fu Xin- De Freitas N., Dueling network architectures for deep rein-
wen, Zhao Wei. Soda: strategy-proof online double auc- forcement learning, arXiv preprint arXiv:1511.06581, 2015.
tion scheme for multimicrogrids bidding, IEEE Transac- [12] Sutton R. S., Barto, A. G., Introduction to reinforcement
tions on Systems Man Cybernetics Systems, Vol.48, No.7, learning Vol. 2, No. 4, Cambridge: MIT press, 1998.
1177−1190, 2017.