0% found this document useful (0 votes)
9 views6 pages

Zhang 2020

Uploaded by

ADAM hassouni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views6 pages

Zhang 2020

Uploaded by

ADAM hassouni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Energy Trading in Smart Grid: A Deep Reinforcement Learning-based

Approach
Feiye Zhang1 , Qingyu Yang2
1. School of Automation Science and Engineering, Xi’an Jiaotong University, Xi’an 710049, China
E-mail: [email protected]
2. SKLMSE Lab, MOE Key Laboratory for Intelligent Networks and Network Security, School of Automation Science and
Engineering, Xi’an Jiaotong University, Xi’an 710049, China
E-mail: [email protected]

Abstract: To achieve the efficient operation of the smart grid, appropriate energy trading strategy plays an important
role in reducing multi-agent costs in the trading process as well as alleviating grid pressure. However, with the increase
of the number of participants in smart grid, energy trading has been greatly challenged in terms of stable and effective
operation. In this paper, we propose a deep reinforcement learning-based energy double auction trading strategy. Through
the deep reinforcement learning algorithm, buyers and sellers can gradually learn the environment by treating the three
elements: total supply, total demand and their own supply and demand as states, in addition, regarding both bidding
price and quantity as bidding strategy. Results from simulation indicate that as the learning continues and reaches the
convergence, both the cost which buyers pay in the auction has decreased significantly, and the profit which sellers earn
in the auction will increase.

Key Words: Smart Grid, Energy Trading, Double Auction, Deep Reinforcement Learning

1 INTRODUCTION optimal bidding and pricing policies. The existing litera-


ture show the potential of applying a deep reinforcement
The smart grid, as typical cyber-physical system, have at-
learning algorithm into a continuous double auction mech-
tracted significant research interests, due to economic and
anism, but most of these papers aim at maximizing the pay-
technological benefits[1]. To improve the energy trading
offs of sellers or reducing buyers costs respectively. There-
efficiency, double auction is often used to solve the prob-
fore, putting forward a new double auction scheme by us-
lem of bidding with multiple participants in the energy trad-
ing DRL, covering both buyers and sellers to optimization
ing market [2]. The research interest of double auction
model is challenging and desirable.
mechanism design has focused on two aspects, the first is a
In this paper we first investigate a double auction mecha-
novel energy auction framework [3] [6], the other is apply-
nism that allows all agents to present their price and vol-
ing optimization technique to determine the bidding price
ume to participate into the auctions. To promote trad-
and allocation rule in the trading scheme to improve ben-
ing efficiency, i.e., maximizing benefits for all agents, it
efits of participants [4] [5]. For instance, Dou An et. al
is suggested to enhance the long-term reward optimization
[3] present an energy trading theory for smart grids and de-
skills. In this way, we propose a deep reinforcement learn-
sign a strategy-proof online double auction scheme. Panki-
ing(DRL) framework to maximize the long-term profit gen-
Raj B et. al [4] propose a profit maximization algorithm
erated from both sellers and buyers in the energy trading
for energy suppliers by utilizing peer-to-peer energy trad-
market. The simulation results suggest that the profit of the
ing scheme in a smart grid.
sellers has increased obviously and the cost of the buyer
The application of double auction mechanism in smart grid
has been also reduced.
is often limited by a dynamic uncertain bidding environ-
The remainder of this paper is arranged as follows. We
ment. The bidding price and quantity have great impacts
introduce the double auction mechanism and present our
on the allocation results and the traditional optimization
trading model in Section 2. In Section 3, we present our
methods is hard to handle this complex bidding behavior
trading framework using deep reinforcement learning in
which will reduce the long-term profitability of trading par-
detail. In Section 4, we show performance evaluation and
ticipants. The deep reinforcement learning algorithm have
results. Finally, we conclude this paper in Section 5.
been applied to solve the decision-making problems in auc-
tion [7]−[9]. For instance, Hanchen Xu et. al applies the 2 TRADING SCHEME
deep deterministic policy gradient algorithm to solve the
2.1 Energy Trading Market
This work was supported in part by the National Science Founda-
We consider the energy trading market in this paper with
tion of China under Grant 61973247, 61673315,the China Postdoctoral
Science Foundation under Grant 2018M643659, and the Shaanxi Post- multiple sellers and buyers. The microgrid operators
doctoral Science Foundation under Grant 2017BSHEDZZ82 (MGO) coordinates the energy trading market as it pro-

978-1-7281-5855-6/20/$31.00 2020
c IEEE 3677

Authorized licensed use limited to: UNIVERSITY OF ROCHESTER. Downloaded on August 29,2020 at 17:39:27 UTC from IEEE Xplore. Restrictions apply.
vides the following trading services for all participants: 1) energy that is willing to buy or sell, and p represents the
Collecting the trading information from buyers and sellers, price that is willing to pay or accept for each unit of energy.
the MGO monitors and regulates the operation of energy 3): The MGO calculates the valid price pv,b,t and pv,s,t
market. 2) Carrying out reasonable auction mechanism, respectively for buyers and sellers. Any buyer that satisfies
the MGO guarantees the supply and demand balance. 3) vb,i > pv,b,t and any seller that satisfies vs,j < pv,s,t wins
Adopting payment and allocation rules, the MGO ensures the bid.
the bilateral flow of electricity and price information. The 4): Based on the winners in step 3), the MGO updates the
basic structure of the market model is shown as Figure 1. winning buyers set and winning sellers set
The electricity is transacted according to double auction
mechanism in discrete time duration under the regulation Bt = {i | vb,i > pv,b,t }
(1)
of MGO. The buyers aim to purchase the power from the St = {j | vs,j < pv,s,t }
grid with a relatively low price, while the seller is willing to
sell power to the grid at higher prices to gain greater bene- 5): The MGO decides the trading amount of winning
fits. In a discrete time slot t, when there are both sellers and agents, which
 is derived
 according the following two cases:
buyers in the market, the MGO will firstly gather the buy- Case A: vb,i ≤ vs,j :
i∈Bt j∈St
ers bid information (vb,i , pb,i ) indicating the total volume
Q(b, i) = vb,i
and unit price that buyers are willing to pay and sellers bid
information (vs,j , ps,j ) that sellers are willing to accept. Δ (2)
Q(s, j) = vs,j −
Then the MGO determines the trading price and volume of |St |
each buyer and seller based on the valid price and allocation  
Case B: vs,j ≤ vb,i :
rules. Finally the MGO allocates energy from the sellers to j∈St i∈Bt
the buyers, and transfers the money from the buyers to the Δ
Q(b, i) = vb,i −
sellers. |Bt | (3)
Q(s, j) = vs,j

6HOOHU DOORFDWLRQ %X\HU Notice that, |Bt | and |St | represent the number of winning
YV SD\PHQW YE seller and buyer sets at time step t. Δ is the difference
6HOOHU SV SE %X\HU between total demand and total supply.
6): The MGO updates the energy demands or supply for
YV YE
6HOOHU SV SE %X\HU next time step.
From the trading process above, we can find that computing
6HOOHU SYV 0*2
YE
SE %X\HU the valid price is a key step in the double auctions, and is
V
  presented as follows:
  1): For the active buyers i ∈ Bt , sorts the price they bid
  in descending order, and for the active sellers j ∈ St , sorts
the price they ask in ascending order
6HOOHU, SYVL
VL
YEM
SEM %X\HU-
pb,1 > pb,2 > ... > pb,n
(4)
Figure 1: Market Structure ps,1 < ps,2 < ... < ps,m

2): Sort all buyers’ volume according to their prices in de-


2.2 Double Auction Mechanism scendent order, and all sellers volume according to their
prices in ascending order.
The trading mechanism used in this paper is a typical dou- 3): To determine the valid price, we discuss the following
ble auction problem. Notice that, the buyers and sellers three cases:
should make decisions based on the current state, without Case A: ps,m ≤ pb,n :
knowing the trading information afterwards. Define buyers
and sellers from sets B and S participate in the auction at pv,b = pb,n
time slot t ∈ T . First, each participant submits their bid (5)
pv,s = ps,m
according to its demand or supply to the MGO. Then the
MGO makes the decisions of valid price and allocations. 
k−1 
l 
k
The workflow of double auction scheme used in this paper Case B: pb,l ≥ ps,k ≥ pb,l+1 and vs,j ≤ vb,i ≤ vs,j :
is presented as follows: 1 1 1

1): At the beginning of the time slot t, each buyer from pv,b = pb,l
(6)
sets B and each seller from sets S report their demand and pv,s = ps,k
supply to the MGO.

l−1 
k 
l
2): The MGO computes and announces the total demand Case C: ps,k+1 ≥ pb,l ≥ ps,k and vb,i ≤ vs,j ≤ vb,i :
and supply at time slot t. Each participant submit their bids 1 1 1

(v, p) to the MGO based on the total demand and supply pv,b = pb,l
(7)
with their own needs, where v represents the amount of pv,s = ps,k

3678 2020 Chinese Control And Decision Conference (CCDC 2020)


Authorized licensed use limited to: UNIVERSITY OF ROCHESTER. Downloaded on August 29,2020 at 17:39:27 UTC from IEEE Xplore. Restrictions apply.
3 Our Approach ,QWHUDFWLRQ 6HOOHU%XIIHU 5HSO\
6DPSOH
In this paper, we first present the MDP model of the dou- 6HOOHU
0LQLEDWFK


DVW VWPVW UVW VWPVW
ble auction scheme, and then we introduce the deep rein- 6*2
forcement learning approach to learn the optimal strategy VWPVW UEW VWPVW





DEW


of buyers and sellers. %X\HU
%X\HU%XIIHU 6W $W 5W 6W
3.1 Markov Decision Process
In this paper, we apply a finite Markov decision process
8SGDWH
$FWLRQ
(MDP) with discrete time step to formulate the double auc- 43UHGLFW
49DOXH
tion scheme. Specifically, buyers and sellers involved in
the trading are regarded as agents who aim to pay the least SDUDPHWHUV
DUJPLQ
/RVV

cost and get the most benefit respectively, where the state $FWLRQ

of the current time t is only related to the state and action 49DOXH 47DUJHW

of the previous time t − 1. The MDP can be defined by a 5HZDUG

four-element tuple (S, A, P, R). /HDUQLQJ

S denotes the system state space, besides Sb and Ss respec- Figure 2: Deep Reinforcement Learning Structure
tively reflect the possible state of the buyers and sellers. We
consider s(b, i, t) ∈ Sb is the state of buyer i at time t. We
propose to choose the buyers’ total demand D, sellers’ total
R is the reward function. The immediate reward at time
supply P and each buyer’s own demand di to form the state
step t is defined as rt . For buyer i, the reward at the current
space for the demand and supply relationship has a deci-
time t consists of two parts: the cost of purchasing energy
sive impact on each buyer’s bidding. When supply exceeds
and the dissatisfaction that does not meet the demand of the
demand, buyers will choose to raise the bidding price and
time t
reduce the bidding amount. When demand exceeds supply,
the result is opposite. Therefore the state of buyer i at time
t is defined as: r(b, i, t) = α ∗ (pv,b ∗ Q(b, i)) + (1 − α) ∗ (di,t − Q(b, i))
(12)
s(b, i, t) = {D, P, di } (8) where α is the corresponding coefficient in the range [0, 1],
The above analysis is also equally applicable to the sellers, which balance the cost and dissatisfaction. Also, the reward
s(s, j, t) ∈ Ss is the state of seller j at time t, we define the of seller j at the current time is its profit, which is defined
state of seller as: as follows:

s(s, j, t) = {D, P, pj } (9) r(s, j, t) = pv,s ∗ Q(s, j) (13)

A represents the set of available actions of the trading par- 3.2 Deep Reinforcement Learning
ticipants, and a(t) is the bidding price and quantity at time In the energy trading market, it is difficult for buyers and
slot t. In our trading model, we propose a two-dimensional sellers to decide their bidding strategy via an analytical ap-
tuple a(t) = {pt , qt } combining both bidding price and proach, due to the uncertainty of future energy prices and
quantity to represent action for deep reinforcement learn- supply-demand relations. Notice that Deep reinforcement
ing. learning (DRL) is an effective way to make optimal strate-
The action of buyer i at time t is represented as a(b, i, t) = gies in a specific environment, we utilize DRL in the trad-
{pb,i , vb,i }, where pb,i is the biding price and vb,i is the ing model to find the optimal bidding strategy for both buy-
purchase amount. Notice that, to be practical, we assume ers and sellers. The structure of proposed deep reinforce-
that the purchase amount of the buyer i at time t cannot ment learning based double auction scheme is illustrated in
exceed its demand. Figure 2.
In our reinforcement learning model, both buyers and sell-
vb,i ≤ di,t (10) ers are considered as the agent that learns their best bidding
strategy from observing the rewards of interactions with
Similarly, The action of seller j at time t is a(s, j, t) =
MGO over time. At a discrete time step t, the agent ob-
{ps,j , vs,j }, and that the sold amount of the seller j at time
serve supply and demand relationship, and gets the state
t cannot exceed its supply
value s(t). Then taking an action a(t) = {pt , qt } based
on the output of the neural network called the state action
vs,j ≤ uj,t (11)
value Q(s, a), which indicates the cumulative reward ob-
P is defined as the transition function. We collectively rep- tained by the agent interacting with the environment using
resent the state of buyers and sellers at time t as s(t). In this action a in state s. Notice that, for buyers, the minimum
paper, The state transition probability from the state s(t) to value in the output is selected, and for sellers, the result is
s(t + 1) is denoted as pt : s(t) × a(t) → s(t + 1), which the opposite. Next, SGO will determine the valid price and
meet the definition of MDP that the state of the current time energy allocation rules for each seller and buyer. Once the
is only related to the state and action of the previous time. trading is complete, the trading of environment from state

2020 Chinese Control And Decision Conference (CCDC 2020) 3679


Authorized licensed use limited to: UNIVERSITY OF ROCHESTER. Downloaded on August 29,2020 at 17:39:27 UTC from IEEE Xplore. Restrictions apply.
circles in Figure 3, it separates the value estimation, one
511
VWP &HOO for state value, another for action advantage with number
, \WV
D of actions N , represented by vλ and aψ respectively. These

$GYDQWDJH
Dȥ 

$FWLRQ
VWP 511
&HOO
two streams share the same part of previous two networks
\WV DQ

$JJUHJDWRU
,
511 4 VD
and merged by a special aggregator to produce the state-
VWP &HOO

, \WV
action value:

9DOXH
6WDWH
RW

\W
,

VW 511
&HOO
\W
aψ (fζ (s), a )
a
,QSXW /LQHDU 1RLV\ 'HXOLQJ
Q(s, a) = vλ (fζ (s)) + aψ (fζ (s), a) −
,QSXW N
/D\HU /D\HU /D\HU /D\HU (15)
where ξ, λ and ψ is respectively the parameters of the shar-
Figure 3: Deep Neural Network Structure ing part, value stream and action advantage.
Finally, we use two networks with the structure in the learn-
ing process: the main network with the parameters θ and
s(t) to s(t + 1) generates a reward r(t), reflecting the im- target network with the parameters θ to avoid the overesti-
mediate evaluation of the action a(t) at state s(t). mation. Both networks have the same parameters initially.
The state s(t), action a(t), reward r(t) and next step θ is updated to be equal to θ once every C steps.
state s(t + 1) form a experience tuple, defined as After sampling the minibatch [St , At , Rt , St+1 ] with size
[s(t), at , r(t), s(t+1)], which describes an interaction with B from the batch, we use St as the input of main network,
the environment and will be stored in the replay buffer for and use At to choose the action-state value in the output of
the process of training. We use two similar buffers to store main network Q(St , θ) to calculate the evaluation Q value
the experiences of buyers and sellers, respectively. illustrated by red lines in the Figure 2:
To approximate the state action value, a deep neural net-
work is introduced by taking the states as the input and
generating state-action value Q(s, a) ≈ Q(s, a, θ), where Qevel = Q (St , at , θ) (16)
θ is the parameter of the neural network. The proposed
deep neural network is a fully-connected that consists of To calculate the target Q value, we first find the best action
five layers as shown in Figure 3. a of the state st+1 that corresponds to the minimum or
We use RNN layer as the input layer which is fed by the maximin action-state value Q(st+1 , a , θ) for buyers and
time series state values of length m, and represented by sellers respectively in the main network with the input is
green circles in the Figure 3. Specifically, the input of the St+1 . Then the selected action a and reward Rt from the
first cell is st−m which represents the state at time t−m, the minibatch are used to calculate the target Q value in the tar-
first RNN cell’s parameters I and output yt−s are passed get network with the input is also St+1 , which is indicated
into the second cell. The above process is repeated until by blue lines in the Fig.
the last layer. 2
By concatenating the states information, the output of the
RNN layer are fed into one linear layer, after that two noisy Qtar = rt + γ ∗ Q(st+1 , arg Q(st+1 , a , θ), θ ) (17)
a
layers represented by red and blue circles in Figure 3 re-
spectively. The limitations of traditional exploring policy Where γ is the discount factor, which indicates the degree
-greedy are clear in many conditions, weights with greater of influence of the future reward on the current reward. The
uncertainty introduces more variability into the decisions smaller the γ, the more the agent pays attention to the cur-
made by the policy, which has potential for exploratory rent reward, and vice versa. We update the parameters of
actions[10]. The scheme of the noisy layer is shown as main network θ by performing gradient descent according
follows: to the loss function calculated from the difference between
the target Q value and the evaluation Q value:
y = (μω + σ ω ω )x + μb + σ b b (14)
B

where ω and b are random variables. By dong so, the 2
L(t) = (Qtar − Qevel ) (18)
Equ. 14 can then be used in place of the standard linear
i=1
one y = ωx + b. The last layer is the dueling layer, duel-
ing network is proposed to obtain better policy evaluation The pseudocode of our algorithm is given in Algorithm 1.
in the presence of many similar-valued actions[11]. In the
problem of double auction, the state s is only related to 4 PERFORMANCE EVALUATION
the supply and demand relations in the market, the agent’s
actions do not affect the state estimation in any relevant In this section, we introduce the performance evaluation to
way for they don’t know the supply and demand of others. demonstrate the effectiveness of our energy trading strate-
The proposed dueling network can be seen as a single net- gies. We first present the simulation setup, and then show
work with two streams which is illustrated by the purple the evaluation results.

3680 2020 Chinese Control And Decision Conference (CCDC 2020)


Authorized licensed use limited to: UNIVERSITY OF ROCHESTER. Downloaded on August 29,2020 at 17:39:27 UTC from IEEE Xplore. Restrictions apply.
Algorithm 1: Deep reinforcement learning algo- Table 1: Buyer Average Cost
rithm in double auction Number 10 20 30 40 50
1 Randomly initialize the parameters of the main network θ. EA 2.0374 2.3788 3.4200 5.3936 7.7363
2 Initialize the parameters of the target network θ  = θ. DRL 1.7922 1.8750 1.8845 1.8875 2.0736
3 Initialize the prioritized replay buffer to capacity N.
4 Initialize the greedy number  and increment number b
5 for episode=1 to E do Table 2: Seller Average Profit
6 Determine the initial state st,0 by Equation 8 and 9 .
7 for step = 1 to T do Number 10 20 30 40 50
8 if a random number x ≥  then EA 0.3675 0.6216 0.5981 0.7486 1.2349
9 Randomly choose the action at from action space DRL 2.4245 7.7363 11.013 12.814 16.808
A.
10 else
11 Select the action as at = arg Q(st , a, θ).
a
12 end
4.1.2 Learning Model
13 The MGO determine the valid price and allocation rule.
14 The agent get the reward rt and next state st+1 . The learning rate in reinforcement learning model is set as
15 Store the transition [st , at , rt , st+1 ) in buffer. 0.001. The time interval of replacing the target network
16 Randomly choose minibatch B from the buffer. parameters with the main network parameters C is set as 5.
Set Qevel = Q (St , at , θ).
17
18 Set Qtar = rt + γ ∗ Q(st+1 , arg Q(st+1 , a , θ), θ  ).
The size of replay buffer is set as 500 with the mini-batch
a size is set as 32. In addition, the greedy coefficient  is set
B
to 0.1 initially, and it increases by 0.01 per learning until.
19 Perform the gradient descent on (Qtar − Qevel )2
i=1 The number of neurons in the input layer is set to the state
with respect to the main network parameters.
20  =  + b. dimension 3 with the time series of the RNN cell is set to
21 Every C-step reset θ  = θ. 4. We set one linear layer and two noisy layers after the
22 end input layer with 20, 100 and 200 nodes, respectively. For
23 end the dueling layer, the number of neuron in action advantage
is set to the number of actions |A| = |np | × |nv | = 100,
and the number of neuron in state value is set to 1.

4.1 Simulation Setup 4.2 Evaluation Results

4.1.1 Trading Model In order to compare the optimization effect of reinforce-


ment learning on the trading model, we introduce an Em-
pirical Algorithm (EA)[8]: at each time step t, the buyer
In order to perform the simulation of the energy trading
and the seller only care about obtaining the minimum pay-
model, we makes the following assumptions about the pa-
ment and the maximum return in the current time period,
rameters in the model. The training episodes E in this pa-
but not the future reward. That is, set the discount factor γ
per is set as 10000, and each episode contains 24 trading
in reinforcement learning to 0.
steps. Assume that the participants of the energy trading
are one MGO, buyers with number is a, and sellers with Table 1 and 2 shows the results of the simulation described
number is b, in order to facilitate the simulation, we let in this article. We use the average buyer cost and aver-
a = b. The emerging demand or supply generated by each age seller profit over the last 1000 episodes as indicators to
participants at step t is a discrete number selected from the evaluate learning performance. The results also show the
set [0, 5]. Assume that the demand di,t and supply uj,t for average cost and profit of different numbers of buyers and
each buyer and seller that failed to trade at the previous step sellers in the trading as 10, 20, 30, 40 and 50. Additionally,
will be inherited to the next step, and the inheritance rate λ evaluation results of different algorithms are plotted in Fig-
is set as 0.9 ure 4 and 5, the red and blue lines represent the simulation
results of the reinforcement learning and the empirical al-
gorithm, respectively.
di,t+1 = λ ∗ (di,t − Q(b, i)) + di,t+1 From the Figure 4 and 5, we can see that when the num-
(19) ber of competitors in energy trading increases, the average
uj,t+1 = λ ∗ (uj,t − Q(s, j)) + uj,t+1
cost of buyers will increase, and the average profit of sell-
ers decreases. The simulation results also prove that the
Notice that dimensional explosion is a common issue in re- proposed approach can effectively help buyers and sellers
inforcement learning, the learning efficiency of reinforce- in energy trading to obtain lower costs and greater benefits
ment learning will be greatly reduced or even get the wrong respectively.
results when the dimension getting higher [12]. In order to Finally, we use a learning curve with 10 participants to rep-
solve this problem, in this paper, we discretize the actions resent the convergence process, as illustrated in Figure 6
of the trading participants. The biding price p are selected and 7. It can be intuitively seen from the figures that dur-
from [0.6, 1.5] with the spacing 0.1, and the biding volume ing the initial phase of learning, the performance of rein-
are selected from [0, 5] with the space 0.5. Finally the bal- forcement learning agents is not as good as that of empiri-
ancing coefficient α and β for buyers and sellers are set as cal ones. But as the learning process continues with more
0.5. experiences are being learned, buyers’ average cost will de-

2020 Chinese Control And Decision Conference (CCDC 2020) 3681


Authorized licensed use limited to: UNIVERSITY OF ROCHESTER. Downloaded on August 29,2020 at 17:39:27 UTC from IEEE Xplore. Restrictions apply.


  
! 
 

   
  










 




         
       

Figure 4: Buyer Average Cost Figure 5: Seller Average Profit




 

  



  
   










 

 

           
      

Figure 6: Buyer’s Convergence Process Figure 7: Seller’s Convergence Process

crease, and sellers’ average profit will increase, which re- [4] PankiRaj, Jema Sharin, Abdulsalam Yassine, Salimur
flects that by using reinforcement learning, the agents can Choudhury, An Auction Mechanism for Profit Maximiza-
make better bidding strategies. As it reaches to the conver- tion of Peer-to-Peer Energy Trading in Smart Grids, Procedia
gence, the performance of reinforcement learning agents is Computer Science, 151, 361-368, 2019.
better than the empirical ones. [5] Ramachandran B., Srivastava S. K., Edrington C. S., Cartes
D. A, An intelligent auction scheme for smart grid market
5 CONCLUSION using a hybrid immune algorithm, IEEE Transactions on In-
dustrial Electronics, Vol.58, No.10, 4603-4612, 2010.
In this paper, a deep reinforcement learning based algo- [6] Ma J., Deng J., Song L., Han Z, Incentive mechanism for
rithm is proposed for synchronous bilateral energy auction demand side management in smart grid using auction, IEEE
in the trading market by using the supply and demand re- Transactions on Smart Grid, Vol.5, No.3, 1379-1388, 2014.
lations as well as their own needs as the input. Simulation [7] Xu H., Sun H., Nikovski D., Kitamura S., Mori K.,
experiments prove the effectiveness of the autonomous bid- Hashimoto H., Deep Reinforcement Learning for Joint Bid-
ding strategy learning for buyers and sellers from the trad- ding and Pricing of Load Serving Entity, IEEE Transactions
ing environment. Through deep reinforcement learning, on Smart Grid, 2019.
buyers and sellers can reduce their own costs and increase [8] Wang H., Huang T., Liao X., Abu-Rub H., Chen G., Re-
inforcement learning for constrained energy trading games
the profits in the market, respectively.
with incomplete information, IEEE transactions on cybernet-
REFERENCES ics, Vol.47, No.10, 3404-3416, 2016.
[9] Wang N., Xu W., Shao W., Xu Z., A q-cube framework of
[1] J. Gao, Y. Xiao, J. Liu, W. Liang, and C. L. P. Chen, A survey reinforcement learning algorithm for continuous double auc-
of communication/networking in smart grids, Future Gener- tion among microgrids, Energies, Vol.12, No.15, 2891, 2019.
ation Computer Systems, Vol.28, No.2, 391−404, 2012. [10] Fortunato M., Azar M. G., Piot B., Menick J., Osband
[2] Vytelingum P., Cliff D., Jennings N. R., Strategic bidding in I., Graves A., et al, Noisy networks for exploration, arXiv
continuous double auctions, Artificial Intelligence, Vol.172, preprint arXiv:1706.10295, 2017.
No.14, 1700−1729, 2008. [11] Wang Z., Schaul T., Hessel M., Van Hasselt H., Lanctot M.,
[3] An Dou, Yang Qingyu, Yu Wei, Yang Xinyu, Fu Xin- De Freitas N., Dueling network architectures for deep rein-
wen, Zhao Wei. Soda: strategy-proof online double auc- forcement learning, arXiv preprint arXiv:1511.06581, 2015.
tion scheme for multimicrogrids bidding, IEEE Transac- [12] Sutton R. S., Barto, A. G., Introduction to reinforcement
tions on Systems Man Cybernetics Systems, Vol.48, No.7, learning Vol. 2, No. 4, Cambridge: MIT press, 1998.
1177−1190, 2017.

3682 2020 Chinese Control And Decision Conference (CCDC 2020)


Authorized licensed use limited to: UNIVERSITY OF ROCHESTER. Downloaded on August 29,2020 at 17:39:27 UTC from IEEE Xplore. Restrictions apply.

You might also like