0% found this document useful (0 votes)

9 views6 pages

Zhang 2020

Uploaded by

ADAM hassouni

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views6 pages

Zhang 2020

Uploaded by

ADAM hassouni

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Energy Trading in Smart Grid: A Deep Reinforcement Learning-based

Approach
Feiye Zhang1 , Qingyu Yang2
1. School of Automation Science and Engineering, Xi’an Jiaotong University, Xi’an 710049, China
E-mail: [email protected]
2. SKLMSE Lab, MOE Key Laboratory for Intelligent Networks and Network Security, School of Automation Science and
Engineering, Xi’an Jiaotong University, Xi’an 710049, China
E-mail: [email protected]

Abstract: To achieve the efficient operation of the smart grid, appropriate energy trading strategy plays an important
role in reducing multi-agent costs in the trading process as well as alleviating grid pressure. However, with the increase
of the number of participants in smart grid, energy trading has been greatly challenged in terms of stable and effective
operation. In this paper, we propose a deep reinforcement learning-based energy double auction trading strategy. Through
the deep reinforcement learning algorithm, buyers and sellers can gradually learn the environment by treating the three
elements: total supply, total demand and their own supply and demand as states, in addition, regarding both bidding
price and quantity as bidding strategy. Results from simulation indicate that as the learning continues and reaches the
convergence, both the cost which buyers pay in the auction has decreased significantly, and the profit which sellers earn
in the auction will increase.

Key Words: Smart Grid, Energy Trading, Double Auction, Deep Reinforcement Learning

1 INTRODUCTION optimal bidding and pricing policies. The existing litera-

ture show the potential of applying a deep reinforcement
The smart grid, as typical cyber-physical system, have at-
learning algorithm into a continuous double auction mech-
tracted significant research interests, due to economic and
anism, but most of these papers aim at maximizing the pay-
technological benefits[1]. To improve the energy trading
offs of sellers or reducing buyers costs respectively. There-
efficiency, double auction is often used to solve the prob-
fore, putting forward a new double auction scheme by us-
lem of bidding with multiple participants in the energy trad-
ing DRL, covering both buyers and sellers to optimization
ing market [2]. The research interest of double auction
model is challenging and desirable.
mechanism design has focused on two aspects, the first is a
In this paper we first investigate a double auction mecha-
novel energy auction framework [3] [6], the other is apply-
nism that allows all agents to present their price and vol-
ing optimization technique to determine the bidding price
ume to participate into the auctions. To promote trad-
and allocation rule in the trading scheme to improve ben-
ing efficiency, i.e., maximizing benefits for all agents, it
efits of participants [4] [5]. For instance, Dou An et. al
is suggested to enhance the long-term reward optimization
[3] present an energy trading theory for smart grids and de-
skills. In this way, we propose a deep reinforcement learn-
sign a strategy-proof online double auction scheme. Panki-
ing(DRL) framework to maximize the long-term profit gen-
Raj B et. al [4] propose a profit maximization algorithm
erated from both sellers and buyers in the energy trading
for energy suppliers by utilizing peer-to-peer energy trad-
market. The simulation results suggest that the profit of the
ing scheme in a smart grid.
sellers has increased obviously and the cost of the buyer
The application of double auction mechanism in smart grid
has been also reduced.
is often limited by a dynamic uncertain bidding environ-
The remainder of this paper is arranged as follows. We
ment. The bidding price and quantity have great impacts
introduce the double auction mechanism and present our
on the allocation results and the traditional optimization
trading model in Section 2. In Section 3, we present our
methods is hard to handle this complex bidding behavior
trading framework using deep reinforcement learning in
which will reduce the long-term profitability of trading par-
detail. In Section 4, we show performance evaluation and
ticipants. The deep reinforcement learning algorithm have
results. Finally, we conclude this paper in Section 5.
been applied to solve the decision-making problems in auc-
tion [7]−[9]. For instance, Hanchen Xu et. al applies the 2 TRADING SCHEME
deep deterministic policy gradient algorithm to solve the
2.1 Energy Trading Market
This work was supported in part by the National Science Founda-
We consider the energy trading market in this paper with
tion of China under Grant 61973247, 61673315,the China Postdoctoral
Science Foundation under Grant 2018M643659, and the Shaanxi Post- multiple sellers and buyers. The microgrid operators
doctoral Science Foundation under Grant 2017BSHEDZZ82 (MGO) coordinates the energy trading market as it pro-

978-1-7281-5855-6/20/$31.00 2020
c IEEE 3677

Authorized licensed use limited to: UNIVERSITY OF ROCHESTER. Downloaded on August 29,2020 at 17:39:27 UTC from IEEE Xplore. Restrictions apply.
vides the following trading services for all participants: 1) energy that is willing to buy or sell, and p represents the
Collecting the trading information from buyers and sellers, price that is willing to pay or accept for each unit of energy.
the MGO monitors and regulates the operation of energy 3): The MGO calculates the valid price pv,b,t and pv,s,t
market. 2) Carrying out reasonable auction mechanism, respectively for buyers and sellers. Any buyer that satisfies
the MGO guarantees the supply and demand balance. 3) vb,i > pv,b,t and any seller that satisfies vs,j < pv,s,t wins
Adopting payment and allocation rules, the MGO ensures the bid.
the bilateral flow of electricity and price information. The 4): Based on the winners in step 3), the MGO updates the
basic structure of the market model is shown as Figure 1. winning buyers set and winning sellers set
The electricity is transacted according to double auction
mechanism in discrete time duration under the regulation Bt = {i | vb,i > pv,b,t }
(1)
of MGO. The buyers aim to purchase the power from the St = {j | vs,j < pv,s,t }
grid with a relatively low price, while the seller is willing to
sell power to the grid at higher prices to gain greater bene- 5): The MGO decides the trading amount of winning
fits. In a discrete time slot t, when there are both sellers and agents, which
is derived
according the following two cases:
buyers in the market, the MGO will firstly gather the buy- Case A: vb,i ≤ vs,j :
i∈Bt j∈St
ers bid information (vb,i , pb,i ) indicating the total volume
Q(b, i) = vb,i
and unit price that buyers are willing to pay and sellers bid
information (vs,j , ps,j ) that sellers are willing to accept. Δ (2)
Q(s, j) = vs,j −
Then the MGO determines the trading price and volume of |St |
each buyer and seller based on the valid price and allocation
Case B: vs,j ≤ vb,i :
rules. Finally the MGO allocates energy from the sellers to j∈St i∈Bt
the buyers, and transfers the money from the buyers to the Δ
Q(b, i) = vb,i −
sellers. |Bt | (3)
Q(s, j) = vs,j

6HOOHU DOORFDWLRQ %X\HU Notice that, |Bt | and |St | represent the number of winning
YV SD\PHQW YE seller and buyer sets at time step t. Δ is the difference
6HOOHU SV SE %X\HU between total demand and total supply.
6): The MGO updates the energy demands or supply for
YV YE
6HOOHU SV SE %X\HU next time step.
From the trading process above, we can find that computing
6HOOHU SYV 0*2
YE
SE %X\HU the valid price is a key step in the double auctions, and is
V
presented as follows:
1): For the active buyers i ∈ Bt , sorts the price they bid
in descending order, and for the active sellers j ∈ St , sorts
the price they ask in ascending order
6HOOHU, SYVL
VL
YEM
SEM %X\HU-
pb,1 > pb,2 > ... > pb,n
(4)
Figure 1: Market Structure ps,1 < ps,2 < ... < ps,m

2): Sort all buyers’ volume according to their prices in de-

2.2 Double Auction Mechanism scendent order, and all sellers volume according to their
prices in ascending order.
The trading mechanism used in this paper is a typical dou- 3): To determine the valid price, we discuss the following
ble auction problem. Notice that, the buyers and sellers three cases:
should make decisions based on the current state, without Case A: ps,m ≤ pb,n :
knowing the trading information afterwards. Deﬁne buyers
and sellers from sets B and S participate in the auction at pv,b = pb,n
time slot t ∈ T . First, each participant submits their bid (5)
pv,s = ps,m
according to its demand or supply to the MGO. Then the
MGO makes the decisions of valid price and allocations.
k−1
l
k
The workﬂow of double auction scheme used in this paper Case B: pb,l ≥ ps,k ≥ pb,l+1 and vs,j ≤ vb,i ≤ vs,j :
is presented as follows: 1 1 1

1): At the beginning of the time slot t, each buyer from pv,b = pb,l
(6)
sets B and each seller from sets S report their demand and pv,s = ps,k
supply to the MGO.

l−1
k
l
2): The MGO computes and announces the total demand Case C: ps,k+1 ≥ pb,l ≥ ps,k and vb,i ≤ vs,j ≤ vb,i :
and supply at time slot t. Each participant submit their bids 1 1 1

(v, p) to the MGO based on the total demand and supply pv,b = pb,l
(7)
with their own needs, where v represents the amount of pv,s = ps,k

3678 2020 Chinese Control And Decision Conference (CCDC 2020)

Authorized licensed use limited to: UNIVERSITY OF ROCHESTER. Downloaded on August 29,2020 at 17:39:27 UTC from IEEE Xplore. Restrictions apply.
3 Our Approach ,QWHUDFWLRQ 6HOOHU%XIIHU 5HSO\
6DPSOH
In this paper, we ﬁrst present the MDP model of the dou- 6HOOHU
0LQLEDWFK

DVW VWPVW UVW VWPVW
ble auction scheme, and then we introduce the deep rein- 6*2
forcement learning approach to learn the optimal strategy VWPVW UEW VWPVW

DEW

of buyers and sellers. %X\HU
%X\HU%XIIHU 6W $W 5W 6W
3.1 Markov Decision Process
In this paper, we apply a ﬁnite Markov decision process
8SGDWH
$FWLRQ
(MDP) with discrete time step to formulate the double auc- 43UHGLFW
49DOXH
tion scheme. Speciﬁcally, buyers and sellers involved in
the trading are regarded as agents who aim to pay the least SDUDPHWHUV
DUJPLQ
/RVV

cost and get the most beneﬁt respectively, where the state $FWLRQ

of the current time t is only related to the state and action 49DOXH 47DUJHW

of the previous time t − 1. The MDP can be deﬁned by a 5HZDUG

four-element tuple (S, A, P, R). /HDUQLQJ

S denotes the system state space, besides Sb and Ss respec- Figure 2: Deep Reinforcement Learning Structure
tively reflect the possible state of the buyers and sellers. We
consider s(b, i, t) ∈ Sb is the state of buyer i at time t. We
propose to choose the buyers’ total demand D, sellers’ total
R is the reward function. The immediate reward at time
supply P and each buyer’s own demand di to form the state
step t is defined as rt . For buyer i, the reward at the current
space for the demand and supply relationship has a deci-
time t consists of two parts: the cost of purchasing energy
sive impact on each buyer’s bidding. When supply exceeds
and the dissatisfaction that does not meet the demand of the
demand, buyers will choose to raise the bidding price and
time t
reduce the bidding amount. When demand exceeds supply,
the result is opposite. Therefore the state of buyer i at time
t is defined as: r(b, i, t) = α ∗ (pv,b ∗ Q(b, i)) + (1 − α) ∗ (di,t − Q(b, i))
(12)
s(b, i, t) = {D, P, di } (8) where α is the corresponding coefficient in the range [0, 1],
The above analysis is also equally applicable to the sellers, which balance the cost and dissatisfaction. Also, the reward
s(s, j, t) ∈ Ss is the state of seller j at time t, we define the of seller j at the current time is its profit, which is defined
state of seller as: as follows:

s(s, j, t) = {D, P, pj } (9) r(s, j, t) = pv,s ∗ Q(s, j) (13)

A represents the set of available actions of the trading par- 3.2 Deep Reinforcement Learning
ticipants, and a(t) is the bidding price and quantity at time In the energy trading market, it is difficult for buyers and
slot t. In our trading model, we propose a two-dimensional sellers to decide their bidding strategy via an analytical ap-
tuple a(t) = {pt , qt } combining both bidding price and proach, due to the uncertainty of future energy prices and
quantity to represent action for deep reinforcement learn- supply-demand relations. Notice that Deep reinforcement
ing. learning (DRL) is an effective way to make optimal strate-
The action of buyer i at time t is represented as a(b, i, t) = gies in a specific environment, we utilize DRL in the trad-
{pb,i , vb,i }, where pb,i is the biding price and vb,i is the ing model to find the optimal bidding strategy for both buy-
purchase amount. Notice that, to be practical, we assume ers and sellers. The structure of proposed deep reinforce-
that the purchase amount of the buyer i at time t cannot ment learning based double auction scheme is illustrated in
exceed its demand. Figure 2.
In our reinforcement learning model, both buyers and sell-
vb,i ≤ di,t (10) ers are considered as the agent that learns their best bidding
strategy from observing the rewards of interactions with
Similarly, The action of seller j at time t is a(s, j, t) =
MGO over time. At a discrete time step t, the agent ob-
{ps,j , vs,j }, and that the sold amount of the seller j at time
serve supply and demand relationship, and gets the state
t cannot exceed its supply
value s(t). Then taking an action a(t) = {pt , qt } based
on the output of the neural network called the state action
vs,j ≤ uj,t (11)
value Q(s, a), which indicates the cumulative reward ob-
P is defined as the transition function. We collectively rep- tained by the agent interacting with the environment using
resent the state of buyers and sellers at time t as s(t). In this action a in state s. Notice that, for buyers, the minimum
paper, The state transition probability from the state s(t) to value in the output is selected, and for sellers, the result is
s(t + 1) is denoted as pt : s(t) × a(t) → s(t + 1), which the opposite. Next, SGO will determine the valid price and
meet the definition of MDP that the state of the current time energy allocation rules for each seller and buyer. Once the
is only related to the state and action of the previous time. trading is complete, the trading of environment from state

2020 Chinese Control And Decision Conference (CCDC 2020) 3679

Authorized licensed use limited to: UNIVERSITY OF ROCHESTER. Downloaded on August 29,2020 at 17:39:27 UTC from IEEE Xplore. Restrictions apply.
circles in Figure 3, it separates the value estimation, one
511
VWP &HOO for state value, another for action advantage with number
, \WV
D of actions N , represented by vλ and aψ respectively. These

$GYDQWDJH
Dȥ

$FWLRQ
VWP 511
&HOO
two streams share the same part of previous two networks
\WV DQ

$JJUHJDWRU
,
511 4 VD
and merged by a special aggregator to produce the state-
VWP &HOO

, \WV
action value:

9DOXH
6WDWH
RW
YƎ
\W
,

VW 511
&HOO
\W
aψ (fζ (s), a )
a
,QSXW /LQHDU 1RLV\ 'HXOLQJ
Q(s, a) = vλ (fζ (s)) + aψ (fζ (s), a) −
,QSXW N
/D\HU /D\HU /D\HU /D\HU (15)
where ξ, λ and ψ is respectively the parameters of the shar-
Figure 3: Deep Neural Network Structure ing part, value stream and action advantage.
Finally, we use two networks with the structure in the learn-
ing process: the main network with the parameters θ and
s(t) to s(t + 1) generates a reward r(t), reflecting the im- target network with the parameters θ to avoid the overesti-
mediate evaluation of the action a(t) at state s(t). mation. Both networks have the same parameters initially.
The state s(t), action a(t), reward r(t) and next step θ is updated to be equal to θ once every C steps.
state s(t + 1) form a experience tuple, defined as After sampling the minibatch [St , At , Rt , St+1 ] with size
[s(t), at , r(t), s(t+1)], which describes an interaction with B from the batch, we use St as the input of main network,
the environment and will be stored in the replay buffer for and use At to choose the action-state value in the output of
the process of training. We use two similar buffers to store main network Q(St , θ) to calculate the evaluation Q value
the experiences of buyers and sellers, respectively. illustrated by red lines in the Figure 2:
To approximate the state action value, a deep neural net-
work is introduced by taking the states as the input and
generating state-action value Q(s, a) ≈ Q(s, a, θ), where Qevel = Q (St , at , θ) (16)
θ is the parameter of the neural network. The proposed
deep neural network is a fully-connected that consists of To calculate the target Q value, we first find the best action
five layers as shown in Figure 3. a of the state st+1 that corresponds to the minimum or
We use RNN layer as the input layer which is fed by the maximin action-state value Q(st+1 , a , θ) for buyers and
time series state values of length m, and represented by sellers respectively in the main network with the input is
green circles in the Figure 3. Specifically, the input of the St+1 . Then the selected action a and reward Rt from the
first cell is st−m which represents the state at time t−m, the minibatch are used to calculate the target Q value in the tar-
first RNN cell’s parameters I and output yt−s are passed get network with the input is also St+1 , which is indicated
into the second cell. The above process is repeated until by blue lines in the Fig.
the last layer. 2
By concatenating the states information, the output of the
RNN layer are fed into one linear layer, after that two noisy Qtar = rt + γ ∗ Q(st+1 , arg Q(st+1 , a , θ), θ ) (17)
a
layers represented by red and blue circles in Figure 3 re-
spectively. The limitations of traditional exploring policy Where γ is the discount factor, which indicates the degree
-greedy are clear in many conditions, weights with greater of influence of the future reward on the current reward. The
uncertainty introduces more variability into the decisions smaller the γ, the more the agent pays attention to the cur-
made by the policy, which has potential for exploratory rent reward, and vice versa. We update the parameters of
actions[10]. The scheme of the noisy layer is shown as main network θ by performing gradient descent according
follows: to the loss function calculated from the difference between
the target Q value and the evaluation Q value:
y = (μω + σ ω ω )x + μb + σ b b (14)
B

where ω and b are random variables. By dong so, the 2
L(t) = (Qtar − Qevel ) (18)
Equ. 14 can then be used in place of the standard linear
i=1
one y = ωx + b. The last layer is the dueling layer, duel-
ing network is proposed to obtain better policy evaluation The pseudocode of our algorithm is given in Algorithm 1.
in the presence of many similar-valued actions[11]. In the
problem of double auction, the state s is only related to 4 PERFORMANCE EVALUATION
the supply and demand relations in the market, the agent’s
actions do not affect the state estimation in any relevant In this section, we introduce the performance evaluation to
way for they don’t know the supply and demand of others. demonstrate the effectiveness of our energy trading strate-
The proposed dueling network can be seen as a single net- gies. We first present the simulation setup, and then show
work with two streams which is illustrated by the purple the evaluation results.

3680 2020 Chinese Control And Decision Conference (CCDC 2020)

Authorized licensed use limited to: UNIVERSITY OF ROCHESTER. Downloaded on August 29,2020 at 17:39:27 UTC from IEEE Xplore. Restrictions apply.
Algorithm 1: Deep reinforcement learning algo- Table 1: Buyer Average Cost
rithm in double auction Number 10 20 30 40 50
1 Randomly initialize the parameters of the main network θ. EA 2.0374 2.3788 3.4200 5.3936 7.7363
2 Initialize the parameters of the target network θ = θ. DRL 1.7922 1.8750 1.8845 1.8875 2.0736
3 Initialize the prioritized replay buffer to capacity N.
4 Initialize the greedy number and increment number b
5 for episode=1 to E do Table 2: Seller Average Proﬁt
6 Determine the initial state st,0 by Equation 8 and 9 .
7 for step = 1 to T do Number 10 20 30 40 50
8 if a random number x ≥ then EA 0.3675 0.6216 0.5981 0.7486 1.2349
9 Randomly choose the action at from action space DRL 2.4245 7.7363 11.013 12.814 16.808
A.
10 else
11 Select the action as at = arg Q(st , a, θ).
a
12 end
4.1.2 Learning Model
13 The MGO determine the valid price and allocation rule.
14 The agent get the reward rt and next state st+1 . The learning rate in reinforcement learning model is set as
15 Store the transition [st , at , rt , st+1 ) in buffer. 0.001. The time interval of replacing the target network
16 Randomly choose minibatch B from the buffer. parameters with the main network parameters C is set as 5.
Set Qevel = Q (St , at , θ).
17
18 Set Qtar = rt + γ ∗ Q(st+1 , arg Q(st+1 , a , θ), θ ).
The size of replay buffer is set as 500 with the mini-batch
a size is set as 32. In addition, the greedy coefﬁcient is set
B
to 0.1 initially, and it increases by 0.01 per learning until.
19 Perform the gradient descent on (Qtar − Qevel )2
i=1 The number of neurons in the input layer is set to the state
with respect to the main network parameters.
20 = + b. dimension 3 with the time series of the RNN cell is set to
21 Every C-step reset θ = θ. 4. We set one linear layer and two noisy layers after the
22 end input layer with 20, 100 and 200 nodes, respectively. For
23 end the dueling layer, the number of neuron in action advantage
is set to the number of actions |A| = |np | × |nv | = 100,
and the number of neuron in state value is set to 1.

4.1 Simulation Setup 4.2 Evaluation Results

4.1.1 Trading Model In order to compare the optimization effect of reinforce-

ment learning on the trading model, we introduce an Em-
pirical Algorithm (EA)[8]: at each time step t, the buyer
In order to perform the simulation of the energy trading
and the seller only care about obtaining the minimum pay-
model, we makes the following assumptions about the pa-
ment and the maximum return in the current time period,
rameters in the model. The training episodes E in this pa-
but not the future reward. That is, set the discount factor γ
per is set as 10000, and each episode contains 24 trading
in reinforcement learning to 0.
steps. Assume that the participants of the energy trading
are one MGO, buyers with number is a, and sellers with Table 1 and 2 shows the results of the simulation described
number is b, in order to facilitate the simulation, we let in this article. We use the average buyer cost and aver-
a = b. The emerging demand or supply generated by each age seller profit over the last 1000 episodes as indicators to
participants at step t is a discrete number selected from the evaluate learning performance. The results also show the
set [0, 5]. Assume that the demand di,t and supply uj,t for average cost and profit of different numbers of buyers and
each buyer and seller that failed to trade at the previous step sellers in the trading as 10, 20, 30, 40 and 50. Additionally,
will be inherited to the next step, and the inheritance rate λ evaluation results of different algorithms are plotted in Fig-
is set as 0.9 ure 4 and 5, the red and blue lines represent the simulation
results of the reinforcement learning and the empirical al-
gorithm, respectively.
di,t+1 = λ ∗ (di,t − Q(b, i)) + di,t+1 From the Figure 4 and 5, we can see that when the num-
(19) ber of competitors in energy trading increases, the average
uj,t+1 = λ ∗ (uj,t − Q(s, j)) + uj,t+1
cost of buyers will increase, and the average profit of sell-
ers decreases. The simulation results also prove that the
Notice that dimensional explosion is a common issue in re- proposed approach can effectively help buyers and sellers
inforcement learning, the learning efficiency of reinforce- in energy trading to obtain lower costs and greater benefits
ment learning will be greatly reduced or even get the wrong respectively.
results when the dimension getting higher [12]. In order to Finally, we use a learning curve with 10 participants to rep-
solve this problem, in this paper, we discretize the actions resent the convergence process, as illustrated in Figure 6
of the trading participants. The biding price p are selected and 7. It can be intuitively seen from the figures that dur-
from [0.6, 1.5] with the spacing 0.1, and the biding volume ing the initial phase of learning, the performance of rein-
are selected from [0, 5] with the space 0.5. Finally the bal- forcement learning agents is not as good as that of empiri-
ancing coefficient α and β for buyers and sellers are set as cal ones. But as the learning process continues with more
0.5. experiences are being learned, buyers’ average cost will de-

2020 Chinese Control And Decision Conference (CCDC 2020) 3681

Authorized licensed use limited to: UNIVERSITY OF ROCHESTER. Downloaded on August 29,2020 at 17:39:27 UTC from IEEE Xplore. Restrictions apply.

Figure 4: Buyer Average Cost Figure 5: Seller Average Proﬁt

Figure 6: Buyer’s Convergence Process Figure 7: Seller’s Convergence Process

crease, and sellers’ average profit will increase, which re- [4] PankiRaj, Jema Sharin, Abdulsalam Yassine, Salimur
flects that by using reinforcement learning, the agents can Choudhury, An Auction Mechanism for Profit Maximiza-
make better bidding strategies. As it reaches to the conver- tion of Peer-to-Peer Energy Trading in Smart Grids, Procedia
gence, the performance of reinforcement learning agents is Computer Science, 151, 361-368, 2019.
better than the empirical ones. [5] Ramachandran B., Srivastava S. K., Edrington C. S., Cartes
D. A, An intelligent auction scheme for smart grid market
5 CONCLUSION using a hybrid immune algorithm, IEEE Transactions on In-
dustrial Electronics, Vol.58, No.10, 4603-4612, 2010.
In this paper, a deep reinforcement learning based algo- [6] Ma J., Deng J., Song L., Han Z, Incentive mechanism for
rithm is proposed for synchronous bilateral energy auction demand side management in smart grid using auction, IEEE
in the trading market by using the supply and demand re- Transactions on Smart Grid, Vol.5, No.3, 1379-1388, 2014.
lations as well as their own needs as the input. Simulation [7] Xu H., Sun H., Nikovski D., Kitamura S., Mori K.,
experiments prove the effectiveness of the autonomous bid- Hashimoto H., Deep Reinforcement Learning for Joint Bid-
ding strategy learning for buyers and sellers from the trad- ding and Pricing of Load Serving Entity, IEEE Transactions
ing environment. Through deep reinforcement learning, on Smart Grid, 2019.
buyers and sellers can reduce their own costs and increase [8] Wang H., Huang T., Liao X., Abu-Rub H., Chen G., Re-
inforcement learning for constrained energy trading games
the profits in the market, respectively.
with incomplete information, IEEE transactions on cybernet-
REFERENCES ics, Vol.47, No.10, 3404-3416, 2016.
[9] Wang N., Xu W., Shao W., Xu Z., A q-cube framework of
[1] J. Gao, Y. Xiao, J. Liu, W. Liang, and C. L. P. Chen, A survey reinforcement learning algorithm for continuous double auc-
of communication/networking in smart grids, Future Gener- tion among microgrids, Energies, Vol.12, No.15, 2891, 2019.
ation Computer Systems, Vol.28, No.2, 391−404, 2012. [10] Fortunato M., Azar M. G., Piot B., Menick J., Osband
[2] Vytelingum P., Cliff D., Jennings N. R., Strategic bidding in I., Graves A., et al, Noisy networks for exploration, arXiv
continuous double auctions, Artificial Intelligence, Vol.172, preprint arXiv:1706.10295, 2017.
No.14, 1700−1729, 2008. [11] Wang Z., Schaul T., Hessel M., Van Hasselt H., Lanctot M.,
[3] An Dou, Yang Qingyu, Yu Wei, Yang Xinyu, Fu Xin- De Freitas N., Dueling network architectures for deep rein-
wen, Zhao Wei. Soda: strategy-proof online double auc- forcement learning, arXiv preprint arXiv:1511.06581, 2015.
tion scheme for multimicrogrids bidding, IEEE Transac- [12] Sutton R. S., Barto, A. G., Introduction to reinforcement
tions on Systems Man Cybernetics Systems, Vol.48, No.7, learning Vol. 2, No. 4, Cambridge: MIT press, 1998.
1177−1190, 2017.

3682 2020 Chinese Control And Decision Conference (CCDC 2020)

Authorized licensed use limited to: UNIVERSITY OF ROCHESTER. Downloaded on August 29,2020 at 17:39:27 UTC from IEEE Xplore. Restrictions apply.

Value Analysis Value Engineering
100% (2)
Value Analysis Value Engineering
25 pages
Records: Archives: Management: Preservation
No ratings yet
Records: Archives: Management: Preservation
113 pages
Data Driven Solution To Market Equilibrium Via Deep Reinforcement Learning
No ratings yet
Data Driven Solution To Market Equilibrium Via Deep Reinforcement Learning
5 pages
Lecture 9
No ratings yet
Lecture 9
32 pages
Strategic Bidding and Generation Scheduling in Electricity Spot-Market
No ratings yet
Strategic Bidding and Generation Scheduling in Electricity Spot-Market
6 pages
Mathematics 13 00373
No ratings yet
Mathematics 13 00373
90 pages
Key Point Mapping
No ratings yet
Key Point Mapping
69 pages
Two-Level Optimal Scheduling Strategy of Demand Response-Based Microgrids Based On Renewable Energy Forecasting
No ratings yet
Two-Level Optimal Scheduling Strategy of Demand Response-Based Microgrids Based On Renewable Energy Forecasting
32 pages
Energies 13 05359 v2
No ratings yet
Energies 13 05359 v2
27 pages
1 s2.0 S0004370221000990 Main
No ratings yet
1 s2.0 S0004370221000990 Main
23 pages
Energies 17 01702 v2
No ratings yet
Energies 17 01702 v2
18 pages
Jian Wang2017
No ratings yet
Jian Wang2017
22 pages
Energies 17 03773
No ratings yet
Energies 17 03773
16 pages
1 s2.0 S0306261921004189 Main
No ratings yet
1 s2.0 S0306261921004189 Main
16 pages
A Reinforcement and Imitation Learning Method For Pricing Strategy of
No ratings yet
A Reinforcement and Imitation Learning Method For Pricing Strategy of
14 pages
Aggregators Optimal Bidding Strategy in Sequentia
No ratings yet
Aggregators Optimal Bidding Strategy in Sequentia
20 pages
Energies: A New Power Sharing Scheme of Multiple Microgrids and An Iterative Pairing-Based Scheduling Method
No ratings yet
Energies: A New Power Sharing Scheme of Multiple Microgrids and An Iterative Pairing-Based Scheduling Method
20 pages
A Novel Energy Trading Framework Using Adapted Blockchain Technology
No ratings yet
A Novel Energy Trading Framework Using Adapted Blockchain Technology
11 pages
A Consortium Blockchain-Enabled Double Auction Mechanism For Peer-to-Peer Energy Trading Among Prosumers
No ratings yet
A Consortium Blockchain-Enabled Double Auction Mechanism For Peer-to-Peer Energy Trading Among Prosumers
16 pages
1 s2.0 S0098135425000304 Main
No ratings yet
1 s2.0 S0098135425000304 Main
15 pages
Full Text
No ratings yet
Full Text
14 pages
Applied Energy: Ross May, Pei Huang
No ratings yet
Applied Energy: Ross May, Pei Huang
15 pages
Usage of GAMS-Based Digital Twins and Clustering To Improve Energetic Systems Control
No ratings yet
Usage of GAMS-Based Digital Twins and Clustering To Improve Energetic Systems Control
17 pages
1 s2.0 S0378779622003546 Main
No ratings yet
1 s2.0 S0378779622003546 Main
15 pages
Energy Management System For An Industrial Microgrid Using Optimization Algorithms-Based Reinforcement Learning Technique
No ratings yet
Energy Management System For An Industrial Microgrid Using Optimization Algorithms-Based Reinforcement Learning Technique
18 pages
TR2020-003
No ratings yet
TR2020-003
12 pages
A Data Driven Method For Microgrid Bidding Optimization in Electricity Market
No ratings yet
A Data Driven Method For Microgrid Bidding Optimization in Electricity Market
11 pages
An Energy Sharing Mechanism
No ratings yet
An Energy Sharing Mechanism
15 pages
2020-Elsevier - (Interconnected Microgrids+Market Uncertainty)
No ratings yet
2020-Elsevier - (Interconnected Microgrids+Market Uncertainty)
13 pages
Nash Equilibrium Bidding Strategies in A Bilateral Electricity Market
No ratings yet
Nash Equilibrium Bidding Strategies in A Bilateral Electricity Market
7 pages
Optimal Power Dispatch in Multinode Electricity Market Using Genetic Algorithm
No ratings yet
Optimal Power Dispatch in Multinode Electricity Market Using Genetic Algorithm
10 pages
Demand Response Management For Profit Maximizing Energy Loads in Real-Time Electricity Market
No ratings yet
Demand Response Management For Profit Maximizing Energy Loads in Real-Time Electricity Market
10 pages
Cui 2018
No ratings yet
Cui 2018
11 pages
An Optimal Pricing Mechanism For Peer-To-Peer
No ratings yet
An Optimal Pricing Mechanism For Peer-To-Peer
5 pages
Trading in Power Exchanges With Consideration of The Cost of Generation Scheduling Under The New Electricity Trading Arrangements
No ratings yet
Trading in Power Exchanges With Consideration of The Cost of Generation Scheduling Under The New Electricity Trading Arrangements
6 pages
Research On Multi-Microgrids Scheduling Strategy Considering Dynamic Electricity Price Based On Blockchain
No ratings yet
Research On Multi-Microgrids Scheduling Strategy Considering Dynamic Electricity Price Based On Blockchain
14 pages
Peer-to-Peer Energy Trading in Smart Grid Through Blockchain A Double Auction-Based Game Theoretic Approach
No ratings yet
Peer-to-Peer Energy Trading in Smart Grid Through Blockchain A Double Auction-Based Game Theoretic Approach
13 pages
Energy Trading of Multiple Virtual Power Plants Using Deep Reinforcement Learning
No ratings yet
Energy Trading of Multiple Virtual Power Plants Using Deep Reinforcement Learning
7 pages
Monacchi-Elmenreich2016 Article AssistedEnergyManagementInSmar
No ratings yet
Monacchi-Elmenreich2016 Article AssistedEnergyManagementInSmar
13 pages
Jadhav 2018
No ratings yet
Jadhav 2018
12 pages
2024 Socially Optimal Energy Usage Via Adaptive Pricing
No ratings yet
2024 Socially Optimal Energy Usage Via Adaptive Pricing
7 pages
OuYang 2023 J. Phys. Conf. Ser. 2465 012030
No ratings yet
OuYang 2023 J. Phys. Conf. Ser. 2465 012030
8 pages
Uncertainty-Informed Renewable Energy Scheduling A Scalable Bilevel Framework
No ratings yet
Uncertainty-Informed Renewable Energy Scheduling A Scalable Bilevel Framework
14 pages
Optimal Bidding Strategy For Physical Market Participants With Virtual Bidding Capability I
No ratings yet
Optimal Bidding Strategy For Physical Market Participants With Virtual Bidding Capability I
11 pages
Incentive-Based Peer-to-Peer Distributed Energy Trading in Smart Grid Systems
No ratings yet
Incentive-Based Peer-to-Peer Distributed Energy Trading in Smart Grid Systems
7 pages
A Multi-K Double Auction Pricing Mechanism For
No ratings yet
A Multi-K Double Auction Pricing Mechanism For
5 pages
Medium-Term Generation Programming in Competitive Environments: A New Optimisation Approach For Market Equilibrium Computing
No ratings yet
Medium-Term Generation Programming in Competitive Environments: A New Optimisation Approach For Market Equilibrium Computing
8 pages
HF Security Smart-Pass - Installation Instructions - 1.5.9 - 20220304
No ratings yet
HF Security Smart-Pass - Installation Instructions - 1.5.9 - 20220304
28 pages
Igbsg 2019 8886341
No ratings yet
Igbsg 2019 8886341
4 pages
Pesgm2006 000015
No ratings yet
Pesgm2006 000015
6 pages
21 - Demand Response Management For Profit Maximizing Energy Loads in Real-Time Electricity Market
No ratings yet
21 - Demand Response Management For Profit Maximizing Energy Loads in Real-Time Electricity Market
10 pages
Bidding Strategy in Deregulated Power Market Using Differential Evolution Algorithm
No ratings yet
Bidding Strategy in Deregulated Power Market Using Differential Evolution Algorithm
10 pages
BE Strategic Bidding
No ratings yet
BE Strategic Bidding
9 pages
Journal of Energy Storage: Abhilipsa Sahoo, Prakash Kumar Hota
No ratings yet
Journal of Energy Storage: Abhilipsa Sahoo, Prakash Kumar Hota
12 pages
SSRN 5214456
No ratings yet
SSRN 5214456
12 pages
Bidding in Interrelated Day-Ahead Electricity Markets: Insights From An Agent-Based Simulation Model
No ratings yet
Bidding in Interrelated Day-Ahead Electricity Markets: Insights From An Agent-Based Simulation Model
10 pages
Robust Worst-Case Analysis of Demand-Side Management in Smart Grids
No ratings yet
Robust Worst-Case Analysis of Demand-Side Management in Smart Grids
11 pages
Research Methodology Assignment To Develop A Complete Report Against Your Assigned Topic Keeping in View The Instruction Provided Above
No ratings yet
Research Methodology Assignment To Develop A Complete Report Against Your Assigned Topic Keeping in View The Instruction Provided Above
5 pages
Modeling The Strategic Bidding in Competitive Electricity Markets Based On Fuzzy-Logic
No ratings yet
Modeling The Strategic Bidding in Competitive Electricity Markets Based On Fuzzy-Logic
5 pages
Overview of Procure To Pay Cycle
No ratings yet
Overview of Procure To Pay Cycle
5 pages
Market Allocation Between Bilateral Contracts and Spot Market Without Financial Transmission Rights
No ratings yet
Market Allocation Between Bilateral Contracts and Spot Market Without Financial Transmission Rights
5 pages
An Auction Game Model For Pool-Based Electricity Markets: Deqiang Gan, Jianquan Wang, Donald V. Bourcier
No ratings yet
An Auction Game Model For Pool-Based Electricity Markets: Deqiang Gan, Jianquan Wang, Donald V. Bourcier
8 pages
Optimal Bidding in Restructure Power System
No ratings yet
Optimal Bidding in Restructure Power System
3 pages
Vdrive - Vinculum Firmware
No ratings yet
Vdrive - Vinculum Firmware
77 pages
14M R9J Schematic PDF
No ratings yet
14M R9J Schematic PDF
8 pages
Omni Legend Scanner
No ratings yet
Omni Legend Scanner
13 pages
(Irwin Business Communications) Deborah Barrett - Leadership Communication-McGraw-Hill Education (2013) - 165-193
No ratings yet
(Irwin Business Communications) Deborah Barrett - Leadership Communication-McGraw-Hill Education (2013) - 165-193
29 pages
BTech - 5sem - CE - Booklet - 2022-23-ODD
No ratings yet
BTech - 5sem - CE - Booklet - 2022-23-ODD
37 pages
TechTip 1503 ConvertingManagedApptoModernApp
No ratings yet
TechTip 1503 ConvertingManagedApptoModernApp
5 pages
Python Codes Arules
100% (1)
Python Codes Arules
17 pages
Env SPV DR B 001 QC Manual Rev.A
No ratings yet
Env SPV DR B 001 QC Manual Rev.A
92 pages
Volunteer Resume Example
100% (2)
Volunteer Resume Example
7 pages
Day # 18 (2 Past Papers)
No ratings yet
Day # 18 (2 Past Papers)
32 pages
Boelter Complaint
No ratings yet
Boelter Complaint
3 pages
Lesson 1 - Introduction To ICT
No ratings yet
Lesson 1 - Introduction To ICT
52 pages
Students File - S4
No ratings yet
Students File - S4
6 pages
MT 199 Corrección
No ratings yet
MT 199 Corrección
3 pages
SwissgasSonimix 2106 Gas Dilution Calibrator
No ratings yet
SwissgasSonimix 2106 Gas Dilution Calibrator
2 pages
Week Eight Term Project
No ratings yet
Week Eight Term Project
5 pages
ANSWER SHEET IN Statisctics and Probabilty: Written Work
No ratings yet
ANSWER SHEET IN Statisctics and Probabilty: Written Work
1 page
123 Appendix4
No ratings yet
123 Appendix4
6 pages
Multinomial Goodness-of-Fit Based On U - Statistics: High-Dimensional Asymptotic and Minimax Optimality
No ratings yet
Multinomial Goodness-of-Fit Based On U - Statistics: High-Dimensional Asymptotic and Minimax Optimality
29 pages
1911 Manufacturer Dates
No ratings yet
1911 Manufacturer Dates
1 page
Switching Lemma
No ratings yet
Switching Lemma
3 pages
JD - MIS Data Scientist
No ratings yet
JD - MIS Data Scientist
2 pages
18CS34 CES Questionnaire
No ratings yet
18CS34 CES Questionnaire
2 pages
Trace
No ratings yet
Trace
2 pages
NB 06 Cisco en Software Matrix
No ratings yet
NB 06 Cisco en Software Matrix
2 pages

Zhang 2020

Uploaded by

Zhang 2020

Uploaded by

Energy Trading in Smart Grid: A Deep Reinforcement Learning-based

1 INTRODUCTION optimal bidding and pricing policies. The existing litera-

2): Sort all buyers’ volume according to their prices in de-

3678 2020 Chinese Control And Decision Conference (CCDC 2020)

of the previous time t − 1. The MDP can be deﬁned by a 5HZDUG

four-element tuple (S, A, P, R). /HDUQLQJ

s(s, j, t) = {D, P, pj } (9) r(s, j, t) = pv,s ∗ Q(s, j) (13)

2020 Chinese Control And Decision Conference (CCDC 2020) 3679

3680 2020 Chinese Control And Decision Conference (CCDC 2020)

4.1 Simulation Setup 4.2 Evaluation Results

4.1.1 Trading Model In order to compare the optimization effect of reinforce-

2020 Chinese Control And Decision Conference (CCDC 2020) 3681

Figure 4: Buyer Average Cost Figure 5: Seller Average Proﬁt

Figure 6: Buyer’s Convergence Process Figure 7: Seller’s Convergence Process

3682 2020 Chinese Control And Decision Conference (CCDC 2020)

You might also like