Smart Microgrid Optimization using Deep Reinforcement Learning by utilizing the Energy Storage Systems
Smart Microgrid Optimization using Deep Reinforcement Learning by utilizing the Energy Storage Systems
Abstract—The need for power is expanding quickly, and as digitization of the grid’s distribution segment. Furthermore,
fossil fuels are depleted, alternatives based on renewable energy distributed generation and microgrids have been integrated into
are emerging. In recent years, microgrids, which are based on the power grid to increase its reliability and efficiency [3]. Mi-
distributed generation and storage systems, have been on an
increasing trend. This generates various new opportunities, one crogrids are defined according to the US Department of Energy
of which is the capacity to monitor, control, and regulate the as a ”group of interconnected loads and distributed energy
energy flow inside and outside of the microgrid. By making resources within clearly defined electrical boundaries that act
microgrid’s distributed energy resources (DERs) and energy stor- as a single controllable entity with respect to the grid” [4]. A
age components economically viable with artificial intelligence microgrid can operate in grid-tied mode or standalone mode
and machine learning incorporated into the cost-optimization
process, the demand for and growth of these technologies will using a backup power supply. One of the most used backup
be accelerated significantly. This paper focuses on employing power supplies in microgrids is a battery storage system
reinforcement learning (RL) algorithms to control energy flow because of its quick response and reliability in the face of any
in an AC microgrid. By incorporating artificial intelligence and power disruption or blackout to support critical loads [5]. To
machine learning into the energy management system (EMS), benefit from this costly investment, it is important to utilize the
the paper aims to optimize costs and facilitate the integration
of renewable energy sources. The RL agent is designed to trade battery in other avenues, such as peak-shaving and electricity
energy with the main grid, taking advantage of the energy storage trading. However, the microgrid is a complex and dynamic
system and achieving cost savings. The RL agent is tested using system with several distributed generation units and loads and
real spot prices data in Norway from a simulation model that due to the intermittent nature of renewable sources and the
combines Python and MATLAB-Simulink software programs for difficulty of accurately predicting the power generation and
efficient co-simulation technique. Results show significant cost
savings of around 14% for a simple model and 7.5% for a consumption[6], effectively optimizing the energy flow using
complex dynamic model. the battery is a challenging problem. Reinforcement learning
Index Terms—Smart Microgrids, Reinforcement Learning, (RL) is among the potential approaches used in the literature
PPO, Co-Simulation, Energy Management System to solve this problem [7], [8]. This paper uses reinforcement
learning by training agents that receive power measurements
I. I NTRODUCTION from renewable sources, the grid, and the load and perform
The contemporary civilization requires a reliable supply of actions that optimize the operational cost of the microgrid. The
electricity to consumers and prosumers with a high level of rest of the paper is structured as follows: Section 2 provides a
power quality. In many nations and areas, the structure of the brief introduction to reinforcement learning, Section 3 explores
power grid is constantly evolving, creating challenges with en- related work and illustrates the novelty of this paper, section 4
ergy flow changes, capacity constraints, and expensive invest- explains the modeling and implementation approach, section
ment costs to modernize the power grid. With new technology 5 shows the results and section 6 concludes the paper.
come new possibilities, and for many years, grids have been
digitalized to enable centralized monitoring and management II. R EINFORCEMENT L EARNING
of the electricity network [1]. This ”smart” digitalized grid A reinforcement learning system consists of four compo-
has been a reality for the grid’s high-voltage segment for nents: a policy, a reward, a value function, and, in certain
a substantial amount of time. In the next years, grids will situations, a model of the environment (microgrid), as can
increasingly rely on renewable energy, and customers will be seen in Fig. 1. An RL system’s policy is the mapping of
benefit from smart technology such as electric vehicle chargers observations/inputs to outputs/actions. A reward is a number
and smart meters [2]. This creates a possibility to increase that the RL agent receives in the subsequent time step after
Authorized licensed use limited to: Oestfold University College. Downloaded on May 01,2025 at 22:48:24 UTC from IEEE Xplore. Restrictions apply.
performing an action that should indicate the quality of the in the long-term. The authors in [12], compared ten different
activity. The RL agent’s only purpose is to maximize the RL algorithms on a scenario of 10 days. The data they used
overall reward. To prevent the RL agent from prioritizing the was from Finland, and they found that a modified version
immediate benefit above the cumulative gain over the long of the Advantage Actor-Critic algorithm (A3C++) performed
run, a value function is employed to define the future projected best and that the performance of Proximal Policy Optimization
return. The last optional component is the environment model, (PPO) achieved negative results. Moreover, they proposed in
which attempts to replicate the environment to anticipate the future work section that the day-ahead electricity spot
behavior; typically, this is the next observation and reward. prices should be added to the observations. In another example,
This is referred to as model-based approaches, while model- [13] used RL to reduce the operational cost by 20.75%, but the
free methods are purely empirical with no explicit model of agent was only tested on a simple mathematical model. This
the environment [9]. paper will show the impact of integrating an RL agent into
Agent Microgrid a complex environment that simulates the effect of switching
Policy
converters, reactive power, and primary and secondary control.
Action (Pref)
function ESS Additionally, the use of real variable spot prices and weather
Reward data from Norway will be introduced, and different algorithms
PV Wind Load and pricing schemes will be investigated.
The following points summarize the novelty and contribu-
Grid
tions of this paper:
Observations
1) Development and training of a reinforcement learning
agent capable of optimizing and reducing the operational
PGrid PLoad cost of a microgrid using real electricity prices, real
Electricity
weather data, and load data from a Norwegian commu-
Price Day-Ahead
Electricity
nity.
Pwind+pv Prices 2) Development of a full-scale complex model with power
electronic converters controlled using model predictive
control (MPC) to verify the performance of the RL agent
Fig. 1. Structure of the RL agent used in the proposed microgrid and compare it with the simplified mathematical model
used for training.
For very complex dynamic systems, it may become ex-
IV. M ODELING AND I MPLEMENTATION
tremely difficult, if not impossible, to construct an agent that
models the entire system mathematically and so performs the A. Modeling Approach
optimal actions consistently. Rather than describing these rules The chosen framework for training the RL agent is Python,
empirically, RL will continually learn and update its parame- mainly using stable-baselines3 library. To be able to train an
ters to improve performance. Even if the environment changes RL agent efficiently, a simplified model that expresses the most
over time, the RL agent will learn to adapt. For this learning to relevant dynamics of the microgrid should be made to reduce
occur, however, random exploration of the environment is key the time needed for the training. A complex model developed
and thus, one of the most critical and challenging criteria to in Simulink is then used to test the efficacy of the trained RL
select is the trade-off between exploration and exploitation. If agent.
the agent’s exploration parameter is set to a high value, it will 1) Modeling the PV Modules: For the PV modules, the
nearly select random actions, but if it is set to a low value, it dynamics of the PV, inverter, and filter are neglected, since the
will take the first and best solution it discovers and get trapped main concern of a cost-optimization algorithm is the output
in a local minimum. power, not the frequency or voltage. Since the relationship
between the solar irradiance and the power generation is
III. R ELATED W ORK AND N OVELTY positive with a high correlation coefficient[14][15], the PV
Several works in the literature explored the problem of modules can simply be modeled using a linear equation as:
optimizing microgrid operational cost using reinforcement
learning. The Deep-Q-Network (DQN) algorithm that was Ppv = G · Ppv,rated (1)
used in [10] improved the efficiency and cost of an MG located
where Ppv is the power output of the PV system, G is the
in Belgium. However, real-time prices were not used and the
irradiance, and Ppv,rated is the rated power of the PV modules.
energy cost was set at a constant 2C/kWh. The proposed
2) Modeling of the wind turbine: The wind turbine power
scheme in [11] created 3 different consumption profiles based
was modeled using the Cubic Power with Cut-off model,
on the customers’ requirements and obtained a highly prof-
where the wind power is given by (2) [16][17].
itable scheme using DDPG algorithm, but the results were only
over a shorter time horizon of some weeks, and the battery " #
3
simply got discharged at the end for one of the schemes, 3 Pnom Pnom Umin
Pwind = VW ind 3 3 − 3 3 (2)
which does not prove how the trained algorithm would work Unom − Umin Unom − Umin
Authorized licensed use limited to: Oestfold University College. Downloaded on May 01,2025 at 22:48:24 UTC from IEEE Xplore. Restrictions apply.
Where Pwind is the output power of the wind turbine, Vwind is and receives the next state observations and a reward. The
the wind speed in m/s, Pnom is the nominal power of the wind observation space is the set of observations received by the
turbine, Unom is the nominal wind speed in m/s, Umax and agent. In grid-connected mode, these observations are the grid
Umin are the maximum and minimum operating wind speeds power, total renewable power generation (PV Power + Wind
in m/s respectively. It is worth noting that the output power Power), the load power, the SOC of the battery, and the
will be zero if the wind speed is outside the operating range day-ahead spot price which are the list of electricity prices
(less than Umin or greater than Umax ) and it would be limited in the next 24 hours since the day-ahead price is usually
to the nominal power Pnom if the result from (2) is higher than published one day in advance and the agent has access to
that. that information. The action space is simply one action which
is the power reference given to the battery. The observation
B. Modeling of the Energy Storage System space has 28 dimensions while the action space has only one.
The energy storage system used is a lithium-ion battery
E. The Reward Function
and it is modeled as a simple system with power as input
and state of charge (SOC) as output. The sign of the power The reward function has to be designed carefully based
determines whether the battery is charging or discharging, on the objectives of the system. For the energy management
positive sign means it is charging while negative means that system, the objectives of the system are as follows:
it is discharging. The state of charge is calculated based 1) Keep the battery state of charge between 10% and
on the efficiency of the battery, battery capacity, and the 90% to preserve the health of the battery and avoid
charging/discharging power using (3). degradation
2) Economic optimization by minimizing the annual elec-
η
SOC(k + 1) = SOC(k) + P (k) × T × (3) tricity cost incurred by the microgrid using the battery
CBat 3) Disable economic optimization when the state of charge
where SOC (k + 1) is the state of charge in the next time step, is below 30%
SOC (k) is the current state of charge, P (k) is input power to To address the first objective, a severe penalty of -10 is
the battery, T is the timestep which is the time span in which applied to discourage the battery’s state of charge (SOC) from
the battery is charging/discharging with the given power, η is exceeding operational limits, preventing it from discharging to
the efficiency of the battery and CBat is the total capacity of 0% or charging to 100%. This is formally displayed in (7).
the battery in kWh.
(
C. Modeling of the Main Grid and Cost Calculation −10, SOC > 90% ∥ SOC < 10%
RSOC = (7)
The grid is modeled as an ideal power source that is able 0, 10% ≤ SOC ≤ 90%
to supply and accept infinite power. Therefore, the grid power Achieving the second objective, which focuses on encour-
is simply the load power minus the summation of the wind, aging the agent to make beneficial decisions regarding power
PV and battery powers to achieve power balance as shown in trading with the grid, proved challenging due to the inter-
(4). mittent nature of renewable sources. Rewarding or punishing
based on power buying or selling is unreliable, as it depends
PGrid = PLoad − (PP V + PW ind − PBat ) (4) heavily on environmental factors. Instead, the paper proposes
Cost Calculation: The instantaneous electricity cost (de- a different approach: creating a baseline called the ”No Action
noted by H) during a given hour (n) is simply the grid power Cost,” defined as the instantaneous electricity cost when the
multiplied by the spot price (denoted as S) at that hour as agent takes no action. This approach eliminates the impact
shown in (5). of luck on the reward function, ensuring that the agent is
rewarded only when its actions outperform doing nothing.
H(n) = S(n) × PGrid (n) (5) We can then mathematically define the reward function as the
difference between the no-action instantaneous cost Hn and
The cumulative annual cost (denoted by J) is then taken the instantaneous cost with action Ha as shown in (8).
by summing the instantaneous cost over the entire year (8760
Hn − Ha
hours) as shown in (6). REO = (8)
N
8760
X where N is a constant used to normalize the error to ensure
J= H(n) (6)
the value is not very high. It is worth noting that the reward
n=1
is limited between -2 and 2 to avoid sharp spikes and ensure
D. The EMS Environment that the penalties remain significant.
Reinforcement learning problems are modeled as Markov The third objective is to make sure there is enough charge
Decision Processes with an agent that interacts with the en- in the battery in case of grid disconnection. Without this, the
vironment through sending actions and receiving observations agent will try to exploit the battery energy and sell as much
and rewards. At each timestep, the agent carries out an action power to keep the SOC at around 10%. We added another
Authorized licensed use limited to: Oestfold University College. Downloaded on May 01,2025 at 22:48:24 UTC from IEEE Xplore. Restrictions apply.
term to the reward which penalizes the agent based on the G. Complex Model and Model Predictive Control
amount of grid power supplied or absorbed by the grid. This
A model that takes the switching of power electronics and
will ensure that when the SOC is too low, the agent will
all of the layers of a hierarchical control structure into account
attempt to direct any excess energy from renewables to charge
was created to test the behavior of the RL agent within this
the battery rather than sell it to the grid and it would also not
system with more advanced dynamics. In this model, the rated
discharge the battery into the grid to further decrease the SOC.
power of the components is the same, and all the primary
The mathematical representation is found in (9).
controllers for the inverters use Finite Set Model Predictive
abs(PGrid ) Control (FS-MPC) that tries to keep the DC-link voltage
RG = (9)
PN om constant while producing the maximum power using MPPT
Where PGrid is the grid power when taking an action and algorithms. The battery controller also uses FS-MPC and it
PN om is the nominal grid power which is 100kW in this case. receives a power reference from the RL agent. The control of
The complete reward function is simply the summation of the three-phase PV inverter is based on the strategy using αβ-
Rc minus the penalties as show in (10). transformation developed by [28], while the battery and wind
turbine converters were implemented based on the strategy
R = REO − RSOC − RG (10) found in [29]. The cost function components of the PV inverter
MPC controller are depicted in (11) - (13).
Where REO is the reward from economic optimization from
(8), RSOC is the penalty to keep the SOC within the acceptable
limits from (7) and RG is the reward that depends on the grid ∗ 2 ∗ 2
f1 = Isα (k) − Isα (k + 1) + Isβ (k) − Isβ (k + 1) (11)
power when the SOC falls below 30% in (9).
F. The Learning Algorithm
2 2
f2 = VC∗f α (k)−VCf α (k+1) + VC∗f β (k)−VCf β (k+1) (12)
There are a variety of reinforcement learning algorithms
used in different problems depending on the action space, the
observation space, and the nature of the problem but generally,
there are two approaches to solving reinforcement learning ∗ 2 ∗ 2
problems, either through the value function or directly through f 3 = IL fα
(k)−ILf α (k +1) + IL fβ
(k)−ILf β (k +1) (13)
optimizing the policy. Actor-critic methods use both the value ∗ ∗
where Isα (k) and Isβ (k) are the αβ components of the
function and the policy to learn. The critic estimates the value
reference grid current, Isα (k + 1) and Isβ (k + 1) are the
function and the actor updates the policy based on the value
αβ components of the predicted grid current, VC∗f α (k) and
function estimate by the critic [9]. The most widely used
VC∗f β (k) are the αβ components of the reference filter ca-
algorithm for microgrids in the literature is Deep-Q-Network
pacitor voltage, VCf α (k + 1) and VCf β (k + 1) are the αβ
(DQN) which was used in [13], [18] and [18] while [19] used
components of the predicted filter capacitor voltage, [IL∗ f α (k)
Q-learning and [20] used Double DQN which are similar to
DQN. Other algorithms such as Monte Carlo with a deep and IL∗ f β (k) are the αβ components of the reference filter
neural network [21] and DDPG [22] were used as well. In ad- inductor current, Lf α(k + 1) and Lf β(k + 1) are the αβ
dition, policy-based algorithms started gaining more traction, components of the predicted filter inductor current.
and the Vanilla Policy gradient (VPG) algorithm was employed The overall cost function F is the weighted sum of f1 , f2 ,
in [23] and a comparison between Deep-Deterministic Policy and f3 as shown in (14):
Gradient (DDPG) and Proximal policy optimization (PPO)
was done in [24]. Authors in [25] recently utilized PPO algo- F = f1 + λv f2 + λi f3 (14)
rithm for scheduling optimization and found that it displayed
superior performance to DQN. Furthermore, [12] compared where λv and λi are weighting factors for the filter capacitor
seven different algorithms and proposed improved versions of voltage and filter inductor current cost functions. More details
PPO and the Asynchronous Actor-Critic (A3C) that provided about these equations can be found in [28].
superior performance. After testing a variety of algorithms The battery and wind turbine MPC cost function depends
including DQN, DDQG, Advantage Actor-Critic (A2C), and on the control of the active and reactive powers as displayed
PPO, we found that PPO delivered the best results and that in (15):
is why we decided to use PPO for the training of the RL
agents. PPO was first introduced in [26] and was found to be f = |P ∗ − P (k + 1)| + |Q∗ − Q(k + 1)| (15)
superior to many of the existing algorithms such as A2C and
DQN. It is a variant of an algorithm called Trust Region Policy where P ∗ and Q∗ are the reference active and reactive powers
Optimization [27] and maintains many of its advantages while generated by the EMS in the case of the battery and the speed
being much simpler to implement. It works by alternating controller in the case of the wind turbine, P (k +1) and Q(k +
between sampling data from the policy and performing several 1) are the predicted active and reactive power using the method
rounds of first-order optimization. described in [29].
Authorized licensed use limited to: Oestfold University College. Downloaded on May 01,2025 at 22:48:24 UTC from IEEE Xplore. Restrictions apply.
Acquire and Develop Simulink Phasor
Process the Data
Explore the Data and Complex Models
Authorized licensed use limited to: Oestfold University College. Downloaded on May 01,2025 at 22:48:24 UTC from IEEE Xplore. Restrictions apply.
160 1
Cummulative Cost without EMS SpotPrice
Cost (Phasor Simulation) SOC
140 Cost (Complex Model) 0.9 RES Power
Cost (Python)
0.8
120
0.7
Price in thousand (NOK)
100
Normalized Units
0.6
80
0.5
60 0.4
40 0.3
20
0.2
0.1
0
0 1000 2000 3000 4000 5000 6000 7000 8000
Time (hrs) 0
5 10 15 20 25 30 35 40 45
Time (hrs)
Fig. 3. Comparison of the achieved cumulative cost savings in (%)
Fig. 5. Normalized results for the spot price, PRES , and the SOC for 2
1
random days
Spot Price
SOC
0.9
0.8
component dynamics. The algorithm’s performance was less
0.7
successful in the complex environment, with minor variations
Normalized Units
0.6
in power production and communication latency between
0.5
Simulink and Python models. The PPO algorithm, applied
0.4
to the microgrid, achieved cost savings of 13.7%, which
0.3
reduced to 7.48% in the complex model due to changing
0.2
spot prices. Using reinforcement learning algorithms such as
0.1
PPO for energy optimization yields promising results, and
0
0 1000 2000 3000 4000 5000 6000 7000 8000 training with simplified models is a valid approach for complex
Time (hrs)
environments. This research contribute to understanding how
(a) Result for 1 year
reinforcement learning can enhance microgrid performance
1
Spot Price
and reduce operational costs.
SOC
0.9
ACKNOWLEDGMENTS
0.8
This work was supported in part by EEA and Norway
0.7
Grants financed by Innovation Norway: DOITSMARTER
Normalized Units
0.6
(2022/337335), ENERGEIA (2022/346660) and Increased
0.5 knowledge on RES and Energy Efficiency (2022/346705).
0.4
R EFERENCES
0.3
0.2
[1] M. C. Falvo, L. Martirano, D. Sbordone, and E. Bocci,
“Technologies for smart grids: A brief review,” in
0.1
7700 7800 7900 8000 8100 8200 8300
Time (hrs)
8400 8500 8600 8700 2013 12th International Conference on Environment
and Electrical Engineering, IEEE, Wroclaw, Poland,
(b) Result for the last 1000 hours
2013, pp. 369–375.
[2] S. Kakran and S. Chanana, “Smart operations of smart
Fig. 4. Normalized results for the spot price, and the SOC
grids integrated with distributed generation: A review,”
Renewable and Sustainable Energy Reviews, vol. 81,
power along with the SOC and the spot price for two days. pp. 524–535, 2018.
There is a correlation between the RES production and the [3] Y. Yoldaş, A. Önen, S. Muyeen, A. V. Vasilakos, and
SOC where the SOC follows the graph of renewable power I. Alan, “Enhancing smart grid with microgrids: Chal-
production. lenges and opportunities,” Renewable and Sustainable
Energy Reviews, vol. 72, pp. 205–214, 2017.
VI. C ONCLUSION [4] Department of Energy Office of Electricity Delivery and
Energy Reliability, Summary report: 2012 doe micro-
This paper introduced an optimization technique for data
grid workshop, [Accessed: 2022-05-24, 2012. [Online].
analytics that utilizes RL which exploit both real weather data
Available: https : / / www . energy . gov / sites / prod /
and electricity price, tested on a full-scale complex model.
files / 2012 % 20Microgrid % 20Workshop % 20Report %
Key differences between the complex and simplified models
2009102012.pdf.
included reactive power, switching devices, controllers, and
Authorized licensed use limited to: Oestfold University College. Downloaded on May 01,2025 at 22:48:24 UTC from IEEE Xplore. Restrictions apply.
[5] M. Faisal, M. A. Hannan, P. J. Ker, A. Hussain, microgrid system by deep reinforcement learning tech-
M. B. Mansor, and F. Blaabjerg, “Review of energy niques,” Energies, vol. 13, no. 11, p. 2830, 2020.
storage system technologies in microgrid applications: [19] E. Foruzan, L.-K. Soh, and S. Asgarpoor, “Reinforce-
Issues and challenges,” IEEE Access, vol. 6, pp. 35 143– ment learning approach for optimal distributed energy
35 164, 2018. management in a microgrid,” IEEE Transactions on
[6] J. Del Ser, D. Casillas-Perez, L. Cornejo-Bueno, et al., Power Systems, vol. 33, no. 5, pp. 5749–5758, 2018.
“Randomization-based machine learning in renewable [20] V.-H. Bui, A. Hussain, and H.-M. Kim, “Double deep
energy prediction problems: Critical literature review, Q-learning-based distributed operation of battery energy
new results and perspectives,” Applied Soft Computing, storage system considering uncertainties,” IEEE Trans-
p. 108 526, 2022. actions on Smart Grid, vol. 11, no. 1, pp. 457–469,
[7] E. O. Arwa and K. A. Folly, “Reinforcement learning 2019.
techniques for optimal power control in grid-connected [21] Y. Du and F. Li, “Intelligent multi-microgrid energy
microgrids: A comprehensive review,” IEEE Access, management based on deep neural network and model-
vol. 8, pp. 208 992–209 007, 2020. free reinforcement learning,” IEEE Transactions on
[8] D. Zhang, X. Han, and C. Deng, “Review on the re- Smart Grid, vol. 11, no. 2, pp. 1066–1076, 2019.
search and practice of deep learning and reinforcement [22] H. Bian, X. Tian, J. Zhang, and X. Han, “Deep rein-
learning in smart grids,” CSEE Journal of Power and forcement learning algorithm based on optimal energy
Energy Systems, vol. 4, no. 3, pp. 362–370, 2018. dispatching for microgrid,” in 2020 5th Asia Conference
[9] R. S. Sutton and A. G. Barto, Reinforcement learning: on Power and Electrical Engineering (ACPEE), IEEE,
An introduction. MIT press, 2018. Tianjin, China, 2020, pp. 169–174.
[10] V. François-Lavet, D. Taralla, D. Ernst, and R. Fonte- [23] M. ELamin, F. Elhassan, and M. A. Manzoul, “Enhanc-
neau, “Deep reinforcement learning solutions for energy ing energy trading between different islanded micro-
microgrids management,” in European Workshop on Re- grids a reinforcement learning algorithm case study in
inforcement Learning (EWRL 2016), Barcelona, Spain, northern kordofan state,” in 2020 International Confer-
2016. ence on Computer, Control, Electrical, and Electron-
[11] P. Chen, M. Liu, C. Chen, and X. Shang, “A bat- ics Engineering (ICCCEEE), IEEE, Khartoum, Sudan,
tery management strategy in microgrid for personalized pp. 1–6.
customer requirements,” Energy, vol. 189, p. 116 245, [24] M. ELamin, F. Elhassan, and M. A. Manzoul, “Com-
2019. parison of deep reinforcement learning algorithms in
[12] T. A. Nakabi and P. Toivanen, “Deep reinforcement enhancing energy trading in microgrids,” in 2020 Inter-
learning for energy management in a microgrid with national Conference on Computer, Control, Electrical,
flexible demand,” Sustainable Energy, Grids and Net- and Electronics Engineering (ICCCEEE), IEEE, Khar-
works, vol. 25, p. 100 413, 2021. toum, Sudan, 2021, pp. 1–6.
[13] Y. Ji, J. Wang, J. Xu, X. Fang, and H. Zhang, “Real-time [25] Y. Ji, J. Wang, J. Xu, and D. Li, “Data-driven online
energy management of a microgrid using deep rein- energy scheduling of a microgrid based on deep rein-
forcement learning,” Energies, vol. 12, no. 12, p. 2291, forcement learning,” Energies, vol. 14, no. 8, p. 2120,
2019. 2021.
[14] Y.-K. Wu, C.-R. Chen, and H. Abdul Rahman, “A [26] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and
novel hybrid model for short-term forecasting in pv O. Klimov, “Proximal policy optimization algorithms,”
power generation,” International Journal of Photoen- arXiv preprint arXiv:1707.06347, 2017.
ergy, vol. 2014, pp. 1–9, 2014. [27] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and
[15] M. Abuella and B. Chowdhury, “Solar power probabilis- P. Moritz, “Trust region policy optimization,” in Inter-
tic forecasting by using multiple linear regression analy- national conference on machine learning, PMLR, Lille,
sis,” in SoutheastCon 2015, IEEE, Fort Lauderdale, FL, France, 2015, pp. 1889–1897.
USA, 2015, pp. 1–5. [28] C. Xue, D. Zhou, and Y. Li, “Hybrid model predic-
[16] V. Thapar, G. Agnihotri, and V. K. Sethi, “Critical tive current and voltage control for LCL-filtered grid-
analysis of methods for mathematical modelling of connected inverter,” IEEE Journal of Emerging and
wind turbines,” Renewable Energy, vol. 36, no. 11, Selected Topics in Power Electronics, vol. 9, no. 5,
pp. 3166–3177, 2011. pp. 5747–5760, 2021.
[17] R. Chedid, H. Akiki, and S. Rahman, “A decision [29] M. P. Akter, S. Mekhilef, N. M. L. Tan, and H.
support technique for the design of hybrid solar-wind Akagi, “Model predictive control of bidirectional ac-
power systems,” IEEE transactions on Energy conver- dc converter for energy storage system,” Journal of
sion, vol. 13, no. 1, pp. 76–83, 1998. Electrical Engineering and Technology, vol. 10, no. 1,
[18] D. Domı́nguez-Barbero, J. Garcı́a-González, M. A. pp. 165–175, 2015.
Sanz-Bobi, and E. F. Sánchez-Úbeda, “Optimising a
Authorized licensed use limited to: Oestfold University College. Downloaded on May 01,2025 at 22:48:24 UTC from IEEE Xplore. Restrictions apply.