Ghadimi EJOR Article in Press Aug 14 2023 1

JID: EOR
ARTICLE IN PRESS [m5G;August 12, 2023;19:0]

European Journal of Operational Research xxx (xxxx) xxx
Contents lists available at ScienceDirect
European Journal of Operational Research

journal homepage: www.elsevier.com/locate/ejor
Stochastics and Statistics
Stochastic search for a parametric cost function approximation:

Energy storage with rolling forecasts
Saeed Ghadimi a,∗, Warren B. Powell b
a
Department of Management Sciences, University of Waterloo, Waterloo, ON, N2L 3G1, Canada
b
Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ, 08544, USA
a r t i c l e i n f o a b s t r a c t
Article history: Rolling forecasts have been almost overlooked in the renewable energy storage literature. In this paper,
Received 13 April 2022 we provide a new approach for handling uncertainty not just in the accuracy of a forecast, but in the
Accepted 1 August 2023
evolution of forecasts over time. Our approach shifts the focus from modeling the uncertainty in a looka-
Available online xxx
head model to accurate simulations in a stochastic base model. We develop a robust policy for making
Keywords: energy storage decisions by creating a parametrically modified lookahead model, where the parameters
Stochastic programming are tuned in the stochastic base model. Since computing unbiased stochastic gradients with respect to
Energy storage the parameters require restrictive assumptions, we propose a simulation-based stochastic approximation
Simulation optimization algorithm based on numerical derivatives to optimize these parameters. While numerical derivatives, cal-
Parametric cost function approximation culated based on the noisy function evaluations, provide biased gradient estimates, an online variance
Rolling forecast reduction technique built in the framework of our proposed algorithm, will enable us to control the ac-
cumulated bias errors and establish the finite-time rate of convergence of the algorithm. Our numerical
experiments show the performance of this algorithm in finding policies outperforming the deterministic
benchmark policy.
© 2023 Elsevier B.V. All rights reserved.
1. Introduction Standard solution strategies tend to fix a forecast and hold it

constant. A deterministic lookahead will assume the forecast is
Over the past 20 years, wind and solar have become important perfect, while classical stochastic models will capture the error in
sources of clean energy and are becoming cost competitive with the forecast, but ignoring the ability to continuously update the
fossil fuels, but at a price of dealing with variability and uncer- forecast within the lookahead model. Fixing the forecast in the
tainty. For example, the total output of all the wind farms for PJM lookahead model means that it is being treated as a latent vari-
can drop from 50 0 0 MW to 10 0 0 MW in an hour or two. Over- able. Not only does this eliminate any ability to claim optimality, it
estimating the available supply from these sources may result in can introduce significant errors, since it ignores the ability to delay
paying high prices to buy energy from other sources or not satis- making a decision now to exploit more accurate forecasts later.
fying the demand. On the other hand, underestimation can result We introduce a new approach that uses a parameterized de-
in missing these clean sources of energy. terministic lookahead, but where the parameters are tuned in a
In this paper, we focus on handling the uncertainty of energy simulator that fully captures the rolling forecasts. Thus, we shift
from wind, where forecasts are notoriously inaccurate. There are the emphasis from creating and solving a more realistic lookahead
many papers that handle the uncertainty in forecasts by solving model, to tuning an approximate lookahead model in a more ac-
stochastic models, but this prior work has ignored the presence curate simulator. The approach is simple, practical and produces
of rolling forecasts which are updated every 10 minutes. Classical results that are significantly better than a classical deterministic
methods based on Bellman’s equation are not able to handle this, lookahead using rolling forecasts.
because it means that the forecast has to be included in the state There is a body of literature that focuses on developing energy
variable, which dramatically increases the complexity of the value storage policies. Here, we just present a list of them based on the
function. three widely used policies: exact or approximate value functions
(Jiang & Powell, 2015; Sioshansi, Madaeni, & Denholm, 2014; Xi &
Sioshansi, 2016; Xi, Sioshansi, & Marano, 2014), policy function ap-
∗
Corresponding author.
proximations (such as “buy low, sell high” policy) (Dokka & Frim-
E-mail addresses: [email protected] (S. Ghadimi), [email protected]
(W.B. Powell).
pong, 2019; Keerthisinghe, Chapman, & Verbič, 2019), and model
https://doi.org/10.1016/j.ejor.2023.08.003
0377-2217/© 2023 Elsevier B.V. All rights reserved.
Please cite this article as: S. Ghadimi and W.B. Powell, Stochastic search for a parametric cost function approximation: Energy storage
with rolling forecasts, European Journal of Operational Research, https://doi.org/10.1016/j.ejor.2023.08.003
JID: EOR
S. Ghadimi and W.B. Powell European Journal of Operational Research xxx (xxxx) xxx
predictive control (Almassalkhi & Hiskens, 2015; Kiaei & Lotfifard, model and show that, when optimized with our proposed algo-
2018; Kumar et al., 2018; Zafar, Ravishankar, Fletcher, & Pota, 2018). rithm, they can outperform a deterministic benchmark policy using
To the best of our knowledge, in all existing energy storage mod- vanilla point forecasts for an energy storage problem.
els, forecasting is either ignored, or the set of forecasts over the The rest of this paper is organized as follows. We discuss the
entire horizon are fixed (see e.g., Dicorato, Forte, Pisani, & Trovato, issue of rolling forecasts and its importance in sequential decision
2012; Zhang, Zhang, Huang, & Lee, 2018). This assumption under- making under uncertainty in Section 2. We then present our en-
mines real world problems where sources of uncertainty, in par- ergy storage model in Section 3. We discuss solution strategies in
ticular from renewable sources, are changing every few minutes. Section 4 and present our parametric CFA approach. We also pro-
This highlights a broader challenge in sequential stochastic deci- pose a stochastic policy search algorithm to optimize the parame-
sion making problems when a given forecast for a source of uncer- ters within the parametric CFA model in Section 5 and establish its
tainty is updated frequently over the time horizon. finite-time rate of convergence. We further show the performance
Several approaches have been proposed to generate forecasts of of this algorithm in optimizing the aforementioned policies for an
different sources of uncertainty in energy systems (which then as- energy storage problem in Section 6 and conclude the paper with
sumed to be fixed over the horizon). For example, time series mod- some remarks in Section 7.
els (saltyte Benth, Benth, & Jalinskas, 2007; Taylor & Buizza, 2003)
and neural networks (Abhishek, Singh, Ghosh, & Anand, 2012; 2. Rolling forecasts
Khotanzad, Davis, Abaye, & Maratukulam, 1996), have been used to
forecast the weather (temperature) which can affect both demand The problem of planning in the presence of rolling forecasts,
and supply. There are also different approaches in forecasting re- which exhibit potentially high errors, is difficult and has been
newable energy like solar radiation (Akarslan, Hocaoğlu, & Edizkan, largely overlooked in the energy storage literature. A simple fact
2014; Arbizu-Barrena, Ruiz-Arias, Rodríguez-Benítez, Pozo-Vázquez, for the rolling forecasts is having accumulated noise as we predict
& Tovar-Pescador, 2017) and wind speed (Liu, Shi, & Erdem, 2010; far more in the future. For example, at time t, denoting the forecast
Traiteur, Callicutt, Smith, & Roy, 2012). More recently, a vector au- of energy available from wind for time t by { ftE,t }t ≥t and assum-
toregression model is also proposed in Liu, Roberts, & Sioshansi
ing that { f0E,t }t =0,...,min(H,T ) is given, one can generate the forecasts
(2018) for forecasting temperature, wind speed, and solar radiation.
as
Our approach formalizes the idea that has been widely used
in industry that an effective way to solve complex stochastic op- ftE+1,t = ftE,t + t +1,t t = 0, . . . , T − 1,
timization problems is to shift the modeling of uncertainty from t = t + 1, . . . , min(t + H, T ), (1)
a lookahead approximation to the stochastic base model which
is captured by a simulator that includes the updating of rolling where T is the problem horizon, H is the size of the lookahead,
forecasts, as well as capturing any other dynamics relevant to the t +1,t ∼ N (0, σ2 ), and σ depends on ρE ftE,t for some constant ρE .
problem. We have first presented this idea more conceptually in This model is usually known as the“martingale model of forecast
Powell & Ghadimi (2022) for sequential decision making problems evolution” (see e.g., Graves, Dasu, & Qui, 1986; Heath & Jackson,
under uncertainty with updating forecasts. This approach, which 1994; Sapra & Jackson, 2004).
we call parametric cost function approximations (CFAs), requires The standard approach to handling forecasts is to fix them over
that we a) design a parameterized deterministic optimization prob- the planning horizon (ignoring the reality that they will actually be
lem and b) tune the parameters in a simulator. While the idea of changing) and optimize over a deterministic future. However, an
parameterized policies is well known in the form of linear decision optimal policy would require modeling the evolution of forecasts
rules (also called affine policies), step functions such as order-up-to over time, something that we have never seen done in a looka-
rules for inventory problems, or even neural networks, our idea of head model. An alternative is to fix the forecast (say, at time t)
parameterizing an optimization model is new to the stochastic op- over a horizon t ∈ {t, t + 1, . . . , t + H } and solve a stochastic dy-
timization community. We do not minimize the challenges of the namic program. With this strategy, the forecast becomes a latent
two aforementioned steps, but they are done offline, and repre- variable in the lookahead model. These are computationally dif-
sent the research required to design a policy that is both robust ficult, and it would be hard doing this, for example, every 5–10
yet no more difficult to compute than basic deterministic looka- minutes as might be required for an energy storage problem.
heads. In this paper, we mainly focus on applying this idea to an The failure to capture rolling forecasts represents a more sig-
energy storage problem under the presence of rolling forecasts and nificant modeling error than has been recognized in the research
discuss its associated computational challenges. literature. Fixing the forecast as a latent variable ignores our abil-
We make three main contributions in this paper. First, we ap- ity to wait to make decisions at a later time with a more accurate
ply the idea of using parametric CFA to handle uncertainty in the forecast. Properly modeling rolling forecasts and their associated
context of an energy storage problem with rolling forecasts. In errors represents a surprisingly complex challenge in a lookahead
contrast with a basic parametric model, our parameterized opti- model. If we have a rolling forecast extending 24 hours into the
mization model performs critical scaling functions and makes it future, including the forecast into the state variable introduces a
possible to handle high-dimensional decisions. Second, we present 24-dimensional component of the state variable into the model,
a new simulation-based stochastic approximation (SA) algorithm, without any particular structure that we can exploit.
based on the Gaussian random smoothing technique, to optimize Indeed, there is an important tradeoff: including a dynamically
(tune) the parameters in the parametric CFA model while using varying forecast in the state variable produces a more complex, higher
only two function evaluations at each iteration. Our proposed al- dimensional state variable, but one which does not have to be re-
gorithm is equipped with an online variance reduction technique optimized when the forecast changes. By contrast, treating the fore-
which makes it more robust than the vanilla stochastic gradient cast as a latent variable, as it has been done in the classical dynamic
method using numerical derivatives. Furthermore, we establish the programming models using Bellman’s equation, simplifies the model,
finite-time convergence of this algorithm and show that its sample but requires that the model be re-optimized when it changes.
complexity is in the same order of the one presented in Nesterov For this reason, we are going to adopt a completely differ-
& Spokoiny (2017) with slightly better dependence on the problem ent approach. Rather than developing a more accurate lookahead
parameters, when applied to nonsmooth nonconvex problems. Fi- model, we are going to use a parameterized, deterministic looka-
nally, we propose several policies for parameterization of the CFA head model, where the parameters are tuned in the simulator
2
JID: EOR
which captures the updating of rolling forecasts. While the pa- β c (xtwr + xtgr ) − xtrd − xtrg ≤ Rmax − Rt ,
rameterization needs to be carefully designed, this strategy shifts
xtwr + xtgr ≤ γ c ,
the focus from solving a complex lookahead model to using a re-
alistic simulator, where it is much easier to handle complex dy- xtrd + xtrg ≤ γ d , (2)
namics. However, this parameterization policy brings the challenge where βc, βd ∈ (0, 1 ) are the charge and discharge efficiencies, γ c
of optimizing its parameters which requires an efficient stochastic and γ d are the maximum amount of energy that can be charged or
search method. We will present details of our proposed algorithm discharged from the storage device. Fig. 1 summarizes the model.
in Section 5. We will also discuss in more detail the base model The exogenous information
and lookahead models in Section 4. The exogenous information Wt describes the information that
first becomes known at time t. For our energy storage model, we
3. Energy storage model assume that the spot price of electricity from the grid, the mar-
ket price of electricity, and the load are deterministic. However, for
In this section, we describe an energy storage model involving sake of general modeling, we include the change in them from pe-
rolling forecasts of wind. Assume that a smart grid manager must riod t to t + 1 in the exogenous information. It also includes the
satisfy a recurring power demand with a stochastic supply of re- change in forecasts of the wind. It should be pointed out that the
newable energy, unlimited supply of energy from the main power exogenous information for the next period may depend on the cur-
grid at a stochastic price, and access to local rechargeable storage rent decision and/or state of the system.
devices. At the beginning of each period, the manager combines The transition function
different sources of available energy from the grid, local storage, The transition function, SM (· ) also explicitly describes the rela-
and wind to satisfy the current load. Moreover, depending on the tionship between the state of the model at time t and t + 1 such
main power grid price, they may decide to purchase more energy that St+1 = SM (St , xt , Wt+1 ). More specifically, the relationship of
from the grid and recharge the local storage if it has any remain- storage levels between periods is defined as:
ing capacity. In the case of excess energy from wind, they may also
Rt+1 = Rt − xtrd + β c xtwr + xtgr − xtrg . (3)
decide to sell it to the main power grid. We also consider different
energy prices for the grid and customers to allow more flexibility The forecast for the wind is also updated according to (1). The spot
in our model. price of electricity from the grid, the market price of electricity,
We now formally present our energy storage model by intro- and the load are assumed to be fixed over the horizon and do not
ducing the five key elements of sequential decision making under change once they are given at t = 1.
uncertainty (Powell, 2019; Powell & Meisel, 2016a), namely, state The objective function
variables, decision variables, exogenous information variables, the To evaluate the effectiveness of a policy or sequence of de-
transition function, and the objective function. cisions, we need an objective function representing the expected
The state variables sum of the costs Ct (St , xt ) in each time period t over a finite hori-
The state variable at time t, St , includes the following. zon. Denoting the penalty of not satisfying the demand by C P , for
Rt : The level of energy in storage satisfying Rt ∈ [0, Rmax ], where a given state St and decision xt , the cost realized at t is given by
Rmax > 0 represents the storage capacity.
{ ftE,t }t ≥t : The forecast of energy from wind at time t made at Ct (St , xt ) = C P Dt − (C P + Ptm ) xtwd + β d xtrd + xtgd
time t, where the current energy Et = ft,tE .
g
P[t] : The forward curve of spot prices of electricity from the grid
− Ptg β d xtrg − xtgr − xtgd . (4)
with the notation of [t] = {t }t ≥t . Therefore, we seek to find the policy that solves
P[m : The market price of electricity.
t]

T
D[t] : The load curve. min E Ct (St , Xtπ (St )) S0 , (5)
Hence the state of the system can be represented by the vector π ∈
t=0
, P , D[t] ) ∀t ≥ t.
g
St = (Rt , ftE,t , P[m
t] [t] where Xtπ (St ) denotes the decision function (policy) determining
The decision variables
the decision variable xt and the initial state S0 is assumed to be
At time t, several decision variables should be made to satisfy
known. If the cost function, transition function and constraints are
the load and replenish the storage device for the future.
linear, a deterministic lookahead policy can be constructed as a lin-
xtwd : The available energy from the wind used to satisfy the
ear program if point forecasts of exogenous information are pro-
load.
vided. Eq. (5), along with the transition function and the exoge-
xtrd : The allocated energy from the storage used to satisfy the
nous information process, is called the base model which can be
load.
gd
xt : The purchased energy from the grid used to satisfy the
load.
xtwr : The available energy from the wind transferred to storage.
gr
xt : The purchased energy from the grid used to store.
rg
xt : The stored energy to be sold to the grid.
Hence, the manager’s decision variables at time t are defined as
the vector xt given by

xt = xtwd , xtrd , xgd , xtwr , xtgr , xtrg ≥ 0,
which should satisfy the following constraints:
xtwd + β d xtrd + xtgd ≤ Dt ,

xtrd + xtrg ≤ Rt ,
xtwr + xtwd ≤ Et , Fig. 1. Energy storage model.
3
JID: EOR
used to model virtually any sequential stochastic decision making which can efficiently handle the issue of rolling forecasts. Consider
problem, with possibly minor twists in the objective function for a deterministic lookahead policy given by
specific classes of the problem such as risk measures. min(t+H,T )

XtD-LA (St ) = argmin Ct (St , xt ) + Ct (St , x˜t ,t ),
xt ,x˜t t =t+1
4. Designing policies
In this section, we first review the existing solution strategies to s.t. (2 ), (3 ) xt , x˜t ≥ 0, and
solve the base model and then present our new approach, namely,
the parametric cost function approximation. x˜twd
,t +β d rd
x˜t ,t + x˜tgd,t ≤ Dt ,
x˜trd,t + x˜trg,t ≤ Rt ,
4.1. The four classes of policies
x˜twr ˜twd
,t + x
E
,t ≤ ft ,t ,
There are two general strategies for designing policies to solve
the base model. The first is to use policy search, where we have β c (x˜twr,t + x˜tgr,t ) − x˜trd,t − x˜trg,t + Rt ≤ Rmax ,
to tune the parameters of a policy so that it works well over time. ˜tgr,t ≤ γ c ,
x˜twr
,t + x
The second is to build a policy that makes the best decision now
that minimizes costs now and into the future (we call these looka- x˜trd,t + x˜trg,t ≤ γ d ,
head policies).
The policy search class can be divided into two classes: policy Rt − x˜trd,t + β c (x˜twr ˜tgr,t ) − x˜trg,t = Rt +1 ,
,t + x (6)
function approximations (PFAs), where the policy is an analytical
(t+H,T )
function that maps states to actions (such as a linear model or a where x˜t = (x˜t ,t )tmin
=t+1 . When we solve the above model, we
neural network); and cost function approximations (CFAs) which keep xt to compute the portion of the cost function at time t, and
consist of an optimization problem that has been parameterized discard all x˜t ,t and repeat this process as we move forward over
so that it produces good solutions over time. the problem horizon.
The lookahead class can also be divided into two classes. The In the parametric CFA approach, we parameterize the lookahead
class of value function approximation (VFA), is the familiar ap- model in (6) in which the parametric terms can be added to the
proach based on Bellman’s equation where we might compute cost function and/or constraints. In this paper, we focus on param-
(more often we approximate) the value of being in a downstream eterizing constraints including noisy forecasts. Hence, our hybrid
state produced by a decision now. The class of direct lookahead ap- policy Xtπ (St |θ ) is defined as the solution to the linear program-
proximation (DLA), is based on direct lookaheads where we opti- ming model (6) in which the wind energy constraint is updated as
mize over some planning horizon. The challenge with DLAs is how
,t ≤ bt ( ft ,t , θ ),
to handle uncertainty as we optimize over the horizon. Most prac- x˜twr ˜twd E
,t + x (7)
tical tools such as Google maps use a deterministic approximation.
Building uncertainty explicitly into the lookahead model is chal- where bt is a real valued function and θ is the set of constraint
lenging. The ultimate stochastic lookahead would require solving parameters.
We then need to optimize the values of parameters θ , for a

T given policy π , by solving
Xt∗ (St ) = argmin Ct (St , xt ) + E min E Ct (St , Xtπ (St ))
xt ∈Xt π ∈
t =t+1
min F π (θ ) := Eω F̄ π (θ , ω )
θ

St+1 St , xt .

T
=E Ct St (ω ), Xtπ (St (ω )|θ ) S0 , (8)
In special cases, the lookahead portion of the above equation can t=0
be computed exactly using Bellman’s equation:
where ω denotes the randomness in the model for which F̄ π (θ , ω )

represents the stochastic cumulative cost of the parametrized pol-
Vt (St ) = min Ct (St , xt ) + E Vt+1 (St+1 |St , xt ) ,
xt icy over the horizon. It should be pointed out that the more gen-
eral optimization problem associated with the parametric CFA ap-
where Vt+1 (· ) denotes the value of the downstream impact of a de-
proach is to optimize over the structure of policies and their pa-
cision xt made in state St . More often, we have to replace this value
rameterization simultaneously. However, our focus in this paper is
function with an approximation, but this only works when we can
to solve problem (8) to optimize the parameters for a given struc-
exploit structure such as convexity, linearity or monotonicity. Of-
ture of a policy. The tuned parameters capture the proper dynam-
ten, we have to directly approximate the lookahead by creating a
ics of forecasts, unlike an optimal solution to a stochastic looka-
lookahead model opening the door to a variety of approximation
head that uses a fixed forecast. However, tuning is not easy since
strategies, including the use of deterministic lookaheads, approxi-
the above problem is usually nonconvex and we will discuss an
mating the state variable and exogenous information process (this
approximation algorithm in Section 5 to solve it. We refer inter-
is where we can ignore the presence of rolling forecasts), along
ested readers to the companion paper (Powell & Ghadimi, 2022)
with the use of restricted policies. However, the best approach de-
in which we described the idea of the parametric CFA approach in
pends on the problem setting (see Powell & Meisel, 2016 for more
more detail and for general decision making problems under un-
details).
certainty.
An important step in the parametric CFA approach is to con-
4.2. The parametric cost function approximation sider meaningful parameterized policies in the model. This step is
truly domain dependent and can be significantly different from one
In this subsection, we propose using a hybrid policy of com- problem to another one. Indeed, this step is the art of modeling
bining deterministic lookaheads with parametrically modified CFAs that draws on a statistical model or the knowledge and insights
4
JID: EOR
of the domain experts. For the energy storage problem in this pa- where
per, we assume that uncertainties only exist in the wind forecasts.
∂ C0 ∂ X0π
T
∂ Ct ∂ St
Therefore, we propose the following: ∇θ F̄ π (θ ) = · + ·
∂ X0π ∂θ ∂ St ∂θ
• Constant parameterization (π = const) - This parameterization t=1
uses a single scalar to modify the forecast of energy from wind ∂ Ct ∂ Xtπ ∂ St ∂ Xtπ
+ · · + , (9)
for the entire horizon such that bt in (7) is set to bt ( ftE,t , θ ) = ∂ Xtπ ∂ St ∂θ ∂θ
θ· ftE,t .
π
• Lookup table parameterization (π = lkup) - Overestimating or ∂ St ∂ St ∂ St−1 ∂ St π
∂ Xt−1 ∂ St−1 ∂ Xt−1
underestimating forecasts of energy from wind influences how and = · + π · · + ,
∂θ ∂ St−1 ∂θ ∂ Xt−1 ∂ St−1 ∂θ ∂θ
aggressively a policy will store energy. We can modify the fore-
cast for each period of the lookahead model with a unique pa- in which the ω is dropped for simplicity.
rameter θτ . This parameterization is a lookup table representa-
tion because there is a different θτ for each lookahead period, Proof. If F̄ π (·, ω ) is convex or concave for every ω ∈ , and
τ = 0, 1, 2, . . . This implies that bt ( ftE,t , θ ) = θt −t · ftE,t , where F π (· ) is finite
valued in the neighborhood
of θ , then we have
∇θ E F̄ π (θ , ω ) = E ∇θ F̄ π (θ , ω ) by Strassen (1965). Applying
t ∈ [t + 1, min(t + H, T )] and τ = t − t. If θτ < 1 the policy will
the chain rule, we find
be more conservative and decrease the risk of running out of
energy. Conversely, if θτ > 1 the policy will be more aggressive

d T
∂ C0 dX0π
and less adamant about maintaining large energy reserves. This ∇θ F̄ π (θ ) = C0 (S0 , X0π ) + C (St , Xtπ ) = ·
dθ dX π
0
∂θ
is a time-independent (or stationary) parameterization since t=1

the modification of the forecast at each time period depends
T
d
on how far in the future forecasts are provided. + C (St , Xtπ )
dθ
• Exponential decay parameterization (π = exp) Instead of cal- t=1

culating a set of parameters for every period within the looka- ∂ C0 dX0π
T
∂ Ct ∂ St ∂ Ct dXtπ
head model, we can make our parameterization a function of = · + · + ·
dX0π ∂θ ∂ St ∂θ ∂ Xtπ dθ
time and a few parameters. Intuitively, we can assume the fore- t=1
casts become worse when we are far in the future. Hence, it ∂ C0 dX0π
= ·
might be good to try some decaying functions of parameters dX0π ∂θ
to decrease the impact of errors in forecasts for the far fu-
ture. To do this, we suggest using the following exponential
T
∂ Ct ∂ St ∂ Ct ∂ Xtπ ∂ St ∂ Xtπ
+ · + · · + ,
function of two variables which also limits the search space ∂ St ∂θ ∂ Xtπ ∂ St ∂θ ∂θ
of parameters into a two dimensional plane i.e., bt ( ftE,t , θ ) =
t=1
−t ) where
ftE,t · θ1 · eθ2 ·(t .
π π
∂ St ∂ St ∂ St−1 ∂ St ∂ Xt−1 ∂ St−1 ∂ Xt−1
Similar parameterization schemes can be also proposed for the = · + π · + .
RHS of other constraints in the lookeahead model, if they include ∂θ ∂ St−1 ∂θ ∂ Xt−1 ∂ St−1 ∂θ ∂θ
noisy forecasts. The combination of these parameterizations can be
then used in the parametric CFA model, but tuning the higher di-
mensional parameter vector becomes harder. Note that if F̄ π (θ ) is not differentiable, then its subgradient can
be still computed using (9). However, when F̄ π (θ ) is not convex
5. The stochastic search algorithm (concave), its subgradient may not exist and the concept of gen-
eralized subgradient should be employed. If ∇θ F̄ π (θ , ω ) exists for
Our goal in this section is to solve problem (8) under spe- every ω ∈ , the ability to calculate its unbiased estimator allows
cific assumptions on F π (θ ). Even for a simple parameterization, us to use SA-type techniques such as stochastic gradient descent
this function is possibly nonconvex and nonsmooth which makes (SGD) to determine the optimal parameter θ ∗ . However, this is not
the optimization problem hard to solve. On the other hand, com- always the case. The function F π (θ ) can be generally nonsmooth
puting unbiased (sub)gradient estimates of the objective function and nonconvex and hence, its subgradient may not exist every-
w.r.t the parameters may be prohibitive or impossible. We first where. Moreover, calculating (9) may not be easy. Therefore, we
present a result on computing unbiased stochastic (sub)gradient propose an alternative way to estimate gradient of F π (θ ).
of F π (θ ) under certain conditions. We then discuss the setting To simplify our notation, we drop the superscript π for the
in which we cannot compute these stochastic (sub)gradients and policies in definition of the objective function and it only refers
we only have access to noisy evaluations of F π (θ ). We present a to the number π in the rest of this section. Before we proceed, we
simulation-based optimization algorithm based on a randomized assume that the objective function is Lipschitz continuous w.r.t θ
Gaussian smoothing technique and establish its finite-time rate of with constant L0 > 0, for any ω ∈ i.e.,
convergence to a stationary point of problem (8) when F π (θ ) is
|F̄ (θ1 , ω ) − F̄ (θ2 , ω )| ≤ L0 θ1 − θ2 ∀ θ1 , θ2 ,
possibly nonsmooth and nonconvex.
Stochastic approximation algorithms require computing which consequently implies that F (θ ) is Lipschitz continuous with
stochastic (sub)gradients of the objective function iteratively. constant L0 . This is a reasonable assumption for most of the ap-
Due to the special structure of F π (θ ), its (sub)gradient can be plications as the cost (objective function) does not make sudden
computed recursively under certain conditions as shown in the changes w.r.t small change of resources (policies). This property
next result. will be used to establish the convergence analysis of our proposed
algorithm. Furthermore, we assume that noisy evaluations of F (θ )
Proposition 5.1. Assume F̄ π (·, ω ) is convex/concave for every ω ∈ ,
can be obtained through simulations and hence, we can use tech-
and F π (· ) is finite valued in the neighborhood of θ . If distribution of
niques from simulation-based optimization where even the shape
ω is independent of θ , we have
of the function may not be known (see e.g., Fu, 2015 and the ref-
∇θ F π (θ ) = E ∇θ F̄ π (θ , ω ) , erences therein). In the reminder of this section, we provide a
5
JID: EOR
zeroth-order SA algorithm and establish its finite-time convergence for ∇ Fηk (θ k ) at each iteration k, which is a convex combination
analysis to solve problem (8). of all generated zeroth-order gradient estimators up to iteration
A smooth approximation of the function F (θ ) can be defined by k. Indeed, taking this weighted average of gradient estimators will
the following convolution: help use to reduce the variance of these estimates. When working
with zeroth-order estimators, we have an additional level of noise
1
Fη (θ ) = F (θ + ηv )e− 2 v dv = Ev [F (θ + ηv )].
1 2
(10) (Gaussian noise in our case) to use in the finite-difference formula.
( 2π ) 2
d
Hence, making the choice of αk less than 1, will act as an online

where η > 0 is the smoothing parameter and v ∈ Rd is a Gaussian variance reduction technique which is more beneficial in the case
random vector whose mean is zero and covariance is the identity of derivative-free setting.
matrix. The following result in Nesterov & Spokoiny (2017) pro- On the other hand, if the unbiased stochastic gradient of f is
vides some properties of Fη (· ). available and used instead of Gηk (θ k , ωk ), Algorithm 1 reduces to
a variant of the algorithm proposed in Ghadimi, Ruszczynski, &
Lemma 5.1. The following statements hold for any Lipschitz continu- Wang (2020) for nested problems. Thus, one may easily establish
ous function F with constant L0 . the convergence analysis of Algorithm 1 assuming it is applied
(a) The function Fη is differentiable and its gradient is given by to minimize Fη (θ ). However, this does not specify the choice of
smoothing parameter η. In addition, the smoothing parameter can
1 F (θ +ηv )−F (θ )
∇ Fη (θ ) =
1
ve− 2
2
v be different at each iteration and hence, the convergence analysis
η
( 2π ) 2
d
of Ghadimi et al. (2020) is not directly applicable as the smoothing
function is changed every iteration.
F (θ +ηv )−F (θ )
d v = Ev η v . (11)
In the next result, we provide the main convergence analysis of
our proposed algorithm.
(b) The gradient of Fη is Lipschitz continuous with constant Lη =
√ Theorem 5.1. Let {θk } be generated by Algorithm 1, F̄ (θ ) be Lipschitz
d
η L0 , and for any θ ∈ Rd , we have continuous with constant L0 , and F (θ ) be bounded below by F ∗ . If
parameters are chosen such that
|Fη (θ ) − F (θ )| ≤ ηL0 d, (12)
Algorithm 1 The stochastic averaging numerical gradient method
(SANG).
Ev F (θ + ηv ) − F (θ ) v 2 ≤ η2 L20 (d + 4 )2 . (13)
1: Input: θ 0 ∈ Rd , Ḡ0 = 0, an iteration limit N, positive sequences
{ηk }k≥1 , {βk }k≥1 , {αk }k≥1 ∈ (0, 1 ), and a probability mass func-
tion (PMF) PR (· ) supported on {1, . . . , N}.
We also need the following result about using different smooth-
2: Generate a random number R according to PR (· ).
ing parameters.
Lemma 5.2. Assume that the function F is Lipschitz continuous with For k = 1, . . . , R:
constant L0 and η1 , η2 > 0. Then, for any θ ∈ Rd , we have 3: Update policy parameters as
2 L 0 d | η2 − η1 | θyk = θ k−1 − βk Ḡk−1 , (16)

∇ Fη2 (θ ) − ∇ Fη1 (θ ) ≤ . (14)
η2 θ = (1 − αk )θ
k k−1
+α θk
k y. (17)
Proof. Noting (11), we have

4: Generate a trajectory ωk where k (ω k ) =
St+1
∇ Fη2 (θ ) − ∇ Fη1 (θ ) SM (Stk (ωk ), Xtπ (Stk (ωk )|θ k ), Wt+1 (ωk )),
and a random

= Ev η1 F (θ +η2 v )−η2 F (ηθ2+ηη1 1 v )+(η2 −η1 )F (θ ) v Gaussian vector vk to compute the gradient estimator as
Gηk (θ k , ωk ) = F̄ (θ +ηk v ,ωη )−F̄ (θ ,ω ) vk ,
k k k k k
(18)
|F (θ +η2 v )−F (θ +η1 v )|
≤ Ev η2 + |η2 −η1 ||F (ηθ2+ηη1 1 v )−F (θ )| v k
and set
2 L 0 d | η2 − η1 |
≤ , Ḡk = (1 − αk )Ḡk−1 + αk Gηk (θ k , ωk ). (19)
η2
where the last inequality follows from the Lipschitz continuity of End For
F and the fact that Ev [ v 2 ] = d.
Our method is formally described as Algorithm 1. The gradient β0 = 1, ηk+1 ≤ ηk , βk+1 ≤ βk , βk

estimate at its kth iteration, as shown in (18), is obtained by only c 2 ηk
N
sampling a random Gaussian vector vk and computing the noisy ≤ √ , αi Ai ≤ c1 Ak ∀k ≥ 1 (20)
objective values at the current point θ k and its perturbation θ k + L0 d i=k
ηk vk . While (18) gives us an unbiased estimator for ∇ Fηk (θ k ) as for some positive constants c1 , c2 > 0 and

Eω,v Gηk (θ , ω ) = ∇ Fηk (θ )
k k k
(15)
k
Ak = (1 − αi ) ∀k ≥ 1 . (21)
due to (11), it provides a biased estimator for ∇ F (θ k ). However, i=1
by properly choosing the algorithm parameters, we can control the We have
bias terms and ensure convergence of the algorithm. In particular,
if αk = 1 ∀k, Algorithm 1 reduces to those presented in Ghadimi

N
αk βk E Ḡk−1 2 ≤ F ∗ − F (θ 0 ) + 2η1 L0 d
& Lan (2013) and Nesterov & Spokoiny (2017) (with the established k=2
rate of convergence), which are based on the classical simultane-
L20 (d + 4 )2 (4 + c2 )
N
ous perturbation stochastic approximation (SPSA) algorithm (Spall, + βk αk2 , (22)
1992). When αk ∈ (0, 1 ) ∀k, Algorithm 1 uses a biased estimator 2
k=1
6
JID: EOR
E[ Gηk (θ k , ωk ) − Ḡk−1 2
] ≤ 4L20 (d + 4 )2 ,

N
which together with (26) and the assumption that Lηk βk ≤ c2 , im-
αk βk E ∇ Fηk (θ k ) 2 ≤ 2(1 + 2c1 c22 )(F ∗ − F (θ 0 )+2η1 L0 d )
ply that
k=1
2αk βk Ḡk−1 2
+ βk Ḡk 2
≤ βk−1 Ḡk−1 2

N
8c1 (ηk − ηk−1 )2
+ L20 (d + 4 ) 2
βk
αk ηk2 + 2 Fηk (θ ) − Fηk (θ
k k−1
) + L20 (d + 4 )2 (4 + c2 )βk αk2 .
k=1
Summing up the above inequalities, and rearranging the terms, and
+ [2(1 + c1 ) + (4 + c2 )(1 + 2c1 c22 )]αk2 , (23) noting the fact that Ḡ0 = 0, we obtain

N
N
where the expectation is taken w.r.t the random vector ω, and Gaus- 2 αk βk E[ Ḡk−1 2 ] ≤ 2N + L20 (d + 4 )2 (4 + c2 ) βk αk2 ,
sian random vector v. k=2 k=1
Proof. First note that F (θ ) is Lipschitz continuous with constant L

where N = FηN (θ N ) − Fη1 (θ 0 ) + N−1 [F (θ k ) − Fηk+1 (θ k )]. Noting
k=1 ηk
due to the same assumption on F̄ (θ ). Hence, the gradient of Fη (θ )
(12), the fact that F (θ ) ≤ F ∗ for any θ ∈ Rd , (10), Lipschitz continu-
is Lipschitz continuous with constant Lη due to Lemma 5.1.b which
ity of F , and the assumption that ηk+1 ≤ ηk we have
together with the fact that

θ k − θ k−1 = αk θyk − θ k−1 = −αk βk Ḡk−1 (24) FηN (θ N ) − Fη1 (θ 0 ) ≤ F ∗ − F (θ 0 ) + (η1 + ηN )L0 d,
due to (16) and (17), imply that Fηk (θ ) − Fηk+1 (θ ) = Ev [F (θ + ηk v ) − F (θ + ηk+1 v )]

k k k k

Lηk ≤ L0 (ηk − ηk+1 )Ev [ v ] ≤ L0 d (ηk − ηk+1 ),
Fηk (θ k−1 ) ≥ Fηk (θ k ) + ∇ Fηk (θ k ), θ k−1 − θ k − θ k − θ k−1 2
√
2
which clearly implies that N ≤ F ∗ − F (θ 0 ) + 2η1 L0 d. We then
Lη α 2 β 2
= Fηk (θ k ) + αk βk ∇ Fηk (θ k ), Ḡk−1 − k k k Ḡk−1 2 immediately conclude (22).
2 Now, observe that we need to bound ∇ Fηk (θ k ) − Ḡk 2 as
= Fηk (θ k ) + αk βk δk + Gηk (θ k , ωk ), Ḡk−1
Lηk α 2
β2 ∇ Fηk (θ k ) 2
≤2 ∇ Fηk (θ k ) − Ḡk 2
+ Ḡk 2
− k k
Ḡk−1 2
, (25)
2
≤2 ∇ Fηk (θ k ) − Ḡk 2
+ (1 − αk ) Ḡk−1 2
where δk := ∇ Fηk (θ k ) − Gηk (θ k , ωk ). Moreover, by (19), we have
Ḡ k 2
= Ḡ − Ḡ k k−1 2
+ Ḡ k−1 2
+ 2 Ḡ − Ḡk k−1
, Ḡ k−1
+ αk Gηk (θ k , ωk ) 2
, (28)
= α Gηk (θ , ω ) − Ḡ
2
k
k
+ Ḡ k k−1 2 k−1 2
where the second inequality comes from the convexity of · 2
+ 2αk Gηk (θ , ω ) − Ḡ , Ḡk−1 .

k k k−1
and (19). Noting (19) and the definition of δk , we have
Multiplying the above relation by βk /2, combining it by (25), not- ∇ Fηk (θ k ) − Ḡk = (1 − αk )(∇ Fηk (θ k ) − Ḡk−1 ) + αk δk , (29)
ing the fact that βk ≤ βk−1 re-arranging the terms, we obtain
which implies that
βk−1
Fηk (θ k ) + β2k Ḡk 2
≤ Fηk (θ k−1 ) + 2
Ḡk−1 2
− αk βk Ḡk−1 2
∇ Fηk (θ k ) − Ḡk 2
= (1 − αk )(∇ Fηk (θ k ) − Ḡk−1 ) 2
+ αk2 δk 2
−αk βk δk , Ḡ k−1
+ 2αk (1 − αk ) ∇ Fηk (θ ) − Ḡ
k k−1
, δk .
(30)
β α2
+ k2 k Gηk (θ k , ωk )−Ḡk−1 2 +Lηk βk Ḡk−1 2 . Moreover, by the convexity of · 2 and we have
(26) (1 − αk )(∇ Fηk (θ ) − Ḡ k k−1
) 2
= (1 − αk )(∇ Fηk−1 (θ k−1 ) − Ḡk−1 )
Dividing both sides of (19) by Ak (defined in (21)), summing them + αk ek 2
up, and noting that Ḡ0 = 0, we obtain ≤ (1 − αk ) ∇ Fηk−1 (θ k−1 ) − Ḡk−1 2

k
αi + αk ek 2
, (31)
Ḡk = Ak Gηi (θ i , ωi ), (27)
Ai where
i=1
which together with the convexity of · 2 and the fact that (1 − αk )

ek = ∇ Fηk (θ k ) − ∇ Fηk−1 (θ k−1 ) . (32)
αk

k
αi α1
k
αi α1
k
1 1 1
= + = + − = −1 Hence, by (30) and (31),
Ai A1 Ai A1 Ai Ai−1 Ak
i=1 i=2 i=2
∇ Fηk (θ k ) − Ḡk 2
≤ (1 − αk ) ∇ Fηk−1 (θ k−1 ) − Ḡk−1 2
due to (21), imply
+ αk ek 2 + αk2 δk 2

k
αi
Ḡk 2
≤ ( 1 − Ak )Ak Gηi (θ i , ωi ) 2
. + 2αk (1 − αk ) ∇ Fηk (θ k ) − Ḡk−1 , δk . (33)
Ai
i=1
Thus, similar to (27), we can obtain
Moreover, by Lipschitz continuity of F̄ (·, ω ), (13), and (15), we have
∇ Fηk (θ k ) − Ḡk 2

E[δk ] = 0,

k
αi αi2
≤ Ak ∇ Fη0 (θ 0 ) 2
+ ei 2
+ δi 2
Ai Ai
E[ Gηk (θ k , ωk ) 2
] ≤ L20 (d + 4 )2 , i=1

k
αi
k
αi (1 − αi )
E[ Ḡk 2
] ≤ L20 (d + 4 )2 Ak ≤ L20 (d + 4 )2 , + 2Ak ∇ Fηi (θ i ) − Ḡi−1 , δi . (34)
Ai Ai
i=1 i=1
7
JID: EOR
In the rest of the proof, only for the sake of simplicity, we assume Proof. First, note that by choices of parameters in (36), we have
that η0 and θ0 are chosen such that ∇ Fη0 (θ 0 ) = 0. Now, by (24),
Lipschitz continuity of F̄ (·, ω ) and ∇ Fη , (13), and (14), we have
N
δ N 1 δN
βk αk = = 2 ,
L20 d δ ( d + 4 )N L0 d 2 (d + 4 )
∇ Fηk (θ ) − ∇ Fηk−1 (θ )
k k−1 2 k=1

≤ 2 ∇ Fηk (θ k ) − ∇ Fηk (θ k−1 ) 2 + ∇ Fηk (θ k−1 ) − ∇ Fηk−1 (θ k−1 ) 2
N
δ N 1
βk αk2 = ≤ ,
k=1
L20 d δ ( d + 4 )N L20 d (d + 4 )
4L2 d2 (ηk − ηk−1 )2
≤ 2 (Lηk αk βk )2 Ḡk−1 2 + 0 ,
N
N
Ai−1 − Ai
N
ηk2 αi Ai = Ai (Ai−1 − Ai )(1 − αi )
i=k i=k
Ai−1
i=k
E δk 2 ≤ 2 ∇ Fηk (θ k ) 2 + E[ Gηk (θ k−1 , ωk ) 2 ≤ 2L20 (d + 4 )2 ,

N
E[ ∇ Fηk (θ ) − Ḡ
k k−1
, δk ] = 0.
= (Ai − Ai+1 )
i=k
Therefore, taking expectation from both sides of (34) and noting = Ak − AN+1 ≤ Ak ∀k ≥ 1 ,
(32), we obtain
which implies that the assumptions in (20) hold with c1 = c2 = 1

N
and together with (22) and (23) imply (37) and (38).
βk αk E[ ∇ Fηk (θ k ) − Ḡk 2 ]
k=1 We now add a few remarks about the abode results. First,
note that a sufficient condition to obtain an -stationary point

N
k
αi
k
α 2
of the smooth approximation problem (any θ̄ ∈ Rd such that
≤ βk αk Ak E[ ei 2
]+ i
E[ δi 2 ]
Ai Ai E[ ∇ Fη (θ̄ ) 2 ] ≤ ), is to make the RHS of (38) less than the target
k=1 i=1 i=1
accuracy which, after neglecting some constants, implies that the

N
αk N
total number of function evaluations (2N) is bounded by
= βi αi Ai E[ ek 2
] + αk E[ δk 2
Ak
k=1 i=k L40 d3
O . (40)

N
αk βk
N δ 2
≤ αi Ai E[ ek 2
] + αk E[ δk 2
Ak This bound is slightly better than the one obtained in Nesterov &
k=1 i=k
Spokoiny (2017) (for the weighted average of E[ ∇ Fη (θ k ) 2 ] with-

N
out introducing the random index R) in terms of dependence on L0 .
≤ c1 αk βk E[ ek 2 ] + αk E[ δk 2
It should be noted that due to the choice of η in (36), the param-
k=1
eter δ controls the error between the original objective function

N
4L20 d2 βk (ηk − ηk−1 )2 and its smooth approximations i.e., | f (θ ) − f δ√ (θ )| ≤ δ for any
≤ 2 c1 c22 αk βk E[ Ḡk−1 2
]+
αk ηk2 L0 d
k=1 given θ . Hence, as δ goes to zero, the output of Algorithm 1 will

N be closer to a stationary point of problem (8).
+2c1 L20 (d + 4 )2 βk αk2 , (35) Second, we can adaptively choose βk and ηk such that they
k=1 gradually converge to zero. For example, if both βk and ηk are in
where the second to the fourth inequalities follow from the as- the order of 1/kγ for some γ ∈ (0, 1 ), the algorithm is still conver-
sumptions in (20). Combining the above relation with (22) and gent, albeit with a worse complexity than (40). In this case, we do
(28), we obtain (23). not need to use a very small smoothing parameter at the begin-
ning iteration of the algorithm.
In the next result, we specialize the rate of convergence of Third, the weighted average of stochastic gradients in (19) is
Algorithm 1 by specifying its parameters. used to reduce the variance associated with gradient estimates. To
further reduce this variance, one can use a mini-batch of samples
Corollary 5.1. Let the assumptions in the statement of Theorem to compute (18). In particular, given a batch size of mk and gener-
5.1 hold and an iteration limit N ≥ 1 is given. If the parameters are mk
ating samples ωk = {ωk,i }i=1 , the stochastic gradient used in (19) is
set to computed as
1 δ δ
αk = , ηk = √ , βk = k = 1, . . . , N, 1
mk
δ ( d + 4 )N L0 d L20 d Gηk (θ k , ωk ) = Gηk (θ k , ωk,i ). (41)

mk
(36) i=1
for some δ > 0. Then we have This additional averaging will further improve the practical perfor-
mance of the algorithm as shown in the next section. Also, it is
L20 (d + 4 ) (F ∗ − F (θ 0 ) + 5 )
3
2
E[ ḠR 2
]≤ √ , (37) worth noting that E[ ∇ FηR (θ R ) − ḠR 2 ] converges to zero with the
δN same rate presented in Corollary 5.1. Hence, Ḡk can be used as
an online certificate to assess the quality of generated solutions
3
6L20 (d+4 ) 2 (F ∗ −F (θ 0 )+6 ) without taking extra batch of samples. This is another advantage
E[ ∇ FηR (θ R ) 2 ] ≤ √
δN
, (38)
of using the weighted average of stochastic gradients to update the
where expectation is also taken w.r.t the random integer number R policy at each iteration of the SANG method.
whose probability distribution is supported on {1, . . . , N} and is given Finally, when the smoothing parameter is fixed, βk can be set
by to any number while changing the rate of convergence by a con-
stant factor. Hence, practically successful stepsize policies can be
αk βk tried. For example, one can use the widely used adaptive step-
PR (R = k ) = N k ∈ {1, . . . , N }. (39)
k=1 αk βk
size formula in the machine learning community for stochastic op-
timization, namely, the Root Mean Square Propagation (RMSProp)
8
JID: EOR
Fig. 2. Performance of the constant forecast parameterization.
Fig. 3. Averaged performance of lookup parameterization policy under perfect forecasts. Each curve represents performance of the lookup policy over changing one θi
(i = 1, . . . , 9) while θ j = 1 ∀ j = i. The rest of θi s (i > 9) have similar behavior and are removed to increase the readability of the graph.
Fig. 4. performance of exponential decay parameterization policy.
9
JID: EOR
Table 1
Energy storage model parameters.
βd βc γd γc Rmax R0 CP Ptm
1 1 25 25 400 20 30 10
Tieleman & Hinton (2012) given by

b
βk = , ḡk = (1 − γk )ḡk−1 + γk Gk 2
, (42)
ḡk
where b is a tunable parameter in the learning rate γk ∈ (0, 1 ), and
Gk is the gradient estimate at the kth iteration. This stepsize policy
performs well in our experiments as shown in the next section.
6. Numerical experiments
In this section, we test the performance of the parametric CFA

approach in designing different parameterization policies on the
energy storage problem as discussed in Section 3. To do so, we
compare the objective function for a given policy π parameterized
by θ with that of the deterministic benchmark policy (6), denoted
by F D−LA . We then report the improvement as
F π (θ ) − F D−LA
F π ( θ ) = , (43)
|F D−LA |
where F π here is the average of noisy objective values given by
(8) with the contribution function (4) in terms of thousand dol-
lars. In all of our experiments, we generate 10 0 0 sample paths and
report the averaged function evaluations to approximate the objec-
00 π
tive function i.e., F π (θ ) = 10
i=1 F̄ (θ , ω )/10 0 0.
i
To better show the practical performance of the parametric CFA

approach, we assume that we are only given forecasts for the re-
newable energy over the rest of the day and all other information
is known in the energy storage problem. In particular, we assume
that the forecasts are generated according to (1) with the choice of
H = 23, T = 24 and ρE = 0.2 unless otherwise stated. We also as-
sume that the spot prices of electricity from the grid and the load
are fixed over the horizon, while slightly changing from one sam-
ple to another sample in our data set. As the observed load is usu-
ally cyclic, the load curve is generated with a sinusoidal function.
The forward curve of spot prices are also generated from the load
curve as they are correlated. Moreover, the energy storage model
parameters are set to the values in Table 1.
In our first set of experiments, we evaluate the performance Fig. 5. Average performance of Algorithm 1 to optimize the time-independent
lookup table parameterization policy with ρE = 0.2 over 5 runs of 200 itera-
of the constant forecast parameterization (π = const). We gener- tions. Top: different learning rates (b in (42)) while a = mk = 1; Middle: different
ate five data sets of forecasts with different levels of noise (ρE ∈ weighted average parameters (a in (44)) while b = mk = 1; Bottom: different batch
{0, 0.1, 0.2, 0.3, 0.4}) and perform a grid search for the values of θ . sizes (mk in (41)) while b = 1, a = 2.
We then compare the averaged objective values with the bench-
mark policy (θ = 1). Since the range of these values is high, we
show the normalized objective improvement in Fig. 2 for the pur- [1, 1.2] and θ2 ∈ [10−3 , 10−2 ]. Indeed, for the most choices of pa-
pose of better presentation. Under perfect forecast (ρE = 0) the rameters outside of these ranges, the performance of the proposed
benchmark policy works the best as expected. However, under the policy is worse than that of the benchmark. Thus, we focus on
presence of noisy forecasts, the optimal policy changes from θ = 1 these ranges and the results have been shown in Fig. 4. When θ2 =
and the constant parameterization improves the objective function. 0, the exponential decay parametrized policy corresponds to the
We also examine the performance of the lookup table parame- constant one and the best performance is achieved when θ1 = 1.1
terization policy with H = 23 under perfect forecasts (ρE = 0). In as already shown in Fig. 2 (when ρE = 0.2). When θ = 1, we get
particular, we first set all values of θ to 1 and then do a one- a little improvement for a few values of θ2 while the performance
dimensional search over each coordinate of θ . As it can be seen gets worse than the benchmark policy for its other values. Similar
from Fig. 3, under perfect forecasts, the optimal value for each co- behavior can be observed when θ1 = 1.2. However, when θ1 = 1.1,
ordinate of θ equals 1 while the others are set to 1. we will get significant improvement over the bench mark policy.
In our next experiment, we evaluate the performance of the In this case, the exponential decay parameterized policy will also
exponential decay parameterization policy (π = exp). In this case, outperform the constant one when θ2 > 0. Indeed, it can achieve
there are two parameters to be optimized θ1 , θ2 . We do a grid almost 100% improvement over the benchmark policy while the
search to find the best possible values of these parameters. Our constant one achieves around 40% improvement. This is intuitively
search shows that the best improvements are obtained when θ1 ∈ expected as we have more ability to capture the stochasticity of
10
JID: EOR
Fig. 6. Performance of Algorithm 1 and CMA-ES to optimize the time-independent lookup table parameterization policy with ρE = 0.2 over 10 runs with the starting point
of θi = 1.
Fig. 7. Performance of Algorithm 1 running from four different starting points to optimize the time-independent lookup table parameterization policy with ρE = 0.2 over 10
runs of 500 iterations. The algorithms parameters are set to a = 2, b = 1, mk = 1.
the wind forecast under the exponential decay parameterized pol- ter a and report the results in the middle graph of Fig. 5. As it can
icy. be observed, the choice of a = 2 achieves the best performance.
In our last set of experiments, we test the performance of Finally, we fine-tune the batch size mk by setting b = 1, a = 2. As
Algorithm 1 in optimizing the parameters for the parametric CFA it can be seen from the bottom graph of Fig. 5, the best perfor-
approach. We focus on the lookup table parameterization (π = mance is obtained by the choice of mk = 1. Hence, we set the pa-
lkup) which has a larger search space (θ ∈ R23 ). To do so, we first rameters to a = 2, b = 1, mk = 1 for the rest of this experiment. We
need to fine-tune the parameters used in the algorithm. We use should emphasis that a better approach for fine-tuning more than
the stepsize policy in (42) with the choice of γk = 0.1 (as it is com- one hyperparameter might be using a bilevel programming model
mon in the machine learning literature), different batch sizes at in which the upper level problem is defined as an optimization
each iteration to compute (18) with (41), and weight coefficients problem w.r.t. the hyperparameters over the validation set while
in (36) which are multiplied by a constant factor i.e., the lower optimization problem is defined w.r.t. the parametrized
a policies over the training data set. Recently, gradient-based models
αk = , (44) have been developed to solve such problems efficiently, however,
δ ( d + 4 )N it is out of scope of this paper.
for some a > 0. We also set δ = 1 and ηk = 0.1. We then do the After fine-tuning the aforementioned parameters, we compare
fine-tuning for a in (44), b in (42), and mk in (41) by running the performance of Algorithm 1 against that of Covariance Ma-
Algorithm 1 five times, each for 200 iterations starting from θi = trix Adaptation Evolution Strategy (CMA-ES) which is a well-known
1 ∀i. We then evaluate the objective improvement after using 40 derivative-free global optimization method. Since this method is
samples (20 iterations when mk = 1). First, we fine-tune the algo- designed for deterministic problems, we use the sample average
rithm w.r.t b by setting a = mk = 1. As it can be seen from the top approximation (SAA) approach for estimating the function values.
graph in Fig. 5, the best performance is achieved by the choice of We use a fixed number of samples (totally 50 samples in each iter-
b = 1. Using this choice of b and mk = 1, we fine-tune the parame- ation of CMA-ES), run both methods 10 times and report the aver-
11
JID: EOR
age and standard deviation of the percentage of the objective im- Ghadimi, S., Ruszczynski, A., & Wang, M. (2020). A single timescale stochastic ap-
provement in Fig. 6. As it can be seen, Algorithm 1 outperform the proximation method for nested stochastic optimization. SIAM Journal on Opti-
mization, 30(1), 960–979.
CMA-ES in terms of both quality of the solutions and their vari- Graves, H. C., Stephen, C., Meal, Dasu, S., & Qui, Y. (1986). Two-stage production
ability. The possible reason is that good estimates of the objective planning in a dynamic environment. In S. Axsäter, C. Schneeweiss, & E. Silver
function is a key point in the success of CMA-ES which requires a (Eds.), Multi-stage production planning and inventory control (pp. 9–43). Berlin,
Heidelberg: Springer Berlin Heidelberg.
large number of samples which is computationally very expensive Heath, D. C., & Jackson, P. L. (1994). Modeling the evolution of demand forecasts
in our case. with application to safety stock analysis in production/distribution systems. IIE
We also test the performance of Algorithm 1 by running it for Transactions, 26(3), 17–30.
Jiang, D. R., & Powell, W. B. (2015). Optimal hour-ahead bidding in the real-time
500 iterations using four different starting points. One of them
electricity market with battery storage using approximate dynamic program-
is θi = 1 ∀i and the other three are randomly chosen such that ming. INFORMS Journal on Computing, 27(3), 525–543.
θi ∈ [0.5, 1.5] ∀i. Agian, we repeat the runs 10 times and report Keerthisinghe, C., Chapman, A. C., & Verbič, G. (2019). Energy management of PV-s-
torage systems: Policy approximations using machine learning. IEEE Transactions
the average and standard deviation of the percentage of the ob-
on Industrial Informatics, 15(1), 257–265.
jective improvement in Fig. 7. As it can be seen, Algorithm 1 is Khotanzad, A., Davis, M. H., Abaye, A., & Maratukulam, D. J. (1996). An artificial neu-
able to find good parameterized policies regardless of the staring ral network hourly temperature forecaster with applications in load forecasting.
points. Moreover, it can improve the benchmark policy up to 150% IEEE Transactions on Power Systems, 11(2), 870–876.
Kiaei, I., & Lotfifard, S. (2018). Tube-based model predictive control of energy storage
which is significantly higher than that of the exponential decay systems for enhancing transient stability of power systems. IEEE Transactions on
(100%) and constant (40%) parametrized policies. While the num- Smart Grid, 9(6), 6438–6447.
bers can change by changing the problem parameters, this high- Kumar, R., Wenzel, M. J., Ellis, M. J., ElBsat, M. N., Drees, K. H., & Zavala, V. M. (2018).
A stochastic model predictive control framework for stationary battery systems.
lights again the importance of the careful parametrization in our IEEE Transactions on Power Systems, 33(4), 4397–4406.
proposed parametric CFA approach. Liu, H., Shi, J., & Erdem, E. (2010). Prediction of wind speed time series using mod-
ified Taylor Kriging method. Energy, 35(12), 4870–4879.
Liu, Y., Roberts, M. C., & Sioshansi, R. (2018). A vector autoregression weather model
7. Conclusion for electricity supply and demand modeling. Journal of Modern Power Systems
and Clean Energy, 6(4), 763–776.
We provide a hybrid policy of deterministic lookahead and cost Nesterov, Y., & Spokoiny, V. (2017). Random gradient-free minimization of convex
functions. Foundations of Computational Mathematics, 17(2), 527–566. https://doi.
function approximations (CFA), namely, the parametric CFA to find
org/10.1007/s10208-015-9296-2.
the best policy to for energy storage problems under the pres- Powell, W. B. (2019). A unified framework for stochastic optimization. European Jour-
ence of rolling forecasts. While this approach can handle complex nal of Operational Research, 275(3), 795–821.
Powell, W. B., & Ghadimi, S. (2022). The parametric cost function approximation:
stochastic models associated with the rolling forecasts, it comes
A new approach for multistage stochastic programming. arXiv preprint arXiv:
at the cost of tuning parameters (policies). The objective func- 2201.00258.
tion in the parametric CFA model is likely to be nonconvex and Powell, W. B., & Meisel, S. (2016). Tutorial on stochastic optimization in energy
its unbiased gradient estimates are not easy to calculate. Hence, - Part I: Modeling and policies. IEEE Transactions on Power Systems, 31(2),
1459–1467.
we present a new stochastic numerical derivative-based algorithm Powell, W. B., & Meisel, S. (2016). Tutorial on stochastic optimization in energy -
which only uses noisy function evaluations (obtained via simula- Part II: An energy storage illustration. IEEE Transactions on Power Systems, 31(2),
tions) to provide biased gradient estimates. By properly taking a 1468–1475.
saltyte Benth, J., Benth, F. E., & Jalinskas, P. (2007). A spatial-temporal model for
weighted average of these biased gradient estimates, we reduce temperature with seasonal variance. Journal of Applied Statistics, 34(7), 823–841.
the variance associated with them, which enables us to control Sapra, A., & Jackson, P. L. (2004). The martingale evolution of price forecasts in a sup-
accumulated the bias errors. Furthermore, we establish finite-time ply chain market for capacity: Technical report.
Sioshansi, R., Madaeni, S. H., & Denholm, P. (2014). A dynamic programming ap-
rate of convergence of this algorithm under different settings and proach to estimate the capacity value of energy storage. IEEE Transactions on
show that it can practically find policies that perform better than Power Systems, 29(1), 395–403.
the deterministic benchmark policy in optimizing an energy stor- Spall, J. C. (1992). Multivariate stochastic approximation using a simultaneous per-
turbation gradient approximation. Automatic Control, IEEE Transactions on, 37(3),
age system under the presence of rolling forecasts.
332–341.
Strassen, V. (1965). The existence of probability measures with given marginals. An-
References nals of Mathematical Statistics, 38, 423–439.
Taylor, J. W., & Buizza, R. (2003). A comparison of temperature density forecasts
Abhishek, K., Singh, M., Ghosh, S., & Anand, A. (2012). Weather forecasting model from GARCH and atmospheric models. Journal of Forecasting, 23(5), 337–355.
using artificial neural network. Procedia Technology, 4, 311–318. Tieleman, T., & Hinton, G. (2012). Lecture 6.5-rmsprop: Divide the gradient by a run-
Akarslan, E., Hocaoğlu, F. O., & Edizkan, R. (2014). A novel M-D (multi-dimensional) ning average of its recent magnitude. In COURSERA: Neural networks for machine
linear prediction filter approach for hourly solar radiation forecasting. Energy, learning.
73(C), 978–986. Traiteur, J. J., Callicutt, D. J., Smith, M., & Roy, S. B. (2012). A short-term ensemble
Almassalkhi, M. R., & Hiskens, I. A. (2015). Model-predictive cascade mitigation in wind speed forecasting system for wind power applications. Journal of Applied
electric power systems with storage and renewables–Part II: Case-study. IEEE Meteorology and Climatology, 51(10), 1763–1774.
Transactions on Power Systems, 30(1), 78–87. Xi, X., & Sioshansi, R. (2016). A dynamic programming model of energy storage and
Arbizu-Barrena, C., Ruiz-Arias, J. A., Rodríguez-Benítez, F. J., Pozo-Vázquez, D., & To- transformer deployments to relieve distribution constraints. Computational Man-
var-Pescador, J. (2017). Short-term solar radiation forecasting by advecting and agement Science, 13, 119–146.
diffusing MSG cloud index. Solar Energy, 155, 1092–1103. Xi, X., Sioshansi, R., & Marano, V. (2014). A stochastic dynamic programming model
Dicorato, M., Forte, G., Pisani, M., & Trovato, M. (2012). Planning and operating com- for co-optimization of distributed energy storage. Energy Systems, 5, 475–505.
bined wind-storage system in electricity market. IEEE Transactions on Sustainable Zafar, R., Ravishankar, J., Fletcher, J. E., & Pota, H. R. (2018). Multi-timescale model
Energy, 3(2), 209–217. predictive control of battery energy storage system using conic relaxation in
Dokka, T., & Frimpong, R. (2019). Approximate policy iteration using neural net- smart distribution grids. IEEE Transactions on Power Systems, 33(6), 7152–7161.
works for storage problems. arXiv preprint arXiv:1910.01895. Zhang, Z., Zhang, Y., Huang, Q., & Lee, W. (2018). Market-oriented optimal dispatch-
(2015). In M. C. Fu (Ed.), Handbook of simulation optimization. Springer. ing strategy for a wind farm with a multiple stage hybrid energy storage sys-
Ghadimi, S., & Lan, G. (2013). Stochastic first- and zeroth-order methods for noncon- tem. CSEE Journal of Power and Energy Systems, 4(4), 417–424.
vex stochastic programming. SIAM Journal on Optimization, 23(4), 2341–2368.
12

Ghadimi EJOR Article in Press Aug 14 2023 1

Uploaded by

Copyright:

Available Formats

Ghadimi EJOR Article in Press Aug 14 2023 1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ghadimi EJOR Article in Press Aug 14 2023 1

Uploaded by

Copyright:

Available Formats

JID: EOR

ARTICLE IN PRESS [m5G;August 12, 2023;19:0]

Contents lists available at ScienceDirect

European Journal of Operational Research

Stochastics and Statistics

Stochastic search for a parametric cost function approximation:

1. Introduction Standard solution strategies tend to ﬁx a forecast and hold it

xtwd + β d xtrd + xtgd ≤ Dt ,

Hence, making the choice of αk less than 1, will act as an online

2 L 0 d | η2 − η1 | θyk = θ k−1 − βk Ḡk−1 , (16)

Proof. Noting (11), we have

Our method is formally described as Algorithm 1. The gradient β0 = 1, ηk+1 ≤ ηk , βk+1 ≤ βk , βk

Proof. First note that F (θ ) is Lipschitz continuous with constant L 

due to (16) and (17), imply that Fηk (θ ) − Fηk+1 (θ ) = Ev [F (θ + ηk v ) − F (θ + ηk+1 v )]

+ 2αk Gηk (θ , ω ) − Ḡ , Ḡk−1 .

up, and noting that Ḡ0 = 0, we obtain ≤ (1 − αk ) ∇ Fηk−1 (θ k−1 ) − Ḡk−1 2

which together with the convexity of · 2 and the fact that (1 − αk )  

δ ( d + 4 )N L0 d L20 d Gηk (θ k , ωk ) = Gηk (θ k , ωk,i ). (41)

Fig. 2. Performance of the constant forecast parameterization.

Fig. 4. performance of exponential decay parameterization policy.

Tieleman & Hinton (2012) given by

In this section, we test the performance of the parametric CFA

To better show the practical performance of the parametric CFA

You might also like

Proof. First note that F (θ ) is Lipschitz continuous with constant L

+ 2αk Gηk (θ , ω ) − Ḡ , Ḡk−1 .

which together with the convexity of · 2 and the fact that (1 − αk )