Pinball-Huber Boosted Extreme Learning Machine Regression: A Multiobjective Approach To Accurate Power Load Forecasting
Pinball-Huber Boosted Extreme Learning Machine Regression: A Multiobjective Approach To Accurate Power Load Forecasting
https://doi.org/10.1007/s10489-024-05651-3
Abstract
Power load data frequently display outliers and an uneven distribution of noise. To tackle this issue, we present a forecasting
model based on an improved extreme learning machine (ELM). Specifically, we introduce the novel Pinball-Huber robust
loss function as the objective function in training. The loss function enhances the precision by assigning distinct penalties
to errors based on their directions. We employ a genetic algorithm, combined with a swift nondominated sorting technique,
for multiobjective optimization in the ELM-Pinball-Huber context. This method simultaneously reduces training errors while
streamlining model structure. We practically apply the integrated model to forecast power load data in Taixing City, which is
situated in the southern part of Jiangsu Province. The empirical findings confirm the method’s effectiveness.
Keywords Load forecasting · Robust loss function · Multi-objective optimization · Neural networks · Extreme learning
machine
123
8746 Y. Yang et al.
the input weights, hidden layer thresholds, and hidden Unlike traditional artificial neural networks, ELM’s input
node numbers as input parameters, the optimized ELM weights and biases in the hidden layer are randomly assigned.
model can achieve minimized training errors and a more ELM derives hidden weights through the least squares
streamlined network structure. method, eliminating the need for adjusting hidden layer
(3) The superiorty of proposed NSGA-II-ELM-Pinball-Huber weights through iterative backpropagation [21]. As a result,
model is validated by comparing it with benchmarks the ELM model demonstrates faster learning and more pro-
(LSTM, GRU, CNN-BiLSTM-Attention) using power nounced generalization with minimal preset parameters [22].
load data from Taixing City. This validation has empha- Numerous ELM-based predictive models have been pro-
sized the effectiveness of the Pinball-Huber loss, high- posed, showcasing their exceptional regression capabilities
lighting the enhanced performance of the multiobjective in forecasting. Ni et al. [23] employed an ensemble method
optimization algorithm NSGA-II. using ELM and lower upper bound estimation (LUBE) for
short-term power prediction. Han et al. [24] developed sea-
The rest of the current paper is organized as follows: Section 2 sonal multimodels based on ELM by considering the seasonal
contains the literature review. Section 3 presents definitions distribution of power features. The effectiveness of the pro-
of some terms. Section 4 gives the methodology (method). posed methods was validated through a comparison with
Section 5 goes over the results. Section 6 discusses the con- other approaches. Thus, compared with shallow learning sys-
clusions and future research areas. tems, ELM exhibits higher efficiency, lower computational
costs, and stronger generalization.
The loss function reflects the disparity between the pre-
dicted values and actual values during the optimization
2 Literature review process, significantly impacting the learning model’s gen-
eralization and accuracy [25]. Chen et al. [26] utilized ELM
Over the past few decades, researchers have proposed numer-
enhanced with an L2-norm loss function for feature selec-
ous short-term load forecasting methods [6], which can be
tion. Most neural network methods adopt mean squared error
broadly categorized into physical, statistical, and intelligent
(MSE) or L2 loss function. Unfortunately, MSE loss function
methods [7]. Physical methods establish the mathematical
relies on Gaussian assumptions, making it sensitive to out-
relationships between historical data and physical character-
liers and challenging to precisely evaluate nonlinear errors.
istics to achieve power load forecasting. Statistical methods
Yang et al. [27] suggested employing the Huber loss func-
perform mathematical statistics on historical data, estab-
tion as the model’s training objective. The Huber loss treats
lishing the correlations between load and time to make
errors of different magnitudes differently. However, it lacks
predictions [8]. These models typically include linear regres-
consideration for the direction of errors. Power load data are
sion (LR) [9], autoregressive integrated moving average
nonlinear and often exhibit various asymmetric noise dis-
(ARIMA) [10], gray models (GM) [11], and seasonal expo-
tributions [5], necessitating the development of a new loss
nential smoothing(SEs) [7]. However, these methods fail to
function that comprehensively considers both error magni-
capture the nonlinear characteristics present in load data.
tude and direction.
Compared with traditional physical and statistical meth-
In conventional ANNs, including ELM, certain param-
ods, intelligent methods exhibit greater potential in handling
eters are set randomly, leading to a degree of error and
the nonlinear fluctuations and complex relationships within
variability in the predictive outcomes. Artificial intelligence
power load data, hence demonstrating higher accuracy in the
also exhibits drawbacks such as slow convergence, suscepti-
field of power load forecasting [12, 13]. Intelligent methods
bility to local optima, and overfitting [11, 28]. Hence, several
such as artificial neural networks (ANN) [14], support vector
intelligent optimization algorithms have been proposed to
regression (SVR) [15], and ELM [16] have found exten-
alleviate these limitations. Optimization algorithms applied
sive applications in recent power forecasting studies. Among
to machine learning algorithms have further improved their
these, ANN is adept at modeling more intricate relationships
regression capabilities to some extent [22]. For instance,
between the power load and correlated variables compared
Niu et al. [29] utilized a cooperative search algorithm
with other methods, hence leading to its widespread usage
that can explore the optimal hyperparameters of support
in power load forecasting [2, 17]. ANN, which is akin to
vector machines (SVM), using this algorithm to predict elec-
the structure of the human brain, can interpret vast amounts
tricity consumption in four Chinese provinces. Niu et al.
of data and transform it into actionable knowledge [18].
[29] optimized BPNN parameters using a genetic algo-
ELM, an enhanced single-hidden-layer feedforward neural
rithm (GA). Shang et al. [30] established a prediction model
network, has been widely employed in forecasting tasks [19,
combining least squares support vector machines (LSSVM)
20].
123
Pinball-Huber boosted extreme learning machine regression: a multiobjective... 8747
with generalized regression neural networks and optimized that the error obtained is amplified. The L2-norm loss func-
the weight coefficients by using the whale optimization tion can be described as follows:
algorithm (WOA). Xie et al. [31] proposed a short-term
1 2
power load forecasting method combining Elman neural L2(r ) = r , (1)
network (ENN) and particle swarm optimization (PSO). 2
Differing from traditional random initialization, PSO was where r = y − ŷ is the residual, y represents the expected
employed to search for the optimal learning rate for ENN. results, and ŷ represents the forecasting results.
Addressing the issue of model parameter determination,
arithmetic optimization algorithms (AOA) [32], gene expres- 3.1.2 L1-norm loss
sion programming (GEP) [33], and chimpanzee optimization
algorithm (ChOA) [34], among others, have been utilized. In the regression problem, L1-norm loss measures the abso-
Many studies on power load forecasting solely employed lute value of the difference between the forecasting value and
single-objective algorithms to optimize a criterion. However, true value. The L1-norm loss function can be described as
in practical applications, meeting multiple constraints is often follows:
necessary [7, 35].
The present paper introduces a novel power load fore- L1(r ) =| r | . (2)
casting model to address the aforementioned issues. Named
NSGA-II-ELM-Pinball-Huber, this model is based on an The L1-norm loss function is a function commonly used
enhanced Pinball-Huber loss function and multiobjective in regression problems.
optimization algorithm NSGA-II. To effectively handle
errors and anomalies in power load data, we introduced an 3.1.3 Huber loss
asymmetric and robust Pinball-Huber loss function. Within
the ELM framework, this loss function is employed as the The Huber loss function was proposed in 1964. It absorbs
objective, and the iteratively reweighted least squares (IRLS) the advantages of L1-norm and L2-norm loss functions and
method is utilized to determine the output weight vector. The makes up for their shortcomings. Concerning outliers in the
present paper conducts global multiobjective optimization data, Huber loss can perform more robustly. Not only is it
of the ELM model by employing the NSGA-II algorithm to more robust to outliers, but Huber is also derivable in the
simultaneously optimize training errors and output weights. whole domain, greatly simplifying the calculation difficulty.
The experimental results demonstrate that the proposed load Huber loss function can be described as follows:
forecasting model significantly enhances predictive perfor-
manc r , |r | ≤ δ
1 2
Hδ (r ) = 2 (3)
| r | δ − δ2 , |r | > δ,
2
Within various enhanced algorithms, the role of the loss func- 3.1.4 Pinball loss
tion is to assess the merits and drawbacks of the improved
model by computing its minimum value within the improved The Pinball loss function is asymmetric. It not only imposes
function. Yet during practical application, because of factors certain penalties on outliers in data, but it also imposes addi-
such as the loss function’s objective, the nature of the applica- tional penalties according to different situations of outliers.
tion, data attributes, and the desired level of confidence in the In addition, because of the introduction of quantile distance,
forecasted values, a single loss function cannot be universally the Pinball loss function improves the insensitivity to charac-
applied to all model experiments. Thus, a range of loss func- teristic noise and resampling. The expression of the Pinball
tions is required to be explored to optimize the treatment of loss function is as follows:
target-type data and achieve optimal evaluation results [36].
r τ, r ≥ 0
3.1.1 L2-norm loss Pτ (r ) = (4)
r (1 − τ ), r < 0.
L2-norm loss is a smooth function that is derivable in the The parameter τ ∈ [0, 1]. When parameter τ = 1, Pinball
whole domain and simplifies the calculation. When the error loss is the same as the L1-norm loss in function, so the Pinball
increases, the error is squared because of L2-norm loss, so
123
8748 Y. Yang et al.
loss function can be considered a generalized L1-norm loss with hidden layer thresholds, without any further adjustments
function. In addition, Pinball loss also absorbs the advantages during algorithm execution. With no need for additional
of L1-norm loss and can handle the deviation of outliers. parameter settings, ELM offers simplicity in its usage. As
highlighted by Huang et al. [43], ELM commonly employs
3.1.5 Biweight loss the Moore-Penrose generalized inverse to determine key
node weights. This methodology involves only a single cal-
Tukey’s Biweight loss function is also a non-convex loss culation step (linear equation operation) to establish the
function, which can overcome the interference and influence weight matrix between the hidden and output layers [44].
caused by outlier samples and noise samples in regres- Unlike backpropagation, there’s no gradient operation, sig-
sion tasks, hence showing strong robustness in regression nificantly reducing computational demands and enhancing
tasks [38, 39]. The Biweight loss function is defined as fol- speed. Furthermore, ELM demonstrates superior general-
lows: ization compared with alternative algorithms. The structural
2 diagram of ELM is shown in Fig. 1.
c
[1 − (1 − ( rc )2 )], | r |≤ c A typical ELM network structure consists of an input
Bc (r ) = c62 (5)
6 , otherwise,
layer, a hidden layer, and an output layer, with n, L, and
m nodes, respectively. For data set (xi , yi )(i = 1, 2, ..., N )
where c is a tuning constant, which is generally specified as with N samples, xi = [xi1 , xi2 , ..., xin ]T is the input vector,
4.685. At this time, Tukey’s Biweight can achieve a regres- yi = [yi1 , yi2 , ..., yim ]T is output vector, and the output of
sion effect like that of the L2-norm loss function (95% ELM can be described as:
progressive) in minimizing the variance consistent with the
normal distribution [40]. Tukey’s Biweight suppresses the
L
y= β j G(w j · xi + b j ) (i = 1, 2, ..., n) (7)
influence of outliers during backpropagation by reducing the
j=1
gradient size to near zero. Another interesting feature of this
loss function is that it imposes a soft constraint between the where w j is the weight from the input layer to the j-th hidden
inner layer and outlier without setting a hard threshold for layer node, b j is the threshold of the j-th hidden layer node,
the residual. β j is the output layer weight connecting the j-th hidden layer
node, and G(·) represents the activation function.
3.1.6 Lncosh loss Equation (7) can be simplified as H β = Y . The objective
function of the ELM model can be written as follows:
Lncosh is a loss function commonly used in regression tasks,
with high smoothness. Define it as [41, 42]: min ||H β − Y ||. (8)
L(r ) = ln(cosh(r )). (6) Using the least square method to solve the (8), the solution
β is the following:
For the smaller residual r , ln(cosh(r )) is approximately
2
equal to r2 ; for the larger residual r , it is roughly equal to β = (H T H )−1 H T Y = H + Y , (9)
| r | −ln2. This means that the working principle of Lncosh
is very similar to the mean square error to a large extent, but it
is not greatly affected by the occasional wrong forecasting. It
has all the advantages of the Huber loss function, but unlike
Huber loss, it is quadratically differentiable everywhere.
123
Pinball-Huber boosted extreme learning machine regression: a multiobjective... 8749
where H + is the Moore-Penrose generalized inverse of the Step 4: Combine the parent Pt and offspring Q t to get
hidden layer output matrix H . the population Rt = Pt Q t , where t is the number of
iterations; a new parent population Pt+1 is selected through
3.3 Multiobjective optimization elitist strategy.
Step 5: When the iteration reaches the specific number or
The concept behind the multiobjective optimization algo- the termination condition is met, the final population and its
rithm is to identify a collection of optimal Pareto solutions, Pareto solution set will be obtained; otherwise, let the number
where each solution fulfills the fundamental criteria of mul- of iterations t = t + 1 and go to Step 2.
tiple optimization objectives and showcases an optimal state
holistically. Within the optimal Pareto solution set, no other
solution surpasses itself in all optimization objectives [45]. 4 The proposed method
Achieving this demands that the optimization algorithm
extensively explores the solution set, guarantees a global 4.1 Proposed Pinball-Huber loss
optimization outcome, and prevents being trapped in local
optimization. Section 3.1 provides an overview of six fundamental loss
Derived from biological genetic theory, the genetic algo- functions: L2-norm, L1-norm, Huber, Pinball, Biweight, and
rithm has evolved and found applications across diverse Lncosh. The strengths, weaknesses, and suitable application
domains [46]. By incorporating the genetic algorithm, the contexts for each loss function are analyzed. Upon examina-
drawbacks associated with traditional multiobjective opti- tion and consolidation, it is evident that these loss functions
mization approaches, such as the risk of converging to local often lack compatibility with robustness and accuracy in
optima, are circumvented. This integration ensures that the diverse evaluation models, measurement approaches, and
solutions’ diversity is effectively maintained. forecasting experiments. Additionally, they tend to inade-
The general multiobjective optimization problem can be quately address standard positive and negative errors and
described as follows: outliers in machine learning challenges. As a result, this may
lead to suboptimal evaluation levels and reduced accuracy in
min F(x) = [ f 1 (x), f 2 (x), ..., f n (x)] forecasting outcomes.
To address the aforementioned challenges, we propose a
s.t. x ∈ C, (10)
solution by merging the Pinball loss with the Huber loss. The
Pinball loss function offers the ability to adapt to positive and
where f i (x) is the optimization objective, x is the solution,
negative errors during forecasting computations, displaying
and C is the constraint.
self-adjusting asymmetry. On the other hand, the Huber loss
NSGA-II is an advanced multiobjective optimization
function demonstrates remarkable robustness and effectively
algorithm that was improved by Deb et al. [47]. NSGA-
handles outliers; however, it treats both positive and negative
II introduces the concepts of fast nondominated sorting,
issues concurrently in the algorithmic process, leading to a
crowding-distance sorting, and elitist strategy, which greatly
reduction in forecasting precision. Our innovation lies in the
enhance the practical application of NSGA-II. In NSGA-II,
development of novel loss functions, combining the attributes
we can initialize a certain population P and use the genetic
of Huber and Pinball. This allows for distinct measures to be
algorithm to select, cross, and mutate the parent population
applied to diverse errors within the training procedure, sig-
P to produce the offspring population Q. After fast nondom-
nificantly enhancing the model’s performance. The proposed
inated sorting and crowding
distance sorting of the combined
Pinball-Huber loss function is presented as follows:
population R = P Q, the new population and its Pareto
optimal solution set are obtained by using the elitist strategy. ⎧
⎪
⎪ 2 r τ, 0 ≤ r ≤ δ
1 2
The outflow diagram of NSGA-II is shown in Fig. 2. The ⎪
⎪
⎨ 1 r 2 (1 − τ ), −δ ≤ r ≤ 0
specific steps are as follows: P Hδ,τ (r ) = 2 (11)
⎪ (| r | δ − δ2 )τ, r > δ
2
Step 1: Set population and iteration times and initialize ⎪
⎪
⎪
⎩(| r | δ − δ 2 )(1 − τ ), r < −δ.
the parent population P. 2
Step 2: For the parent population P, conduct fast non-
dominated sorting and crowding distance sorting and assign The newly introduced Pinball-Huber loss function com-
each individual the rank. prises two adjustable parameters: δ and τ . Notably, these
Step 3: Generate the offspring population Q 0 through parameters originate from the Huber and Pinball loss func-
tournament selection, simulated binary crossover, and poly- tions and are skillfully merged to leverage their distinct roles.
nomial mutation. Their combined utilization allows for tailored actions based
123
8750 Y. Yang et al.
on the magnitude and direction of training errors. Beyond structure complexity. Achieving higher accuracy demands
refining the accuracy of the foundational loss function, a network that can personalize its modeling to the data,
the Pinball-Huber approach introduces a novel perspective which often results in an intricate network structure, particu-
by categorizing training errors based on their directional larly within the hidden layer, potentially harboring numerous
attributes. This innovative method presents a fresh approach unnecessary nodes. While pursuing a simplified neural net-
for addressing outliers. In the context of the power sys- work structure, it is not prudent to directly designate a
tem, where power load data are influenced by variables like minimal number of nodes and related parameters. Subjec-
weather, season, and market demand, volatility and the pres- tively determining the appropriate count of neurons for the
ence of outliers and asymmetric noise are common. Our network to accurately capture the input-to-output relation-
proposed Pinball-Huber loss function addresses these intrica- ship is not a feasible approach. What is required is a rational
cies by meticulously dissecting errors and handling positive and efficient algorithm to assist in identifying the optimal
and negative scenarios in distinct ways. number of nodes for ELM.
Utilizing the minimization of training error and output
4.2 NSGA-II-ELM-Pinball-Huber weight within the ELM-Pinball-Huber model as the dual
optimization objectives, we employ the multiobjective opti-
In the practical application of ELM, there exists a funda- mization algorithm NSGA-II to enhance the ELM model.
mental trade-off between forecasting accuracy and network From the derived set of Pareto front solutions, we carefully
123
Pinball-Huber boosted extreme learning machine regression: a multiobjective... 8751
choose the most suitable solution to execute the forecasting multiobjective optimization algorithm NSGA-II. The steps
task. The objective function for this multiobjective optimiza- of the model can be found in pseudocode Algorithm 1.
tion endeavor is presented as follows:
Algorithm 1 The NSGA-II-ELM-Pinball-Huber model.
N
min i=1 P H (ri ) Require: Hidden layer nodes L, input weight ω, and hidden layer
L (12)
min i=1 | βi |, threshold b
Ensure: Training error and output weight of ELM
where ri is the training error and N represents the number of 1: Set the population and iteration times
2: Initialize the population:
samples; we use the training error based on the new proposed 3: Randomly generate multiple groups of ELMs with different numbers
Pinball-Huber loss function as one optimization objective. β of hidden layer nodes
is the output weight vector of the output layer in the ELM 4: for each ELM
model, and L is the number of hidden layer nodes; in addition, 5: Select the input weight ω and hidden layer threshold b randomly
6: Take the proposed Pinball-Huber loss function as the
we take the L1 norm of it as the other optimization objective. objective function and solve the output weight vector β =
Following the multiobjective optimization process, we (H T W H )−1 H T W Y by IRLS
N
arrive at the set of Pareto solutions. By representing the two 7: Get the training error i=1 P H (ri ) based on Pinball-Huber loss
L
optimization objectives along the horizontal and vertical axe, function and output weight i=1 | βi | of ELM as the two optimiza-
respectively, we observe that the solutions within the Pareto tion objectives
8: end for
set form a U-shaped distribution. This pattern highlights the 9: Fast nondominated sort and crowding distance sort on the population
inherent trade-off between training error and output weight and take it as the Parent
as optimization objectives. The solution situated at the inflec- 10: for each i = 1, 2, 3, · · · , gen do
tion point simultaneously possesses a lower training error and 11: Tournament selection, Simulated Binary Crossover, and Poly-
nomial mutation to produce the Offspring
output weight norm, rendering it the optimal choice for our 12: Merge Parent and Offspring
multiobjective optimization. 13: Fast nondominated sort and crowding distance sort
Furthermore, to validate the effectiveness of the Pinball- 14: Generate new Parent by Elitist strategy
Huber loss function, we conducted a separate comparative 15: end for
16: Get optimized Pareto solution set of all the groups
test. Employing various loss functions in conjunction with 17: Select the best sample from the set, that is, the ELM model with
ELM and integrating a lasso penalty term, the composite the best parameters L, ω, b
ELM-loss function models were employed for power load
forecasting within identical experimental parameters. Elab-
orate insights into the model’s objective function and its
solution procedure are provided in Appendix A.
5 Case study
4.2.1 The overall steps
This section employs the power load dataset from Taixing
In this section, we provide a combined load forecasting model City in southern Jiangsu Province to validate the efficacy
NSGA-II-ELM-Pinball-Huber. The model is ELM based on of the integrated NSGA-II-ELM-Pinball-Huber forecasting
the Pinball-Huber loss function and then optimized by the model within the power load system.
123
8752 Y. Yang et al.
1 2
L2 loss 2r r 1 –
L1 loss |r | sign(r) max(|r |, ) ,
1
= 10− 6 –
2 r , | r |≤ δ
1 2
r , | r |≤ δ δ
Huber loss min(1, |r | ) 1.345
| r | δ − δ2 , | r |> δ
2
δsign(r ), | r |> δ
2
6 [1 − (1 − ( c ) ) ], | r |≤ c
c r 2 3
Biweight loss c2
r (1 − ( rc )2 )2 , | r |≤ c (1 − ( rc )2 )2 , | r |≤ c 4.685
6 , other wise
tanh(r )
Lncosh loss ln(cosh(r)) tanh(r) r –
⎧1 2 ⎧ ⎧
⎪
⎪
⎪ 2 r τ, 0 ≤ r ≤ δ ⎪
⎪
⎪
τr , 0 ≤ r ≤ δ ⎪
⎪
⎪
τ, 0 ≤ r ≤ δ
⎨ 1 r 2 (1 − τ ), −δ ≤ r ≤ 0 ⎨(1 − τ )r , −δ ≤ r ≤ 0 ⎨1 − τ, −δ ≤ r ≤ 0
Pinball-Huber loss 2
δ2 δτ δ, τ
⎪
⎪(| r | δ − 2 )τ, r > δ ⎪
⎪ δτ sign(r ), r > δ ⎪
⎪ |r | , r > δ
⎪
⎩ ⎪
⎩ ⎪
⎩ δ(1−τ )
(| r | δ − δ2 )(1 − τ ), r < −δ
2
δ(1 − τ )sign(r ), r < −δ |r | , r < −δ
123
Pinball-Huber boosted extreme learning machine regression: a multiobjective... 8753
Fig. 4 Training error distribution diagram of ELM-Pinball-Huber in the Taixing data set
The hyperparameter τ in Pinball loss is the target quantile, Furthermore, an intriguing observation emerged from the
which is used to handle the direction errors in forecasting. comparison among the six loss functions. The L2-norm and
For the robust Pinball-Huber loss function proposed by us, L1-norm loss functions exhibited subpar and unstable perfor-
both parameters δ and τ need to be set. The proper values are mance in multistep forecasting. Huber, Biweight, and Lncosh
determined using the time series cross-validation approach, loss functions demonstrated favorable performance, but their
with δ and τ each choosing a random value between [0,1] stability in multistep experiments displayed notable fluc-
and [1,2]. Taking the training error in single-step forecasting tuations. Conversely, the ELM-Pinball-Huber consistently
as an example for analysis, as shown in Fig. 4, the training demonstrated the most optimal forecasting results while
errors have deviations and are asymmetrically distributed. maintaining a relatively stable performance throughout the
The hyperparameters of the Pinball-Huber loss function experiments.
obtained through the time series cross-validation method are
shown in Table 3. The experimental analysis is as follows: 5.2.2 Comparisons of ELM with loss functions with/without
multiobjective optimization
5.2.1 Comparisons among ELM-Pinball-Huber and ELM
with other loss functions In the preceding section’s comparative experiments, ELM
utilizing the Pinball-Huber loss function exhibited consistent
This section presents the comparative experiments conducted advantages in forecasting accuracy. Nonetheless, a notewor-
on the ELM model using six distinct loss functions that are thy observation was that achieving higher precision often
aimed at substantiating the benefits of the newly introduced resulted in elevated output weights within the ELM model.
Pinball-Huber loss function. Based on the three evaluation This outcome could lead to intricate network structures and
metrics-RMSE, MSE, and MAPE-outlined in Table 4, it can even overfitting issues. To ascertain the enhanced forecast-
be deduced that the forecasting outcomes of ELM utilizing ing performance of ELM-Pinball-Huber through NSGA-II
our Pinball-Huber loss function surpass those achieved with optimization, a comparative validation was conducted.
other loss functions, both in single-step and multistep fore- Table 5 presents the outcomes of multiobjective opti-
casting scenarios. Figure 5 provides a visual representation mization for ELM using diverse loss functions. Upon
of the ELM’s performance in multistep forecasting across comparison with the results from the pre-multiobjective
the six different loss functions. Hence, adopting the pro- optimization experiments illustrated in Table 4, the ELM,
posed Pinball-Huber loss function as the objective function post-multiobjective optimization not only attains heightened
for ELM can lead to enhanced forecasting capabilities within forecasting precision, but it also substantially diminishes the
the power load prediction. output weight within the ELM model. The distribution of
solutions within the Pareto solution set, along with the curve
showcasing the alteration in the two optimization objectives
Table 3 The hyperparameters of Pinball-Huber loss function
with the number of iterations, is depicted in Fig. 6. Notably, as
the number of iterations increases, the values of the two opti-
Steps Single-step Three-step Five-step Seven-step mization objectives consistently decrease. Ultimately, within
δ 1.25 1.60 1.20 1.50 the figure, a Pareto solution is discernible, maintaining com-
τ 0.45 0.30 0.50 0.30 mendable values for both optimization objectives, thereby
achieving elevated forecasting accuracy while concurrently
123
8754 Y. Yang et al.
Table 4 Multistep forecasting results of ELM-loss function models in the Taixing data set
Models Taixing data set
MAE RMSE MAPE β(%) MAE RMSE MAPE β(%)
Single-step Three-step
ELM-L2 115.337 157.701 0.055 48.0 130.577 183.425 0.063 56.0
ELM-L1 274.207 329.957 0.125 52.5 233.836 269.626 0.107 60.5
ELM-Huber 274.858 330.423 0.126 34.0 222.722 285.020 0.103 26.5
ELM-Biweight 85.986 112.454 0.041 38.0 132.345 186.186 0.064 31.0
ELM-Lncosh 258.347 317.961 0.119 50.5 143.788 192.219 0.069 58.0
ELM-Pinball-Huber 73.946 96.879 0.036 73.0 121.818 167.292 0.060 72.0
Five-step Seven-step
ELM-L2 161.549 214.583 0.077 55.5 163.955 216.922 0.079 51.5
ELM-L1 258.779 316.949 0.119 58.5 273.857 331.715 0.125 61.0
ELM-Huber 273.836 330.600 0.125 25.5 274.930 332.128 0.126 31.0
ELM-Biweight 162.277 208.288 0.077 27.0 168.125 221.927 0.081 27.0
ELM-Lncosh 273.831 331.458 0.125 61.0 234.569 296.230 0.108 57.5
ELM-Pinball-Huber 258.779 316.949 0.119 69.0 163.492 222.261 0.077 75.5
123
Pinball-Huber boosted extreme learning machine regression: a multiobjective... 8755
preserving smaller output weights. This simplification con- The multistep ahead forecasting curves for the NSGA-II-
siderably reduces the intricacies of the model network. The ELM-Pinball-Huber model are displayed in Fig. 7.
streamlined ELM necessitates fewer hidden layer nodes,
enhancing its generalization capabilities. Moreover, we have 5.2.3 Comparisons among NGSA-II-ELM-Pinball-Huber
also observed that NSGA-II can enhance the performance and comparative models
of various ELM-loss function combinations, indicating its
wide applicability for ELM. Notably, the amalgamation of To assess the predictive performance of the NSGA-II-ELM-
the Pinball-Huber loss function and ELM, following NSGA- Pinball-Huber model, we conduct comparative experiments
II optimization, demonstrates the most optimal performance.
123
8756 Y. Yang et al.
Table 6 Parameter of
Model Parameter name Parameter value
comparative models
LSTM number of layers, units [3, 64]
GRU number of layers, units [3, 64]
CNN-BiLSTM-Attention number of layers, kernel size, units [9, 1, 64]
123
Pinball-Huber boosted extreme learning machine regression: a multiobjective... 8757
with three models: LSTM, GRU, and CNN-BiLSTM-Atten- significance level for the one-tailed test was set at α = 0.05.
tion. Brief descriptions of these models are as follows: The original hypothesis posited that there would be no signif-
icant difference in the predictive results between our model
(1) LSTM model: LSTM, an improved variant of traditional and the comparative models in power load forecasting. If
RNN, effectively captures the semantic relationships in the p-value is less than 0.05, the original hypothesis will be
long sequences, mitigating gradient vanishing or explod- rejected (h=1). The predicted values of the proposed model
ing issues. LSTM features a more complex structure [48]. and three comparison models were turned into Wilcoxon
(2) GRU model: Introduced by Cho et al. [49] in 2014, the signed-rank tests separately in multistep forecasting from 1
GRU neural network addresses the gradient vanishing to 7 steps. The results are shown in Table 8.
problem in standard recurrent neural networks and shares
a similar design philosophy with LSTM.
(3) CNN-BiLSTM-Attention model [50]: This model employs
complex mathematical operations in the convolutional 6 Conclusion and future work
and pooling layers of the convolutional neural net-
works (CNN) to extract the spatial features of the input In the current paper, we have introduced a robust Pinball-
variables. The multi-head attention layer minimizes irrel- Huber loss function that demonstrates remarkable resistance
evant feature impact, enhancing the extracted features. to outliers and substantially reduces the likelihood of over-
The BiLSTM layer models trend information in time fitting. This loss function effectively manages outliers and
series, generating a probability model for prediction dis- asymmetrical noise within the dataset, serving as the objec-
tribution. tive function for training the ELM model. Given the ELM’s
susceptibility to preset parameters’ influence, and aiming to
We adjust the hyperparameters of each model to achieve opti- ensure forecasting accuracy while maximizing and simpli-
mal performance, as shown in Table 6. The evaluation metrics fying the ELM network structure, as well as preventing the
for forecasting performance are presented in Table 7. squandering of training time and the emergence of overfit-
The predictive performance of LSTM and GRU is simi- ting because of an excessive number of hidden layer nodes,
lar, showing close values for RMSE and the output weight. we employed the NSGA-II algorithm for the optimization
Comparing the predictive performance evaluation metrics of of both training errors and output weights within the ELM
LSTM, GRU, and CNN-BiLSTM-Attention with NSGA-II- model. The combined NSGA-II-ELM Pinball-Huber model
ELM-Pinball-Huber, the RMSE of the proposed model was was then employed for power load forecasting in the context
always lower than that of the three comparison models. Par- of Taixing City. By employing the multi-objective optimiza-
ticularly in single-step forecasting, the prediction error of tion algorithm NSGA-II, we acquired the Pareto optimal
NSGA-II-ELM-Pinball-Huber (RMSE=84.27) was signifi- solution set for the number of hidden layer nodes in the ELM
cantly smaller than that of the three comparative models model, enabling an in-depth analysis of the forecasting out-
(RMSE=184.87, 184.58, 273.91). These results indicate that comes. Our analysis of the experimental outcomes revealed
the proposed model effectively captured the changing trends that the performance of the suggested Pinball-Huber loss
in power load data in both the spatial and temporal dimen- function within the ELM framework surpassed that of other
sions. Furthermore, the proposed model maintained a stable loss functions. Moreover, the NSGA-II algorithm effectively
structure in multistep forecasting. enhanced the performance of diverse ELM-loss function
Finally, to verify whether the NSGA-II-ELM-Pinball- combinations. The innovative combined approach, NSGA-
Huber model significantly improves predictive accuracy II-ELM Pinball-Huber model, can be seen as a promising
in power load forecasting compared with other models, and effective method for power load forecasting, offering a
we conducted the Wilcoxon signed-rank test [51]. The novel solution to this domain.
123
8758 Y. Yang et al.
Multiobjective optimization greatly improves the predic- where W N is the sample weight matrix and W L is the weight
tive performance of the model, but it takes up a significant matrix of hidden nodes. The details of wi of the loss function
amount of computational resources. In the future, we aim to can be found in Table 1. Their specific forms are as follows:
delve deeper into simplifying the computational resources
and time required for training the proposed method, which ⎡ ⎤
w(r1 ) 0 · · · 0
is crucial for the widespread applicability of the model. Fur- ⎢ 0 w(r2 ) · · · 0 ⎥
⎢ ⎥
thermore, this forecasting model can only provide predicted WN = ⎢ . .. . . .. ⎥
values for future power loads, and recent research has focused ⎣ .. . . . ⎦
on uncertain predictions. Future studies will delve deeper into 0 0 · · · w(r N )
⎡ ⎤
the probability predictions of the model, which holds signif- 1
max(|β1 |, ) 0 ··· 0
icant value for practical applications in power systems [52, ⎢ 1
··· ⎥
⎢ 0 max(|β2 |, ) 0 ⎥
53]. WL = ⎢
⎢ .. .. .. .. ⎥
⎥
⎣ . . . . ⎦
0 0 ··· 1
max(|β L |, )
Appendix
In general, the specific steps of the ELM-Pinball-Huber
A The ELM-loss function model model are as follows:
Step 1: Initialize the relevant parameters w, b, and L of
The combined ELM-loss function is a single objective model, the ELM-Pinball-Huber model.
and its mathematical model can be written as follows: Step 2: Calculate the output weight vector β by (18).
Step 3: Update sample weight matrix W N and hidden
N
L
nodes’ weight matrix W L .
min C P H (ri ) + | βj | Step 4: Repeat steps 2-3 until β converges; then, obtain
i=1 j=1 (16)
the trained ELM-Pinball-Huber model.
s.t. h(xi ) = yi − ri , i = 1, 2, ..., N Step 5: Substitute the test set into the trained model to get
the forecasting results.
where ri represents the training error of the sample and Similar to the above ELM-Pinball-Huber model, we can
N
i=1 P H (ri ) is the total error under Pinball-Huber loss combine the loss functions in Table 1 with ELM, respectively,
function of N different training samples, here representing to compare their performance.
experience risk. Lj=1 | β j | is a lasso penalty term, repre-
Acknowledgements The work is supported by the Australian Research
senting the complexity of the model. C > 0 is called the
Council project (grant number DP160104292), the National Natu-
regularization parameter or the penalty parameter and is used ral Science Foundation of China under Grant 61873130 and Grant
to balance empirical risk and model complexity. 61833011, the Natural Science Foundation of Jiangsu Province under
Lagrangian multipliers are introduced for each equality Grant BK20191377, the 1311 Talent Project of Nanjing University of
Posts and Telecommunications, and “Chunhui Program” Collaborative
constraint condition in the model, and the Lagrangian func-
Scientific Research Project (202202004).
tion is constructed to transform it into an unconstrained
optimization problem: Author Contributions Yang Yang: Writing - review & editing, Fund-
ing acquisition. Hao Lou: Software, Visualization, Formal analysis,
Writing-original draft. Zijin Wang: Writing-review & editing. Jinran
N
L
Wu: Supervision, Formal analysis, Writing-original draft, Writing-
L(β, r , α) = C P H (ri ) + | βj | review & editing.
i=1 j=1
N Funding Open Access funding enabled and organized by CAUL and
− αi (h(xi )β − yi − ri ). (17) its Member Institutions.
i=1
Data Availability The data are available with a reasonable request.
123
Pinball-Huber boosted extreme learning machine regression: a multiobjective... 8759
Open Access This article is licensed under a Creative Commons 17. Biswas MR, Robinson MD, Fumo N (2016) Prediction of resi-
Attribution 4.0 International License, which permits use, sharing, adap- dential building energy consumption: A neural network approach.
tation, distribution and reproduction in any medium or format, as Energy 117:84–92
long as you give appropriate credit to the original author(s) and the 18. Trairat P, Banjerdpongchai D (2022) Multi-objective optimal oper-
source, provide a link to the Creative Commons licence, and indi- ation of building energy management systems with thermal and
cate if changes were made. The images or other third party material battery energy storage in the presence of load uncertainty. Sustain-
in this article are included in the article’s Creative Commons licence, ability 14(19):12717
unless indicated otherwise in a credit line to the material. If material 19. Tian X, Zou Y, Wang X, Tseng M, Li H, Zhang H (2022) Improving
is not included in the article’s Creative Commons licence and your the efficiency and sustainability of intelligent electricity inspec-
intended use is not permitted by statutory regulation or exceeds the tion: IMFO-ELM Algorithm for Load Forecasting. Sustainability
permitted use, you will need to obtain permission directly from the copy- 14(21):13942
right holder. To view a copy of this licence, visit http://creativecomm 20. Sajjadi S, Shamshirband S, Alizamir M, Yee L, Mansor Z, Manaf
ons.org/licenses/by/4.0/. AA et al (2016) Extreme learning machine for prediction of heat
load in district heating systems. Energy Build 122:222–227
21. Ding S, Xu X, Nie R (2014) Extreme learning machine and its
applications. Neural Comput Appl 25(3):549–556
References 22. Zhou Y, Zhou N, Gong L, Jiang M (2020) Prediction of photovoltaic
power output based on similar day analysis, genetic algorithm and
1. Li K, Huang W, Hu G, Li J (2023) Ultra-short term power load extreme learning machine. Energy 204:117894
forecasting based on CEEMDAN-SE and LSTM neural network. 23. Ni Q, Zhuang S, Sheng H, Kang G, Xiao J (2017) An ensemble
Energy Build 279:112666 prediction intervals approach for short-term PV power forecasting.
2. Wen L, Zhou K, Yang S, Lu X (2019) Optimal load dispatch of Solar Energy 155:1072–1083
community microgrid with deep learning based solar power and 24. Han Y, Wang N, Ma M, Zhou H, Dai S, Zhu H (2019) A PV power
load forecasting. Energy 171:1053–1065 interval forecasting based on seasonal model and nonparametric
3. Lebotsa ME, Sigauke C, Bere A, Fildes R, Boylan JE (2018) Short estimation algorithm. Solar Energy 184:515–526
term electricity demand forecasting using partially linear additive 25. Chen X, Yu R, Ullah S, Wu D, Li Z, Li Q et al (2022) A novel
quantile regression with an application to the unit commitment loss function of deep learning in wind speed forecasting. Energy
problem. Appl Energy 222:104–118 238:121808
4. He F, Zhou J, Mo L, Feng K, Liu G, He Z (2020) Day- 26. Chen J, Zeng Y, Li Y, Huang GB (2020) Unsupervised feature
ahead short-term load probability density forecasting method with selection based extreme learning machine for clustering. Neuro-
a decomposition-based quantile regression forest. Appl Energy computing 386:198–207
262:114396 27. Yang Y, Tao Z, Qian C, Gao Y, Zhou H, Ding Z, et al (2022) A
5. Gupta D, Hazarika BB, Berlin M (2020) Robust regularized hybrid robust system considering outliers for electric load series
extreme learning machine with asymmetric Huber loss function. forecasting. Applied Intelligence, pp 1–23
Neural Comput Appl 32(16):12971–12998 28. Wang J, Zhu H, Cheng F, Zhou C, Zhang Y, Xu H et al (2023)
6. Zhang J, Siya W, Zhongfu T, Anli S (2023) An improved hybrid A novel wind power prediction model improved with feature
model for short term power load prediction. Energy 268:126561 enhancement and autoregressive error compensation. J Clean Prod
7. Wang J, Zhang L, Li Z (2022) Interval forecasting system for 420:138386
electricity load based on data pre-processing strategy and multi- 29. Wj Niu, Zk Feng, Li Ss Wu, Hj Wang Jy (2021) Short-term elec-
objective optimization algorithm. Appl Energy 305:117911 tricity load time series prediction by machine learning model via
8. Wu F, Cattani C, Song W, Zio E (2020) Fractional ARIMA with an feature selection and parameter optimization using hybrid cooper-
improved cuckoo search optimization for the efficient Short-term ation search algorithm. Environ Res Lett 16(5):055032
power load forecasting. Alex Eng J 59(5):3111–3118 30. Shang Z, He Z, Song Y, Yang Y, Li L, Chen Y (2020) A novel
9. Dudek G (2016) Pattern-based local linear regression models for combined model for short-term electric load forecasting based on
short-term load forecasting. Electr Power Syst Res 130:139–147 whale optimization algorithm. Neural Process Lett 52:1207–1232
10. Lee CM, Ko CN (2011) Short-term load forecasting using lifting 31. Xie K, Yi H, Hu G, Li L, Fan Z (2020) Short-term power load
scheme and ARIMA models. Expert Syst Appl 38(5):5902–5911 forecasting based on Elman neural network with particle swarm
11. Wang J, Xing Q, Zeng B, Zhao W (2022) An ensemble forecasting optimization. Neurocomputing 416:136–142
system for short-term power load based on multi-objective opti- 32. Abualigah L, Diabat A, Mirjalili S, Abd Elaziz M, Gandomi A
mizer and fuzzy granulation. Appl Energy 327:120042 (2020) The arithmetic optimization algorithm. Comput Methods
12. Voyant C, Notton G, Kalogirou S, Nivet ML, Paoli C, Motte F et al Appl Mech Eng 376:113609
(2017) Machine learning methods for solar radiation forecasting: 33. Kaboli SHA, Fallahpour A, Selvaraj J, Rahim N (2017) Long-
A review. Renew Energy 105:569–582 term electrical energy consumption formulating and forecasting via
13. Kim MK, Kim YS, Srebric J (2020) Predictions of electricity con- optimized gene expression programming. Energy 126:144–164
sumption in a campus building using occupant rates and weather 34. Khishe M, Mosavi MR (2020) Chimp optimization algorithm.
elements with sensitivity analysis: Artificial neural network vs. lin- Expert Syst Appl 149:113338
ear regression. Sustain Cities and Soc 62:102385 35. Luo L, Li H, Wang J, Hu J (2021) Design of a combined wind speed
14. Hopfield JJ (1988) Artificial neural networks. IEEE Circ Devices forecasting system based on decomposition-ensemble and multi-
Mag 4(5):3–10 objective optimization approach. Appl Math Model 89:49–72
15. Awad M, Khanna R, Awad M, Khanna R (2015) Support vector 36. Yang Y, Zhou H, Gao Y, Wu J, Wang YG, Fu L (2022) Robust
regression. Theories, concepts, and applications for engineers and penalized extreme learning machine regression with applications
system designers, Efficient learning machines, pp 67–80 in wind speed forecasting. Neural Comput Appl 34(1):391–407
16. Huang GB, Zhu QY, Siew CK (2004) Extreme learning machine: 37. Huber PJ (1973) Robust regression: asymptotics, conjectures and
a new learning scheme of feedforward neural networks. In: 2004 Monte Carlo. The Annals of Statistics, pp 799–821
IEEE international joint conference on neural networks (IEEE Cat.
No. 04CH37541), vol 2. Ieee; pp 985–990
123
8760 Y. Yang et al.
38. Wang K, Zhong P (2014) Robust non-convex least squares loss 48. Wang J, Zhu H, Zhang Y, Cheng F, Zhou C (2023) A novel pre-
function for regression with outliers. Knowl-Based Syst 71:290– diction model for wind power based on improved long short-term
302 memory neural network. Energy 265:126283
39. Yang X, Tan L, He L (2014) A robust least squares support vector 49. Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares
machine for regression and classification with noise. Neurocom- F, Schwenk H, et al (2014) Learning phrase representations
puting 140:41–52 using RNN encoder-decoder for statistical machine translation.
40. Beaton AE, Tukey JW (1974) The fitting of power series, meaning arXiv:1406.1078
polynomials, illustrated on band-spectroscopic data. Technomet- 50. Zhang YM, Wang H (2023) Multi-head attention-based probabilis-
rics 16(2):147–185 tic CNN-BiLSTM for day-ahead wind speed forecasting. Energy
41. Wang X, Jiang Y, Huang M, Zhang H (2013) Robust vari- 278:127865
able selection with exponential squared loss. J Am Stat Assoc 51. Li D, Jiang MR, Li MW, Hong WC, Xu RZ (2023) A floating
108(502):632–643 offshore platform motion forecasting approach based on EEMD
42. Karal O (2017) Maximum likelihood optimal and robust Support hybrid ConvLSTM and chaotic quantum ALO. Applied Soft Com-
Vector Regression with lncosh loss function. Neural Netw 94:1–12 puting, pp 110487
43. Huang GB, Siew CK (2005) Extreme learning machine with ran- 52. Hong T, Fan S (2016) Probabilistic electric load forecasting: A
domly assigned RBF kernels. Int J Inf Technol 11(1):16–24 tutorial review. Int J Forecast 32(3):914–938
44. Huang G, Huang GB, Song S, You K (2015) Trends in extreme 53. Zhang Y, Wang J, Wang X (2014) Review on probabilistic forecast-
learning machines: A review. Neural Netw 61:32–48 ing of wind power generation. Renew Sust Energ Rev 32:255–270
45. Konak A, Coit DW, Smith AE (2006) Multi-objective optimiza-
tion using genetic algorithms: A tutorial. Reliab Eng & Syst Saf
91(9):992–1007
Publisher’s Note Springer Nature remains neutral with regard to juris-
46. Sampson JR.: Adaptation in natural and artificial systems (John H.
dictional claims in published maps and institutional affiliations.
Holland). Society for Industrial and Applied Mathematics
47. Deb K, Agrawal S, Pratap A, Meyarivan T (2000) A fast eli-
tist non-dominated sorting genetic algorithm for multi-objective
optimization: NSGA-II. In: International conference on parallel
problem solving from nature. Springer, pp 849–858
123