A New Alternating Suboptimal Dynamic Programming A
A New Alternating Suboptimal Dynamic Programming A
Article
A New Alternating Suboptimal Dynamic Programming
Algorithm with Applications for Feature Selection
David Podgorelec * , Borut Žalik , Domen Mongus and Dino Vlahek
Faculty of Electrical Engineering and Computer Science, University of Maribor, Koroška Cesta 46,
SI-2000 Maribor, Slovenia; [email protected] (B.Ž.); [email protected] (D.M.); [email protected] (D.V.)
* Correspondence: [email protected]
Abstract: Feature selection is predominantly used in machine learning tasks, such as classification,
regression, and clustering. It selects a subset of features (relevant attributes of data points) from a
larger set that contributes as optimally as possible to the informativeness of the model. There are
exponentially many subsets of a given set, and thus, the exhaustive search approach is only practical
for problems with at most a few dozen features. In the past, there have been attempts to reduce the
search space using dynamic programming. However, models that consider similarity in pairs of
features alongside the quality of individual features do not provide the required optimal substructure.
As a result, algorithms, which we will call suboptimal dynamic programming algorithms, find a
solution that may deviate significantly from the optimal one. In this paper, we propose an iterative
dynamic programming algorithm, which inverts the order of feature processing in each iteration.
Such an alternating approach allows for improving the optimization function by using the score from
the previous iteration to estimate the contribution of unprocessed features. The iterative process is
proven to converge and terminates when the solution does not change in three successive iterations
or when the number of iterations reaches the threshold. Results in more than 95% of tests align
with those of the exhaustive search approach, being competitive and often superior to the reference
greedy approach. Validation was carried out by comparing the scores of output feature subsets
and examining the accuracy of different classifiers learned on these features across nine real-world
applications, considering different scenarios with various numbers of features and samples. In the
context of feature selection, the proposed algorithm can be characterized as a robust filter method that
Citation: Podgorelec, D.; Žalik, B.;
can improve machine learning models regardless of dataset size. However, we expect that the idea of
Mongus, D.; Vlahek, D. A New
alternating suboptimal optimization will soon be generalized to tasks beyond feature selection.
Alternating Suboptimal Dynamic
Programming Algorithm with
Applications for Feature Selection.
Keywords: dynamic programming; suboptimal solution; feature selection; machine learning
Mathematics 2024, 12, 1987. https://
doi.org/10.3390/math12131987 MSC: 90C39; 90C35; 68W25
Feature selection plays a vital role in model construction in statistical analysis, dimension-
ality reduction, signal processing, pattern recognition, data visualization, and, particularly,
in various machine learning tasks, such as classification, regression, and clustering. Its
aim is to improve the model’s performance, including its accuracy, generalizability, and
interpretability, and reduce overfitting and computational cost [2].
Feature selection methods can be grouped into three categories [1]. Filter methods
evaluate candidate subsets with independent criteria that exploit essential characteristics
of the training data. They are fast, but the solution may deviate significantly from the
optimal one. A wrapper approach uses a learning algorithm for subset evaluation, such
as a classifier or regressor. Its performance is usually better but also much slower than
the filter approach. Embedded methods interact with a learning algorithm but at a lower
computational cost than the wrapper approach. They use independent criteria to identify
optimal subsets for a known cardinality. The learning algorithm is then used to select the
final optimal subset across different cardinalities [3].
Regardless of the approach chosen, feature selection can be viewed as an optimization
problem as it searches for the best-evaluated feature subset [4]. Different search strategies
can be used, including sequential search (greedy approach), exponential search (exhaustive
search, beam search, or branch and bound), and random search [3]. Conversely, dynamic
programming (DP) is not as commonly applied to feature selection as other methods. This
popular optimization approach breaks a problem into smaller subproblems and uses their
solutions to construct the solution to the larger problem. An optimal solution can be
found if the problem exhibits optimal substructure. This means that an optimal solution
to the problem contains optimal solutions to subproblems [5,6]. However, DP is usually
computationally demanding, so for reasons of feasibility, acceptable speed, and availability
to handle problems with higher dimensionality, it is also required that the number of
subproblems is not too high and that the subproblems overlap, suggesting that it makes
sense to record their solutions in a table and reuse them [6].
In this paper, we highlight the possibilities of using DP in feature selection, analyze the
difficulties of existing (rare) approaches, and propose alternative solutions. An evaluation
criterion based on feature quality, correlation, and/or statistics does not generally provide
an optimal substructure since, e.g., the union of two optimal subsets is not necessarily
optimal due to possible high correlations between pairs of features, one from each subset. It
is possible to achieve an optimal solution for specific problems by adapting the evaluation
criterion, but this spoils generality (e.g., wrappers or embedded selection methods are
tied to specific machine learning models and prone to overfitting [2]), which is among
our primary goals. We thus focused on finding the best possible suboptimal solution. We
studied approximate (ADP) [7] and iterative [8] dynamic programming (IDP) methods and
developed a solution that we called alternating suboptimal dynamic programming (ASDP).
It inverts the order of feature processing in each iteration and improves the optimization
criterion by using the score from the previous iteration to estimate the contribution of
unprocessed features. Its contributions are as follows:
1. A better or at least the same evaluation score of the final solution set compared to
the score after a single iteration. Furthermore, the solution found in each iteration is
never worse than the one found in the previous iteration.
2. Optimal solution according to the evaluation score found in more than 95% of cases.
3. Polynomial worst-case time complexity (O(n4 )) allows significantly larger input
feature sets to be considered compared to the exhaustive search approach.
4. Comparable and, in some cases, better classification accuracy on the basis of the
feature set selected by the new method than when using our previous graph-based
greedy feature selection method. In this respect, we have already demonstrated the
competitiveness of the latter in our previous work [9] compared to state-of-the-art
classification approaches and applied feature selection methods.
The rest of the paper is structured as follows. In Section 2, we survey existing solutions
in feature evaluation and selection, the use of DP in feature selection, and suboptimal DP
Mathematics 2024, 12, 1987 3 of 22
2. Related Works
As the topic presented here combines several challenges, the state-of-the-art review
must address several areas. First, in Section 2.1, we address feature evaluation, i.e., proce-
dures and metrics to assess the contribution of individual features and/or a feature subset
to a machine learning model. Feature evaluation is the basis for feature selection, which
we review in Section 2.2. The goal is to optimally select a subset of the input features that
solve a given machine-learning task. We wanted to approach the problem using dynamic
programming, so in Section 2.3, we review the use of this software design strategy in feature
selection. However, such methods are rare, time-demanding, and practically always offer
partial solutions only. Consequently, the solution proposed in this paper is suboptimal;
thus, Section 2.4 briefly reviews the use of suboptimal DP for various problems, including
feature selection.
• Filters;
• Wrappers;
• Embedded methods.
Filtering is usually performed using a threshold value. Although such methods are
computationally very efficient, their classification power largely depends on the feature
evaluation techniques [2,10,25]. The latter often consider only pairwise dependencies
between feature values and the target variable, ignoring correlations between features [2,12].
As a result, the prediction efficiency is thus limited.
In [26,27], a method calculates the efficiency of separation between different classes
in the local neighborhood of selected samples to evaluate the feature. This enables low
execution times because it does not use all of the samples contained in the dataset. However,
calculations are usually inaccurate due to the limited number of considered samples.
The method also does not consider the correlation between features. In [28], the authors
proposed an approach that selects features highly correlated with the class labels and in
low correlation with each other. A similar method is proposed in [29] but for regression
purposes. The only difference is that it selects features highly correlated with the target
variable. However, neither technique considers the interaction between features and only
considers the linear interdependence between feature values and target variables.
In [30], the authors propose a two-stage feature selection method in which the evalua-
tion is based on the calculation of mutual information while at the same time considering
the correlation between pairs of features. In [23], the adequacy of the mutual information
for regression is considered. However, in the case of a small number of training samples,
inaccurate estimates of mutual information may appear, and the method is biased towards
features with a large number of different values due to the use of this metric. In [31], a
feature selection framework for large datasets was proposed based on a cascade of methods
capable of detecting nonlinear relationships between two features and designed to achieve
a balance between accuracy and speed.
Conversely, wrapper methods select a subset of features that maximizes the per-
formance of a given classifier or regressor [2]. Wrappers are considered a multi-criteria
optimization problem that maximizes machine-learning-method performance while mini-
mizing the number of selected features. This can be addressed with several optimization
techniques [2,32], such as sequential selection algorithms or nature-inspired algorithms,
such as the evolutionary and genetic algorithm [33,34], particle swarm optimization [33,35],
and the bees algorithm [36].
Early wrappers were based on sequential selection [37]. This starts with an empty set,
adding features, and evaluating the prediction performance. The feature that gives the
best results is permanently included in the set. The selection continues by adding features
one by one again and keeping those that contribute the most to improving the prediction
performance. The algorithm terminates when a predetermined threshold of acceptable
results is reached or when a sufficient number of features are selected. In [38], the authors
propose an inverse procedure where features are removed from the input set. The main
limitation of these algorithms is that they do not consider the correlation between features.
This limitation is eliminated in the adaptive version of the algorithm [39]. However, such a
search for an optimal set with a more significant number of features soon grows into an
exponentially time-consuming process [12,37].
Therefore, methods that find a suboptimal solution were proposed [33,34]. Examples
of the latter are algorithms based on nature-inspired concepts. Similar to sequential feature
selection methods, the evaluation function of evolutionary algorithms represents the perfor-
mance of a model, while the features represent the population. The best-performing subsets
are combined to achieve the desired result [34,35]. The biggest problem with these methods
is their computational complexity, as the process involves the model evaluating features
for each specific subset over a large number of iterations to obtain useful results [2,40].
Mathematics 2024, 12, 1987 5 of 22
where
q µi denotes mean, while standard deviation σi of feature values is defined as σi =
M
M ∑m=1 ( f i,m − µi ). Both functions, ∆ and P, are designed such that lower values (closer
1
to 0) are more favorable for selection than higher values (closer to 1).
According to the theoretical framework introduced in [9], we use the following defini-
tions of elementary properties:
• Vertices f i ∈ F and f j ∈ F are adjacent in a graph G if there exists an edge ei,j ∈ E.
• A path from f i0 to f i N is an ordered sequence of vertices ∏i0 ,i N = f i0 , f i1 , . . . , f N , such
that f i j and f i j+1 are adjacent for all j ∈ [0, N − 1].
• A graph G is connected if ∀ f i , f j ∈ F there exists a path ∏i,j .
• A graph G ′ = ( F ′ , E′ ) is subgraph of G if F ′ ⊆ F and E′ ⊆ E.
• A neighbourhood Z ( f i ) of a vertex f i in graph G is the subset of vertices F, defined by
all the adjacent vertices of f ik , namely, Z ( f i ) = { f j }; f j ∈ F; ∃ei,j , where i ̸= j.
We say that a set of vertices CUT ( F ) ⊆ F is a vertex-cut if its removal separates
graph G into at least two non-empty and pairwise disconnected connected components.
Obviously, Z ( f i ) is a graph-cut, as it separates a singleton { f i } (i.e., an individual vertex)
from the rest of the graph, thus creating a subgraph G ′ = ( F ′ , E′ ), whose vertex- and
edge-sets are given formally by Equation (2).
F ′ = F \ ( Z ( f i ) ∪ { f i }),
(2)
E′ = {eh,l ∈ E; f h , f l ∈ F ′ and h ̸= l }.
(a) (b)
(c) (d)
Figure 1. Vertex-cut-based feature selection: (a) graph G where the feature of the highest quality
(coloured green) is selected and its neighbourhood (red) is removed, (b) repeating the same procedure
on subgraph G ′ , and (c) subgraph G ′′ . (d) The output result { f 1 , f 5 , f 6 } (in green) is obtained.
∆( f 0 ) = 0,
∆( f n+1 ) = 0,
P(e0,n+1 ) = ∞, (3)
P(e0,i ) = 0, 0 < i ≤ n,
P(ei,n+1 ) = 0, 0 < i ≤ n.
Mathematics 2024, 12, 1987 9 of 22
Each graph vertex f i contains, in addition to the weight ∆( f i ), a set Si that stores the
“optimal” subset (feature selection result) of the vertices already processed, and the score si
of this subset, which is obtained by the evaluation criterion. Their initialization is described
by Equation (4) and is important for the convergence proof in Section 3.3. The evaluation
criterion described in Equation (5) seeks a minimum for all vertices, except the guards, i.e.,
0 < i ≤ n.
Si = ∅, 0 ≤ i ≤ n + 1,
(4)
si = 0, 0 ≤ i ≤ n + 1.
si = min (s j + ∆( f i ) +
0≤ j < i
∑ P(ek,i )) (5)
k∈S j
Let r be the value of j where the minimum was identified. The corresponding Si is
calculated by Equation (6).
S i = Sr ∪ { i } (6)
The final score score and feature selection result Solution are given by Equation (7).
score = min si ,
0< i ≤ n
solution = i, where score was found, (7)
Solution = Ssolution .
Figure 2a shows the situation immediately before Equations (5) and (6) are applied to
vertex f i , and Figure 2b shows the situation immediately after the equations are applied.
Green indicates the graph vertices that have already been processed, and white indicates
those that are being or will be processed. The red text indicates vertex attributes modified
during the observed f i processing.
(a) (b)
Figure 2. The concept of feature selection based on dynamic programming: (a) partial solution to be
stored in f i considers the solutions stored in all its predecessors; (b) the situation after updating the
status of f i . Si and si are calculated with Equations (6) and (5), respectively.
To date, everything seems straightforward, but there are, in fact, three serious problems
in the process that need to be addressed. The first is that the importance of vertices and
edges might differ. For this reason, we introduce a weight w, 0 ≤ w ≤ 1. This modifies the
evaluation criterion Equation (5) into Equation (8).
si = min (s j + w · ∆( f i ) + (1 − w) ·
0≤ j < i
∑ P(ek,i )) (8)
k∈S j
The second problem is that Equation (5) in its present form always leads to a trivial
solution from Equation (9). Since the weights of the graph vertices and edges are all
non-negative, the minimum consists of a single vertex (without incident edges) with the
lowest weight.
To prevent this, we first modified the model by replacing the decreasing vertex eval-
uation function ∆ with the increasing ∆′ ( f ) = 1 − ∆( f ). The idea was to award high
vertex weights and penalize high edge weights. This resulted in the optimization function
Equation (10):
si = max (s j + w · ∆′ ( f i ) − (1 − w) · ∑ P(ek,i )), (10)
0≤ j < i k∈S j
which does not tend towards the trivial solution. However, to retain complementarity
with the graph-cut-based method, we preferred to choose an alternative approach, which
decrements all vertex and edge weights (except those of guards and their incident edges)
for user-defined non-negative values sh f t∆ and sh f t P , respectively, (see Equation (11)).
Furthermore, these two additional parameters provide new possibilities for tuning, as
demonstrated in Section 4.
si = min (s j + w · (∆( f i ) − sh f t∆ ) + (1 − w) ·
0≤ j < i
∑ ( P(ek,i ) − sh f tP )) (11)
k∈S j
The third problem is the most demanding. Even if all partial solutions S j , 0 < j < i
were optimal, there is no guarantee that this will be the case after adding f i to any of these
solutions. It is enough that f i is over-correlated with a single feature from each S j , and the
optimum will likely be missed. In other words, optimization defined in this way does not
guarantee an optimal substructure, one of the two fundamental assumptions of dynamic
programming, along with overlapping subproblems [6]. Of course, when considering f i ,
we can no longer refresh its predecessors’ attributes S j and s j . We tried to mitigate this
problem by extending the evaluation criterion by predicting the contribution of vertices
not yet visited and, most importantly, considering the correlation between the visited and
predicted parts. The need to predict the contribution of unvisited nodes led us to a simple
idea, which later turned out to be very successful, namely to reverse the graph traversing
direction after arriving at f n . As G is an undirected graph, the status from the previous
traversal can simply be used to estimate the score si and partial solution Si . The updated
evaluation criterion is given by Equation (12).
s f wd = min (s j + si + (1 − w) ·
n +1≥ j > i
∑ ( P(ek,h ) − sh f t P )) (12)
k ∈S j ,h∈Si
When the reverse traversal reaches f 1 , the direction of visiting the vertices is inverted
again. The evaluation criterion Equation (12) is slightly modified to Equation (13), corre-
sponding to the forward direction from f 1 towards f n . The only difference between the
two equations is, of course, the direction and boundaries of the vertices’ traversal, written
under the min function label.
s f wd = min (s j + si + (1 − w) ·
0≤ j < i
∑ ( P(ek,h ) − sh f t P )) (13)
k ∈S j ,h∈Si
The modified evaluation criterion significantly impacts the choice of vertex f r (r is the
value of j, providing the minimum) and thus indirectly affects the calculation of si and Si .
Let r be the value of j in Equation (12) or (13) where the minimum was identified. The score
si is then calculated by using Equation (14), while Equation (6) representing the solution
subset Si remains applicable.
they can be used to predict the attributes of another vertex instead of f i , namely f iend ,
which represents the last vertex in the set Si (the one with the lowest index in the reverse
direction traversal or with the highest index in the forward traversal). However, we should
not update siend and Siend when we process f i because we will need the values from the
previous iteration when we process f iend later. As a consequence, we extend each vertex
f k with additional attributes pr (sk ) and pr (Sk ) (pr stands for prediction), which store the
aforementioned estimates of the score and the solution set. At the beginning of each
iteration, the initialization pr (sk ) = ∞, 0 < k ≤ n, is performed. Algorithm 1 shows
the processing of vertex f i , which is further explained in Figure 3. For simplicity, we
assume that all the variables in Algorithm 1 are global, except i and f orward. The score si
is determined as the minimum between the previously stored pr (si ) and si computed by
Equation (14). In the former case, the set pr (Si ) is assigned to Si , while in the latter case, Si
is determined by Equation (6). Note that pr (si ) and pr (Si ) can be refreshed multiple times
in the same iteration since multiple sequences Si at different i can terminate with the same
vertex f iend .
Figure 3a,b show the situation immediately before and after Equations (6), (12) and
(14) are applied to vertex f i , respectively. The graph traversal is performed in the reverse
direction. The obvious difference between the straightforward non-iterative solution from
Figure 2 is that here, Si does not contain the initial vertex f i only, but the partial solution
from the previous iteration instead. As a consequence, there is a double loop in the sum
calculation. The green color indicates the graph vertices that have already been processed
in the observed iteration, and the yellow color indicates those that were processed in
the previous iteration (and are or will be processed later in the current iteration). Note
that these yellow vertices contain the predictions (colored cyan), which might be updated
earlier in the ongoing iteration. The red text indicates vertex attributes modified during the
observed f i processing. Analogously, Figure 3c,d show the processing of vertex f i when the
graph is passed in the forward direction. Equation (13) replaces Equation (12) in this case.
Mathematics 2024, 12, 1987 12 of 22
(a) (b)
(c) (d)
Figure 3. The concept of feature selection based on alternating suboptimal dynamic programming:
the situation (a) before processing f i during the reverse direction traversal; (b) after processing f i
during the reverse direction traversal; (c) before processing f i during the forward direction traversal;
and (d) after processing f i during the forward direction traversal.
The pseudocode in Algorithm 2 describes the overall structure of the alternating sub-
optimal dynamic programming method for feature selection. As mentioned, 200 features
can still be processed relatively fast, but for larger input sets, it makes sense to preprocess
the features with graph-cut-based feature selection filtering (line 2). The initialization
in line 3 sets up the guard vertices using Equation (3). Partial solution sets candidates
and their scores are initialized using Equation (4), which is needed in lines 4, 7 and 11 of
Algorithm 1 within the first-iteration calls of ProcessVertex (line 11 of Algorithm 2). The
value f inalScore is set to some high value (∞) to provide the first comparison in line 16, and
MaxIterations is set to a user-defined value or default 100. In line 8, all predicted scores
are set to a high value (∞) at the beginning of each iteration, which is needed in line 16.
The main work is done in the ProcessVertex function, which is called sequentially in line 11
for each feature f i except for the guard vertices. The direction of traversing the features
is inverted in each iteration (line 23). The process terminates when the identical score is
obtained three times in a row, or the number of iterations reaches maxIterations (line 24). If
there are two (or more) solutions with the same score, the algorithm may find one during
the forward direction traversal and a different one in the reverse direction traversal. In this
case, it will return the last of the two solutions found.
Proposition 1. The score in each iteration of the proposed alternating suboptimal dynamic pro-
gramming algorithm can only be lower (better) or equal to the score in the previous iteration but
never higher (worse).
Proof. The proof is conceptually straightforward since we will show that the score si from
the previous iteration is also considered a candidate for the minimum in the observed
iteration. Namely, this score is obtained in the evaluation criterion in line 7 of Algorithm 1
at j = 0 or in line 4 at j = n + 1. The algorithm does not modify the parameters of the
two guards, so s j = 0 and S j = ∅ in both cases. Consequently, only si remains from the
expression on the right of (13) or (12). If si is also the minimum in the current iteration,
then s f wd = si will be written first to pr (siend ) in line 17 of Algorithm 1, then to si in line 13
of Algorithm 1, to score in line 13 of Algorithm 2, and finally to f inalScore in line 17 of
Algorithm 2. Conversely, if si is not the minimum in the current iteration, then it can
only be replaced with a lower score in some of the aforementioned lines of Algorithm 1 or
Algorithm 2. This completes the proof.
Based on Proposition 1, it makes sense to modify the initialization (line 3 of Algorithm 2).
The proven convergence allows us to use the input feature set instead of the empty set as an
initial solution candidate. Equation (15) introduces a recursive definition of initial values,
which replaces Equation (4). Note that the last two lines of Equation (15) were derived
from Equations (6) and (14) by setting r = i − 1.
S0 = Sn+1 = ∅, s0 = sn+1 = 0,
S1 = { 1 } , s 1 = ∆ ( f 1 ) ,
(15)
Si = Si−1 ∪ {i }, 2 ≤ i ≤ n,
si = si−1 + w · (∆( f i ) − sh f t∆ ) + (1 − w) · ∑ ( P(ek,i ) − sh f t P ), 2 ≤ i ≤ n.
k ∈ Si − 1
Mathematics 2024, 12, 1987 14 of 22
Propositions 2–4 consider the time and space complexity of the graph-cut-based and
the alternating suboptimal dynamic programming feature selection approaches.
Proposition 2. The graph-cut-based feature selection method has the worst-case time complexity
O(n2 ), where n is the number of features, i.e., graph vertices.
Proof. The algorithm gradually selects features fˆr with the highest quality, which requires at
most O(n) steps. In each step, a neighborhood Z( fˆr ) is considered, which contains at most
O(n) features. This results in O(n) · O(n) = O(n2 ) worst-case time complexity. Note that
the method removes the considered features and their highly correlated neighborhood from
the graph G in each step and, consequently, the expected time complexity is much closer to
O(n · log n), which corresponds to sorting the vertices according to their qualities.
Proof. A double sum in lines 4 and 7 of Algorithm 1 contributes O(n2 ) time. In both cases,
it is performed within the min function, which considers O(n) values. The ProcessVer-
tex function thus requires O(n) · O(n2 ) = O(n3 ) time. It is called O(n) times in line 11
of Algorithm 2, resulting in O(n4 ) time per a single iteration. Although the number of
iterations (loop of lines 6–24) is by default set to 100, it rarely exceeds ten and practically
never 15, so its time consumption may be considered constant, i.e., O(1), and the overall
worst-case time complexity is proven O(n4 ).
Proposition 4. Both considered approaches to feature selection, i.e., the graph-cut-based and the
alternating suboptimal dynamic programming algorithm, require O(n2 ) space, where n is the
number of graph vertices (features).
n·(n−1)
Proof. In the graph-cut-based approach, the graph contains n vertices and at most 2
(n+2)·(n+1)
edges. Similarly, there are n + 2 vertices and 2 − 1 edges in the ASDP approach.
Furthermore, n + 2 sets Si and pr (Si ), each with O(n) elements, also do not exceed O(n2 )
space. The overall space complexity is thus O(n2 ).
4. Results
4.1. Validation Setup
The proposed method based on alternating suboptimal dynamic programming (ASDP)
and the exhaustive search algorithm (brute force, BF) was implemented using C++, while
the graph-cut-based feature selection (Graph-FS) was implemented using Python 3.11.5
on the Microsoft® Windows 11 operating system. All experiments were conducted on a
workstation with an Intel® Core™ i5 CPU and 16 GB of main memory. The algorithms are
not yet integrated into a common application, but the results of the Graph-FS prefiltering are
imported into the ASDP and BF methods via text files. The reproducibility of classification
experiments is provided through the scikit-learn 1.4.1 implementation of machine learning
methods. Classifiers were implemented with the following settings:
• K-Nearest neighbors classifier (KNN) was assessed using default settings, where
K ∈ {2, 3, . . . , 8} were tested;
• Naive Bayes classifier (NBC) was used with the default settings;
• Random Forest (RF) was of maximal depth from the range {2, 4, 8, 16, 20}, while the
maximal number of iterations was from {5, 10, 15, 20, 25, 30};
• XGBOOST was of maximal depth from the range {2, 4, 8, 16, 20}, while the maximal
number of iterations was from {5, 10, 15, 20, 25, 30}.
The ASDP and BF evaluation and the classification accuracy assessment were con-
ducted on nine well-known benchmark datasets, available at the UCI machine learning
Mathematics 2024, 12, 1987 15 of 22
repository [59]. Table 1 summarises the characteristics of each dataset, including its name
and the number of features, classes, and samples contained.
Table 2. Comparison of scores obtained by BF, SDP-1, and the ASDP method.
SDP-1 Score =
SDP-1 Score = ASDP Score = Max. # Avg. #
Dataset ID # Tests ASDP Score
BF Score [%] BF Score [%] Iterations Iterations
[%]
Ds1 125 58.4 100.0 58.4 5 3.4
Ds2 125 68.8 100.0 68.8 7 3.6
Ds3 125 76.8 100.0 76.8 7 3.3
Ds4 125 / / 65.6 8 3.6
Ds5 125 54.4 94.4 54.4 6 3.4
Ds6 125 / / 52.0 7 3.9
Ds7 125 / / 50.4 11 4.4
Ds8 125 54.4 84.8 54.4 7 3.7
Ds9 125 / / 72.0 8 3.7
Total * 1125 62.6 95.8 61.4 (62.6) 3.7
* The Tests column contains the sum, and the others contain average values.
Mathematics 2024, 12, 1987 16 of 22
1. The third column shows that SDP-1 reaches the global optimum in 62.6% of the tests.
The fourth column then shows that ASDP significantly raises this percentage to 95.8.
2. The degree of match (61.4%) between the SDP-1 and ASDP scores in the fifth column
should not be below that between SDP-1 and BF (62.6%) since ASDP never degrades
the score from the first iteration, according to Proposition 1. Indeed, if we ignore rows
Ds4, Ds6, Ds7, and Ds9, where we could not evaluate BF, we also obtain 62.6% for
ASDP (in brackets). Interestingly, at least for the tests performed, a conclusion can be
drawn that whenever ASDP fails to reach the global optimum in the first iteration, it
improves the score at least a little in subsequent iterations.
3. The last two columns confirm the empirical finding of the proof of Proposition 3 that
the number of iterations of ASDP is within O(1), since in the tests performed, it does
not exceed 11, and on average it is only 3.7, barely above the termination condition of
3 consecutive iterations with the unchanged score.
In order to further improve the results and, in particular, the feasibility in situations
with a larger number of features, we preprocessed ASDP with fast and highly accurate,
though still suboptimal, Graph-FS. The results are shown in Table 3, and the critical
observations are listed immediately below.
1. The second column confirms a significantly lower number of features than before the
use of Graph-FS (see Table 1).
2. The fourth column shows that BF did not change the Graph-FS results in 38.7% of
tests. In other words, it obtains a better score in 61.3% of cases.
3. The fifth column gives the first impression that ASDP performs significantly worse
(34.8% vs. 38.7) compared to BF. However, eliminating all tests on the Ds7 dataset,
where BF was not viable, made both scores equal. Since ASDP cannot, according to
Proposition 1 and the initialization from Equation (15), spoil the initial score, we may
also conclude here that the score was strictly improved in the remaining 61.3% of tests.
However, a better ASDP score obtained with Equation (14) does not necessarily imply
better results in practical applications. We will show this in Section 4.3 by matching
the ASDP score with the classification accuracy.
4. The sixth column shows that preprocessing of ASDP with Graph-FS raises the propor-
tion of solutions reaching the global optimum from 95.8% in Table 2 to 98%.
5. The last two columns show a maximum number of iterations of 12 and a lower number
of iterations of 3.4 compared to 3.7 from Table 2.
this purpose, we compared the classification performance of the selected features for both
presented methods and their combination (Graph-FS + ASDP) with the performance of
the same classifiers when learning about the input feature set. The results are shown in
Tables 4–7 for each specific classifier used. All tests were conducted by ten-fold cross-
validation [60], using average accuracy acc to indicate the method’s efficiency. The accuracy
is defined by (16):
Note that the acc values in the tables represent the highest achieved classification
results. Namely, in all test cases, all combinations of the classifier’s parameter values
(see Section 4.1) were tested, except for the NBC. The latter is a non-parametric method
and it was used with the default settings. We also report the number of features selected
and parameters T∆ and Tp used in the Graph-FS and Graph-FS + ASDP methods while
obtaining the listed highest results. Since identical results were typically obtained for
different combinations, we do not list ASDP parameters w, sh f t∆ , and sh f t p . Table 1 gives
the number of input features. The highest accuracy for each dataset is emphasized in bold.
Here, we considered that the same accuracy can be achieved across different methods,
regardless of selected features.
Table 4. Accuracies for RF classifier after the feature selection with Graph-FS, ASDP, their combination,
or when using all input features.
Table 5. Accuracies for XGBOOST classifier after Graph-FS, ASDP, Graph-FS + ASDP, or when using
all input features.
Table 6. Accuracies for NBC after the feature selection with Graph-FS, ASDP, their combination, or
when using all input features.
Table 7. Accuracies for KNN classifier after Graph-FS, ASDP, Graph-FS + ASDP, or when using all
input features.
Analysis shows an improvement in accuracy on the original dataset for all test cases
except in the case of Ds1 for classifier RF. Furthermore, Graph-FS and ASDP achieved
similar classification scores. However, Graph-FS showed slightly higher accuracy for Ds2,
Ds3, Ds5, and Ds8 for classifier RF, while the same results as ASDP are shown in the case
of Ds4, Ds5, Ds6, and Ds9. For classifier XGBOOST, similar results are obtained, where
Graph-FS is slightly better in classification accuracy than ASDP in cases Ds1, Ds2, Ds3, Ds7,
and Ds8. In the case of classifier NBC, Graph-FS achieved the best results in cases Ds2, Ds3,
Ds5, and Ds8, while for Ds8, ASDP provides the most informative feature subset, achieving
the highest accuracy among those in the comparison. We observed different results for the
last classifier, KNN, with the ASDP showing superior performance. It achieved highest
accuracy in cases Ds1, Ds3, Ds7, and Ds8.
Conversely, when comparing ASDP and Graph-FS + ASDP, we noticed improved
classification performance of selected classifiers in some cases. For example, in the case
of Ds4 for classifier RF, we achieved the highest classification accuracy with Graph-FS
+ ASDP for a selected feature subset that contains only two features, while Graph-FS
and ASDP achieved the same results when subsets of 10 and 14 features were selected,
respectively. Similar results can be found in the case of Ds2 and Ds6 across all classifiers,
Ds1 for NBC, and Ds2 and Ds3 for the KNN classifier, where the combination of Graph-FS
and ASDP achieved the highest measured accuracy but with a smaller number of features
than Graph-FS and ASDP individually. The most interesting result is that for Ds7 for
NBC, where Graph-FS and ASDP combined achieved the highest accuracy among all the
measured results.
Mathematics 2024, 12, 1987 19 of 22
Finally, the results demonstrate the robustness of both approaches, as no significant de-
viations regarding the improvements were displayed in experiments with various datasets
with different numbers of features or samples. Both, ASDP and Graph-FS + ASDP, achieved
comparable results regardless of the number of features, which can be low (e.g., Ds1 and
Ds3) or high (e.g., Ds7 and Ds9). In addition, both approaches showed improvements in
classification accuracy in datasets containing both small and large numbers of samples.
5. Discussion
This paper introduces an alternating suboptimal dynamic programming (ASDP) algo-
rithm, primarily aimed at improving feature selection, at least in some cases, and being
competitive in others. It iteratively considers individual features and inverts the processing
order in each iteration. This allows the optimization function to be improved by using the
score from the previous iteration to estimate the contribution of yet unprocessed features in
the current one. We proved that convergence is achieved and that the time complexity dis-
plays a polynomial (O(n4 )) relationship. Results on nine well-known benchmark datasets
for machine learning tasks demonstrated that a single iteration suboptimal dynamic pro-
gramming found the global optimum in 62.6% cases, which was significantly improved to
95.8% by ASDP in only 3.7 iterations on average (and never above 12). Although ASDP is
relatively slow and thus limited to 200–300 input features, we have extended its usability
by preprocessing it with our fast and highly accurate graph-cut-based feature selection
(Graph-FS) method. This raised the proportion of solutions reaching the global optimum to
98% and reduced the average number of iterations to 3.4.
We have also shown the practicality of using ASDP and the Graph-FS + ASDP com-
bination in classification. The latter was slightly behind or equal to the Graph-FS alone
when using the RF or XGBOOST classifiers and sometimes slightly better when using
the NBC. The former seems contradictory to the proven convergence of ASDP, but the
optimization criterion of ASDP and the classification accuracy of the used classifiers do
not guarantee the perfect consistency of results. Surprisingly, the ASDP method without
Graph-FS prefiltering performed best when using the KNN classifier. Finally, in all but
one case for RF, the presented methods achieved better classification accuracy than the
classifiers learned from the complete input feature set. Note that the superior performance
of Graph-FS in comparison to state-of-the-art approaches was already demonstrated in [9].
We may thus conclude that ASDP and Graph-FS + ASDP are also entirely competitive.
The four contributions of the proposed method, listed in Section 1, were justified as
follows. The first was confirmed by the proof of Proposition 1 and by the results in Table 2.
Table 2 also confirmed the second promised contribution, which was further exceeded by
the results in Table 3. The third contribution was confirmed by the proof of Proposition 3,
as well as by the fact that the BF score in some cases in Table 2 could not be determined due
to excessive time complexity. The fourth contribution was confirmed by the experiments in
Section 4.3, in particular by the results in Tables 6 and 7.
A disadvantage of using ASDF without preprocessing is that a larger number of
features makes the method too slow or, depending on the implementation, even infeasible.
It processes 200 features in 5 s on a regular PC and becomes useless at 500 features. This
represents a significant improvement compared to the exhaustive search approach, which
achieves such performance at a very modest 25 and 30 features, respectively. However, for
larger input sets, it makes sense to preprocess ASDF with some faster filtering. Conversely,
Graph-FS + ASDF restricts the solution search space to subsets of the Graph-FS solution.
We will try to achieve a compromise by cascading Graph-FS over 2–5 iterations, in which
in each iteration, we will gradually lower the thresholds T∆ and TP and extend the selected
set with features chosen from those not yet in the solution. We would also like to evaluate
the use of ASDF in regression tasks in the future. In addition, we expect that the idea
of alternating suboptimal optimization will soon be generalized to tasks beyond feature
selection as well. In general, graph nodes can represent a wide variety of entities, and edges
can represent any bilateral operation, such as distance, similarity, or correlation.
Mathematics 2024, 12, 1987 20 of 22
Author Contributions: Conceptualization, D.P. and D.V.; methodology, D.M. and D.V.; software, D.P.
and D.V.; validation, D.P., D.V. and B.Ž.; formal analysis, D.V. and B.Ž.; investigation, D.P., D.V., D.M.
and B.Ž.; resources, D.V.; data curation, D.V.; writing—original draft preparation, D.P., D.V. and B.Ž.;
writing—review and editing, D.P. and D.M.; visualization, D.P. and D.V.; supervision, B.Ž. and D.M.;
project administration, B.Ž.; funding acquisition, B.Ž. and D.M. All authors have read and agreed to
the published version of the manuscript.
Funding: This research was funded by the Slovene Research and Innovation Agency under Research
Project J2-4458 and Research Programme P2-0041.
Data Availability Statement: No new data were created or analyzed in this study. Data sharing is
not applicable to this article.
Conflicts of Interest: The authors declare no conflicts of interest.
Abbreviations
The following abbreviations are used in this manuscript:
References
1. Liu, H.; Motoda, H. Feature Selection for Knowledge Discovery and Data Mining; Kluwer Academic Publishers: Dordrecht, The
Netherlands, 1998.
2. Guyon, I.; Elisseeff, A. An Introduction to Variable and Feature Selection. J. Mach. Learn. Res. 2003, 3, 1157–1182.
3. Kumar, V.; Minz, S. Feature selection: A literature Review. SmartCR 2014, 4, 211–229.
4. Kohavi, R.; John, G.H. Wrappers for feature subset selection. Artif. Intell. 1997, 97, 273–324.
5. Bellman, R. Dynamic programming. Princet. Univ. Press 1957, 89, 92.
6. Cormen, T.H.; Leiserson, C.E.; Rivest, R.L.; Stein, C. Introduction to Algorithms; MIT Press: Cambridge, MA, USA, 2022.
7. Liu, D.R.; Li, H.L.; Wang, D. Feature selection and feature learning for high-dimensional batch reinforcement learning: A survey.
Int. J. Autom. Comput. 2015, 12, 229–242.
8. Kossmann, D.; Stocker, K. Iterative dynamic programming: A new class of query optimization algorithms. ACM Trans. Database
Syst. 2000, 25, 43–82.
9. Vlahek, D.; Mongus, D. An Efficient Iterative Approach to Explainable Feature Learning. IEEE Trans. Neural Netw. Learn. Syst.
2023, 34, 2606–2618.
10. Forman, G. An Extensive Empirical Study of Feature Selection Metrics for Text Classification. J. Mach. Learn. Res. 2003,
3, 1289–1305.
11. Fakhraei, S.; Soltanian-Zadeh, H.; Fotouhi, F. Bias and Stability of Single Variable Classifiers for Feature Ranking and Selection.
Expert Syst. Appl. 2014, 41, 6945–6958.
12. Liu, H.; Motoda, H. Computational Methods of Feature Selection; Chapman & Hall/CRC: Boca Raton, FL, USA, 2007; p. 440.
13. Gu, Q.; Li, Z.; Han, J. Generalized Fisher Score for Feature Selection. In Proceedings of the 27th Conference on Uncertainty in
Artificial Intelligence, UAI 2011, Barcelona, Spain, 14–17 July 2012; pp. 266–273.
14. Li, H.; Jiang, T.; Zhang, K. Efficient and robust feature extraction by maximum margin criterion. In Proceedings of the Advances
in Neural Information Processing Systems, Whistler, BC, Canada, 8–13 December 2003; Volume 16.
15. He, X.; Cai, D.; Niyogi, P. Laplacian Score for Feature Selection. In Proceedings of the 18th International Conference on Neural
Information Processing Systems, Vancouver, BC, Canada, 5–8 December 2005; pp. 507–514.
Mathematics 2024, 12, 1987 21 of 22
16. Han, J.; Kamber, M.; Pei, J. Data Mining: Concepts and Techniques; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA,
2011; p. 744.
17. Cover, T.M.; Thomas, J.A. Elements of Information Theory; Wiley-Interscience: Hoboken, NJ, USA, 2006; p. 792.
18. Verleysen, M.; Rossi, F.; François, D. Advances in Feature Selection with Mutual Information. In Similarity-Based Clustering: Recent
Developments and Biomedical Applications; Biehl, M., Hammer, B., Verleysen, M., Villmann, T., Eds.; Springer: Berlin/Heidelberg,
Germany, 2009; pp. 52–69.
19. Breiman, L.; Friedman, J.; Stone, C.; Olshen, R. Classification and Regression Trees; Wadsworth International Group: Belmont, CA,
USA, 1984.
20. Strobl, C.; Boulesteix, A.L.; Augustin, T. Unbiased split selection for classification trees based on the Gini Index. Comput. Stat.
Data Anal. 2007, 52, 483–501.
21. Raileanu, L.; Stoffel, K. Theoretical Comparison between the Gini Index and Information Gain Criteria. Ann. Math. Artif. Intell.
2004, 41, 77–93.
22. Krakovska, O.; Christie, G.; Sixsmith, A.; Ester, M.; Moreno, S. Performance comparison of linear and non-linear feature selection
methods for the analysis of large survey datasets. PLoS ONE 2019, 14, e0213584.
23. Frénay, B.; Doquire, G.; Verleysen, M. Is mutual information adequate for feature selection in regression? Neural Netw. 2013,
48, 1–7.
24. Bishop, C.M. Pattern Recognition and Machine Learning (Information Science and Statistics); Springer: Berlin/Heidelberg, Germany,
2006; p. 728.
25. Bell, D.; Wang, H. A Formalism for Relevance and Its Application in Feature Subset Selection. Mach. Learn. 2000, 41, 175–195.
26. Kira, K.; Rendell, L.A. A Practical Approach to Feature Selection. In Proceedings of the Ninth International Workshop on
Machine Learning, San Francisco, CA, USA, 1–3 July 1992; pp. 249–256.
27. Kononenko, I.; Šimec, E.; Robnik-Šikonja, M. Overcoming the myopia of inductive learning algorithms with RELIEFF. Appl. Intell.
1997, 7, 39–55.
28. Hall, M.A. Correlation-Based Feature Selection for Machine Learning. Ph.D. Thesis, The University of Waikato, Hamilton, New
Zealand, 1999.
29. Yu, L.; Liu, H. Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution. In Proceedings of the
Twentieth International Conference on International Conference on Machine Learning, Washington, DC, USA, 21–24 August
2003; pp. 856–863.
30. Peng, H.; Long, F.; Ding, C. Feature selection based on mutual information criteria of max-dependency, max-relevance, and
min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 2005, 27, 1226–1238.
31. Garcia-Ramirez, I.A.; Calderon-Mora, A.; Mendez-Vazquez, A.; Ortega-Cisneros, S.; Reyes-Amezcua, I. A novel framework for
fast feature selection based on multi-stage correlation measures. Mach. Learn. Knowl. Extr. 2022, 4, 131–149.
32. Wang, L.; Zhou, N.; Chu, F. A General Wrapper Approach to Selection of Class-Dependent Features. IEEE Trans. Neural Netw.
2008, 19, 1267–1278.
33. Oliveira, L.S.; Sabourin, R.; Bortolozzi, F.; Suen, C.Y. A methodology for feature selection using multiobjective genetic algorithms
for handwritten digit string recognition. Int. J. Pattern Recognit. Artif. Intell. 2003, 17, 903–929.
34. Jesenko, D.; Mernik, M.; Žalik, B.; Mongus, D. Two-Level Evolutionary Algorithm for Discovering Relations between Nodes
Features in a Complex Network. Appl. Soft Comput. 2017, 56, 82–93.
35. Chuang, L.Y.; Chang, H.W.; Tu, C.J.; Yang, C.H. Improved binary PSO for feature selection using gene expression data. Comput.
Biol. Chem. 2008, 32, 29–38.
36. Schiezaro, M.; Pedrini, H. Data feature selection based on Artificial Bee Colony algorithm. EURASIP J. Image Video Process. 2013,
47, 1–8.
37. Narendra; Fukunaga. A Branch and Bound Algorithm for Feature Subset Selection. IEEE Trans. Comput. 1977, C-26, 917–922.
38. Gheyas, I.A.; Smith, L.S. Feature subset selection in large dimensionality domains. Pattern Recognit. 2010, 43, 5–13.
39. Somol, P.; Pudil, P.; Novovicová, J.; Paclík, P. Adaptive floating search methods in feature selection. Pattern Recognit. Lett. 1999,
20, 1157–1163.
40. Chandrashekar, G.; Sahin, F. A survey on feature selection methods. Comput. Electr. Eng. 2014, 40, 16–28.
41. Zhao, P.; Yu, B. On model selection consistency of Lasso. J. Mach. Learn. Res. 2006, 7, 2541–2563.
42. Buteneers, P.; Caluwaerts, K.; Dambre, J.; Verstraeten, D.; Schrauwen, B. Optimized parameter search for large datasets of the
regularization parameter and feature selection for ridge regression. Neural Process. Lett. 2013, 38, 403–416.
43. Nelson, G.D.; Levy, D.M. A Dynamic Programming Approach to the Selection of Pattern Features. IEEE Trans. Syst. Sci. Cybern.
1968, 4, 145–151.
44. Acır, N. Classification of ECG beats by using a fast least square support vector machines with a dynamic programming feature
selection algorithm. Neural Comput. Appl. 2005, 14, 299–309.
45. Cheung, R.; Eisenstein, B. Feature selection via dynamic programming for text-independent speaker identification. IEEE Trans.
Acoust. Speech Signal Process. 1978, 26, 397–403.
46. Moudani, W.; Shahin, A.; Shakik, F.; Mora-Camino, F. Dynamic programming applied to rough sets attribute reduction. J. Inf.
Optim. Sci. 2013, 32, 1371–1397.
47. Bertsekas, D.; Tsitsiklis, J.N. Neuro-Dynamic Programming; Athena Scientific: Nashua, NH, USA, 1996.
Mathematics 2024, 12, 1987 22 of 22
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.