Appliedmath 04 00081
Appliedmath 04 00081
1 Operational Research Center in Healthcare, Near East University, TRNC Mersin 10, Nicosia 99138, Turkey
2 Department of Biomedical Engineering, Near East University, TRNC Mersin 10, Nicosia 99138, Turkey
3 Department of Mathematics, Near East University, TRNC Mersin 10, Nicosia 99138, Turkey
* Correspondence: [email protected] (E.P.O.); [email protected] (N.S.)
Abstract: High-dimensional datasets, where the number of features far exceeds the number of
observations, present significant challenges in feature selection and model performance. This study
proposes a novel two-stage feature-selection approach that integrates Artificial Bee Colony (ABC)
optimization with Adaptive Least Absolute Shrinkage and Selection Operator (AD_LASSO). The
initial stage reduces dimensionality while effectively dealing with complex, high-dimensional search
spaces by using ABC to conduct a global search for the ideal subset of features. The second stage
applies AD_LASSO, refining the selected features by eliminating redundant features and enhancing
model interpretability. The proposed ABC-ADLASSO method was compared with the AD_LASSO,
LASSO, stepwise, and LARS methods under different simulation settings in high-dimensional data
and various real datasets. According to the results obtained from simulations and applications on
various real datasets, ABC-ADLASSO has shown significantly superior performance in terms of
accuracy, precision, and overall model performance, particularly in scenarios with high correlation
and a large number of features compared to the other methods evaluated. This two-stage approach
offers robust feature selection and improves predictive accuracy, making it an effective tool for
analyzing high-dimensional data.
Keywords: feature selection; artificial bee colony; adaptive LASSO; high-dimensional data
Given a dataset X = x1 , x2 , . . . , x p where xi represents i-th feature, the aim of feature
selection is to identify a subset of features XS ⊂ X that maximizes the model’s performance
according to a certain evaluation criterion. Feature selection is the process of diminishing
the number of input features by eliminating the least significant or redundant ones, hence
enhancing model interpretability and decreasing computational effort [5]. Depending
on the way these features are employed, the feature-selection process can be broadly
subdivided into three broad categories: filter, wrapper, and embedded [6,7]. Filter methods
use statistical measures to rank each feature discriminately to the learning algorithm, which
is simple and fast but ignores interactions between features. The wrapper methods apply
a machine learning model to rate various feature subsets, which results in a more precise
selection of features but comes at the cost of more time consumption [6–8]. Some of the
methods are designed in a way that feature selection is incorporated into the model, for
instance, LASSO or Elastic Net, balancing accuracy with the complexity of the model
and number of samples. Despite the strengths of existing selection methods, the high-
dimensional data complexity still requires hybrid approaches that leverage the advantages
of multiple feature-selection techniques [9].
There are many methods available for optimizing the feature-selection process, and
most of them have their advantages as well as disadvantages. Of the swarm intelligence
algorithms, nature-inspired Artificial Bee Colony (ABC) has emerged as a powerful opti-
mization algorithm [10,11]. ABC was originally designed based on the foraging pattern of
honeybees. ABC mimics the bee colony to search for the optimal solution to hundreds of
optimization problems [12,13]. When applied to feature selection, it seeks the best subsets
of features through a balance of the exploitation and exploration process. This algorithm
performs better than others due to its simplicity and ability to avoid becoming trapped in
local optima. If compared with conventional approaches that fail in high-dimensional data
due to computational issues, the application of ABC is more flexible [12,14,15]. This study
proposes the development of a two-stage feature-selection approach involving ABC and
Adaptive Least Absolute Shrinkage and Selection Operator (AD_LASSO) to increase model
accuracy. The motivation for using ABC in this framework stems from its demonstrated
effectiveness in global optimization tasks and its capability to handle the intricacies of
high-dimensional data.
2. Related Studies
Feature selection plays an important role in machine learning, especially in dealing
with high-dimensional datasets where the number of features is higher than the number
of observations. Different feature-selection techniques have been proposed in the existing
literature, broadly categorized into filter methods, wrapper methods, embedded methods,
and a combination of these methods, known as hybrid methods. In this section, the authors
review some studies and advances dedicated to the hybrid feature-selection techniques, as
well as present the potential of ABC relative to other types of metaheuristic approaches.
LASSO method adds an L1 norm penalty to the objective function when the coefficients of
the features that are not useful for the prediction equal to zero, hence providing a binary
selection of features [19,20]. Because of this, LASSO is especially helpful when the data
contain many unimportant factors. However, LASSO often chooses one predictor from a
set of correlated features, which may not always be desirable in scenarios where predictors
are highly correlated. Adaptive LASSO considers LASSO’s strengths and applies better
penalty strategies for feature selection. It overcomes LASSO’s limitations by offering better
feature selection, less bias, and better consistency in high-dimensional data [21,22].
aim of this study is to develop a two-stage feature-selection technique using the ABC
optimization method alongside AD_LASSO in a high-dimensional dataset. This hybrid
framework seeks to achieve maximum feature-selection performance while at the same
time minimizing model complexity.
y = Xβ + ϵ (1)
where X ∈ Rnxp data (design) matrix for independent features (predictors). The row
i of the data matrix X is the column vector xi = xi1 , . . . , xip . y ∈ Rn is the vector
of observed dependent features with n as the number of observations, and p is the
number of independent features. β = β 1 , . . . , β p ∈ R p is the vector of unknown co-
where ∥y − Xβ∥2 represents the L2 norm (Euclidean norm) of the residual vector y − Xβ.
This minimization problem has a closed-form solution which is called an ordinary least
square estimator (OLS), defined as
−1
β̂ = X T X XT y (3)
It is obviously seen that computation of the OLS estimator depends on some as-
sumptions, like most statistical models [30]; the matrix X T X must have full rank, i.e.,
rank (X) = p. When p >> n, the matrix X T X becomes singular and non-invertible, making
the OLS solution undefined. On the other hand, overfitting and high correlation among in-
dependent features are other challenges of high-dimensional data. Stepwise regression [31]
and LARS (Least Angle Regression) [32] are widely used methods for feature selection in
high-dimensional data. These methods can help manage the complexity of the model while
selecting important features. On the other hand, regularization methods in regression,
such as LASSO and Adaptive LASSO, apply penalties on the size of coefficients to avoid
overfitting and enhance model performance in high-dimensional data by shrinking less
relevant feature coefficients toward zero.
L( β; .) = L( β) + ϕ( β; .) (4)
| {z } | {z } | {z }
total loss f unction loss f untion f or linear model regularization penalty
where L( β) = ∥y − Xβ∥22 .
where λ is a tuning parameter that controls the shrinkage of the LASSO coefficient with
λ ≥ 0. LASSO stands out due to its ability to perform feature selection by shrinking some
coefficients to exactly zero [33]. Because of this, LASSO is especially helpful when the data
contains many unimportant factors. However, LASSO often chooses one predictor from a
set of correlated features, which may not always be desirable in scenarios where predictors
are highly correlated.
1
where w j is the weight for each coefficient. Generally, it is set to w j = with LASSO
| βˆj |
estimator β̂ j . These weights penalize small coefficients more and large coefficients less,
thus keeping the necessary features in the model.
Adaptive LASSO handles high-dimensional components better than LASSO due to its
improved weight. The weights w j , inversely proportional to initial coefficient estimates β̂ j ,
mitigate bias from smaller coefficients that are highly impacted. This technique promotes
significant features with higher coefficient values and demotes and eliminates unimportant
ones with higher penalties, improving feature-selection precision. However, Adaptive
LASSO may have limitations, especially when working with enormous data or data with
significant correlation or nonlinearity.
One-stage selection methods like Adaptive LASSO may perform poorly when
there are multiple layout interconnections or dependency linkages. This is where
two-stage techniques become crucial. Utilizing a two-stage approach, one can enhance
one-stage methods by adding a second refinement step, utilizing Adaptive LASSO
with a metaheuristic optimization algorithm. Two-stage techniques allow additional
time for consideration and fine-tuning feature importance, reducing variability and
increasing stability. These methods first perform an initial selection, followed by a
secondary process to refine the feature subset, ensuring that only the most significant
features are retained in the model. This proposed approach enhances the reliability
of feature selection in high-dimensional data, where traditional methods may fall
short. In this study, a two-stage ABC-Adaptive LASSO-based hybrid variable-selection
method has been proposed.
j j
where j = 1, 2, . . ., D. Xmax and Xmin represents upper and lower bounds of the jth parameter,
and rand(0, 1) represents the random number from 0 to 1.
Step 2: This step involves evaluating the food sources by objective function. In
this context, we determine the nectar amount (or objective value) associated with each
food source.
AppliedMath 2024, 4 1528
Step 3: The process of worker bees: Upon initialization, each worker bee visits its
food source and searches for a neighboring food source with superior nectar quality. The
location of the closest food source for a worker bee Xi is Vi given by the following equation:
jrand jrand jrand jrand
Vi = Xi + rand(−1, 1)· Xi − Xk (9)
where Xk is the food source which is randomly selected, k ∈ (1, 2, . . ., SN) is determined
at random, different from i, jrand ∈ {1, 2, . . . , D} represents the random integer number,
and rand[−1, 1] is a random value from −1 to 1.
Step 4: Selection and assessment of quality: The quality of the new food source is
assessed after the identification of the new food source. Bees will abandon their current
food source in favor of a new one if the latter exhibits superior quality.
Step 5: The process for onlooker bees: The onlooker bees acquire knowledge regarding
the characteristics of the food sources from worker bees after all the worker bees complete
their foraging operations. The onlooker bee assesses the probability of locating a food
source Xi based on the quality of information obtained from all worker bees, represented
as πi . For each food source Xi , the probability value πi is ascertained based on the quality
of food source I, as assessed by the worker bee by Equation (10).
fi
πi = SN
(10)
∑ n =1 f n
where fi is the value associated the objective function. Consequently, this probability
value π i is compared with a randomly generated number between 0 and 1. The
onlooker bee that identifies a new food source is given this food source if its probability
value, as calculated by Equation (10), exceeds the random value, provided the food
source is located.
Step 6: This stage involves preserving the best food source with the best quality.
Step 7: The scout bee: In the scout bee process, a bee substitutes an abandoned food
source with one it has discovered. Each bee in the swarm is assigned an own counter for
this process. Upon reaching a specific threshold in her counter value, a bee will abandon the
food source (the solution) and commence the search for alternative food sources. According
to Equation (8), a scout bee seeks a new food source.
The procedure persists until a specified termination criterion is met by repeating steps
3 to 7.
Since the feature-selection problem is defined as a discrete optimization problem, a
binary version of ABC is needed. Generally, the sigmoid function is applied to convert
continuous values into binary.
1
S(Vi ) = (11)
1 + e−Vi
If S(Vi ) > rand(0, 1), set Vi ; otherwise, set Vi = 0.
giving it a significant advantage over other algorithms and faster convergence, making it a
suitable choice for high-dimensional data. To use the ABC algorithm efficiently and reap
its benefits, a few crucial factors must be taken into account:
1. Representation of Bees
Each bee in the ABC algorithm represents a potential solution, which is a binary vector
corresponding to a subset of features. For example, given a dataset with 100 features,
a bee might be represented as a vector [1, 0, 0, 1, 0, 1, 0, . . ., 1], where 1 s indicates the
selected features.
2. Objective Function
Choosing the appropriate objective function in the optimization process is critical
to ensure the accuracy and effectiveness of the solution. The Extended-Bayesian
Information Criterion (ExBIC) was utilized as a fitness function for the proposed
feature-selection method. ExBIC is a model selection criterion developed especially
for high-dimensional data and is commonly used for feature selection [36]. ExBIC is
also effective in controlling false positives while balancing model fit and complexity
and is defined by the Equation (12):
where d denotes the number of selected features, n is the number of total observations,
p is the number of all features in the data matrix, and γ is a parameter ranging
between 0 and 1. A more optimal model will have a lower ExBIC value, reflecting
an improved trade-off between model accuracy and complexity. In this case, γ is a
fixed-value parameter, commonly assigned as 0.5, as suggested by [36]. logL is the
logarithm of the likelihood of the model (which is related to the residual sum of
squares in linear regression). In the proposed method, the fitness function ExBIC
will be minimized in the first step using the ABC algorithm for feature selection.
Then, in the second step, the remaining unnecessary features will be eliminated using
Adaptive LASSO. Adaptive LASSO will be enabled to work on a more refined feature
set in the second stage after unnecessary features are removed using ABC in the first
stage. In order to provide more precise and efficient results, this strategy seeks to
integrate the advantages of both approaches.
3. The Control Parameters for ABC
By trial and error, the following parameters have been defined for the ABC-based
proposed method:
Number of food sources: SN, as SN is the number of features in the data.
Maximum number of iterations: 100.
Max Limit: 10, where the max limit is how many times a food source can be selected
without improvement before it is abandoned.
The flow chart of the proposed two-stage ABC-ADLASSO method is presented in
Figure 1.
AppliedMath
AppliedMath2024,
2024,4,4FOR PEER REVIEW 1530 9
Figure
Figure1.1.Flow
Flowchart
chartof
of the
the proposed method.
proposed method.
4.4.Simulation
SimulationStudy Study
Thesimulation
The simulation study study was was conducted
conducted to to show
showthe thefeature-selection
feature-selectionperformance performanceofof
thedeveloped
the developed ABC-ADLASSO ABC-ADLASSO method method by by comparing
comparing itit with withthe theAD_LASSO,
AD_LASSO,LASSO, LASSO,
stepwise, and LARS methods under different simulation
stepwise, and LARS methods under different simulation settings in high-dimensional settings in high-dimensional data.
In the simulation study, the R Studio was used for
data. In the simulation study, the R Studio was used for all processes. The linear modelall processes. The linear model was used
for data
was usedgeneration.
for data generation.
y = Xβ + ϵ, ϵ ∼ N 0, σ2 (13)
𝑦 = 𝑋𝛽 + 𝜖, 𝜖~𝑁(0, 𝜎 ) (13)
Six simulation scenarios with high-dimensional settings were considered. The sample
size Sixn = 50 simulation
is used forscenarios each setting. with high-dimensional settings were considered. The
sample The 6 scenarios are consideredsetting.
size n = 50 is used for each as follows [18]:
The 6 scenarios are considered as follows [18]:
Scenario1: p = 60 and σ = 1.5. The rows are independent in data matrix X. The
Scenario1: p = 60 and x j1 , .σ. .=, x1.5. The rows are independent in data matrix X. The first 10
first 10 features j10 and the remaining 50 features x j11 , . . . , x j60 are inde-
features 𝑥 , … , 𝑥 and the remaining 50 features
pendent in thej-th row. The pairwise correlation among r-th and d-th components 𝑥 , … , 𝑥 are independent in the
in row.
j-th x j1 , .The
. . , xpairwise | r − d
j10 is ρ correlation
| where ρ among = 0.5 and r-thr, dand= 1,d-th
. . .,components
10. Also, thein pairwise is 𝜌| |
𝑥 , … , 𝑥correlation
| r − d |
amongρ r-th
where = 0.5 and and d-thr, components
d = 1, …, 10.inAlso, x j11 ,the x j60 is ρ correlation
. . . ,pairwise where ρamong= 0.5 and r-thr,and
d = 11,
d-th
| |
. . ., 60.
components in 𝑥 , … , 𝑥 is 𝜌 where 𝜌 = 0.5 and r, d = 11, …, 60.
Scenario2:This
Scenario2: Thisisisidentical
identical to to Scenario1,
Scenario1, with with the theexception
exceptionthat thatρ𝜌==0.90.
0.90.
Scenario3: This is identical to Scenario1, with the exception that p = 100.
Scenario3: This is identical to Scenario1, with the exception that p = 100.
Scenario4: This is identical to Scenario2, with the exception that p = 100.
Scenario4: This is identical to Scenario2, with the exception that p = 100.
Scenario5: p = 60 and σ = 1.5. The features are generated:
Scenario5:
x ji = Z1i +p e=ji 60 forandj = 1, σ 2,= 1.5.
. . ., The
5 andfeatures are generated:
𝑥x ==𝑍Z ++𝑒e for j = 1, 2, …, 5 and
ji 2i ji for j = 6, 7, . . ., 10 where Z ji ∼ N (0, 1) and e ji ∼ N (0, 1/100). βs are 1.5 for
𝑥the=first
𝑍 + 𝑒 for
10 components j = 6, 7, and …,0 10 the rest𝑍of~𝑁(0,1)
for where and 𝑒 ~𝑁(0,1/100). 𝛽𝑠 are 1.5 for the
the components.
first 10 components
Scenario6: This is identicaland 0 for the rest of with
to Scenario1, the components.
the exception that p = 100.
Scenario6: This is identical
The confusion matrix was used to evaluate to Scenario1, with the exception
performances that p of = 100.
the developed and
traditional
The confusion Adaptive LASSO
matrix was methods.
used toInevaluate
this matrix, True Positives
performances of the (TP) are features
developed and
correctly identified as relevant (correctly determining
traditional Adaptive LASSO methods. In this matrix, True Positives (TP) are features significant or non-zero coeffi-
cients), and
correctly False Positives
identified as relevant (FP) are irrelevant
(correctly determining featuressignificant
incorrectly oridentified as relevant
non-zero coefficients),
(zero coefficients incorrectly determined as
and False Positives (FP) are irrelevant features incorrectly identified as relevantsignificant or non-zero). True Negatives(zero
coefficients incorrectly determined as significant or non-zero). True Negatives (TN) are
irrelevant features correctly identified as irrelevant (correctly determining zero
AppliedMath 2024, 4 1531
TP + TN
Accuracy = (14)
TP + TN + FN + FP
TN
Specificity = (15)
TN + FP
TP
Sensitivity = (16)
TP + FN
A total of 300 random repetitions of the simulations are performed. Every simulated
dataset is split into a training set (80%) and a test set (20%) for each iteration of the
simulation. The proposed ABC-ADLASSO, AD_LASSO, LASSO, stepwise, and LARS
methods were implemented on the training set, and the performances of the methods were
analyzed on the testing set.
5. Simulation Results
The simulation study has been performed to demonstrate the impact of increasing
the number of features (p) and the correlation among the features on feature-selection
performance in high-dimensional data. As p increases, the complexity of the data grows,
making feature selection more challenging due to the higher likelihood of including ir-
relevant or redundant features. Among the feature-selection methods compared, the
standard AD_LASSO consistently outperforms traditional one-stage feature-selection meth-
ods, LASSO, LARS, and stepwise, across all scenarios, particularly when dimensionality
increases. The simulation study demonstrates that the proposed two-stage ABC-ADLASSO
method enhances AD_LASSO’s feature-selection performance, achieving superior sensitiv-
ity, specificity, and accuracy values, resulting in more successful outcomes than AD_LASSO
and other compared methods across all scenarios.
In scenarios with lower correlation and lower dimension (p = 60, ρ = 0.50), LASSO
performs similarly to stepwise and LARS, managing acceptable feature selection. As the
correlation among features rises (ρ = 0.90), stepwise demonstrates the most significant
decline in performance, struggling to handle multicollinearity effectively. LASSO and
LARS also have experienced noticeable performance decreases, but LASSO generally
outperforms LARS by providing a slightly better feature selection in high-correlation
settings. However, in these difficult scenarios, both approaches are not as effective as
AD_LASSO.
Our proposed ABC-ADLASSO feature-selection method consistently outperforms
AD_LASSO, LASSO, stepwise, and LARS across all simulation scenarios, particularly as p
increases and the correlation between features increases. This improvement is attributed to
the ABC’s ability to explore a broader solution space, enabling it to handle multicollinearity
more effectively and avoid local minima. As the correlation among features increases,
traditional one-stage methods like AD_LASSO, LASSO, LARS, and stepwise (BICP) struggle
with bias and selection accuracy, while the proposed two-stage approach achieves feature
selection more robustly. Table 1 shows the performance results for all methods across
different simulation scenarios.
AppliedMath 2024, 4 1532
Since the goal is to select important features on a high-dimensional dataset, a random index
was created to select 80 observations for the training set and 20 observations for the test set.
The Large-scale Wave Energy Farm dataset includes 99 WECs, or wave energy con-
verters, with 6300 observations based on Perth and Sydney wave scenarios as predictors
and total power output as the response variable. The main goal is to predict the total power
output of the wave farm based on the coordination of WECs. Since the goal is to select
significant features on a high-dimensional dataset, a random index was created to select
80 observations for the training set and 20 observations for the test set.
The Insurance Company Benchmark (COIL 2000) dataset includes 5000 customer
records, each with 86 features. Among these, 85 are independent variables: 43 sociode-
mographic features and 42 product ownership features. The target variable is number of
mobile home policies, which indicates the number of mobile home insurance policies. Since
the study aims to perform feature selection on a high-dimensional dataset, a random index
was used to partition the data into an 80-observation training set and a 20-observation
test set.
The FRED data used in this study consist of 115 macroeconomic variables obtained
from the Federal Reserve Economic Data (FRED) database of the St. Louis Federal Re-
serve Bank. For this analysis, we focus on the period between 2008 and 2016 to evaluate
high-dimensional regression models with 102 observations. The goal is to perform variable
selection on 114 predictors with one output variable, “Personal Consumption Expendi-
tures Price Index” (PCEPI), using Adaptive-LASSO and the proposed method and to
compare the performance of these approaches using 82 observations for the training set
and 20 observations for the test set.
Each method was applied to every dataset 10 times for feature selection, and the mean,
standard deviation, median, interquartile range (IQR), minimum, and maximum values for
Adjusted R2 , MAE, and RMSE are presented in Tables 2–5.
Table 4. Real dataset results on Insurance Company Benchmark (COIL 2000) dataset.
Insurance
Company IQR (25th
Standard
Benchmark Mean Median Min Max Percentile–75th
Deviation
(COIL 2000) Percentile)
DataSet
Adjusted R2 0.636 0.012 0.628 0.620 0.653 0.624–0.647
ABC-ADLASSO RMSE 0.113 0.002 0.113 0.110 0.118 0.112–0.115
MAE 0.087 0.002 0.087 0.084 0.092 0.085–0.089
Adjusted R2 0.611 0.005 0.612 0.600 0.619 0.608–0.615
AD_LASSO RMSE 0.170 0.003 0.171 0.165 0.176 0.168–0.172
MAE 0.140 0.003 0.140 0.135 0.146 0.138–0.142
Adjusted R2 0.566 0.004 0.568 0.559 0.574 0.563–0.570
LASSO RMSE 0.228 0.003 0.228 0.223 0.235 0.225–0.230
MAE 0.183 0.003 0.183 0.180 0.191 0.181–0.186
Adjusted R2 0.476 0.004 0.477 0.470 0.484 0.473–0.479
STEPWISE RMSE 0.522 0.003 0.523 0.514 0.527 0.520–0.525
MAE 0.428 0.003 0.428 0.421 0.432 0.426–0.430
Adjusted R2 0.514 0.005 0.514 0.505 0.522 0.510–0.518
LARS RMSE 0.468 0.004 0.470 0.462 0.475 0.466–0.471
MAE 0.378 0.004 0.379 0.372 0.385 0.376–0.381
AppliedMath 2024, 4 1535
Table 5. Real dataset results on Federal Reserve Economic Data (FRED) dataset.
6. Discussion
The findings from both the simulation study and the empirical data application vali-
date the benefits of the proposed two-stage feature-selection method utilizing ABC and
Adaptive LASSO. In high-dimensional contexts, feature selection is essential to prevent
overfitting and enhance model interpretability. Our results indicate that ABC-ADLASSO
provides enhanced feature selection and predictive accuracy relative to one-stage ap-
proaches such as AD_LASSO, LASSO, stepwise, and LARS. The initial stage utilizes the
ABC method to effectively reduce the search space, select the most promising features,
and address issues related to high multicollinearity. The second stage employs Adaptive
LASSO to further improve the selected features, guaranteeing that the final model is both
concise and precise. The proposed ABC-ADLASSO method offers advantages such as im-
proved feature-selection accuracy by combining global exploration (ABC) with AD_LASSO.
AppliedMath 2024, 4 1536
However, a potential drawback is the need for careful tuning of hyperparameters. The
performance of both ABC and AD_LASSO depends on the chosen hyperparameter settings,
and incorrect tuning may reduce the method’s effectiveness and impact the accuracy of the
results. In the simulation analysis, the suggested method performs well in every situation,
especially as the correlation rises. This approach reduces the complexity of the model and
increases its performance in high-dimensional data.
7. Conclusions
This study has introduced an innovative two-stage feature-selection method that integrates
the ABC metaheuristic optimization method with Adaptive LASSO for high-dimensional data.
The proposed ABC-ADLASSO method was compared with the AD_LASSO, LASSO, stepwise,
and LARS methods under different simulation settings in high-dimensional data and various
real datasets to show the feature-selection performance of the proposed method. The ABC-
ADLASSO method has overcome the shortcomings of single-stage feature-selection methods
by integrating a global optimization algorithm (ABC) in the first stage and enhancing feature
selection using a penalization technique (Adaptive LASSO) in the second stage. According to
the results obtained from simulations and applications on various real datasets, ABC-ADLASSO
has shown significantly superior performance in terms of accuracy, precision, and overall model
performance, particularly in scenarios with high correlation and a large number of features
compared to the other methods evaluated. This two-stage methodology offers a robust and
adaptable solution to handling high-dimensional data, rendering it particularly relevant in
domains such as genetics, bioinformatics, and intricate predictive modeling. Future research
may investigate the integration of this methodology with alternative machine learning classifiers
and its use across different datasets from various fields. Also, in future studies, a comprehensive
comparative analysis of the proposed method with other optimization-based feature-selection
techniques can be performed.
Author Contributions: Conceptualization, E.P.O. and N.S.; methodology, E.P.O. and N.S.; software,
E.P.O. and N.S.; validation, E.P.O. and N.S.; formal analysis, E.P.O. and N.S.; investigation, E.P.O. and
N.S.; resources, E.P.O. and N.S.; data curation, E.P.O. and N.S.; writing—original draft preparation;
writing—review and editing, E.P.O. and N.S.; visualization, E.P.O. and N.S.; supervision, E.P.O. and
N.S.; project administration, E.P.O. and N.S. All authors have read and agreed to the published
version of the manuscript.
Funding: This research received no external funding.
Institutional Review Board Statement: Not Applicable.
Informed Consent Statement: Not Applicable.
Data Availability Statement: The data that support the findings of this study are available on request
from the corresponding author.
Conflicts of Interest: The authors declare no conflicts of interest.
References
1. Sancar, N.; Onakpojeruo, E.P.; Inan, D.; Uzun, O.D. Adaptive Elastic Net Based on Modified PSO for Variable Selection in Cox
Model with High-Dimensional Data: A Comprehensive Simulation Study. IEEE Access 2023, 11, 127302–127316. [CrossRef]
2. Jain, R.; Xu, W. HDSI: High dimensional selection with interactions algorithm on feature selection and testing. PLoS ONE 2021,
16, e0246159. [CrossRef] [PubMed]
3. Amini, A.A.; Wainwright, M.J. High-dimensional analysis of semidefinite relaxations for sparse principal components. In
Proceedings of the IEEE International Symposium on Information Theory ISIT 2008, Toronto, ON, Canada, 6–11 July 2008;
pp. 2454–2458.
4. Holtzman, G.; Soffer, A.; Vilenchik, D. A greedy anytime algorithm for sparse PCA. In Proceedings of the 33rd Conference on
Learning Theory (COLT 2020), Graz, Austria, 9–12 July 2020; pp. 1939–1956.
5. Rouhi, A.; Nezamabadi-Pour, H. Feature Selection in High-Dimensional Data. In Advances in Intelligent Systems and Computing;
Springer: Cham, Switzerland, 2020; Volume 1123, pp. 85–128. Available online: https://link.springer.com/chapter/10.1007/978-
3-030-34094-0_5 (accessed on 7 October 2024).
AppliedMath 2024, 4 1537
6. Pudjihartono, N.; Fadason, T.; Kempa-Liehr, A.W.; O’Sullivan, J.M. A Review of Feature Selection Methods for Machine Learning-
Based Disease Risk Prediction. Front. Bioinform. 2022, 2, 927312. Available online: www.frontiersin.org (accessed on 25 October
2024). [CrossRef] [PubMed]
7. Curreri, F.; Fiumara, G.; Xibilia, M.G. Input Selection Methods for Soft Sensor Design: A Survey. Future Internet 2020, 12, 97.
Available online: https://www.mdpi.com/1999-5903/12/6/97/htm (accessed on 25 October 2024). [CrossRef]
8. Maseno, E.M.; Wang, Z. Hybrid Wrapper Feature Selection Method Based on Genetic Algorithm and Extreme Learning Machine
for Intrusion Detection. J. Big Data 2024, 11, 24. [CrossRef]
9. Bohrer, J.S.; Dorn, M. Enhancing Classification with Hybrid Feature Selection: A Multi-Objective Genetic Algorithm for High-
Dimensional Data. Expert Syst. Appl. 2024, 255, 124518. [CrossRef]
10. Owoc, M.L. Usability of Honeybee Algorithms in Practice. In Towards Nature-Inspired Sustainable Development; IFIP Advances in
Information and Communication Technology; Springer: Cham, Switzerland, 2024; Volume 693, pp. 161–176. Available online:
https://link.springer.com/chapter/10.1007/978-3-031-61069-1_12 (accessed on 15 October 2024).
11. Stamadianos, T.; Taxidou, A.; Marinaki, M.; Marinakis, Y. Swarm Intelligence and Nature-Inspired Algorithms for Solving Vehicle
Routing Problems: A Survey. Oper. Res. 2024, 24, 47. Available online: https://link.springer.com/article/10.1007/s12351-024-008
62-5 (accessed on 15 October 2024). [CrossRef]
12. Karaboga, D. An Idea Based on Honey Bee Swarm for Numerical Optimization; Technical Report TR06; Computer Engineering
Department, Engineering Faculty, Erciyes University: Kayseri, Türkiye, 2005.
13. Karaboga, D.; Kaya, E. An Adaptive and Hybrid Artificial Bee Colony Algorithm (aABC) for ANFIS Training. Appl. Soft Comput.
2016, 49, 423–436. [CrossRef]
14. Nozohour-Leilabady, B.; Fazelabdolabadi, B. On the Application of Artificial Bee Colony (ABC) Algorithm for Optimization
of Well Placements in Fractured Reservoirs: Efficiency Comparison with the Particle Swarm Optimization (PSO) Methodology.
Petroleum 2016, 2, 79–89. [CrossRef]
15. Yarat, S.; Senan, S.; Orman, Z. A Comparative Study on PSO with Other Metaheuristic Methods. In International Series in
Operations Research and Management Science; Springer: Cham, Switzerland, 2021; Volume 306, pp. 49–72. Available online:
https://link.springer.com/chapter/10.1007/978-3-030-70281-6_4 (accessed on 15 October 2024).
16. Theng, D.; Bhoyar, K.K. Feature Selection Techniques for Machine Learning: A Survey of More Than Two Decades of Research.
Knowl. Inf. Syst. 2024, 66, 1575–1637. Available online: https://link.springer.com/article/10.1007/s10115-023-02010-5 (accessed
on 7 October 2024). [CrossRef]
17. Liu, X.Y.; Liang, Y.; Wang, S.; Yang, Z.Y.; Ye, H.S. A Hybrid Genetic Algorithm with Wrapper-Embedded Approaches for Feature
Selection. IEEE Access 2018, 6, 22863–22874. [CrossRef]
18. Guyon, I.; Elisseeff, A. An Introduction to Variable and Feature Selection. J. Mach. Learn. Res. 2003, 3, 1157–1182.
19. Yerlikaya-Özkurt, F.; Taylan, P. Enhancing Classification Modeling Through Feature Selection and Smoothness: A Conic-
Fused Lasso Approach Integrated with Mean Shift Outlier Modelling. J. Dyn. Games 2024, 12, 1–23. Available online: http:
//staging.xml2html.mdpi.lab/articles/appliedmath-3309531 (accessed on 7 October 2024). [CrossRef]
20. Tibshirani, R. Regression Shrinkage and Selection via the Lasso. J. R. Stat. Soc. B 1996, 58, 267–288. [CrossRef]
21. Huang, J.; Ma, S.; Zhang, C.H. Adaptive Lasso for Sparse High-Dimensional Regression Models. Ann. Stat. 2008, 18, 1603–1618.
22. Zou, H. The Adaptive Lasso and Its Oracle Properties. J. Am. Stat. Assoc. 2006, 101, 1418–1429. Available online: https:
//www.tandfonline.com/doi/abs/10.1198/016214506000000735 (accessed on 25 October 2024). [CrossRef]
23. Zhang, Z.; Tong, T.; Fang, Y.; Zheng, J.; Zhang, X.; Niu, C.; Li, J.; Zhang, X.; Xue, D. Genome-Wide Identification of Barley ABC
Genes and Their Expression in Response to Abiotic Stress Treatment. Plants 2020, 9, 1281. [CrossRef]
24. Garg, S.; Kaur, K.; Batra, S.; Aujla, G.S.; Morgan, G.; Kumar, N.; Zomaya, A.Y.; Ranjan, R. En-ABC: An Ensemble Artificial Bee
Colony Based Anomaly Detection Scheme for Cloud Environment. J. Parallel Distrib. Comput. 2020, 135, 219–233. [CrossRef]
25. Hancer, E.; Xue, B.; Karaboga, D.; Zhang, M. A Binary ABC Algorithm Based on Advanced Similarity Scheme for Feature
Selection. Appl. Soft Comput. 2015, 36, 334–348. [CrossRef]
26. Chamchuen, S.; Siritaratiwat, A.; Fuangfoo, P.; Suthisopapan, P.; Khunkitti, P. High-Accuracy Power Quality Disturbance
Classification Using the Adaptive ABC-PSO as Optimal Feature Selection Algorithm. Energies 2021, 14, 1238. [CrossRef]
27. Guo, Y.; Zhang, C. A Hybrid Artificial Bee Colony Algorithm for Satisfiability Problems Based on Tabu Search. In Proceedings of
the 3rd IEEE International Conference on Computer and Communications (ICCC 2017), Chengdu, China, 13–16 October 2017;
IEEE: New York, NY, USA, 2018; pp. 2226–2230.
28. Gu, T.; Chen, H.; Chang, L.; Li, L. Intrusion Detection System Based on Improved ABC Algorithm with Tabu Search. IEEJ Trans.
Electr. Electron. Eng. 2019, 14, 1652–1660. Available online: https://onlinelibrary.wiley.com/doi/full/10.1002/tee.22987 (accessed
on 15 October 2024). [CrossRef]
29. Kiliçarslan, S.; Dönmez, E. Improved Multi-Layer Hybrid Adaptive Particle Swarm Optimization Based Artificial Bee Colony for
Optimizing Feature Selection and Classification of Microarray Data. Multimed. Tools Appl. 2024, 83, 67259–67281. Available online:
https://link.springer.com/article/10.1007/s11042-023-17234-4 (accessed on 7 October 2024). [CrossRef]
30. Kumar, H. Decision Making for Hotel Selection Using Rough Set Theory: A Case Study of Indian Hotels. Int. J. Appl. Eng. Res.
2018, 13, 3988–3998.
31. Kutner, M.H.; Nachtsheim, C.J.; Neter, J.; Li, W. Applied Linear Statistical Models, 5th ed.; McGraw-Hill: New York, NY, USA, 2005.
32. Efron, B.; Hastie, T.; Johnstone, I.; Tibshirani, R. Least Angle Regression. Ann. Statist. 2004, 32, 407–499. [CrossRef]
AppliedMath 2024, 4 1538
33. Sirimongkolkasem, T.; Drikvandi, R. On Regularisation Methods for Analysis of High Dimensional Data. Ann. Data Sci. 2019,
6, 737–763. Available online: https://link.springer.com/article/10.1007/s40745-019-00209-4 (accessed on 26 November 2024).
[CrossRef]
34. Akay, B.; Karaboga, D.; Gorkemli, B.; Kaya, E. A Survey on the Artificial Bee Colony Algorithm Variants for Binary, Integer, and
Mixed Integer Programming Problems. Appl. Soft Comput. 2021, 106, 107351. [CrossRef]
35. Bansal, J.C.; Joshi, S.K.; Sharma, H. Modified Global Best Artificial Bee Colony for Constrained Optimization Problems. Comput.
Electr. Eng. 2018, 67, 365–382. [CrossRef]
36. Chen, J.; Chen, Z. Extended Bayesian Information Criteria for Model Selection with Large Model Spaces. Biometrika 2008, 95,
759–771. [CrossRef]
37. Communities and Crime—UCI Machine Learning Repository. Available online: https://archive.ics.uci.edu/dataset/183/
communities+and+crime (accessed on 24 October 2024).
38. Neshat, M.; Alexander, B.; Sergiienko, N.Y.; Wagner, M. Optimization of Large Wave Farms Using a Multi-Strategy Evolutionary
Framework. In Proceedings of the 2020 Genetic and Evolutionary Computation Conference, Cancún, Mexico, 8–12 July 2020.
39. Putten, P. Insurance Company Benchmark (COIL 2000) [Dataset]. UCI Machine Learning Repository. [CrossRef]
40. Federal Reserve Bank of St. Louis. Federal Reserve Economic Data (FRED). Available online: https://fred.stlouisfed.org (accessed
on 27 November 2024).
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.