Hosni 2017
Hosni 2017
Hosni 2017
ABSTRACT KEYWORDS
Ensemble Effort Estimation (EEE) consists on predicting the soft- Ensemble Effort Estimation, Machine Learning, Features Selection,
ware development effort by combining more than one single esti- Filter, Accuracy
mation technique. EEE has recently been investigated in software ACM Reference Format:
development effort estimation (SDEE) in order to improve the esti- Mohamed Hosni, Ali Idri, and Alain Abran. 2017. Investigating Heteroge-
mation accuracy. The overall results suggested that the EEE yield neous Ensembles with Filter Feature Selection for Software Effort Estimation.
better prediction accuracy than single techniques. On the other In 27th International Workshop on Software Measurement and 12th Interna-
hand, feature selection (FS) methods have been used in the area of tional Conference on Software Process and Product Measurement. ACM, New
SDEE for the purpose of reducing the dimensionality of a dataset York, NY, USA, 14 pages. https://doi.org/10.1145/3143434.3143456
size by eliminating the irrelevant and redundant features. Thus, the
SDEE techniques are trained on a dataset with relevant features 1 INTRODUCTION
which can lead to improving the accuracy of their estimations. This Accurately estimating the effort required to develop a new software
paper aims at investigating the impact of two Filter feature selection system plays a central role in the success of software project man-
methods: Correlation based Feature Selection (CFS) and RReliefF agement. Thus, the use of accurate software development effort
on the estimation accuracy of Heterogeneous (HT) ensembles. Four estimation (SDEE) techniques can improve the decision-making
machine learning techniques (K-Nearest Neighbor, Support Vector process when managing the budget and resources needed to de-
Regression, Multilayer Perceptron and Decision Trees) were used velop a software system within the expected time. For that purpose,
as base techniques for the HT ensembles of this study. We evaluate several SDEE techniques based on expert judgment, statistics and/or
the accuracy of these HT ensembles when their base techniques machine learning methods have been proposed [2, 7, 16, 31, 32, 36,
were trained on datasets preprocessed by the two feature selection 39, 41, 64, 79]. Although expert estimation remains as the most
methods. The HT ensembles use three combination rules: average, frequently estimation method used in practice [40], the statistic
median, and inverse ranked weighted mean. The evaluation was and machine learning estimation techniques can provide to the
carried out by means of eight unbiased accuracy measures through experts a preliminary set of estimations which can serve to draw
the leave-one-out-cross validation (LOOCV) technique over six the experts’ final estimation.
datasets. The overall results suggest that all the attributes of most However, it is well-known that no SDEE technique performed bet-
datasets used are relevant for building an accurate predictive tech- ter than the others under all circumstances [20, 35, 49]. Hence, the
nique since the ensembles constructed without features selection use of more than one single technique to estimate the effort of a
outperformed in general the ones using features selection. As for new software project can help to significantly improve the estima-
the combination rule, the median generally produces better results tion accuracy [36, 49, 73]. From this perspective, ensemble effort
than the other two used in this empirical study. estimation techniques have been recently investigated in SDEE and
consist on combining more than one single technique by means of
CCS CONCEPTS a combination rule to predict the effort of a new software project
[36, 37]. There are two types of EEE techniques [20, 36]:
• General and reference → Empirical studies; Estimation;
(1) Homogeneous EEE which is divided into two subtypes: (a)
ensembles that combine at least two configurations of the
same single SDEE technique, (b) ensembles that combine
one meta model [30] (such as: Bagging [10], Boosting [22],
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed Negative Correlation [53], and Random Subspace [28]) and
for profit or commercial advantage and that copies bear this notice and the full citation one single SDEE technique; and
on the first page. Copyrights for components of this work owned by others than ACM (2) Heterogeneous EEE which combines at least two different
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a SDEE single techniques [20].
fee. Request permissions from [email protected]. According to the systematic review of Idri et al. [36], EEE ap-
IWSM/Mensura ’17, October 25–27, 2017, Gothenburg, Sweden
© 2017 Association for Computing Machinery. proaches outperformed in general single predictive techniques.
ACM ISBN 978-1-4503-4853-9/17/10. . . $15.00 However, obtaining high accuracy when using ensembles depends
https://doi.org/10.1145/3143434.3143456 on two criteria [12, 35]: accuracy and diversity of the ensemble
IWSM/Mensura ’17, October 25–27, 2017, Gothenburg, Sweden M. Hosni et al.
members (i.e. they make different errors at the same data instances) feature selection methods: CFS and RReliefF, and one without fea-
since using single techniques that generate the same prediction ture selection) * 3 (Combination rules) * 6 (datasets) and aims at
values is not beneficial for ensembles. However, the accuracy of addressing two research questions (RQ):
single techniques mainly depends on the characteristics of datasets • (RQ1): Does the use of feature selection methods led HT
such as the number of instances, number of features, noise data ensembles to generate better estimation accuracy than the
(outliers and errors), and missing data. Hence, there is a need for a ensembles constructed without features selection?
preprocessing stage which deals with these issues before building • (RQ2): Among the three linear rules used which of them led
any SDEE technique. In practice, the preprocessing data can in- the HT ensembles to generate more accurate estimation?
clude cleaning (filling missing data), transforming (scaling features) The main contributions of this paper are:
and/or reducing the data (features selection or extraction). This
• Evaluating the impact of feature selection methods on the
paper deals with reducing data by performing a feature selection
predictive capability of HT ensembles over six datasets.
process for the single techniques used to construct HT ensembles.
• Using two feature Filter selection methods to select the rele-
In SDEE datasets, each historical software project is described by
vant attributes used for base techniques of ensembles.
a set of attributes, known as cost/effort drivers; these attributes
• Using unbiased measures to assess the performance of the
were used as inputs of the prediction techniques and consequently,
proposed HT ensembles.
they influence the estimation accuracy of these techniques. There-
• Setting the parameters values of the single techniques using
fore, the use of feature selection can eliminate the irrelevant and
a grid search.
redundant attributes; which can lead to accurate effort estimations
[6, 13, 35, 66]. Commonly, feature selection approaches can be The rest of the paper is organized as follows: Section 2 presents the
grouped into three main categories [42]: Filters, Wrappers, or Em- two Filter feature selection methods used in this study. Section 3
bedded. presents an overview of related work conducted on EEE techniques.
According to authors’ knowledge, only two papers have investi- Section 4 presents an overview of the four ML techniques used in
gated the impact of the prepossessing step on the HT ensembles this paper and the methodology used to set their parameters values.
[37]. The study of Kocaguneli et al. [49] used 10 preprocessing op- Section 5 presents the empirical design of this study. Section 6
tions with 9 estimation techniques to construct the candidate single presents and discusses the empirical results obtained. Section 7
techniques. Among these 10 preprocessing options, two feature presents the threats to validity of this study. Section 8 presents the
selection techniques were used: sequential forward selection and conclusion and future work.
stepwise regression. However, the main motivation for using these
10 preprocessing options was to generate diverse single techniques. 2 FEATURE SELECTION METHODS
Meanwhile, their study did not show any comparison of ensembles This section presents an overview of the feature selection methods,
with/out the preprocessing step. Note that this study evaluated both in particular those used in this paper. Feature selection consists
homogenous and HT ensembles. The study of Azhar et al. [4] was on performing a data preprocessing step with the aim of reducing
a projection of [49] in web effort estimation. the data size by selecting the most informative and relevant fea-
The main purpose of this paper is to evaluate whether or not the tures that will be used as inputs of a prediction system [71]. This
Filter feature selection methods improve the accuracy of HT en- pre-processing has many advantages such as improving the per-
sembles. Therefore, we evaluate HT ensembles construction based formance of a predictive technique, reducing the size of a dataset
on four ML techniques preprocessed with two Filter methods and which help to better understanding the impact of each attribute
we compare them with HT ensembles constructed without feature and avoiding the overfitting problem [25, 68].
selection. Each ensemble uses three combination rules to generate The literature defines three main types of feature selection meth-
the final estimation: average, median, and inverse ranked weighted ods [68]: Filter, Wrapper, and Embedded. Filter methods rely on
mean. Unlike the work conducted by Kocaguneli et al. [49] which the characteristics of a dataset to select the most relevant features
was aimed to evaluate whether ensembles outperformed single tech- [14, 75], whereas Wrapper methods select the best subset of fea-
niques, this paper aims to evaluate the impact of feature selection tures by evaluating the impact of different subsets of features on
methods on the accuracy of HT ensembles. To this aim, two Filter the performance of a predictive system [29, 70]. The embedded
methods were used: Correlation based Feature Selection (CFS) [27] methods attempt to determine the optimal subsets of features by
and RReliefF methods [66]. The rationale behind choosing four ML taking this process into account when a given predictive technique
techniques (K-Nearest Neighbor (Knn), Support Vector Regression is trained [24, 26].
(SVR), Multilayer Perceptron (MLP) and Decision Trees (DTs)) as A Filter approach performs the feature selection process with re-
base methods for the HT ensembles is that they represented 85% of spect to a chosen performance measure. These performance mea-
the total ML techniques investigated by the 65 selected studies in sures can be grouped into five groups [42]: information, distance,
the systematic review of Wen et al. [79]. In addition, this empirical consistency, similarity, and statistical test. The output of Filter meth-
study uses unbiased measures such as Standardized Accuracy, Mean ods can be a rank of individual features or even a best feature subset.
Balanced Relative Error, Mean Inverted Relative Error, and Loga- Two Filter feature selection methods were used in this study: Corre-
rithmic Standard Deviation [5, 35, 59, 72] to assess the accuracy of lation based feature selection [27] and RReliefF method [66]. These
the proposed HT ensembles. two methods use different performance measures [68]: information
In total, this study evaluates 54 HT ensembles: 3 ensembles (2 with for CFS and distance for RReliefF and provide different outputs
since CFS results on a subset of features while RReliefF provides
Investigating HT Ensembles with Filter Feature Selection for SEE IWSM/Mensura ’17, October 25–27, 2017, Gothenburg, Sweden
a raking of features. These two techniques were already used in 3 RELATED WORK
SDEE [18, 56]. In fact, Denget al. [18], used the ReliefF feature This section presents an overview of EEE studies, in particular some
selection method in Desharnais dataset, and the performance of selected studies from the systematic review of Idri et al. [36] that
the Knn technique was improved with log-transformed effort data. investigated the use of feature selection methods in EEE.
Minku et al. [56] claimed that the performances of Bagging with Idri et al. performed a systematic map and review of EEE studies
RBF, RBF, and Negative Correlation Learning were improved when published between 2000 and 2016 [36, 37]. Their review included
CFS with greedy stepwise search was used in the preprocessing 24 papers and found that homogeneous ensembles were the most
stage. Moreover, these two techniques were investigated in other frequently used since they were investigated by 17 out of 24 pa-
fields such as software quality [44], pattern recognition [52], and pers. Furthermore, the ML techniques were the most frequently
in bioinformatics [77], and they improved the accuracy results. A used as ensemble members. Besides, ANNs and DTs were the two
brief description of each selection method is given bellow. single ML techniques most often used (i.e. were both used by the
Correlation based feature selection was proposed by Hall [27]. 12 selected studies). As for the combination rules, they found that
This method finds the best combination of features that are highly 20 combination rules were used to generate the final estimation of
correlated to the target variable and uncorrelated with each other. ensembles. With regards to the estimation accuracy, the ensembles
This method is a multivariate feature Filter, which means that it as- produced in general more accurate results than their single tech-
sesses different feature subsets and chooses the best one. However, niques especially when the linear combiners were used.
the output of this method is highly dependent on the search strat- With regards to HT ensembles, Idri et al. [36, 37] found that 9 out
egy used, such as greedy forward selection, backward elimination, of 24 selected studies investigated HT ensembles. Further, HT en-
and best-first. Generally, the search strategies can be grouped into sembles used 12 combinations of single techniques and the most
three categories: sequential, exponential, and randomized. In this frequent techniques used as members were DTs and K-nn. The HT
paper, the best-first algorithm was used [27] since this algorithm ensembles yielded to better performance that their members (with
evaluates all the possible combinations and it can update the subset MMRE=62.48%, Pred(25)=43.35%, and MdMRE=26.80%).
of selected features during the evaluation process instead of the Since this paper concerns HT ensembles and feature selection, Ta-
greedy forward selection and greedy backward elimination which ble 1 summarizes the findings of the two papers dealing with both
do not update the subset of features during the evaluation process. topics [4, 49].
Another challenge when using correlation based Filters is related
to the starting points for feature subsets generation. The literature
defines four starting points [42]: forward selection, backward elim- 4 BASE TECHNIQUES AND PARAMETERS
ination, bidirectional selection, and heuristic feature selection. In
this paper, the bidirectional selection was used since it has two start
SETTING
points at the same time: an empty set and the whole set, which is This section presents an overview of the four single ML techniques
not the case of the remaining methods. used in this paper as well as how their parameter values were tuned.
The first version of RReliefF based feature selection was called Re-
lief and proposed by Kira et al. [47]. Relief estimates the relevance
of features according to how well their values distinguish between
instances that are closest to each other. However, Relief was limited 4.1 K-nearest Neighbor
to solve only two-class problems. In 1994, Kononenko proposed an Knn is a non-parametric technique used for classification and re-
improved version of Relief to deal with multi-class problems [50]. gression tasks [3]. It is one of the simplest ML techniques since
The new algorithm is termed ReliefF and is more robust and able its process is based on the analogy reasoning ”similar cases have
to deal with incomplete and noisy data. However, both algorithms similar solutions”. Knn or Case-Based Reasoning was used in SDEE
are applicable only for classification tasks. The RReliefF which is and derived the effort of a new project by aggregating the actual
an update of the ReliefF algorithm was proposed by Robniket al. in effort values of its k similar projects [34]. The similarity measure
1997 in order to deal with both classification and regression tasks used in this paper is Euclidian distance while the number of similar
[66]. These algorithms were successfully used for feature selection projects was varied from 1 to 12 (see Table 2).
[19] and they were applied for many purposes including attribute
weighting [80] and to select splits for regression trees [67]. These
algorithms are univariate feature Filter, which means that these 4.2 Support Vector Regression
methods assess individually each attribute and provide at the end
Support Vector Machine (SVM) is a set of machine learning meth-
the rank of attributes according to their relevance. However, the
ods used for classification and regression. The implementation of
main issue that arises when using RReliefF is how many attributes
SVM for regression analysis is called support vector regression
should we keep. This is still an open question to date even if some
(SVR), and it was proposed in 1996 by Cortes and Vapnik [74]. It is
studies [44, 46] proposed to select log2 (N ) attributes where N is
a ML technique based on statistical theory and it was first applied
the number of features in the initial set. Hence, this study used
in SDEE by Oliveira [63]. Three parameters define an SVR model
this recommendation. Lastly, since the software development ef-
which can have a significant influence on its performance: the com-
fort estimation is a regression problem, the RReliefF algorithm was
plexity parameter denoted with C, the extent to which deviations
used.
are tolerated denoted with Epsilon ϵ, and the kernel [63].
IWSM/Mensura ’17, October 25–27, 2017, Gothenburg, Sweden M. Hosni et al.
deciding upon: (1) Minimum number of instances per leaf (I), (2) how much better a prediction technique (Pi ) is than random guess-
minimum variance for split and (3) Maximum tree depth and (4) ing (P0 ). So, a high value means that Pi is much better than random
Pruning. For the rest of this paper, we use the term DTs to refer to guessing, a value near to zero is discouraging and a negative value
REPTrees. would be worrisome [72]. However, it is well known that the me-
dian instead of mean is more robust and not sensible to outliers. For
4.5 Parameters settings this reason, the median of absolute errors (MdAE, Eq.(6)), median
of Balanced Relative Error (MdBRE, Eq.(9)), and median of Inverted
Prior work in SDEE claimed that the accuracy of ML SDEE tech- Balanced Relative Error (MdIBRE, Eq.(10)) were also used.
niques was not stable across different circumstances (i.e. datasets) Note that Shepperd and MacDonell [72] recommended using the
[37, 49, 76]. Hence, using the same parameters values of a given 5% quantile of the random guessing to evaluate the likelihood of
technique across different datasets can influence its estimation ac- non-random estimation. The interpretation of the 5% quantile for
curacy. Therefore, this study determines the optimal parameters random guessing is similar to the use of Îś for conventional statisti-
values of each ML technique depending on each dataset. For that cal inference, which means that any accuracy value that is better
purpose, a grid search method was performed to select the best than this threshold has less than one in twenty chance of having
configuration of each ML technique through a large number of been randomly occurred.
preliminary executions. Thereafter the configuration that provided To verify if the predictions of a technique are generated by chance
the best estimation accuracy in terms of mean absolute errors of and if there is an improvement over random guessing, the effect size
each ML technique is chosen. Thereby, only the more accurate (∆) criterion defined by Eq.(15) was used. The absolute values of ∆
configurations of the four ML techniques were used to build the can be interpreted in terms of the categories proposed by Cohen
proposed HT ensembles. Table 2 lists the predefined search space [15]: small (≈ 0.2), medium (≈ 0.5) and large (≈ 0.8). A medium or
for the optimal parameters values of each ML technique. large value of ∆ indicates an acceptable degree of confidence on
the technique predictions over random guessing.
5 EMPIRICAL DESIGN To sum up, eight accuracy measures were used in this study (MAE,
This section presents the empirical design of this study including: MdAE, MIBRE, MdIBRE, MBRE, MdBRE, Pred(25), and LSD). These
(1) the accuracy measures and statistical tests used to evaluate the criteria are not biased toward under-estimates instead of MRE-
proposed HT ensembles, (2) the descriptions of the datasets used, based measures such as MMRE. The main reason behind using
and (3) the experimental process pursed to construct and compare these accuracy measures is that the accuracy of an estimation tech-
the ensembles. nique behaves differently from one measure to another; so instead
of using only one accuracy measure, we draw a conclusion from
various measures.
5.1 Accuracy measures and statistical tests
AEi = |ei − ê | (1)
According to the systematic literature review (SLR) conducted
AEi
by Wen et al. [79], the most frequent accuracy measures used MREi = (2)
to evaluate ML techniques in SDEE were the Predictor at level ei
25% (Pred(25%)), Eq.(4)) and the mean magnitude of relative er- N
1 Õ
ror (MMRE, Eq.(3)). Moreover, Idri et al. confirmed in their SLR of MMREi = MREi (3)
N i=1
EEE that the majority of the selected studies also used these two
measures to assess EEE techniques. Indeed, the Pred and MMRE N
100 Õ 1 if MREi ≤ 0.25
measures are derived from the magnitude of relative error (MRE) de- Pred(0.25) = (4)
N i=1 0 Otherwise
fined by Eq.(2). However, this local accuracy measure (MRE) is not
reliable since it has been criticized for being biased toward under- 1 Õ
N
estimates [21, 58, 61], which may lead to inconsistent assessments MAE = AEi (5)
N i=1
of the estimation techniques. To avoid this limitation, Miyazaki et
al. [58] proposed two accuracy measures: Mean Balanced Relative MdAE = Mediane(AE 1 , AE 2 , ..., AE N ) (6)
Error (MBRE, Eq.(11)) and Mean Inverted Balanced Relative Error AEi
(MIBRE, Eq.(12)) which are considered less venerable to bias and BRE = (7)
min(ei , êi )
asymmetry. Another accuracy measure was used in SDEE: the log-
AEi
arithmic standard deviation (LSD) (Eq.(13)) [21, 55, 57]. IBRE = (8)
This study also used the mean of absolute errors (MAE, Eq.(4)) max(ei , êi )
which does not present any of the issues mentioned above. It is MdBRE = Mediane(BRE 1 , BRE 2 , ..., BRE N ) (9)
based on averaging the total of absolute errors (AE, Eq.(1)). Mean- MdI BRE = Mediane(IBRE 1 , IBRE 2 , ..., IBRE N ) (10)
while, the interpretation of MAE is difficult since the residual are
N
not normally distributed. To avoid this limitation, Shepperd and 1 Õ AEi
MBRE = (11)
MacDonell [72] suggested a new accuracy measure Standardized N i=1 min(ei , êi )
Accuracy (SA) based on MAE. SA evaluates whether a given estima-
N
tion technique outperforms the baseline of a random guessing (P0 ) 1 Õ AEi
MIBRE = (12)
-see Eq.(14). The interpretation of SA is that the ratio represents N i=1 max(ei , êi )
IWSM/Mensura ’17, October 25–27, 2017, Gothenburg, Sweden M. Hosni et al.
Table 2: Search spaces of parameters values of each ML technique for Grid Search.
s
2
ÍN
+ s2 )2 can help to investigate a rigorous analysis of the results. These
i=1 (λi
LSD = (13) datasets were collected from PRedictOr Models In Software Engi-
N −1
neering (PROMISE) data repository which is a publicly available
MAEpi online data repository [54]: Albrecht [2], China [54], COCOMO81
SA = 1 − (14)
MAE p0 [7], Desharnais [17], Kemerer [45] and Miyazaki [59].
Table 3 presents the characteristics of the six selected datasets,
MAEpi − MAE p0
∆= (15) including the size of dataset, the unit of effort, the number of at-
sp 0 tributes, and the descriptive statistics of the effort (minimum, max-
Where: imum, mean, median, skewness, and the kurtosis). Note that the
• ei and êi are the actual and predicted effort for the ith project. effort values of historical data do not follow a normal distribution
• MAE p0 is the mean value of a large number runs of random based on the values of skewness and kurtosis coefficients [5, 11]
guessing. This is defined as, predict a ei for the target project presented in the last two columns of the Table 3. Tables A.7-A.12
i by randomly sampling (with equal probability) over all the of Appendix A list the cost/effort drivers of the six datasets respec-
remaining n-1 cases and take ei =er where r is drawn ran- tively.
domly from 1... n∧r , i. This randomization procedure is The COCOMO81 dataset used in this paper contains 252 projects
robust since it makes no assumption and requires no knowl- instead of 63 projects in the original one. In the original dataset, the
edge concerning a population. features were measured using a nominal scale. This scale consists
• MAEpi mean of absolute errors for a prediction technique i. on six linguistic values. For each couple of project and linguistic
• Sp0 is the sample standard deviation of the random guessing value, four numerical values have been randomly generated accord-
strategy. ing to the classical interval used to represent the linguistic value
• λi = ln(ei ) - ln(êi ) (63*4=252, see [33] for details on how we obtained 252 from 63
• s 2 is an estimator of the variance of the residual λi . projects).
In this paper, the leave-one-out cross validation (LOOCV or
Jackknife) method was used [65]. At each step, one project is used 5.3 Methodology used
for test and the remaining instances for training. This process is This section describes the process followed to build and evaluate
performed n times, where n is the number of instances in a dataset. the proposed ensembles. As stated in Section 4, this paper used
As for the statistical test, since we evaluated multiple prediction two Filter feature selection methods to select the most appropriate
techniques, the most useful test is Scott-Knott (SK) [69]. Indeed, features to be used as inputs for the four ML techniques (i.e. en-
the SK test can deal with multiple comparisons problems which semble members). Thereafter, for each feature selection method,
allow to avoid pairwise comparisons. Moreover, SK test considers the four ML techniques were built using a grid search method to
the corrections of error Type I [8] and identifies the non-ambiguous choose the optimal configuration of each ML technique. The en-
groups which cannot be obtained by other multiple comparisons sembles were constructed using the best single ML techniques (i.e.
tests such as Tukey and Student-Newman-Keuls tests. with optimal configurations) and three combination rules: average,
median, and inverse ranked weighted mean. These 36 ensembles: 2
5.2 Datasets (feature selection methods) * 3 (Combination rules) * 6 (datasets)
To evaluate the accuracy of the proposed single and ensemble tech- were compared to the 18 ensembles that were built without feature
niques, six datasets were selected. Selecting a large number of selection (18 ensembles = 3 rules * 6 datasets).
datasets that are diverse in terms of size and number of features The methodology performed on each dataset is as follows:
Investigating HT Ensembles with Filter Feature Selection for SEE IWSM/Mensura ’17, October 25–27, 2017, Gothenburg, Sweden
Effort
Dataset Size Unit Features
Min Max Mean Median Skewness Kurtosis
Albrecht 24 Man/Months 6 0.5 105 21.87 11 2.30 4.7
COCOMO81 252 Man/Months 12 6 114000 683.44 98 4.39 20.5
China 499 Man/Hours 15 26 54620 3921.04 1829 3.92 19.3
Desharnais 77 Man/Hours 11 546 23940 4833.90 3542 2.03 5.3
Kemerer 15 Man/Months 6 23 1107 219.24 130 3.07 10.6
Miyazaki 48 Man/Months 7 5.6 1586 87.47 38 6.26 41.3
Step 1: Apply the two feature selection methods (CFS and ML techniques trained on a dataset reduced by RReliefF and used
RReliefF) in order to select the most relevant features in the Inverse ranked weighted mean as a combiner.
each dataset.
Step 2: Build and evaluate, by means of mean absolute errors, 6 EMPIRICAL RESULTS
the four single ML techniques using a grid search method This section presents and discusses the results of the empirical
on: experiments performed according to the methodology described in
• original dataset, Section 5.3. To carry out the empirical experiments, different tools
• dataset reduced by Correlation based feature selection, were used. A software prototype based on Weka API was developed
and using Java programming language under a Microsoft environment
• dataset reduced by RReliefF based feature selection. to construct the proposed HT ensembles. As for the feature selection
Step 3: Construct the HT ensembles using three combination methods, the statistical tests (Kolmogorov-Smirnov and SK tests)
rules: and the Box-cox transformation method were performed using R
• average, Software [1].
• median, or
• inverse ranked weighted mean (the rank of the four ML
techniques with respect to MAE).
6.1 Feature Selection results
Step 4: Assess the nine HT ensembles according to SA and This subsection presents the results of the step 1 of our methodol-
effect size and only retain the ones that generate better pre- ogy. Table 4 shows the selected features for each feature selection
dictions than 5% quantile of random guessing. method in each dataset. From Table 4 it can be observed that both
Step 5: Cluster the selected HT ensembles using SK test based methods selected different subsets of features. Concerning the num-
on MAE. ber of selected features, the CFS method selected at least 50% of
Step 6: Rank the HT ensembles that belong to the best cluster features available in each dataset. However, the number of features
of SK test using Borda Count based on 8 performance mea- selected by RReliefF method was predefined since we only selected
sures (MAE, MdAE, MIBRE, MdIBRE, MBRE, MdBRE, Pred, log2 (N ) of ranked features (N is the number of features in a given
and LSD). dataset). Moreover, the subsets generated by both methods for each
dataset have at least one common feature. Further, all the features
Note that before performing the SK test (Step 4), we evaluate selected by RReliefF in China, COCOMO81, and Miyazaki datasets
whether the absolute errors of HT ensembles follow a normal dis- were also selected by CFS. The main conclusion that can be drawn
tribution by means of the Kolmogorov-Smirnov statistical test [51] from this step is that using different features selection methods
since the SK test requires that its inputs must be normally dis- resulted in different subsets of features which can lead to build
tributed. The box-cox transformation is performed to make the diverse SDEE techniques.
absolute errors follow a normal distribution.
For the sake of clarity, the following abbreviations were used: 6.2 Evaluation of Heterogeneous Ensembles
• Ensembles with single techniques trained on the original The second step uses the original and the reduced datasets to deter-
dataset were denoted OD. mine the optimal configuration of each single ML technique. The
• Ensembles with single techniques trained on the dataset optimal configuration of each single ML technique is determined
reduced by CFS were denoted CFS. by a search grid to minimize the mean absolute errors. As for the
• Ensembles with single techniques trained on the dataset third step, the HT ensembles were built by combining the best four
reduced by RReliefF were denoted R. single techniques on each dataset with three combination rules
(average, median, and inverse ranked weighted mean). This sub-
As for the combination rules, they were abbreviated as follows: section presents the results of the fourth step of the methodology
average (AV), median (ME), and Inverse Ranked Weighted Mean pursued in this paper. This step consists on assessing the nine HT
(IR). ensembles in terms of SA and effect size.
Example: RAV means an ensemble composed of the four single The SA value of each HT ensemble is computed and compared to
IWSM/Mensura ’17, October 25–27, 2017, Gothenburg, Sweden M. Hosni et al.
Table 4: Selected features in each dataset (each number refers to a feature ID).
the SA of the 5% quantile of random guessing. Thereafter, only the Fig. 1 1 shows the results of SK test on each dataset. The SK identi-
ensembles that outperformed the SA of the 5% quantile of random fied 4 clusters for the China and COCOMO81 datasets and 2 clusters
guessing were selected. The effect size values were used to verify for Miyazaki dataset: this implies that there is a significant differ-
whether the ensembles are truly predicting and only the ensem- ence between the ensembles in these three datasets. As for the
bles that showed a large improvement over random guessing were remaining datasets, only one cluster was identified, which implies
selected (i.e. ∆ > 0.8). This process is conducted for each dataset that the nine ensembles have the same predictive capability. In
through the LOOCV method. summary, the best cluster of COCOMO81 dataset contains only one
ensemble (i.e. ODME) for which the base techniques were trained
Table 5 presents the SA and effect size values of the nine ensem- on the original dataset and used the ME combiner. For the China
bles over the six datasets. The second row of Table 5 indicates the dataset, the best cluster contains two OD ensembles that use ME
SA of 5% quantile of random guessing (SA5% ) for each dataset. From and IR combiners. These results showed that in COCOMO81 and
Table 5, all ensembles generated better results than the 5% quan- China datasets, ensembles based on single techniques without fea-
tile of random guessing. The main observations are: in Albrecht ture selection generated significantly better performance in terms
dataset, the three CFS ensembles generated better estimations than of MAE than the other. As for the Miyazaki dataset, the best cluster
the other ensembles (SA values in this dataset were ranged from contained 6 ensembles (three CFS and three OD ensembles). This
74.4% to 79.9%). As for China dataset, the three OD ensembles were shows that the use of RReliefF feature selection did not lead to
ranked first, followed by CFS ensembles and RE ensembles (SA accurate ensembles for the Miyazaki dataset.
values in China dataset were ranged from 52.2% to 88.4%). For the
COCOMO81 and Desharnais datasets, the ODME ensemble gen- In order to decide which ensemble is more accurate on each
erated better estimates (SA=93.4% and 58.9% for COCOMO81 and dataset, we ranked the ensembles of the best clusters using the
Desharnais respectively). As for the Kemerer and Miyazaki datasets, Borda Count voting system based on 8 performance measures. Ta-
the ODIR ensemble provided more accurate results with respect ble 6 lists the ranks of the best ensembles for each dataset. It can
to the random guessing baseline with SA values equal to 56.3% be observed that:
and 60.41% for Kemerer and Miyazaki respectively. However, the (1) The three CFS ensembles were ranked at the top three posi-
lowest SA values in most datasets were obtained by the RReliefF tions in Albrecht dataset, the CFSME ensemble was ranked
ensembles, especially the ones using average and median rules. In first in Miyazaki dataset, and the two CFS ensembles (CF-
terms of effect size, ∆ values of all ensembles overall datasets were SME and CFSIR) were ranked at the 2nd and 3rd positions
greater than 0.8 which implies that the predictions were not gener- in Kemerer dataset respectively.
ated by chance and the effect improvement over random guessing (2) The ODME ensemble was ranked first in three datasets
was large. Hence, all the nine HT ensembles were selected for the (China, COCOMO81 and Desharnais) and second in Miyazaki
next experiments. dataset; the ODIR ensemble was ranked first in Kemerer
dataset and second in COCOMO81 and Desharnais datasets;
and the ODAV ensemble was ranked at the 3rd position in
6.3 Ranking the Ensembles Desharnais and Miyazaki datasets.
This section presents and discusses the results when applying the (3) None of the RReliefF ensembles was ranked at one of the top
SK test to cluster the nine ensembles into non-overlapping groups three positions in all datasets. Concerning the combination
in order to identify the best ones (i.e. having the less MAE). The rule, four out of the six first ranked ensembles used the
techniques that belong to the same cluster show similar predictive median combiner while the two other first ensembles used
capability. However, before performing the SK test, we evaluated the average and inverse ranked weighted mean respectively.
whether the absolute errors of the ensembles followed a normal Moreover, the median and inverse ranked weighted mean
distribution. We found that in general the absolute errors of the combiners were ranked at the second position in three and
ensembles did not follow a normal distribution. Therefore, the ab- two times respectively.
solute errors of all ensembles on each dataset were transformed to Based on these empirical results, the main findings are as follows:
be normally distributed using the Box-Cox transformation. Hence,
• Ensembles without feature selection outperformed in general
the SK test was performed on the transformed absolute errors.
ensembles using features selection methods. This may be
Investigating HT Ensembles with Filter Feature Selection for SEE IWSM/Mensura ’17, October 25–27, 2017, Gothenburg, Sweden
Table 5: SA and effect size values of the nine ensembles over the six datasets.
due to the fact that all the attributes of the most datasets this method repeats the process of evaluation as the number of
used were relevant for effort estimation. instances in the dataset [36, 48]. Further, the same result can be
• The use of CFS instead of RReliefF feature selection can lead reproduced when the experiment is replicated.
to more accurate ensembles since the RReliefF ensembles External validity: This threat concerns the perimeter of validity
were ranked at the last positions in most of the datasets used. of the results obtained on the accuracy prediction of ensembles
This may be due to the prefixed number of features selected of this study. The proposed ensembles were evaluated using six
(log2 (N )). well-known datasets. These datasets were collected from different
• There is no ensemble that performed better than the others countries and organizations. Also, they are diverse in terms of num-
in all contexts. ber of instances and number of features.
• CFS ensembles generated better results in small datasets Statistical conclusion validity: This threat concerns the statis-
(Albrecht and Miyazaki). tical test used and its effects on the findings of this study. In this
• The use of the median combiner can in general improve the paper, the Scott-knott statistical test was used to check the sig-
accuracy of ensembles. nificant difference of prediction techniques. In fact, this statistical
test belongs to the parametric tests category, which makes certain
7 THREATS TO VALIDITY assumptions on the population (i.e. data). The SK test requires that
As this paper is an empirical study, this section presents the threats the data follow a normal distribution. In our experiments, the AEs
to validity of the conclusions drawn. There are three types of threats of estimation techniques were transformed in order to follow a
to validity: normal distribution. However, transforming data may influence the
Internal validity: the main threat to validity of any empirical final result of the statistical test, so using non-parametric tests such
study comes from using a biased validation method to assess the as Wilcoxon Signed Rank test can result on different conclusions.
performance of estimation techniques. Many of the empirical stud- Construct validity: This threat concerns the measurement crite-
ies in SDEE divided a dataset into two subsets: one is used for train- ria used to assess the performance of the predictive techniques. In
ing a technique while the other is for testing the trained technique. this study, eight unbiased performance measures, in addition to SA
Hence the evaluation is performed in one or many rounds which and Effect size, were used to assess the predictive capability of the
can result in biased and unreliable performance of the technique. To developed ensembles. Moreover, the well-know MMRE criterion
avoid this limitation, this study used the LOOCV validation since
IWSM/Mensura ’17, October 25–27, 2017, Gothenburg, Sweden M. Hosni et al.
Figure 1: SK test results on each dataset. The x-axis represents the selected techniques stored where the better positions start
from the right side. Lines with the same color belong to the same cluster. The y-axis represents the transformed AEs, each ver-
tical line shows the variation of transformed AEs for each technique, and the small circle represents the mean of transformed
AEs.
was not used since it was considered as biased towards underesti- have optimal configuration each one, which can lead to generate
mates. However, concerning the feature selection methods, only accurate ensembles.
two Filter based methods were used to perform the preprocess-
ing step. Hence, the results obtained in this paper concern only
these two methods and it is mandatory to evaluate other feature 8 CONCLUSION AND FUTURE WORK
selection methods. As for parameters settings, a grid search was
This empirical study assessed the impact of feature selection meth-
performed to set the appropriate parameters values of each single
ods on the accuracy of HT ensembles. For this purpose, four ML
technique. This allowed building ensembles with members that
Investigating HT Ensembles with Filter Feature Selection for SEE IWSM/Mensura ’17, October 25–27, 2017, Gothenburg, Sweden
techniques (Knn, SVR, MLP, and DTs) were chosen as base tech- Proceedings of the 4th international workshop on Predictor models in software
niques and two feature selection methods were used to pre-process engineering - PROMISE ’08. 71–78. https://doi.org/10.1145/1370788.1370805
[7] Barry Boehm. 1984. Software Engineering Economics. IEEE Transactions on
six well-known datasets. The ensembles based on the pre-processed Software Engineering 10, 1 (1984), 4–21.
datasets were compared to those constructed without feature selec- [8] LC Borges and DF Ferreira. 2003. Power and type I errors rate of Scott-Knott,
Tukey and Newman-Keuls tests under normal and no-normal distributions of
tion and the ensembles members were tuned using a grid search the residues. Revista de Matemática e Estatística 21, 1 (2003), 67–83.
method. The study used, in addition to SA and effect size, eight [9] Petronio L. Braga, Adriano L. I. Oliveira, and Silvio R. L. Meira. 2007. Software
accuracy measures to assess the performance of the ensembles Effort Estimation using Machine Learning Techniques with Robust Confidence
Intervals. In 7th International Conference on Hybrid Intelligent Systems (HIS 2007).
through LOOCV method. The findings of the research questions 352–357. https://doi.org/10.1109/HIS.2007.56
were as follows: [10] Leo Breiman. 1996. Bagging Predictors. Machine Learning 26, 2 (1996), 123–140.
(RQ1): Does the use of feature selection methods led HT ensembles https://doi.org/10.1023/A:1018054314350
[11] Barbara M. Byrne. 2009. Structural Equation Modeling with AMOS. New York.
to generate better estimation accuracy than the ensembles con- https://doi.org/10.4324/9781410600219
structed without features selection? [12] Arjun Chandra and Xin Yao. 2006. Ensemble Learning Using Multi-Objective
Evolutionary Algorithms. Journal of Mathematical Modelling and Algorithms 5, 4
We found that in general the ensembles without feature selection (2006), 417–445. https://doi.org/10.1007/s10852-005-9020-3
generated better accuracy than the ensembles with the two feature [13] Zhihao Chen, Tim Menzies, Dan Port, and Barry Boehm. 2005. Feature Subset
selection methods: CFS and RReliefF. Furthermore, CFS ensembles Selection Can Improve Software Cost Estimation Accuracy. In International
Conference on Predictor Models in Software Engineering (PROMISE’05). 1–6. https:
generated better results than RReliefF ensembles. //doi.org/10.1145/1082983.1083171
(RQ2): Among the three linear rules used which of them led the [14] Tommy W S Chow and Di Huang. 2005. Estimating optimal feature subsets using
HT ensembles to generate more accurate estimation? efficient estimation of high-dimensional mutual information. IEEE Transactions on
Neural Networks 16, 1 (2005), 213–224. https://doi.org/10.1109/TNN.2004.841414
There is no evidence concerning the best combination rule in all con- [15] Jacob Cohen. 1992. A power primer. Psychological Bulletin 112, 1 (1992), 155–159.
texts. However, the empirical results suggested that the use of ME rule https://doi.org/10.1037/0033-2909.112.1.155
[16] Iris Fabiana de Barcelos Tronto, José Demísio Simões da Silva, and Nilson
can lead to accurate estimates. Sant’Anna. 2008. An investigation of artificial neural networks based prediction
systems in software project management. Journal of Systems and Software 81, 3
Ongoing work aims to investigate other Filter feature selections (2008), 356–367. https://doi.org/10.1016/j.jss.2007.05.011
[17] JM Deharnais. 1989. Analyse statistique de la productivitie des projects de devel-
with both HT and homogenous ensembles. This paper only used opment en informatique apartir de la technique des points des fontion. Master’s
HT ensembles based one feature selection method to generate the Thesis. Quebec university.
diversity among ensemble members; hence investigating more than [18] Jeremiah D Deng and Martin K Purvis. 2009. Software Effort Estimation : Harmo-
nizing Algorithms and Domain Knowledge in an Integrated Data Mining Approach.
one feature selection method may increase the diversity and there- Technical Report. University of Otago.
fore may yield to more accurate results of HT ensembles. Moreover, [19] T. G. Ditterrich. 1997. Machine Learning Research: Four Current Directions.
Artificial Intelligence Magazine 4 (1997), 97–136.
another direction of research is to investigate the two other types [20] Mahmoud O Elish, Tarek Helmy, and Muhammad Imtiaz Hussain. 2013. Empirical
of feature selection: wrappers and embedded with ensembles to Study of Homogeneous and Heterogeneous Ensemble Models for Software De-
evaluate whether they lead to improved estimation accuracy of velopment Effort Estimation. Mathematical Problems in Engineering 2013 (2013).
https://doi.org/10.1155/2013/312067
ensembles. [21] T. Foss, E. Stensrud, B. Kitchenham, and I. Myrtveit. 2003. A simulation study of
the model evaluation criterion MMRE. IEEE Transactions on Software Engineering
ACKNOWLEDGMENTS 29, 11 (2003), 985–995. https://doi.org/10.1109/TSE.2003.1245300
[22] Yoav Freund and Robert E Schapire. 1995. A Desicion-Theoretic Generalization
This work was conducted within the research project MPHR- of On-Line Learning and an Application to Boosting. In Computational learning
theory. Vol. 904. 23–37. https://doi.org/10.1007/3-540-59119-2
PPR1-2015-2018. The authors would like to thank the Moroccan [23] Mihaela Göndör and Vasile Paul Bresfelean. 2012. REPTree and M5P for measur-
MESRSFC and CNRST for their support. ing fiscal policy influences on the Romanian capital market during 2003-2010.
International Journal of Mathematics and Computers in Simulation 6, 4 (2012),
378–386.
A ATTRIBUTES OF THE SIX DATASETS [24] Isabelle Guyon and André Elisseeff. 2003. An Introduction to Variable and Feature
Tables A.7 to A.12 present the attributes of the six selected datasets Selection. Journal of Machine Learning Research (JMLR) 3, 3 (2003), 1157–1182.
https://doi.org/10.1016/j.aca.2011.07.027 arXiv:1111.6189v1
used in this paper. [25] Isabelle Guyon, Steve Gunn, Masoud Nikravesh, and Lotfi A Zadeh Eds. 2006.
Feature Extraction: Foundations and Applications. Heidelberg.
[26] Isabelle Guyon, Jason Weston, Stephen Barnhill, and Vladimir Vapnik. 2002.
REFERENCES Gene selection for cancer classification using support vector machines. Ma-
[1] The R Project for Statistical Computing. chine Learning 46, 1-3 (2002), 389–422. https://doi.org/10.1023/A:1012487302797
[2] A.J. Albrecht and J.E. Gaffney. 1983. Software Function, Source Lines of Code, and arXiv:1111.6189v1
Development Effort Prediction: A Software Science Validation. IEEE Transactions [27] Mark Hall. 1999. Correlation-based Feature Selection for Machine Learning. Ph.D.
on Software Engineering SE-9, 6 (1983), 639–648. https://doi.org/10.1109/TSE. Dissertation. https://doi.org/10.1.1.149.3848 arXiv:arXiv:gr-qc/9809069v1
1983.235271 [28] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann,
[3] N. S. Altman. 1992. An introduction to kernel and nearest-neighbor non- and Ian H Witten. 2009. The WEKA data mining software. ACM SIGKDD
parametric regression. The American Statistician 46, 3 (1992), 175–185. https: Explorations 11, 1 (2009), 10–18. https://doi.org/10.1145/1656274.1656278
//doi.org/10.1080/00031305.1992.10475879 [29] J H Holland. 1975. Adaptation in Natural and Artificial Systems: An introductory
[4] Damir Azhar, Patricia Riddle, Emilia Mendes, Nikolaos Mittas, and Lefteris An- Analysis with Applications to Biology, Control and Artificial Intelligence. MIT
gelis. 2013. Using Ensembles for Web Effort Estimation. In 2013 ACM / IEEE Press (1975), 183. https://doi.org/10.1137/1018105 arXiv:0262082136
International Symposium on Empirical Software Engineering and Measurement. [30] Mohamed Hosni and Ali Idri. 2017. Software Effort Estimation Using Classical
173–182. https://doi.org/10.1109/ESEM.2013.25 Analogy Ensembles Based on Random Subspace. In Proceedings of the Symposium
[5] Mohammad Azzeh, Ali Bou Nassif, and Leandro L Minku. 2015. An empirical on Applied Computing (SAC ’17). ACM, New York, NY, USA, 1251–1258. https:
evaluation of ensemble adjustment methods for analogy-based effort estimation. //doi.org/10.1145/3019612.3019784
The Journal of Systems and Software 103 (2015), 36–52. https://doi.org/10.1016/j. [31] Robert T. Hughes. 1996. Expert judgement as an estimating method. Information
jss.2015.01.028 and Software Technology 38, 2 (1996), 67–75.
[6] Mohammad Azzeh, Daniel Neagu, and Peter Cowling. 2008. Improving Analogy [32] Ali Idri, Ibtissam Abnane, and Alain Abran. 2016. Missing data techniques in
Software Effort Estimation using Fuzzy Feature Subset Selection Algorithm. In analogy-based software development effort estimation. Journal of Systems and
IWSM/Mensura ’17, October 25–27, 2017, Gothenburg, Sweden M. Hosni et al.
ID Attribute Description
1 Input Function points of input
2 Output Function points of external output
3 Inquiry Function points of external enquiry
4 File Function points of internal logical files or entity references
5 FPAdj Adjusted function points
6 RawFPcounts Total number of rows
ID Attribute Description
1 AFP Adjusted function points
2 Input Function points of input
3 Output Function points of external output
4 Enquiry Function points of external enquiry
5 File Function points of internal logical files or entity references
6 Interface Points of external interface added
7 Added Function points of new or added functions
8 Changed Function points of changed functions
9 Deleted Function points of deleted functions
10 PDR_UFP Normalized level 1 productivity delivery rate norm
11 NPDR_AFP Normalized productivity delivery rate
12 NPDU_UFP Productivity delivery rate (adjusted function points)
13 Resource Team type
14 Dev.Type Development type
15 Duration Total elapsed time for the project
ID Attribute Description
1 SIZE Software Size
2 DATA Database Size
3 TIME Execution Time Constraint
4 STOR Main Storage Constraint
5 VIRTMIN, VIRT MAJ Virtual Machine Volatility
6 TURN Computer Turnaround
7 ACAP Analyst Capability
8 AEXP Applications Experience
9 PCAP Programmer Capability
10 VEXP Virtual Machine Experience
11 LEXP Programming Language Experience
12 SCED Required Development
Software 117 (2016), 595 – 611. https://doi.org/10.1016/j.jss.2016.04.058 [36] Ali Idri, Mohamed Hosni, and Alain Abran. 2016. Systematic Literature Review of
[33] Ali Idri, Alain Abran, and Laila Kjiri. 2000. COCOMO Cost Model Using Fuzzy Ensemble Effort Estimation. Journal of Systems and Software 118 (2016), 151–175.
Logic. In Proceedings of the 7th International Conference on Fuzzy Theory & Tech- https://doi.org/10.1016/j.jss.2016.05.016
niques. Atlantic, New Jersy, 1–4. [37] Ali Idri, Mohamed Hosni, and Alain Abran. 2016. Systematic Mapping Study of
[34] Ali Idri, Fatima Azzahra Amazal, and Alain Abran. 2015. Analogy-based software Ensemble Effort Estimation. In Proceedings of the 11th International Conference
development effort estimation: A systematic mapping and review. Information on Evaluation of Novel Software Approaches to Software Engineering. 132–139.
and Software Technology 58 (2015), 206–230. https://doi.org/10.1016/j.infsof.2014. https://doi.org/10.5220/0005822701320139
07.013 [38] Ali Idri, Taghi M Khoshgoftaar, and Alain Abran. 2002. Can Neural Networks be
[35] Ali Idri, Mohamed Hosni, and Alain Abran. 2016. Improved Estimation of Soft- easily Interpreted in Software Cost Estimation? World Congress on Computational
ware Development Effort Using Classical and Fuzzy Analogy Ensembles. Applied Intelligence 0 (2002), 1162–1167. https://doi.org/10.1109/FUZZ.2002.1006668
Soft Computing (2016). https://doi.org/10.1016/j.asoc.2016.08.012
Investigating HT Ensembles with Filter Feature Selection for SEE IWSM/Mensura ’17, October 25–27, 2017, Gothenburg, Sweden
ID Attribute Description
1 TeamExp Team experience measured in years
2 ManagerExp Team manager experience measured in years
3 YearEnd Year of completion
4 Length Length of the project
5 Transactions # of transactions processed
6 Entities # of entities in the systems data model
7 PointsAdjust Function point complexity adjustment factor
8 Envergure Complex measure derived from other factors
9 PointsNonAjust Unadjusted function points
10 Language Category of Programming Language
ID Attribute Description
1 KSLOC The number of COBOL source lines in thousands
2 SCRN Number of different input or output
3 FORM Number of different (report) forms
4 FILE Number of different record formats
5 ESCRN Total number of data elements in all the screens
6 EFORM Total number of data elements in all the forms
7 EFILE Total number of data elements in all the files
Table A.12: Kemerer dataset attributes. [46] T.M. Khoshgoftaar, M. Golawala, and J. Van Hulse. 2007. An Em-
pirical Study of Learning from Imbalanced Data Using Random For-
est. In 19th IEEE International Conference on Tools with Artificial Intelli-
ID Attribute Description gence(ICTAI 2007), Vol. 2. 310–317. https://doi.org/10.1109/ICTAI.2007.46
arXiv:http://dx.doi.org/10.1023%2FA%3A1010933404324
1 KSLOC Kilo Line of Code [47] Kenji Kira and Larry A. Rendell. 1992. A Practical Approach to Feature Selection.
2 AdjFP Adjusted Function Points In Machine Learning Proceedings 1992. Morgan Kaufmann Publishers, Inc., 249–
256. https://doi.org/10.1016/B978-1-55860-247-2.50037-1
3 RawFP Unadjusted Function points [48] Ekrem Kocaguneli and Tim Menzies. 2013. Software effort models should be
4 Duration Duration of project assessed via leave-one-out validation. Journal of Systems and Software 86, 7
5 Language Programming language (2013), 1879–1890. https://doi.org/10.1016/j.jss.2013.02.053
[49] Ekrem Kocaguneli, Tim Menzies, and Jacky W. Keung. 2012. On the Value of
6 Hardware Hardware Resources Ensemble Effort Estimation. IEEE Transactions on Software Engineering 38, 6
(2012), 1403–1416. https://doi.org/10.1109/TSE.2011.111
[50] Igor Kononenko. 1994. Estimating attributes: Analysis and extensions of RE-
LIEF. Machine Learning: ECML-94 784 (1994), 171–182. https://doi.org/10.1007/
3-540-57868-4
[51] Hubert W Lilliefors. 1967. on the Kolmogorov-Smirnov Test for Normality With
[39] Ali Idri, Taghi M. Khoshgoftaar, and Alain Abran. 2002. Investigating soft comput- Mean and Variance Unknown. J. Amer. Statist. Assoc. 62, 318 (1967), 399–402.
ing in case-based reasoning for software cost estimation. Engineering Intelligent https://doi.org/10.1080/01621459.1967.10482916
Systems for Electrical Engineering and Communications 10, 3 (2002), 147–157. [52] Huawen Liu, Jigui Sun, Lei Liu, and Huijie Zhang. 2009. Feature selection with
[40] M. Jørgensen. 2004. A review of studies on expert estimation of software de- dynamic mutual information. Pattern Recognition 42, 7 (2009), 1330–1339. https:
velopment effort. Journal of Systems and Software 70, 1-2 (feb 2004), 37–60. //doi.org/10.1016/j.patcog.2008.10.028
https://doi.org/10.1016/S0164-1212(02)00156-5 [53] Y Liu and X Yao. 1999. Ensemble learning via negative correlation. Neural
[41] Magne Jorgensen and Martin Shepperd. 2007. A Systematic Review of Software networks 12, 10 (1999), 1399–1404. https://doi.org/10.1016/S0893-6080(99)00073-8
Development Cost Estimation Studies. IEEE Transactions on Software Engineering [54] T. Menzies, B. Caglayan, E. Kocaguneli, J. Krall, F. Peters, and B. Turhan. 2012. The
33, 1 (2007), 33–53. https://doi.org/10.1109/TSE.2007.256943 promise repository of empirical software engineering data. (2012). terapromise.
[42] A. Jovic, K. Brkic, and N. Bogunovic. 2015. A review of feature selection meth- csc.ncsu.edu
ods with applications. In 2015 38th International Convention on Information and [55] Leandro L. Minku and Xin Yao. 2013. An analysis of multi-objective evolution-
Communication Technology, Electronics and Microelectronics (MIPRO). 1200–1205. ary algorithms for training ensemble models based on different performance
https://doi.org/10.1109/MIPRO.2015.7160458 measures in software effort estimation. In Proceedings of the 9th International
[43] Sushilkumar Kalmegh. 2015. Analysis of WEKA Data Mining Algorithm REPTree Conference on Predictive Models in Software Engineering - PROMISE ’13. 1–10.
, Simple Cart and RandomTree for Classification of Indian News. International https://doi.org/10.1145/2499393.2499396
Journal of Innovative Science, Engineering & Technology 2, 2 (2015), 438–446. [56] Leandro L. Minku and Xin Yao. 2013. Ensembles and locality: Insight on improving
[44] Gao Kehan, Khoshgoftaar Taghi M., Wang Huanjing, and Seliya Naeem. 2011. software effort estimation. Information and Software Technology 55, 8 (aug 2013),
Choosing software metrics for defect prediction: an investigation on feature 1512–1528. https://doi.org/10.1016/j.infsof.2012.09.012
selection techniques. Software - Practice and Experience 41 (2011), 579–606. [57] Leandro L. Minku and Xin Yao. 2013. Software Effort Estimation As a Multiobjec-
https://doi.org/10.1002/spe.1043 arXiv:1008.1900 tive Learning Problem. ACM Trans.Softw.Eng.Methodol. 22, 4 (2013), 35:1–35:32.
[45] Chris F Kemerer. 1987. An empirical validation of software cost estimation models. http://doi.acm.org/10.1145/2522920.2522928
Commun. ACM 30, 5 (1987), 416–429. https://doi.org/10.1145/22899.22906
IWSM/Mensura ’17, October 25–27, 2017, Gothenburg, Sweden M. Hosni et al.
[58] Y Miyazaki. 1991. Method to estimate parameter values in software prediction selection approach. Journal of Advances in Computer Engineering and Technology
models. Information and Software Technology 33, 3 (1991), 239–243. https://doi. 2, 4 (2016), 31–38.
org/10.1016/0950-5849(91)90139-3 [71] Qiang Shen, Ren Diao, and Pan Su. 2012. Feature Selection Ensemble. Turing-100
[59] Y Miyazaki, M Terakado, and K Ozaki. 1994. Robust Regression for Developing 10 (2012), 289–306.
Software Estimation Models. Journal of Systems and Software 27, 1 (1994), 3–16. [72] Martin Shepperd and Steve MacDonell. 2012. Evaluating prediction systems in
https://doi.org/10.1016/0164-1212(94)90110-4 software project estimation. Information and Software Technology 54, 8 (2012),
[60] W Nor Haizan W Mohamed, Mohd Najib, Mohd Salleh, and Abdul Halim Omar. 820–827. https://doi.org/10.1016/j.infsof.2011.12.008
2012. A Comparative Study of Reduced Error Pruning Method in Decision Tree [73] Martin J. Shepperd and Gada Kadoda. 2001. Comparing software prediction
Algorithms. In IEEE International Conference on Control System, Computing and techniques using simulation. IEEE Transactions on Software Engineering 27, 11
Engineering 2012. 23–25. https://doi.org/10.1109/ICCSCE.2012.6487177 (2001), 1014–1022. https://doi.org/10.1109/32.965341
[61] I. Myrtveit, E. Stensrud, and M. Shepperd. 2005. Reliability and validity in [74] Haykin Simon. 1999. Neural networks: a comprehensive foundation (2 ed.).
comparative studies of software prediction models. IEEE Transactions on Software MacMillan Publishing Company. https://doi.org/10.1017/S0269888998214044
Engineering 31, 5 (2005), 380–391. https://doi.org/10.1109/TSE.2005.58 arXiv:arXiv:1312.6199v4
[62] Ali Bou Nassif, Mohammad Azzeh, Luiz Fernando Capretz, and Danny Ho. 2015. [75] Vikas Sindhwani, Subrata Rakshit, Dipti Deodhare, Deniz Erdogmus, Jose C
Neural network models for software development effort estimation: a comparative Principe, and Partha Niyogi. 2004. Feature selection in MLPs and SVMs based on
study. Neural Computing and Applications (2015), 1–13. https://doi.org/10.1007/ maximum output information. IEEE Transactions on Neural Networks 15, 4 (2004),
s00521-015-2127-1 937–948. https://doi.org/10.1109/TNN.2004.828772
[63] Adriano L.I. Oliveira. 2006. Estimation of software project effort with support [76] L Song, Leandro L Minku, and X Yao. 2013. The impact of parameter tuning
vector regression. Neurocomputing 69, 13-15 (2006), 1749–1753. https://doi.org/ on software effort estimation using learning machines. In Proceedings of the
10.1016/j.neucom.2005.12.119 9th International Conference on Predictive Models in Software Engineering. http:
[64] L.H. Putnam. 1978. A General Empirical Solution to the Macro Software Sizing //dl.acm.org.pc124152.oulu.fi:8080/citation.cfm?id=2499394
and Estimating Problem. IEEE Transactions on Software Engineering SE-4, 4 (1978), [77] Yu Wang, Igor V. Tetko, Mark A. Hall, Eibe Frank, Axel Facius, Klaus F.X. Mayer,
345–361. https://doi.org/10.1109/TSE.1978.231521 and Hans W. Mewes. 2005. Gene selection from microarray data for cancer clas-
[65] Maurice H Quenouille. 1956. Notes on bias in estimation. Biometrika 43, 3/4 sification - A machine learning approach. Computational Biology and Chemistry
(1956), 353–360. 29, 1 (2005), 37–46. https://doi.org/10.1016/j.compbiolchem.2004.11.001
[66] M Robnik-Šikonja and I Kononenko. 1997. An adaptation of Relief for attribute [78] Y. Wang and I. H. Witten. 1997. Inducing Model Trees for Continuous Classes. In
estimation in regression. In Machine Learning: Proceedings of the Fourteenth European Conference on Machine Learning (ECML). 1–10. http://www.cs.waikato.
International Conference (ICML’97), Vol. 5. 296–304. https://doi.org/10.1119/1. ac.nz/
880454 [79] Jianfeng Wen, Shixian Li, Zhiyong Lin, Yong Hu, and Changqin Huang. 2012.
[67] Marko Robnik-Šikonja and Igor Kononenko. 1999. Attribute dependencies, un- Systematic literature review of machine learning based software development
derstandability and split selection in tree based models. In {M}achine {L}earning: effort estimation models. Information and Software Technology 54, 1 (jan 2012),
{P}roceedings of the {S}ixteenth {I}nternational {C}onference (ICML’99). 344–353. 41–59. https://doi.org/10.1016/j.infsof.2011.09.002
[68] N Sánchez-Maroño, A Alonso-Betanzos, and M Tombilla-Snaromán. 2007. Fil- [80] Dietrich Wettschereck, David W. Aha, and Takao Mohri. 1997. A Review and
ter methods for feature selectionâĂŞa comparative study. In Intelligent Data Empirical Evaluation of Feature Weighting Methods for a Class of Lazy Learning
Engineering and Automated Learning - IDEAL 2007. Springer, 178–187. https: Algorithms. Artificial Intelligence Review (1997), 1–37. https://doi.org/10.1023/A:
//doi.org/10.1007/978-3-540-77226-2 1006593614256
[69] A. J. Scott and M. Knott. 1974. A Cluster Analysis Method for Grouping Means [81] Yongheng Zhao and Yanxia Zhang. 2008. Comparison of decision tree methods
in the Analysis of Variance. Biometrics 30, 3 (1974), 507–512. http://www.jstor. for finding active objects. Advances in Space Research 41, 12 (2008), 1955–1959.
org/stable/2529204 https://doi.org/10.1016/j.asr.2007.07.020 arXiv:0708.4274
[70] Zahra Shahpar, Vahid Khatibi, Asma Tanavar, and Rahil Sarikhani. 2016. Im-
provement of effort estimation accuracy in software projects using a feature