Machine Learning Approaches To Estimating Software Development Effort
Machine Learning Approaches To Estimating Software Development Effort
2, FEBRUARY 1995
Abstract-Accurate estimation of software development effort of delivered source lines of code (SLOC). In contrast, many
is critical in software engineering. Underestimates lead to time methods of machine learning make no or minimal assumptions
pressures that may compromise full functional development and
about the form of the function under study (e.g., development
thorough testing of software. In contrast, overestimates can re-
sult in noncompetitive contract bids and/or over allocation of effort), but as with other approaches they depend on historical
development resources and personnel. As a result, many models data. In particular, over a known set of training data, the learn-
for estimating software development effort have been proposed. ing algorithm constructs “rules” that fit the data, and which
This article describes two methods of machine learning, which hopefully fit previously unseen data in a reasonable manner
we use to build estimators of software development effort from
as well. This article illustrates machine learning approaches
historical data. Our experiments indicate that these techniques
are competitive with traditional estimators on one dataset, but to estimating software development effort using an algorithm
also illustrate that these methods are sensitive to the data on for building regression trees [4], and a neural-network learning
which they are trained. This cautionary note applies to any approach known as BACKPROPAGATION [ 191. Our experiments,
model-construction strategy that relies on historical data. All using established case libraries [3], [ll], indicate possible
such models for software effort estimation should be evaluated
advantages of the approach relative to traditional models, but
by exploring model sensitivity on a variety of historical data.
also point to limitations that motivate continued research.
Index Terms-Software development effort, machine learning,
decision trees, regression trees, and neural networks.
be straightforward, then the basic COCOMO model (COCOMO- III. MACHINE LEARNING APPROACHES
basic) relates the nominal development effort (N) and DSI as T O ESTIMATING DEVELOPMENT EFFORT
follows: This section describes two machine learning strategies that
we use to estimate software development effort, which we
N = 3.2 x (KDSI)‘.05, assume is measured in development months (M). In many
respects this work stems from a more general methodology
where KDSI is the DSI in 1000s. However, the prediction of for developing expert systems. Traditionally, expert systems
the basic Coco~o model can be modified using cost drivers . have been developed by extracting the rules that experts
Cost drivers are classified under four major headings relating apparently use by an interview process or protocol analysis
to attributes of the product (e.g., required software reliability), (e.g., ESTOR), but an alternate approach is to allow machine
computer platform (e.g., main memory limitations), personnel learning programs to formulate rulebases from historical data.
(e.g., analyst capability), and the project (e.g., use of modem This methodology requires historical data on which to apply
programming practices). These factors serve to adjust the learning strategies.
nominal effort up or down. These cost drivers and other There are several aspects of software development effort
considerations extend the basic model to intermediate and final estimation that make it amenable to machine learning analysis.
forms. Most important, previous researchers have identified at least
The Function Point method was developed by Albrecht [2]. some of the attributes relevant to software development effort
Function points are based on characteristics of the project that estimation, and historical databasesdefined over these relevant
are at a higher descriptive level than SLOC, such as the number attributes have been accumulated. The following sections de-
of input transaction types and number of reports. A notable scribe two very different learning algorithms that we use to test
advantage of this approach is that it does not rely on SLOC, the machine learning approach. Other research using machine
which facilitates estimation early in the project life cycle learning techniques for software resource estimation are found
(i.e., during requirements definition), and by nontechnical in [5], [14], [15], [22], which we will discuss throughout the
personnel. To count function points requires that one count paper. In short, our work adds to the collection of machine
user functions and then make adjustments for processing learning techniques available to software engineers, and our
complexity. There are five types of user function that are analysis stresses the sensitivity of these approaches to the
included in the function point calculation: external input nature of historical data and other factors.
types, external output types, logical internal file types, external
interface file types, and external inquiry types. In addition,
there are 14 processing complexity characteristics such as A. Learning Decision and Regression Trees
transaction rates and online updating. A function point is Many learning approaches have been developed that con-
calculated based on the number of transactions and complexity struct decision trees for classifying data [4], [17]. Fig. 1
characteristics. The development effort estimate given the illustrates a partial decision tree over Boehm’s original 63
function point, F, is: N = 54 x F - 13390. projects from which COCOMO was developed. Each project
Recently, a case-based approach called ESTOR was devel- is described over dimensions such as AKDSI (i.e., adjusted
oped for software effort estimation. This model was developed delivered source instructions), TIME (i.e., the required system
by Vicinanza et al. [23] by obtaining protocols from a human responsetime), and STOR (i.e., main memory limitations). The
expert. From a library of casesdeveloped from expert-supplied complete set of attributes used to describe these data is given in
protocols, an instance called the source is retrieved that is most Appendix A. The mean of actual project development months
“similar” to the turget problem to be solved. labels each leaf of the tree. Predicting development effort for
The solution of the most similar problem retrieved from the a project requires that one descend the decision tree along an
case library is adapted to account for differences between the appropriate path, and the leaf value along that path gives the
source problem and the target problem using rules inferred estimate of development effort of the new project. The decision
from analysis of the human expert’s protocols. An example of tree in Fig. 1 is referred to as a regression tree, because the
an adjustment rule is: intent of categorization is to generate a prediction along a
IF staff size of Source project is continuous dependent dimension (here, software development
small, AND effort).
staff size of Target is large There are many automatic methods for constructing decision
THEN increase effort estimate of Target and regression trees from data, but these techniques are typi-
by 20%. cally variations on one simple strategy. A “top-down” strategy
Vicinanza et al., have shown that E STOR performs better examines the data and selects an attribute that best divides
than COCOMO and FUNCTION POINTS on restricted samples the data into disjoint subpopulations. The most important
of problems. aspect of decision and regression tree learners is the criterion
In sum, there have been a variety of models developed for used to select a “divisive” attribute during tree construction.
estimating development effort. W ith the exception of ESTOR In one variation the system selects the attribute with values
these are parametric approaches that assume that an initial that maximally reduce the mean squared error (MSE) of
estimate can be provided by a formula that has been fit to the dependent dimension (e.g., software development effort)
historical data. observed in the training data. The MSE of any set, S,
128 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 21, NO. 2, FEBRUARY 1995
<2&O
(MEAN = 57.3) [31] FUNCTION CARTX (Instances)
TKDSI IF termination-condition(Instances)
-c (MEAX = 299.2) [I91 THEN RETURN mean among Instances
>26.0
51.415 ELSE Set Best-Attribute to most informative attribute among
the Instances.
TIME
BUS I I I
CARTX({III is m Instance CARTX( {I(1 with Vs}) CARTX({III with V,})
(MEAN = 243) [l]
with value V,))
51.03
5274.0
Fig. 2. Decision/regression-tree learning algorithm.
STOR r
AAF
attribute domains, for both continuous and nominally-valued
(MEAN = 2250) [2]
(i.e., finite, unordered) attributes, have been explored (e.g.,
>0.925
(MEAN = 3836) [2] [24]). For continuous attributes this bisection process operates
as we have just described, but for a nominally-valued attribute
all ways to group values of the attribute into two disjoint sets
L (MEAN = 9000) [2] are considered. Suffice it to say that treating all attributes
Fig. 1. A regression tree over Boehm’s 63 software project descriptions. as though they had the same number of values (e.g., 2) for
Numbers in square brackets represent the number of projects classified under purposes of attribute selection mitigates certain biases that are
a node.
present in some attribute selection measures (e.g., AMSE).
As we will note again in Section IV, we ensure that all
attributes are either continuous or binary-valued at the outset
of training examples taking on values yk in the continuous of regression-tree construction.
dependent dimension is: The basic regression-tree learning algorithm is summarized
in Fig. 2. The data set is first tested to see whether tree con-
c (Yk - %I2 struction is worthwhile; if all the data are classified identically
MSE(S) = ICES or some other statistically-based criterion is satisfied, then
ISI expansion ceases. In this case, the algorithm simply returns
where g is the mean of the yk values exhibited in S. The a leaf labeled by the mean value of the dependent dimension
values of each attribute, A; partition the entire training data found in the training data. If the data are not sufficiently distin-
set, T, into subsets, T;j, where every example in Tij takes on guished, then the best divisive attribute according to AMSE
the same value, say V, for attribute A;. The attribute, A;, that is selected, the attribute’s values are used to partition the data
maximizes the difference: into subsets, and the procedure is recursively called on these
subsets to expand the tree. W h e n used to construct predictors
AMSE = MSE(T) - c MSE(Tij)
along continuous dimensions, this general procedure is referred
to as recursive-partitioning regression. Our experiments use a
is selected to divide the tree. Intuitively, the attribute that partial reimplementation of a system known as CART [4]. W e
minimizes the error over the dependent dimension is used. refer to our reimplementation as CARTX.
While MSE values are computed over the training data, the Previously, Porter and Selby [14], [15], [22], have in-
inductive assumption is that selected attributes will similarly vestigated the use of decision-tree induction for estimating
reduce error over future cases as well. development effort and other resource-related dimensions.
This basic procedure of attribute selection is easily extended Their work assumes that if predictions over a continuous
to allow continuously-valued attributes: all ordered 2-partitions dependent dimension are required, then the continuous dimen-
of the observed values in the training data are examined. sion is “discretized” by breaking it into mutually-exclusive
In essence, the dimension is split around each observed ranges. More commonly used decision-tree induction algo-
value. The effect is to 2-partition the dimension in Ic - 1 rithms, which assume discrete-valued dependent dimensions,
alternate ways (where /? is the number of observed values), are then applied to the appropriately classified data. In many
and the binary “split” that is best according to AMSE cases this preprocessing of a continuous dependent dimension
is considered along with other possible attributes to divide may be profitable, though regression-tree induction demon-
a regression-tree node. Such “splitting” is common in the strates that the general tree-construction approach can be
tree of Fig. 1; see AKDSI, for example. Approaches have adapted for direct manipulation of a continuous dependent
also been developed that split a continuous dimension into dimension. This is also the case with the learning approach
more than two ranges [9], [15], though we will assume that we describe next.
SRINIVASAN AND FISHER: ESTIMATING SOFTWARE DEVELOPMENT EFFORT 129
Input Layer
AKDSI A
f(X)
X
Fig. 4. An example of function approximation by a regression tree.
Fig. 3. A network architecture for software development effort estimation. comparisons are made between a network’s actual output
pattern and an a priori known correct output pattern. The
B. A Neural Network Approach to Learning difference or error between each output line and its correct
A learning approach that is very different from that outlined corresponding value is “backpropagated” through the net and
above is BACKPROPAGATION,which operates on a network guides the modification of weights in a manner that will
of simple processing elements as illustrated in Fig. 3. This tend to reduce the collective error between actual and correct
basic architecture is inspired by biological nerve nets, and outputs on training patterns. This procedure has been shown
is thus called an artificial neural network. Each line between to converge on accurate mappings between input and output
processing elements has a corresponding and distinct weight. patterns in a variety of domains [21], [25].
Each processing unit in this network computes a nonlinear
function of its inputs and passes the resultant value along as C. Approximating Arbitrary Functions
its output. The favored function is In trying to approximate an arbitrary function like de-
1 velopment effort, regression trees approximate a function
r / \l with a “staircase” function. Fig. 4 illustrates a function of
one continuous, independent variable. A regression tree de-
composes this function’s domain so that the mean at each
leaf reflects the function’s range within a local region. The
where C; W iIi is a weighted sum of the inputs, Ii, to a
“hidden” processing elements that reside between the input and
processing element [ 191, [25].
The network generates output by propagating the initial in- output layers of a neural network do roughly the same thing,
though the approximating function is generally smoothed. The
puts, shown on the lefthand side of Fig. 3, through subsequent
granularity of this partitioning of the function is modulated by
layers of processing elements to the final output layer. This net
the depth of a regression tree or the number of hidden units
illustrates the kind of mapping that we will use for estimating
in a network.
software development effort, with inputs corresponding to
various project attributes, and the output line corresponding Each learning approach is nonparametric, since it makes
no a priori assumptions about the form of the function being
to the estimated development effort. The inputs and output are
approximated. There are a wide variety of parametric methods
restricted to numeric values. For numerically-valued attributes
this mapping is natural, but for nominal data such as LANG for function approximation such as regression methods of
statistics and polynomial interpolation methods of numerical
(implementation language), a numeric representation must be
found. In this domain, each value of a nominal attribute analysis [lo]. Other nonparametric methods include genetic
algorithms [7] and nearest neighbor approaches [1], though
is given its own input line. If the value is present in an
we will not elaborate on any of these alternatives here.
observation then the input line is set to 1.0, and if the value
is absent then it is set to 0.0. Thus, for a given observation
the input line corresponding to an observed nominal value D. Sensitivity to Conjiguration Choices
(e.g., COB) will be 1.0, and the others (e.g., FTN) will be 0.0. Both BACKPROPAGATION and CARTX require that the analyst
Our application requires only one network output, but other make certain decisions about algorithm implementation. For
applications may require more than one. example, BACKPROPAGATION can be used to train networks
Details of the BACKPROPAGATION learning procedure are with differing numbers of hidden units. Too few hidden units
beyond the scope of this article, but intuitively the goal of can compromise the ability of the network to approximate a
learning is to train the network to generate appropriate output desired function. In contrast, too many hidden units can lead
patterns for corresponding input patterns. To accomplish this, to “overfitting,” whereby the learning system fits the “nois?’
130 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 21, NO. 2, FEBRUARY 1995
present in the training data, as well as the meaningful trends linear relationship and those close to 0.0 suggest no such
that we would like to capture. BACKPROPAGATION is also typi- relationship. Our experiments will characterize the abilities of
cally trained by iterating through the training data many times. BACKPROPAGATION and CARTX using the same dimensions as
In general, the greater the number of iterations, the greater the Kemerer: MRE and R2.
reduction in error over the training sample, though there is no As we noted, each system imposes certain constraints on the
general guarantee of this. Finally, BACKPROPAGATION assumes representation of data. There are a number of nominally-valued
that weights in the neural network are initialized to small, attributes in the project databases, including implementation
random values prior to training. The initial random weight language. BACKPROPAGATION requires that each value of such
settings can also impact learning success, though in many an attribute was treated as a binary-valued attribute that was
applications this is not a significant factor. There are other either present (1) or absent (0) in each project. Thus, each
parameters that can effect BACKPROPAGATION ‘s performance, value of a nominal attribute corresponded to a unique input
but we will not explore these here. to the neural network as noted in Section III-B. W e represent
In CARTX, the primary dimension under control by the each nominal attribute as a set of binary-valued attributes for
experimenter is the depth to which the regression tree is CARTX as well. As we noted in Section III-A this mitigates
allowed to grow. Growth to too great a deptlr can lead to certain biases in attribute selection measures such as AMSE.
overfitting, and too little growth can lead to underfitting. In contrast, each continuous attribute identified by Boehm
Experimental results of Section IV-B illustrate the sensitivity corresponded to one input to the neural network. There was
of each learning system to certain configuration choices. one output unit, which reflected a prediction of development
effort and was also continuous. Preprocessing for the neural
IV. OVERVIEWOF EXPERIMENTALSTUDIES network normalized these values between 0.0 and 1.0. A
simple scheme was used where each value was divided by
W e conducted several experiments with CARTX and the maximum of the values for that attribute in the training
BACKPROPAGATIONfor the task of estimating software data. It has been shown empirically that neural networks
development effort. In general, each of our experiments converge relatively quickly if all the values for the attributes
partitions historical data into samples used to train our learning are between zero and one [12]. No such normalization was
systems, and disjoint samples used to test the accuracy of the done for CARTX, since it would have no effect on CARTX’S
trained classifier in predicting development effort. performance.
For purposes of comparison, we refer to previous exper-
imental results by Kemerer [ 1 I]. He conducted comparative
analyses between SLIM, COCOMO,and FUNCTION POINTS on A. Experiment I: Comparison with Kemerer’s Results
a database of 15 projects.’ These projects consist mainly of
Our first experiment compares the performance of machine
business applications with a dominant proportion of them
learning algorithms with standard models of software devel-
(12115) written in the COBOL language. In contrast, the
opment estimation using Kemerer’s data as a test sample. To
Coco~o database includes instances of business, scientific,
test CARTX and BACKPROPAGATION, we trained each system
and system software projects, written in a variety of languages
on COCOMO’Sdatabaseof 63 projects and tested on Kemerer’s
including COBOL, PLl, HMI, and FORTRAN. For compar-
15 projects. For BACKPROPAGATION we initially configured the
isons involving COCOMO,Kemerer coded his 15 projects using
network with 33 input units, 10 hidden units, and 1 output
the same attributes used by Boehm.
unit, and required that the training set error reach 0.00001 or
One way that Kemerer characterized the fit between the pre-
continue for a maximum of 12 000 presentations of the training
dicted (A&) and actual (A4& development person-months
data. Training ceased after 12000 presentations without con-
was by the magnitude of relative error (MRE):
verging to the required error criterion. The-experiment was
MRE = Mest - Mact . done on an AT&T PC 386 under DOS. It required about 6-7
M act hours for 12000 presentations of the training patterns. W e
actually repeated this experiment 10 times, though we only
This measure normalizes the difference between actual and
report the results of one run here; we summarize the complete
predicted development months, and supplies an analyst with a
set of experiments in Section IV-B.
measure of the reliability of estimates by different models. In our initial configuration of CARTX, we allowed the
However, when using a model developed at one site for
regression tree to grow to a “maximum” depth, where each
estimation at another site, there may be local factors that leaf represented a single software project description from the
are not modeled, but which nonetheless impact development
COCOMOdata. W e were motivated initially to extend the tree
effort in a systematic way. Thus, following earlier work
to singleton leaves, because the data is very sparse relative to
by Albrecht [2], Kemerer did a linear regression/correlation the number of dimensions used to describe each data point;
analysis to “calibrate” the predictions, with Mest treated as
our concern is not so much with overfitting, as it is with
the independent variable and Mact treated as the dependent
underfitting the data. Experiments with the regression tree
variable. The R2 value indicates the amount of variation in
learner were performed on a SUN 3/60 under UNIX, and
the actual values accounted for by a linear relationship with
required about a minute. The predictions obtained from the
the estimated values. R* values close to 1.0 suggest a strong learning algorithms (after training on the COCOMOdata) are
“W e thank Professor Chris Kemerer for supplying this dataset. shown in Table I with the actual person-months of Kemerer’s
SRINIVASAN AND FLSHER:ESTIMATING SOFTWARE DEVELOPMENT EFFORT 131
TABLE I TABLE II
CARTX AND BACKPROPAGATION
ESTIMATES
ONKEMERER’
DATA
S A COMPARISONOFLEARNINGANDALGORITHMICAPPROACHES.
THE
REGRESION EQUATIONS GIVE jWact AS A FUNCTION OF h&(z)
Actual CARTX I SACKPROP
MRE(%) R-Square Regress. Eq.
287.00 1893.30 86.45
CARTX 364 0.83 102.5 + 0.075~
82.50 162.03 14.14
1107.31 11400.00 1000.43
BACKPROP 70 0.80 78.13 + 0.882
-
132 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 21, NO. 2, FEBRUARY 1995
3::”
TABLE VI
SENSITIVITY OVER 20 RANDOMIZED TRIALS ON KEMERER’S DATA
31
TABLE IV
CARTX RESULTS WITH VARYING TRAINING ERROR THRESHOLDS
IYMact
decisions. A more complete treatment of resampling and other
IM act
strategies for making configuration choices can be found in
Weiss and Kulikowski [24].
a “case library” and the remaining 5 cases were used to test effort estimation suggest the promise of an automated learning
the model’s predictions; the particular cases used for test were approach to the task. Both learning techniques performed well
not reported, but ESTORoutperformed COCOMOand FUNCTION on the R2 and MRE dimensions relative to some other
POINTS on this set. approaches on the same data. Beyond this cursory summary,
W e do not know the robustness of ESTOR in the face of our experimental results and the previous literature suggest
the kind of variation experienced in our 20 randomized trials several issues that merit discussion.
(Table VI), but we might guess that rules inferred from expert
problem solving, which ideally stem from human learning A. Limitations of Learning from Historical Data
over a larger set of historical data, would render ESTOR
more robust along this dimension. However, our experiments There are well-known limitations of models constructed
and those of Kemerer with selected subsets of his 15 cases using historical data. In particular, attributes used to predict
suggest that care must be taken in evaluating the robustness software development effort can change over time and/or differ
of any model with such sparse data. In defense of Vicinanza’s between software development environments. Mohanty [ 131
et aZ.P methodology, we should note that the creation of. makes this point in comparisons between the predictions of
a case library depended on an analysis of expert protocols a wide variety of models on a single hypothetical software
and the derivation of expert-like rules for modifying the project. In particular, Mohanty surveyed approximately 15
predictions of best matching cases, thus increasing the “cost” models and methods for predicting software development ef-
of model construction to a point that precluded more complete fort. These models were used to predict software development
randomized trials. Vicinanza et al. also point out that their effort of a single hypothetical software project. Mohanty’s
study is best viewed as indicating ESTOR’S“plausibility” as a main finding was that estimated effort on this single project
good estimator, while broader claims require further study. varied significantly over models. Mohanty points out that
In addition to experiments with the combined COCOMO each model was developed and calibrated with data collected
and Kemerer data, and the Kemerer data alone, we experi- within a unique software environment. The predictions of
mented with the COCOMOdata alone for completeness. W h e n these models, in part, reflect underlying assumptions that are
experimenting with Kemerer’s data alone, our intent was to not explicitly represented in the data. For example, software
weakly explore the kind of variation faced by ESTOR. Using development sites may use different development tools. These
the COCOMOdata we have no such goal in mind. Thus, this tools are constant within a facility, and thus not representedex-
analysis uses an N-fold cross validation or a “leave-one-out” plicitly in data collected by that facility, but this environmental
methodology, which is another form of resampling. In particu- factor is not constant across facilities.
lar, if a data sample is relatively sparse,as ours is, then for each Differing environmental factors not reflected in data are
of N (i.e., 63) projects, we remove it from the sample set, train undoubtedly responsible for much of the unexplained variance
the learning system with the remaining N - 1 samples, and in our experiments. To some extent, the R2 derived from
then test on the removed project. MRE and R2 are computed linear regression is intended to provide a better measure of a
over the N tests. CARTX’S R2 value was 0.56 (144.48+0.74x, model’s “fit” to arbitrary new data than MRE in cases where
t = 8.82) and MRE was 125.2%. In this experiment we only the environment from which a model was derived is different
report results obtained with CARTX, since a fair and com- from the environment from which new data was drawn. Even
prehensive exploration of BACKPROPAGATIONacross possible so, these environmental differences may not be systematic in
network configurations is computationally expensive and of a way that is well accounted for by a linear model. In sum,
limited relevance. Suffice it to say that over the COCOMOdata great care must be taken when using a model constructed from
alone, which probably reflects a more uniform sample than the data from one environment to make predictions about data
mixed COCOMo/Kemerer data, CARTX provides a significant from another environment, Even within a site, the environment
linear fit to the data with markedly smaller MRE than its may evolve over time, thus compromising the benefits of
performance on Kemerer’s data. previously-derived models. Machine learning research has
In sum, our initial results indicating the relative merits of recently focussed on the problem of tracking the accuracy
a learning approach to software development effort estimation of a learned model over time, which triggers relearning when
must be tempered. In fact, a variety of randomized experiments experience with new data suggests that the environment has
reveal that there is considerable variation in the performance of changed [6]. However, in an application such as software
these systems as the nature of historical training data changes. development effort estimation, there are probably explicit
This variation probably stems from a number of factors. indicators that an environmental change is occurring or will
Notably, there are many projects in both the COCOMOand occur (e.g., when new development tools or quality control
Kemerer datasetsthat differ greatly in their actual development practices are implemented).
effort, but are very similar in other respects, including SLOC.
Other characteristics, which are currently unmeasured in the B. Engineering the Dejinition of Data
COCOMOscheme, are probably responsible for this variation. If environmental factors are relatively constant, then there
is little need to explicitly represent these in the description of
V. GENERAL DISCUSSION
data. However, when the environment exhibits variance along
Our experimental comparisons of CARTX and BACK- some dimension, it often becomes critical that this variance
PROPAGATIONwith traditional approaches to development be codified and included in data description. In this way,
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 21, NO. 2. FEBRUARY 1995
differences across data points can be observed and used in estimating the person-month effort required for the project
model construction. For example, Mohanty argues that the requiring 23.20 M or the project requiring 1107.31 M; the
desired quality of the finished product should be taken into projects closest to each among the remaining 14 projects are
account when estimating development effort. A comprehensive 69.90 M and 336.30 M, respectively.
survey by Scacchi [20] of previous software production studies The root of CARTX’S difficulties lies in its labeling of
leads to considerable discussion on the pros and cons of many each leaf by the mean of development months of projects
attributes for software project representation. classified at the leaf. An alternative approach that would enable
Thus, one of the major tasks is deciding upon the proper CARTX to extrapolate beyond the training data, would label
codification of factors judged to be relevant. Consider the each leaf by an equation derived through regression-e.g.,
dimension of response time requirements (i.e., TIME) which a linear regression. After classifying a project to a leaf, the
was included by Boehm in project descriptions. This attribute regression equation labeling that leaf would then be used to
was selected by CARTX during regression-tree construction. predict development effort given the object’s values along the
However, is TIME an “optimal” codification of some aspect independent variables. In addition, the criterion for selecting
of software projects that impacts development effort? Consider divisive attributes would be changed as well. To illustrate,
that strict response time requirements may motivate greater consider only two independent attributes, development team
coupling of software modules, thereby necessitating greater experience and KDSI, and the dependent variable of software
communication among developers and in general increasing development effort. CARTX would undoubtedly select KDSI,
development effort. If predictions of de&lopment effort must since lower (higher) values of KDSI tend to imply lower
be made at the time of requirements analysis, then perhaps (higher) means of development effort. In contrast, development
TIME is a realistic dimension of measurement, but better team experience might not provide as good a fit using CARTX’S
predictive models might be obtained and used given some error criterion. However, consider a CART-like system that
measure of software component coupling. divides data up by an independent variable, finds a best
In sum, when building models via machine learning or sta- fitting linear equation that predicts development effort given
tistical methods, it is rarely the case that the set of descriptive development team experience and KDSI, and assesseserror
attributes is static. Rather, in real-world success stories in- in terms of the differences between predictions using this
volving machine learning tools the set of descriptive attributes best fitting equation and actual development months. Using
evolves over time as attributes are identified as relevant or this strategy, development team experience might actually be
irrelevant, the reasons for relevance are analyzed, and addi- preferred; even though lesser (greater) experience does not
tional or replacement attributes are added in response to this imply lesser (greater) development effort, development team
analysis [8]. This “model” for using learning systems in the experience does imply subpopulations for which strong linear
real world is consistent with a long-term goal of Scacchi [20], relationships might exist between independent and dependent
which is to develop a knowledge-based “corporate memory” of variables. For example, teams with lesser experience may not
software production practices that is used for both estimating adjust as well to larger projects as do teams with greater
and controlling software development. The machine-learning experience; that is, as KDSI increases, development effort
tools that we have described, and other tools such as ESTOR, increases are larger for less experienced teams than more
might be added to the repertoire of knowledge-acquisition experienced teams. Recently, machine learning systems have
strategies that Scacchi suggests. In fact, Porter and Selby [14] been developed that have this flavor [ 181. W e have not yet
make a similar proposal by outlining the use of decision-tree experimented with these systems, but the approach appears
induction methods as tools for software development. promising.
The success of CARTX, and decision/regression-tree leam-
ers generally, may also be limited by two other processing
C. The Limitations of Selected Learning Methods characteristics. First, CARTX uses a greedy attribute selection
Despite the promising results on Kemerer’s common data- strategy-tree construction assessesthe informativeness of a
base, there are some important limitations of CARTX and single attribute at a time. This greedy. strategy might overlook
BACKPROPAGATION. W e have touched upon the sensitivity attributes that participate in more accurate regression trees,
to certain configuration choices. In addition to these prac- particularly when attributes interact in subtle ways. Second,
tical limitations, there are also some important theoretical CARTX builds one classifier over a training set of software
limitations, primarily concerning CARTX. Perhaps the most projects. This classifier is static relative to the test projects;
important of these is that CARTX cannot estimate a value any subsequent test project description will match exactly one
along a dimension (e.g., software development effort) that is conjunctive pattern, which is represented by a path in the
outside the range of values encountered in the training data. regression tree. If there is noise in the data (e.g., an error in the
Similar limitations apply to a variety of other techniques as recording of an attribute value), then the prediction stemming
well (e.g., nearest neighbor approaches of machine learning from the regression-tree path matching a particular test project
and statistics). In part, this limitation appears responsible for may be very misleading. It is possible that other conjunctive
a sizable amount of error on test data. For example, in the patterns of attribute values matching a particular test project,
experiment illustrating CARTX’S sensitivity to training data but which are not represented in the regression tree, could
using IO/5 splits of Kemerer’s projects (Section IV-C), CARTX ameliorate CARTX’S sensitivity to errorful or otherwise noisy
is doomed to being at least a factor of 3 off the mark when project descriptions.
SRINIVASAN AND FISHER: ESTIMATING SOFlWARE DEVELOPMENT EFFORT 135