Efficient Software Cost Estimation Using Machine Learning Techniques
Efficient Software Cost Estimation Using Machine Learning Techniques
Efficient Software Cost Estimation Using Machine Learning Techniques
A SYNOPSIS
Submitted
in the partial fulfillment of the requirements for the award of
the degree of
DOCTOR OF PHILOSOPHY
in
Under the Guidance of
Research Director
Dr. L.S.S. Reddy
Professor & Vice Chancellor
Department of Computer Science & Engineering
KL University, Vaddeswaram
Department of Computer Science and Engineering
Acharya Nagarjuna University
Nagarjuna Nagar – 522 510
Andhra Pradesh, India
M a y 2014
CONTENTS
Content Page No
Abstract 2
1. Introduction 4
1.2 Limitations 9
2. Literature Survey 10
3. Research Contribution 12
4. Results 15
References 18
1
ABSTRACT
2
solve the cost estimation problems in product life cycle, we Proposed
different machine learning techniques: Holdback, K Fold, Excluded
rows, decision tree, bootstrap tree, and bootstrap forest (These are
Proposed Methods), Good interpretability is the major advantage of
our approach such as Cross Validation Techniques and Partition
Based Methods. In this approach we use machine learning techniques
in software estimation so the thought of people’s line tries to simulate
by this approach. The other benefit of this research work is that
expert knowledge, project data and the doctrinal algorithmic model
can be put together into one general framework and can be used in
the applications of software cost estimation, software quality
estimation and risk analysis. The cost estimation models and
statistical regression analysis performances can be compared. The
veneration outcomes and pursuance revealed that these models
provide more veracious pursuance than conventional models.
3
1. Introduction
4
the models which are developed by integrating cost factors.
Additionally, to develop model statistical techniques also plays major
role. A non-algorithmic model doesn’t make use of functions to
estimate cost of the software product.
5
1.1.1.5 Parkinson’s Law
6
1.1.2.3 COCOMO II Model
It can be expressed as
The above model was developed by Bailey and Basil and the metric is
lines of source code [12]
7
1.1.2.7 Halsted Model
T= S * X1/3 * Y4/3
X= 1/Y4 * (S/T)/3
8
1.2 Limitations
9
2. Literature Survey
The cost of the software product increases and delivers the
project than expedite, and reprieve the use of those resources on the
next project. To estimate the software project four steps are followed
[15]. Those are 1) Estimate size of the software product 2) Effort
estimation in person-months or person-hours. 3) Estimate schedule
time in the calendar 4) cost of the project can be estimated in dollars
or local currency.
To get a reliable software cost estimate, we need to do much
more than just put numbers into a formula and accept the results.
Software cost estimation process involves seven steps, which shows
that cost estimation is a mini project, so we require planning, reviews,
and so on.
The seven basic seven steps are
1. Establish Objectives
2. Plan for Required Data and Resources
3. Pin Down Software Requirements
4. Work Out as Much Detail as Feasible
5. Use Several Independent Techniques and Sources
6. Compare and Iterate Estimates
7. Follow-up
2.1 MACHINE LEARNING TECHNIQUES
2.1.1 Fuzzy method
The systems using this Fuzzy logic can simulate human behavior
and reasoning. Fuzzy system tools efficiently solve the challenges in
the situations like decision making is hard and huge number of
constraints. These systems always takes the facts that ignored by
other approaches [16].
10
and cost so on. When we inspect the proposed product, if it identifies
any new feature, then it is possible to recollect the store feature [17].
2.1.3 Rule-Based Systems
Rule based systems are very useful in estimating the effort of
software in few cases. Rule based system stores some set of rules in
its knowledge database, which have been used to compare the
features of proposed project. When there is any identical rule is found,
that rule will fire, which can be used to generate a new rule to form
chaining effect. The chaining effect continues until there are no rules
to fire and the result is displayed to user [17].
2.1.4 Regression Trees
Regression trees are widely used to estimate the cost of the
software in software Engineering. A regression tree is constructed by
investigating the attributes, and considers the most informative
attributes. Use an algorithm on these informative attributes to
construct the tree and split the attribute values where it is
appropriate. Tree only ever splits into two leaf nodes. Checking can be
done for any logical errors, by using this probable format [17]
2.1.5 Artificial Neural Networks (ANNs)
ANNs are constructed based on the architecture of biological
neural networks and these are also known as parallel systems.
Artificial neuron is a collection networks that are interconnected. The
neuron uses stepped function to determine the weighted sum of the
inputs. The derived output might be a positive or negative given as
input to other neurons in the parallel network. Feed-forward
multilayer perception is one of the widely used models in ANN.
Neurons are initialized with random weights. The network learns the
associations between the inputs by altering the weights when the
training set is obtainable from both the inputs and outputs. ANNs
estimates development effort and it compares the accuracy to the
algorithmic approaches rather than the suitability of this approach to
estimation process [18][19].
11
3 Research Contributions
3.1 Cross Validation
One of the statistical methods is cross validation, in which the
data are separated as one or more segments for evaluating the
comparing the performance of learning algorithms. One segment is
used to training set to learn the model and the other segment is used
to perform validation on the model. In cross-validation method the
training and validation sets must pass through sequence of rounds
such that each data point getting a chance to be validated. K-fold
cross-validation is the crucial form of cross-validation. Other types of
cross-validations involve repeated rounds of cross-validation and
those are subtypes of k-fold cross-validation [20]. Cross-validation is
used to assess and differentiate the performance of diverse learning
algorithms. The learning algorithms use 1-fold of data to develop the
model or learn the model, and the learned models are using the data
in validation fold to make predictions about the data. After
construction of model based on learning algorithm, the model can be
applied to each fold to measure the performance of metrics like
accuracy. After completion of measuring the performance of metrics, k
samples are available. Then averaging of k samples can be obtained to
aggregate the measure. These samples can be used in statistical
hypothesis test to show that one algorithm is superior to another.
3.1.1 K Fold
The original data is divided into k subsets. Each time, one of the
k subsets is used as the validation set and the other k-l subsets are
put together to form a training set to construct the model. This gives
the best validation statistic is chosen as the best model. Cross-
validation is chosen as the best for small datasets, because it makes
efficient use of limited amount of data.
The steps to use k fold cross validation are
• To use k fold cross validation, the validation column should
consists of four or more distinct values.
12
3.1.2 Hold Back
Hold back method divides the original data into training and
validation data sets. The validation portion used to specify the
proportion of the original data to use as the validation data set (hold
back).
3.1.3 Excluded Rows Method
Validation techniques are used to discover the variables for a
classifier and also used to assess the accuracy of model.1) The
training set is used to deduce the factors of model. 2) The test set is
used to reckon or endorse the presaging accuracy of model.3) The
validation set is also used to reckon or ratifies auguring capability of
the classifier [21]. In Excluded rows method which uses row states to
partition the data. Records that are eliminated by using the above
method are considered as validation data and remaining records are
act as learning data set [22].
13
There are two types of techniques such as supervised and
unsupervised techniques. Decision tree is the successful supervised
learning technique. For this section, all the features to represent data
have finite discrete domains, and the target feature is called
classification [23]. The major element in the classification domain is
called a class. In decision tree every non-leaf node is labeled with test
feature. The arcs coming from labeled node are also labeled with the
possible values of the feature. All the leaf nodes of tree are labeled
with a class names.
3.2.2 Boosted Tree
To build a large and efficient decision tree to be fit into a series
of minute trees, Boosting plays a major role. The residual degree of the
previous tree can be constructed from the smaller trees. All the above
constructed trees are used to form a larger final tree. The boosting
process can use the validation phase to estimate the number of stages
required for fitting the model and it can also specify the threshold
value for the stages.
There are 1 to 5 splits at each stage of the tree. So we can say
that each stage of the tree must be short enough. After constructing
the initial tree, each and every stage fits onto the previous stage. The
same process can be iterative, until it reaches threshold value of the
stages. The validation statistic performance cannot be improved, if we
are fitting an additional stage after the validation phase is used. The
sum of the each stage of terminal node estimates is combined to form
the ultimate forecasting. If the response variable is categorical
attribute then the offsets of linear logistics can be used to fit the
residuals of each stage. The sums of linear logistics of all the stages
are converted by using logistic transformation function to predict the
final target value.
3.2.3 Bootstrap Forest Tree
Bootstrap technique has very good features which are useful to
determine an optimal model and uses classification models such as
decision tree based on data and chosen subset features. Decision tree
14
partitioning methods can be generated by using the bootstrapping
methodology. In general decision tree uses the bootstrap forest
method so it is most popular method across industries. There are
many decision trees are developed by the Bootstrap Forest and those
are used to get the final predicted value. Random sample with
replacement sampling technique can be used to construct decision
tree. The bootstrap Forest methodology can put threshold value not
only on splitting criteria but also on the randomly selected samples.
The bootstrap forest method is considered as an optional one in the
applicability of decision tree.
4. Results
Cross-validation technique uses demographical analysis results
to characterize an individual data set. Cross-validation technique
comprises only two stages; those are learning stage and estimation
stage [24].
1 Rsquare 0.9991672
2 RMSE 0.3027337
3 Mean Abs Dev 0.2233284
4 Log likelihood 1826.7952
5 SSE 747.29514
6 Sum Freq 8154
15
Table: 4.2 shows different datasets with different parameter using Bootstrap
Forest
S.NO DATASETS FEATURES INSTANCES RSquare RMSE N
1 AILERONS 35 13750 0.843 0.0001615 13750
2 CALIFORNIA 9 20640 0.797 51955.496 20640
3 ELEVATORS 19 16599 0.749 0.0033665 16599
4 HOUSE 17 22786 0.651 31238.959 22784
5 TIC 86 9822 0.275 0.2017361 9822
2.5
2
K‐FOLD
1.5
HOLDBACK
1
ExcludedRow
0.5 s Holdback
0
RMSE
16
Following graph 4.4 depicts the discrepancies of various effort
estimation approaches by using RSquare (Root Square Error) values.
0.8
0.7
0.6
K‐FOLD
0.5
HOLDBACK
0.4 ExcludedRows Holdback
0.3 Decision trees
Bootstrap forest
0.2
Boosted tree
0.1
0
RSquare
17
REFERENCES
18
[15] http://www.icstars.org/files/estbasics.pdf
[20] http://www.jmp.com/support/help/Using_Validation.shtml
[23] http://www.mwsug.org/proceedings/2012/JM/MWSUG-2012-
JM04.pdf
19