Efficient Software Cost Estimation Using Machine Learning Techniques

Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

EFFICIENT SOFTWARE COST ESTIMATION

USING MACHINE LEARNING TECHNIQUES

A SYNOPSIS
Submitted
in the partial fulfillment of the requirements for the award of
the degree of

DOCTOR OF PHILOSOPHY
in

FACULTY OF COMPUTER SCIENCE AND ENGINEERING


By
KATTA SUBBA RAO

Under the Guidance of
Research Director
Dr. L.S.S. Reddy
Professor & Vice Chancellor
Department of Computer Science & Engineering
KL University, Vaddeswaram

Department of Computer  Science and Engineering
Acharya  Nagarjuna  University
Nagarjuna  Nagar  –  522  510
Andhra  Pradesh,  India
  M a y 2014 
CONTENTS
Content Page No

Abstract 2

1. Introduction 4

1.1 Existing Systems 4

1.2 Limitations 9

2. Literature Survey 10

3. Research Contribution 12

3.1 Cross Validation 12

3.2 Partition Techniques 13

4. Results 15

Publications out of this work 17

References 18


 
ABSTRACT

In Software development for the successful project planning and


management, the major issues are cost estimation and effort
allocation. Cost estimation process involves unusual steps, software
tools, different algorithms and assumptions. In the field of software
engineering the focus of research investigations are to provide timely
estimation of the probable software development. For the Software
development projects to punctiliously predict the cost, the major
researchers have been working on different models and algorithms.
Researchers have started hassling about the prediction performance
depends on structure of data rather than the models. Industry
competitiveness depends on cost, performance, and timely delivery of
the product. Thus, literal, mercurial, and robust product cost
estimation model for the whole life cycle of the product is cardinal.
The major issues in software project estimation are quality estimation,
cost estimation and risk analysis. Software project attributes in the
field of software engineering are considered as very high, high, low and
very low. The nature of such attributes constitutes inaccurate,
uncertainty, and vagueness in their subsequent interpretation.
Imprecision and uncertainty are associated with those attribute
values, so we believe that software cost estimation models should be
able to deal those values. However, at the time of describing software
projects, without considering the classical intervals and numeric-
values methods, the cost estimation models can not directly tolerate
imprecision and uncertainty. In this thesis we present a machine
learning framework to tackle this challenging problem. More recently,
to predict software development effort, focus has turned to various
machine learning methods or soft computing techniques. Soft
computing is the composition of methodologies of Fuzzy logic (FL),
Case Based Reasoning, Rule-Based Systems and Artificial Neural
Networks (ANN)(These are Existing Methods). Some of the problems in
previous models are also addressed by soft computing methods. To


 
solve the cost estimation problems in product life cycle, we Proposed
different machine learning techniques: Holdback, K Fold, Excluded
rows, decision tree, bootstrap tree, and bootstrap forest (These are
Proposed Methods), Good interpretability is the major advantage of
our approach such as Cross Validation Techniques and Partition
Based Methods. In this approach we use machine learning techniques
in software estimation so the thought of people’s line tries to simulate
by this approach. The other benefit of this research work is that
expert knowledge, project data and the doctrinal algorithmic model
can be put together into one general framework and can be used in
the applications of software cost estimation, software quality
estimation and risk analysis. The cost estimation models and
statistical regression analysis performances can be compared. The
veneration outcomes and pursuance revealed that these models
provide more veracious pursuance than conventional models.


 
1. Introduction

The major issues in software project management are cost of the


software and estimation effort. The growth of software quality depends
on software cost and software effort calculation. In software project
management, software developers has to put effort to decide what
resources are to be used and how the software utilizes these resources
efficiently for software cost estimation and effort estimation. The
parameters are required for developing software accurately and
efficiently, the incorporated parameters are risk analysis, development
time estimation, estimation of software development cost, estimation
of team size, and effort estimation. At the beginning stage of software
development, estimation of the parameters like effort and software
development costs are done. Hence, the need to develop fine model to
precisely calculate above mentioned parameters. In the domain of
software effort estimation, the research community has been working
in the past few decades and different conventional models are
developed by researchers to estimate the size, effort and cost of
software. The conventional models demands inputs, but the extraction
of different inputs is a challenging issue at the beginning stages.
Additionally the conventional models acquire the parameter values
from experience and estimation factor of software development with
reasoning capability are zero. Because of pitfalls in conventional
algorithmic techniques, the non-algorithmic models based on soft
computing techniques are introduced. It includes Genetic algorithms,
Neural Networks and Fuzzy logic. In real life situations, because of
huge flexibility for software development, the non-algorithmic models
are introduced.

1.1 Existing Systems

Cost estimation models are divided into two types: algorithmic


models and non-algorithmic models. Algorithmic models make use
functions to estimate the cost of product. This function is derived from


 
the models which are developed by integrating cost factors.
Additionally, to develop model statistical techniques also plays major
role. A non-algorithmic model doesn’t make use of functions to
estimate cost of the software product.

1.1.1 Non Algorithmic Based Estimation Methods

1.1.1.1 Expert Judgment Method

Expert judgment method makes use of an expert or a group of


experts and their experience and knowledge in understanding the
present product to estimate the cost [1, 2, 3]. This is more popular
method in estimating cost of the software.

1.1.1.2 Estimating by analogy

The process involved in estimating by analogy technique is it


compares the proposed product to similar and already done project if
the information id is known. It makes use of the original data from
previous projects to estimate the proposed project. This technique can
be used either at system-level or component-level [4].

1.1.1.3 Top-Down Estimating Method

Another name for top-down estimating technique is Macro


model. In this method, we can estimate the cost of the software by
using known global properties of the software project. After estimating
the cost, the project is fragmented into different components at low-
level. Putnam model is the popular top-down model[1, 2].

1.1.1.4 Bottom-Up Estimating Method

This method can estimate the overall cost of the project by


taking the cost from each segmented software components and
conjoin the outputs. The main goal of this model is to consider
knowledge of tiny components in the estimation process and perform
analysis then combines all obtained results. The popular technique
using this model is COCOMO detailed model [2].


 
1.1.1.5 Parkinson’s Law

Parkinson’s Law says, “Work expands to fill the available


volume”. In this method it can estimate the cost with the support of
presented assets rather than the assessment of objectives. If the
delivery time does not exceed to 12 months and only 5 programmers
are available at present, the estimated effort is 60 person-months. In
some case it gives fine estimation, as well as this can produce some
impractical estimates, so it is not recommended for real time software
engineering [5].

1.1.2 Algorithmic Method

1.1.2.1 Basic Cost Model

E=X*SY * m (Z), where S is the size of the software, X is the


software constant factor, Y is lies in the range 1 to 1.5, Z is the Cost
factor and m is the Multiplier adjustment. Algorithmic models make
use of functions, which consists of mathematical equations to estimate
the cost of the software. These functions are derived from the research,
the historical database, source lines of code (SLOC), number of
functions, and other cost drivers like language, methodologies used for
design process, skill-levels, risk assessments, etc. These algorithmic
models have been studied by researchers and they can develop
methods based on this model like COCOMO model, Putnam technique,
Function point techniques and so on [1][6].

1.1.2.2 COCOMO Model

One of the most popular algorithmic based cost estimation


model is the constructive cost model (COCOMO) [7][8]. The hierarchy
levels are:

Model 1 (Basic COCOMO Model)


Model 2 (Intermediate COCOMO Model)
Model 3 (Detailed COCOMO Model)


 
1.1.2.3 COCOMO II Model

This method incorporates the features of application


composition model, early design model and Post architecture model.
Intermediate COCOMO model extension is COCOMO II Model [9].

It can be expressed as

EFFORT = 2.9 (KLOC)1.10

1.1.2.4 SEL – Model

This model was developed by the Software Engineering


Laboratory at university of Maryland to estimate the effort of software
[10]. This can be expressed as:

Effort = 1.4 * (size) 0.93

Duration DU= 4.6 (KLOC) 0.26

Effort can be measured in person-months

KLOC can be measured in Lines of code and used as a predictor.

1.1.2.5 Walston-Felix Model

This model is developed by Walston and Flexi in 1977 by taking


the database of 60 projects from IBM federal systems division and
analyzes different features. It shows the metric in delivered lines of
source code. This model consists of participation, customer oriented
changes, memory constraints etc. the function for estimating effort is
[9][11]

Effort= 5.2 (KLOC) 0.91

Duration d= 4.1 (KLOC) 0.36

1.1.2.6 Bailey-Basil Model

The above model was developed by Bailey and Basil and the metric is
lines of source code [12]

Effort= 5.5 (KLOC) 1.16


 
1.1.2.7 Halsted Model

This model was developed by Halstead on delivered lines of code [10]

Effort= 0.7 (KLOC) 1.50

1.1.2.8 Doty (KLOC > 9)

Doty technique was developed by doty and the expression to estimate


the effort is [10]

Effort= 5.288 (KLOC) 1.07

1.1.2.9 Putnam Model

The practical effort estimation model used in software


engineering is called Putnam model. Putnam inspects so many results
about products and then forms the expression:

T= S * X1/3 * Y4/3

X= 1/Y4 * (S/T)/3

S= size estimated in LOC, T= Technical constant, X= Total person


Months, Y= Development time in Years. T is independent of the
parameters used in the development process and forms the expression
based on the historical project data [13].

Rating: T= 2,000 (poor), T=8000 (good), T=12,000 (excellent).

The Putnam model is very sensitive to development time. Time


is inversely proportional to person-months. Putnam model tries to
decrease the decrease the development time then it will automatically
increase person-months [2][14].

The Putnam model estimates the size of the software based on


the historical data stored in the database. The main drawback of
Putnam model is it estimates software size inaccurately and the
values are also uncertain.


 
1.2 Limitations

Estimation of cost in software engineering is the key ultimatum


for noteworthy software planning and management in software
development. It is used to ratiocinating the needed cost and time for a
software. The basic input for the software cost estimation is the coding
size and set of cost drivers, the output is estimated effort in terms of
Person-Months (PM’s). Cost estimation is most exigent but difficult
task. Presently, various techniques are used to estimate the cost of
the proposed project or software. The major issues in software project
estimation are quality estimation, cost estimation and risk analysis.
Software project attributes in the field of software engineering are
considered as very high, high, low and very low. The nature of such
attributes constitutes inaccurate, uncertainty, and vagueness in their
subsequent interpretation. Imprecision and uncertainty are associated
with those attribute values, so we believe that software cost estimation
models should be able to deal those values. However, at the time of
describing software projects, without considering the classical
intervals and numeric-values methods, the cost estimation models
can not directly tolerate imprecision and uncertainty.


 
2. Literature Survey
The cost of the software product increases and delivers the
project than expedite, and reprieve the use of those resources on the
next project. To estimate the software project four steps are followed
[15]. Those are 1) Estimate size of the software product 2) Effort
estimation in person-months or person-hours. 3) Estimate schedule
time in the calendar 4) cost of the project can be estimated in dollars
or local currency.
To get a reliable software cost estimate, we need to do much
more than just put numbers into a formula and accept the results.
Software cost estimation process involves seven steps, which shows
that cost estimation is a mini project, so we require planning, reviews,
and so on.
The seven basic seven steps are
1. Establish Objectives
2. Plan for Required Data and Resources
3. Pin Down Software Requirements
4. Work Out as Much Detail as Feasible
5. Use Several Independent Techniques and Sources
6. Compare and Iterate Estimates
7. Follow-up
2.1 MACHINE LEARNING TECHNIQUES
2.1.1 Fuzzy method
The systems using this Fuzzy logic can simulate human behavior
and reasoning. Fuzzy system tools efficiently solve the challenges in
the situations like decision making is hard and huge number of
constraints. These systems always takes the facts that ignored by
other approaches [16].

2.1.2 Case Based Reasoning


Case based reasoning is a technique to collect and store the
results of inspections on past products such as effort, language used,

10 
 
and cost so on. When we inspect the proposed product, if it identifies
any new feature, then it is possible to recollect the store feature [17].
2.1.3 Rule-Based Systems
Rule based systems are very useful in estimating the effort of
software in few cases. Rule based system stores some set of rules in
its knowledge database, which have been used to compare the
features of proposed project. When there is any identical rule is found,
that rule will fire, which can be used to generate a new rule to form
chaining effect. The chaining effect continues until there are no rules
to fire and the result is displayed to user [17].
2.1.4 Regression Trees
Regression trees are widely used to estimate the cost of the
software in software Engineering. A regression tree is constructed by
investigating the attributes, and considers the most informative
attributes. Use an algorithm on these informative attributes to
construct the tree and split the attribute values where it is
appropriate. Tree only ever splits into two leaf nodes. Checking can be
done for any logical errors, by using this probable format [17]
2.1.5 Artificial Neural Networks (ANNs)
ANNs are constructed based on the architecture of biological
neural networks and these are also known as parallel systems.
Artificial neuron is a collection networks that are interconnected. The
neuron uses stepped function to determine the weighted sum of the
inputs. The derived output might be a positive or negative given as
input to other neurons in the parallel network. Feed-forward
multilayer perception is one of the widely used models in ANN.
Neurons are initialized with random weights. The network learns the
associations between the inputs by altering the weights when the
training set is obtainable from both the inputs and outputs. ANNs
estimates development effort and it compares the accuracy to the
algorithmic approaches rather than the suitability of this approach to
estimation process [18][19].

11 
 
3 Research Contributions
3.1 Cross Validation
One of the statistical methods is cross validation, in which the
data are separated as one or more segments for evaluating the
comparing the performance of learning algorithms. One segment is
used to training set to learn the model and the other segment is used
to perform validation on the model. In cross-validation method the
training and validation sets must pass through sequence of rounds
such that each data point getting a chance to be validated. K-fold
cross-validation is the crucial form of cross-validation. Other types of
cross-validations involve repeated rounds of cross-validation and
those are subtypes of k-fold cross-validation [20]. Cross-validation is
used to assess and differentiate the performance of diverse learning
algorithms. The learning algorithms use 1-fold of data to develop the
model or learn the model, and the learned models are using the data
in validation fold to make predictions about the data. After
construction of model based on learning algorithm, the model can be
applied to each fold to measure the performance of metrics like
accuracy. After completion of measuring the performance of metrics, k
samples are available. Then averaging of k samples can be obtained to
aggregate the measure. These samples can be used in statistical
hypothesis test to show that one algorithm is superior to another.
3.1.1 K Fold
The original data is divided into k subsets. Each time, one of the
k subsets is used as the validation set and the other k-l subsets are
put together to form a training set to construct the model. This gives
the best validation statistic is chosen as the best model. Cross-
validation is chosen as the best for small datasets, because it makes
efficient use of limited amount of data.
The steps to use k fold cross validation are
• To use k fold cross validation, the validation column should
consists of four or more distinct values.

12 
 
3.1.2 Hold Back
Hold back method divides the original data into training and
validation data sets. The validation portion used to specify the
proportion of the original data to use as the validation data set (hold
back).
3.1.3 Excluded Rows Method
Validation techniques are used to discover the variables for a
classifier and also used to assess the accuracy of model.1) The
training set is used to deduce the factors of model. 2) The test set is
used to reckon or endorse the presaging accuracy of model.3) The
validation set is also used to reckon or ratifies auguring capability of
the classifier [21]. In Excluded rows method which uses row states to
partition the data. Records that are eliminated by using the above
method are considered as validation data and remaining records are
act as learning data set [22].

3.2 Partition Techniques


The partition technique considers the relations between X and Y
values to make partitions of tree and it recursively partitions data. It
finds a set of all possible sub divisions and clusters of X values that
best forecast Y value. This can recursively partitions the data to form
a decision tree until the desired model fit is reached. This technique
chooses the optimum splits from vast number of possible splits. The
platform offers three methods for growing the final predictive tree;
those are Decision tree, Bootstrap Forest, Boosted tree.
3.2.1 Decision Tree
The most common learning technique in data mining is Decision
tree. The main aim is to develop a model based on several input
variables to predict the value of target variable. Every interior node is
one of the fields from input variables. The possible values of input
variables are the children of node edges. Each leaf represents a value
of the target variable. The path from root to leaf is the values of input
variables. Decision tree is a simple method to classify examples easily.

13 
 
There are two types of techniques such as supervised and
unsupervised techniques. Decision tree is the successful supervised
learning technique. For this section, all the features to represent data
have finite discrete domains, and the target feature is called
classification [23]. The major element in the classification domain is
called a class. In decision tree every non-leaf node is labeled with test
feature. The arcs coming from labeled node are also labeled with the
possible values of the feature. All the leaf nodes of tree are labeled
with a class names.
3.2.2 Boosted Tree
To build a large and efficient decision tree to be fit into a series
of minute trees, Boosting plays a major role. The residual degree of the
previous tree can be constructed from the smaller trees. All the above
constructed trees are used to form a larger final tree. The boosting
process can use the validation phase to estimate the number of stages
required for fitting the model and it can also specify the threshold
value for the stages.
There are 1 to 5 splits at each stage of the tree. So we can say
that each stage of the tree must be short enough. After constructing
the initial tree, each and every stage fits onto the previous stage. The
same process can be iterative, until it reaches threshold value of the
stages. The validation statistic performance cannot be improved, if we
are fitting an additional stage after the validation phase is used. The
sum of the each stage of terminal node estimates is combined to form
the ultimate forecasting. If the response variable is categorical
attribute then the offsets of linear logistics can be used to fit the
residuals of each stage. The sums of linear logistics of all the stages
are converted by using logistic transformation function to predict the
final target value.
3.2.3 Bootstrap Forest Tree
Bootstrap technique has very good features which are useful to
determine an optimal model and uses classification models such as
decision tree based on data and chosen subset features. Decision tree

14 
 
partitioning methods can be generated by using the bootstrapping
methodology. In general decision tree uses the bootstrap forest
method so it is most popular method across industries. There are
many decision trees are developed by the Bootstrap Forest and those
are used to get the final predicted value. Random sample with
replacement sampling technique can be used to construct decision
tree. The bootstrap Forest methodology can put threshold value not
only on splitting criteria but also on the randomly selected samples.
The bootstrap forest method is considered as an optional one in the
applicability of decision tree.
4. Results
Cross-validation technique uses demographical analysis results
to characterize an individual data set. Cross-validation technique
comprises only two stages; those are learning stage and estimation
stage [24].

Table-4.1 validation Reports for MV-dataset using Random Holdback method

S.No Validation Y Measures

1 Rsquare 0.9991672
2 RMSE 0.3027337
3 Mean Abs Dev 0.2233284
4 Log likelihood 1826.7952
5 SSE 747.29514
6 Sum Freq 8154

Figure-4.1 Residual by Predicted Plot MV-dataset using Random Holdback


method

15 
 
Table: 4.2 shows different datasets with different parameter using Bootstrap
Forest
S.NO DATASETS FEATURES INSTANCES RSquare RMSE N
1 AILERONS 35 13750 0.843 0.0001615 13750
2 CALIFORNIA 9 20640 0.797 51955.496 20640
3 ELEVATORS 19 16599 0.749 0.0033665 16599
4 HOUSE 17 22786 0.651 31238.959 22784
5 TIC 86 9822 0.275 0.2017361 9822

Figure-4.2 Model Validation - Set Summaries of MV-dataset using Boosted tree


3

2.5

2
K‐FOLD
1.5
HOLDBACK
1
ExcludedRow
0.5 s Holdback

0
RMSE

Figure-4.3 Comparison of RMRE using different cost estimation techniques


Table-4.3 Performance Evaluation of Cross validation and partitioning
technique
With RSE parameter
Performanc Excluded
e HOLDBAC Rows Decisio Bootstra Booste
Criteria K-FOLD K Holdback n trees p forest d tree
0.571292 0.602687
RSquare 3 0.6129649 6 0.375 0.729 0.545

16 
 
Following graph 4.4 depicts the discrepancies of various effort
estimation approaches by using RSquare (Root Square Error) values.

0.8

0.7

0.6
K‐FOLD
0.5
HOLDBACK
0.4 ExcludedRows Holdback

0.3 Decision trees
Bootstrap forest
0.2
Boosted tree
0.1

0
RSquare

Figure-4.4 comparison of RSE using different cost estimation techniques

Publications Carried out of this work

Peer reviewed publications related to this thesis

1. K. Subba Rao, L.S.S Reddy “Efficient Software Cost Estimation


Using Partitioning Techniques”, International Journal of
Engineering Research and Technology Volume 2(11), November
2013, ISSN 2278- 0181.
2. K. Subba Rao, L.S.S Reddy “Software Cost Estimation in
Multilayer Feed forward Network using Random Holdback
Method”, International Journal of Advanced Research in
Computer Science and Software Engineering Volume 3(10),
October – 2013,ISSN:2277 128X

17 
 
REFERENCES

[1] Liming Wu ―The Comparison of the Software Cost Estimating


Methods University of Calgary.

[2] Caper Jones, ― “Estimating software cost” Tata Mc-Graw -Hill


Edition 2007.

[3]COCOMO II Model definition manual, version 1.4, University of


Southern California

[4] Murali Chemuturi, “Analogy based Software Estimation,”


Chemuturi Consultants.

[5] G.N. Parkinson, ―Parkinson‘s Law and Other Studies in


Administration, Houghton-Miffin, Boston, 1957

[6] G. Wittig and G. Finnie, "Estimating software development effort


with connectionists models," Information & Software Technology, vol.
39, pp. 469–476, 1997.

[7] R.K.D. Black, R. P. Curnow, R. Katz and M. D. Gray, BCS Software


Production Data, Final Technical Report, RADC-TR-77-116, Boeing
Computer Services, Inc., March 1977.

[8] B.W. Boehm, ―Software Engineering Economics, Prentice- Hall,


Englewood Cliffs, NJ, USA, 1981

[9] Y. Singh, K.K. Aggarwal, Software Engineering Third edition, New


Age International Publisher Limited New Delhi.

[10]O.Benediktsson and D. Dalcher, ―Effort Estimation in


incremental Software Development, IEEE Proc. Software, Vol. 150,
no. 6, pp. 351-357, December 2003.

[11] Pressman. Software Engineering - a Practitioner‘s Approach. 6th


EdditionMcGraw Hill international Edition, Pearson education, ISBN
007 - 124083 – 7.

[12]S. Devnani-Chulani, "Bayesian Analysis of Software Cost and


Quality Models. Faculty of the Graduate School, University Of
Southern California May 1999.

[13]SwetaKumari ,ShashankPushkarPerformance Analysis of the


Software Cost Estimation Methods: A Review International Journal of
Advanced Research in
Computer Science and Software Engineering Volume 3, Issue 7, July
2013 ISSN: 2277 128X

[14] Magne J and Martin S, ― A Systematic Review of Software


Development Cost Estimation Studies , IEEE Transactions On
Software Engineering, Vol. 33, No. 1, pp. 33-53, January 2007.

18 
 
[15] http://www.icstars.org/files/estbasics.pdf

[16] VahidKhatibi, Dayang N. A. Jawawi“Software Cost Estimation


Methods: A Review”Journal of Emerging Trends in Computing and
Information Sciences Volume 2 No. 1 ISSN 2079-8407.
[17] Abdulbasit S. Banga “Software Estimation Techniques”
Proceedings of National Conference; INDIACom-2011Computing for
Nation Development, March 10 – 11, 2011,Bharati Vidyapeeth
University Institute of Computer Applications and Management, New
Delhi.

[18]G. Wittig and G. Finnie, "Estimating software development effort


with connectionists models," Information & Software Technology, vol.
39, pp. 469–476, 1997.

[19] M. J. Shepperd and C. Schofield, "Estimating software project


effort using analogies, "IEEE Transactions on Software Engineering,
vol. 23, pp. 736–743, 1997.

[20] http://www.jmp.com/support/help/Using_Validation.shtml

[21] S. Vicinanza, M. J. Prietula, and T. Mukhopadhyay, "Case-based


reasoning in effort estimation," presented at 11th Intl. Conf. on Info.
Syst., 1990

[22]R. Bisio and F. Malabocchia, "Cost estimation of software projects


through case base reasoning," presented at 1st Intl. Conf. on Case-
Based Reasoning Research & Development,1995.

[23] http://www.mwsug.org/proceedings/2012/JM/MWSUG-2012-
JM04.pdf

[24] S. D. Chulani, Bayesian Analysis of the Software Cost and Quality


Models. PhD thesis, faculty of the Graduate School University of
Southern California, 1999.

19 
 

You might also like