0% found this document useful (0 votes)
224 views14 pages

Envelopment Analysis (DEA) and Machine Learning

This document discusses evaluating the efficiency of system integration projects using data envelopment analysis and machine learning. It begins by providing background on data envelopment analysis and its applications. It then notes some limitations of DEA, including that it only provides relative efficiency comparisons, cannot predict new project efficiencies, and does not provide a path for inefficient projects to improve. The document proposes a methodology using DEA and machine learning to overcome these limitations by classifying new projects, determining input and output variable importance, and identifying steps for inefficient projects to improve. It reviews literature on DEA and measures for system integration projects and describes the proposed two-phase research methodology.

Uploaded by

RoseRose7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
224 views14 pages

Envelopment Analysis (DEA) and Machine Learning

This document discusses evaluating the efficiency of system integration projects using data envelopment analysis and machine learning. It begins by providing background on data envelopment analysis and its applications. It then notes some limitations of DEA, including that it only provides relative efficiency comparisons, cannot predict new project efficiencies, and does not provide a path for inefficient projects to improve. The document proposes a methodology using DEA and machine learning to overcome these limitations by classifying new projects, determining input and output variable importance, and identifying steps for inefficient projects to improve. It reviews literature on DEA and measures for system integration projects and describes the proposed two-phase research methodology.

Uploaded by

RoseRose7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Expert Systems

with Applications
PERGAMON Expert Systems with Applications 16 (1999) 283–296

Evaluating the efficiency of system integration projects using data


envelopment analysis (DEA) and machine learning
Han Kook Hong a, Sung Ho Ha b, Chung Kwan Shin b, Sang Chan Park b,*, Soung Hie Kim a
a
Graduate School of Management, Korea Advanced Institute of Science and Technology (KAIST), 207-43 Chongyang-ni, Dongdaemoon-gu, Seoul 130-012,
South Korea
b
Department of Industrial Engineering, Korea Advanced Institute of Science and Technology (KAIST), 371-1 Kusong-dong, Yusong-gu, Taejon 305-701,
South Korea

Abstract
Data envelopment analysis (DEA), a non-parametric productivity analysis, has become an accepted approach for assessing efficiency in a
wide range of fields. Despite its extensive applications, some features of DEA remain unexploited. We aim to show that DEA can be used to
evaluate the efficiency of the system integration (SI) projects and suggest the methodology which overcomes the limitation of DEA through
hybrid analysis utilizing DEA along with machine learning. In this methodology, we generate the rules for classifying new decision-making
units (DMUs) into each tier and measure the degree of affecting the efficiencies of the DMUs. Finally, we determine the stepwise path for
improving the efficiency of each inefficient DMU. 䉷 1999 Elsevier Science Ltd. All rights reserved.
Keywords: Data envelopment analysis; System integration; Self-organized map; C4.5; Machine learning

1. Introduction 1989), highway patrols (Clark, 1992; Cook et al., 1990)


and the brewing industry (Day et al., 1995).
Data envelopment analysis (DEA) is a linear program- There are a number of reasons why DEA is used for
ming-based technique for measuring the relative perfor- numerous applications. First, it does not require any under-
mance of organizations or decision-making units (DMUs) lying assumptions for inputs and outputs. Second, it allows
where the presence of multiple inputs and outputs makes managers to consider simultaneously multiple inputs and
comparisons difficult. It provides a means for assessing the multiple outputs of a DMU. Third, it provides managers
relative efficiencies of DMUs with minimal prior assump- with a procedure to differentiate between efficient and
tions on input–output relations in these units. After initially inefficient DMUs. Fourth, it pinpoints the sources and the
being explored by Charnes et al. (1978), DEA methods have amount of deficiency for each of the inefficient DMUs.
subsequently been developed and extended (for a review, Finally, it can be used to detect specific inefficiencies that
see Cooper et al., 1996). may not be detectable through other techniques such as
There have been numerous applications of DEA to linear regression or ratio analyses.
various fields. Some examples are given follows: schools Despite its extensive applications, some features of DEA
or universities (Beasley, 1990; Ahn, 1987), courts (Lewin et remain bothersome. First, although DEA is good at estimat-
al., 1982), banks (Thompson et al., 1996, 1997; Athanasso- ing the ‘‘relative’’ efficiency of a DMU, it only tells us how
poulos, 1997; Brockett et al., 1997; Drake and Howcroft, well we are doing compared with our peers but not
1994; Oral et al., 1992; Schaffnit et al., 1997; Sherman and compared with a ‘‘theoretical maximum’’. Thus, in order
Ladino, 1995), airlines (Schefczyk, 1993), software projects to measure efficiency of a new DMU, we have to develop
(Mahmood et al., 1996), equipment maintenance (Clark, entirely new DEA with the data of previously used DMUs.
1992; Hjalmmarsson and Odeck, 1996), health care or Also, we cannot predict the efficiency level of the new DMU
hospitals (Pina and Torres, 1992; Rutledge et al., 1995), without another DEA analysis. Second, for DMUs are
agricultural production units (Haag et al., 1992), police directly compared with a peer or combination of peers,
forces (Thanassoulis, 1995), railroads (Adolphson et al., DEA offers no guide where relatively inefficient DMUs
improve. Finally, it does not provide a stepwise path
for improving the efficiency of each inefficient DMU by
* Corresponding author. Tel.: ⫹82-42-869-2920; fax: ⫹82-42-869-
2990. considering the difference of efficiency.
E-mail address: [email protected] (Sang Chan Park) We think that DEA can be used to evaluate the efficiency
0957-4174/99/$ - see front matter 䉷 1999 Elsevier Science Ltd. All rights reserved.
PII: S0957-417 4(98)00077-3
284 H.K. Hong et al. / Expert Systems with Applications 16 (1999) 283–296

of system integration (SI) projects, as considering these DMU in each time period as being independent from the
characteristics of DEA mentioned. Previous studies have performance in the previous period. Also, with this
been rarely applied to evaluate the performance of SI approach it is not feasible to ascertain trends in performance
projects except for some applications of DEA to measure or to observe the persistence of efficiency or inefficiency.
the productivity of software projects. By SI project we mean The window analysis approach corrects for some of these
all the activities that are necessary to build and maintain problems. For cross-section/time series/panel data one
various kinds of information systems in response to their could employ window analysis or a Malmquist index to
customers’ needs. SI companies mainly carry their works on examine changes across time periods.
a projects basis. Upon receiving the project request from a Window analysis, dubbed by Charnes et al. (1985) and
customer, the SI company organizes a team for the project. recently applied by Charnes et al. (1994) and Day et al.
But it is often over budget, late in delivering and does not (1994), 1995), uses DEA to analyze panel data by convert-
satisfy the requesting user. Precise evaluation of projects ing a panal into an overlapping sequence of windows, which
becomes an important issue for SI companies. Hence the are then treated as separate cross-sections. The Malmquist
quality of the company is determined by the efficiency of index, developed by Fae et al. (1995), uses DEA to analyze
projects. The evaluation results of the proposed projects are panel data by constructing a Malmquist (1953) type of index
considered critical in gaining a competitive strategy for the of productivity change. One advantage of this approach is
SI company. The project evaluation results also influence that it provides two types of information: efficiency evalua-
the level of incentives for the project members. tions for each DMU between each successive pair of
In this paper, we aim to show that DEA can be used to periods. Another advantage of this approach is that it
evaluate the efficiency of SI projects and suggest the meth- provides a decomposition of productivity change into two
odology to overcome the limitation of DEA. We present our mutually exclusive and exhaustive components: efficiency
research framework, which is divided into two phases. In the change and technical change.
first phase, we generate the rules for classifying new DMUs
into each tier, and discriminate among the input and output 2.2. Efficiency evaluation factors of SI projects
variables by the degree of affecting the efficiencies of the
DMUs (discriminant descriptor). In the second phase, we Within an efficiency measurement framework, one is
determine the stepwise path for improving the efficiency of more interested in assessing how well a DMU uses its
each inefficient DMU. resources to obtain a desired outcome outcome; alterna-
The remainder of the paper is structured as follows. tively, one may want to assess how good an outcome one
Section 2 presents a review of literature on DEA and SI is producing with the given resources. Thus, one is intui-
project measures. This is followed by a description of the tively interested in defining the main resources (inputs) and
research methodology in Section 3. Subsequently Section 4 the relevant products (outputs) of the process, and in finding
presents results. The concluding remarks are presented in appropriate measures for these attributes. But, measuring SI
Section 5. projects has not been easy, primarily because most research-
ers and practitioners have difficulty in agreeing on what to
measure and how to measure it.
2. Literature review Christopher et al. (1996) used software quality (Customer
Satisfaction Index) and meeting targets, which includes
2.1. DEA schedules and budgets, and rework after delivery as output
variables of software development projects.
DEA was developed in operational research and Software quality (CSI) covers the extent to which the
economic studies as a method for assessing the efficiency software system meets the actual needs of the intended
of activity units, making the minimum possible assumptions users. These needs can be diverse. For example, a major
regarding the functional form of the underlying production software developer conducted a customer satisfaction
function. DEA is a method based on linear programming survey that included seven aspects of software quality: relia-
that has been used extensively for assessing the relative bility, capability, usability, installability, maintainability,
efficiency of activity units of non-profit (e.g., schools, performance and documentation.
hospitals) and profit-making (e.g., banks, airlines) Meeting targets centers around the fact that, to be
organizations. successful, a project should be ‘‘on time and on budget’’.
In its extensive applications, longitudinal studies in DEA Projects that are behind schedule or over budget have a
are still very rare. Most DEA analyses are comparing the number of consequences. Anticipated benefits of the
performance of DMUs in the same time period. One completed project may be lost or delayed. People on the
approach to performing the longitudinal analysis is to project generally must stay with the project, instead of
compare cross-sectional runs across the time periods in moving on to a new one, which would likely delay the
the study. This approach introduces variability into the next project. A late project could lead to an embarrassing
analysis, however, because it treats the performance of a post hoc review of the original decision to start the project or
H.K. Hong et al. / Expert Systems with Applications 16 (1999) 283–296 285

Fig. 1. Framework of our analysis.

to select a particular group of people to carry it out. An we determine the stepwise path for improving the efficiency
important issue is the level of realism in the initial schedule of each inefficient DMU (refer to Fig. 1).
and budget. These are often negotiated between the devel-
oper and user organizations. Since the users’ goal is to mini- 3.1. Classification rule generation for each tier
mize costs and completion time, and the developers’ goal is
to gain the users’ agreement to the project, there is a In the first phase, we evaluate the efficiencies of the
tendency to set unrealistically low targets. DMUs via a DEA and cluster the DMUs together through
Rework has been observed on many SI projects. Rework the tier analysis, which recursively applies the DEA analysis
may occur as a result of poor understanding of the require- to the remaining inefficient DMUs, and then generate the
ments or poor technical design. When not planned, as in a DMU classification rules using the C4.5, a decision tree
prototyping strategy, rework can play havoc with schedules classifier, with the DMU tiers identified by the tier analysis.
and budgets.
Also, Banker and Kemerer (1992) used Budget Perfor- 3.1.1. The efficiency evaluation of SI projects — input/
mance, Schedule Performance, User Satisfaction, and Main- output data set
tenance Complexity as output variables of Information In this paper, we propose an SI project management
Systems Development Projects. model with four inputs and four outputs as shown in Fig.
All the models have Labor, measured either in labor hours 2. The basic resources (inputs) used by each DMU are mate-
or cost, as their main input representing the effort. rial and equipment resources such as software and hardware
tools, and total labor hours. Total labor hours are the amount
of total person-month considering career. The outputs are
3. Methodology customer satisfaction index, schedule performance, budget
performance, and rework hours after delivery. These vari-
In this section we present our research framework which ables are summarized in Table 1.
is divided into two phases. In the first phase, we generate the There are other criteria of judging the performance of SI
rules for classifying new DMUs into each tier and determine projects that are not covered in this paper. One such criter-
the input and output variables that will discriminate best ion could be productivity such as source lines of code
between the tiers by the degree of affecting the efficiencies (SLOC) and function points (FP). These variables are not
of the DMUs (discriminant descriptor). In the second phase, included in this paper because it does not seem feasible to
reliably measure them across projects.

3.1.2. Evaluating the efficiencies of DMUs using DEA


A DEA involves an alternative principle for extracting
information about a population of observations such as
those shown in Fig. 3.
In contrast to parametric approaches whose object is to
optimize a single regression plane through the data, DEA
optimizes on each individual observation with an objective
of calculating a discrete piecewise frontier determined by
Fig. 2. Model of SI project management. the set of Pareto-efficient DMUs. Both the parametric and
286 H.K. Hong et al. / Expert Systems with Applications 16 (1999) 283–296

Table 1
Summary of variables

Variable Measurement

Input factors Labor hours A (La) Amount of total person-months, who have a career over 10 years
Labor hours B (Lb) Amount of total person-months, who have a career between 6 and 10 years
Labor hours C (Lc) Amount of total person-months, who have a career below 5 years
Material and equipment resource (Mr) Total monetary amount of hardware, software and other materials
Output factors Customer Satisfaction Index (CSI) Customer questionnaire
Schedule performance (Sp) Ratio of planned period to real period
Budget performance (Bp) Difference between real development cost and planned budget
Rework after delivery (Rew) 5 person-months minus person-month of rework (additional service)

non-parametric (mathematical programming) approaches efficiency score of each DMU, subject to the condition that
use all the information contained in the data. the set of weight. Obtained in this manner, DMU must also
In parametric analysis, the single optimized regression be feasible for all the other DMUs included in the calcula-
equation is assumed to apply to each DMU. DEA, in tion. For each inefficient DMU (one that lies below the
contrast, optimizes the performance measure of each DMU. frontier), DEA identifies the sources and level of ineffi-
This results in a revealed understanding about each ciency for each of the inputs and outputs. The level of inef-
DMU instead of the depiction of a mythical ‘‘average’’ ficiency is determined by comparison to a single referent
DMU. In other words, the focus of DEA is on the individual DMU or a convex combination of other referent DMUs
observations as represented by the n optimizations (one for located on the efficient frontier that utilize the same level
each observations) required in DEA analysis, in contrast to of inputs and produce the same or a high level of outputs.
the focus on the averages and estimation of parameters that Details of the methodology as well as descriptions of data
are associated with single-optimization statistical envelopment analysis can be found in Charnes et al. (1978).
approaches. DEA calculates a maximal performance
measure for each a DMU relative to all the DMUs in the
3.1.3. Clustering the DMUs through the tier analysis
observed population with the sole requirement that each
In preceding section, we used DEA to evaluate the effi-
DMU lie on or below the extreme frontier. Each DMU not
ciencies of SI projects. The DEA determines the most
on the frontier is scaled against a convex combination of the
productive group of the DMUs and the group of less-
DMUs on the frontier facet closest to it.
productive DMUs. That is, the DMUs are clustered into
The solid line in Fig. 3 represents a frontier derived by
an efficient group or an inefficient one by DEA. A similar
applying DEA to data on a population of DMUs, each utiliz-
approach to clustering DMUs by DEA was presented by
ing different amounts of a single input to produce various
Thanassoulis (1996). However, the clusters on that study
amounts of a single output. It is important to note that DEA
were not grouped by their efficiency levels but by the char-
calculations, because they are generated from actual
acteristics of the input resource mix. Tier analysis that we
observed data for each DMU, produce only relative effi-
propose is a kind of technique that can be used to cluster
ciency measures. The relative efficiency of each DMU is
DMUs together according to their efficiency levels.
calculated in relation to all the other DMUs, using the actual
In the first step of tier analysis, we obtain the efficiency
observed values for the outputs and inputs of each DMU.
scores of the entire set of DMUs. The result of the first step
The DEA calculations are designed to maximize the relative
should reveal the most efficient group of DMUs by indicat-
ing that their scores are equal to 1. We call this group ‘‘tier
1’’. In the second step, we proceed DEA again only with the
inefficient DMUs which are not part of tier 1. DMUs whose
efficiency scores in the second step are equal to 1 are tier 2.
The same procedure can be repeated while the number of
remaining DMUs is at least three times greater (8 × 3 ˆ 24)
than that of inputs along with outputs (4 ⫹ 4 ˆ 8), as Banker
et al. (1984) have proposed. This makes it possible to appro-
priately discriminate between efficient DMUs and ineffi-
cient DMUs. We call this procedure the tier analysis
because DMUs that belong to the efficient group in each
step form the efficient production frontier in each step as
shown in Fig. 4.
Fig. 4 shows that DMUs on tier 1 are superior to those in
Fig. 3. Comparison of DEA with regression analysis. tier 2 and DMUs in tier 2 are superior to those in tier 3. We
H.K. Hong et al. / Expert Systems with Applications 16 (1999) 283–296 287

Fig. 4. The procedure of tier analysis.

use these DMU tiers as input data for C4.5 to generate DMU equation (Quinlan, 1993),
classification rules and also use them to determine the step-
gain ratio…X† ˆ gain…X†=split info…X†
wise improvement path for any of inefficient DMUs.
where
3.1.4. Generating classification rules for each tier using X  
n
兩Ti 兩 兩Ti 兩
C4.5 slit info…X† ˆ ⫺ × log2 ;
A typical decision tree learning system, C4.5, which is iˆ1 兩T兩 兩T兩
going to be used to generate the rule set for classifying SI
projects, adopts a supervised learning scheme that gain…X† ˆ info…T† ⫺ infoX …T†
constructs decision trees from a set of examples. A decision and gain(X) measures the information that is gained by
tree is a directed graph showing the various possible partitioning T in accordance with the test X.
sequences of questions (tests), answers and classifications. We generate the rules for classifying new DMUs into
The method first chooses a subset of the training examples each tier and determine the input and output variables that
(window) to form a decision tree. If the tree does not give will discriminate best between the tiers by the degree of
the correct answer for all the objects, a selection of the affecting the efficiencies of the DMUs (discriminant
exceptions (incorrectly classified examples) is added to descriptor).
the window and the process continues until the correct deci-
sion set is found. The eventual outcome is a tree in which 3.2. Determination of the improvement path
each leaf carries a class name, and each interior node
specifies an attribute with a branch corresponding to each In the second phase of our analysis, we plan to use a self-
possible value of that attribute. organized map (SOM), which is one clustering tool, and the
C4.5 uses an information theoretic approach aiming at DMU tiers further to suggest improvement paths for ineffi-
minimizing the expected number of tests to classify an cient projects based on the features of the projects.
object. The attribute selection part of C4.5 is based on the
assumption that the complexity of the decision tree is 3.2.1. Clustering the DMUs using SOM
strongly related to the amount of information. An informa- DEA offers no guidelines where relatively inefficient
tion-based heuristic selects the attribute providing the high- DMUs improve since a reference set of an inefficient
est information gain ratio, i.e., the ratio of the total DMU consists of several efficient DMUs. Hence, we make
information gain due to a proposed split to the information an SOM to group similar DMUs according to the character-
gain attributable solely to the number of subsets created as istics of the inputs, for the inefficient DMU to select an
the criterion for evaluating proposed splits. efficient DMU in a reference set as a benchmarking target
The C4.5 system uses a information gain ratio as the (refer to Fig. 5).
evaluation function for classification, with the following The SOM uses an unsupervised learning scheme to train
288 H.K. Hong et al. / Expert Systems with Applications 16 (1999) 283–296

Fig. 5. DMU clusters.

the neural network (Sestito and Dillon, 1994; Berry and statistical features, and model the probability distributions
Linoff, 1997). Unsupervised learning is comprised of that are present in the input data.
those techniques for which the resulting actions or desired The SOM uses competitive learning. When an input
outputs for the training sequences are not known. The pattern is imposed on the network, one output node is
network is only told the input vectors, and the network selected from among all the output nodes as having the
self-organizes these inputs into categories. smallest Euclidean distance between the presented input
Each link between a node in the input layer and a node in pattern vector and its weight vector. This output unit is
the output layer has an associated weight. The net input into declared the winner in the competition among all the
each node in the output layer is equal to the weighted sum of neurons in the output layer. Only the winning neuron gener-
the inputs. Learning proceeds by modifying these weights ates an output signal from the output layer. All the other
from an assumed initial distribution with the presentation of neurons have a zero output signal (see Fig. 6).
each input pattern vector. This process identifies groups of The input and weight vectors are usually normalized in an
nodes in the output layer that are close to each other and SOM so that they have values between 0 and 1 inclusive. If
respond in a similar manner. A particular group of units the dot products between the normalized input vector X^ and
together forms an output cluster. The topology-preserving a normalized set of weight vectors W^ j are determined, the
mappings from the inputs to the clusters reflect the existing neuron with the largest dot product (the one with the smal-
similarities in the inputs and capture any regularities and lest Euclidean distance) is declared to be the winner. Thus

Fig. 6. Winning neuron and its path.


H.K. Hong et al. / Expert Systems with Applications 16 (1999) 283–296 289

Fig. 7. Competitive learning between neurons.

the winner is the vector obtained from the expression: limit. Thus, during these stages, fewer neurons have
their weights adjusted closer to the input vector. Lateral
max …X^ t W^ j † inhibition of weight vectors that are distant from a
j
particular input pattern may also be carried out as shown
in Fig. 7.
As learning involves adjustment of weight vectors, learn-
A general algorithm for the SOM may be summarized as
ing with this particular input pattern is restricted to lateral
follows.
interconnections with immediately neighboring units of the
winning neuron in the output layer. Adjusting their weights 1. Initialize weights to small random values and set the
closer to the input vector carries out learning for the nodes initial neighborhood to be large. One approach is to set
within the neighborhood. The size of the neighborhood is each weight vector equal to an input vector pattern when
initially chosen to be enough large to include all units in the there are more training input patterns than output units.
output layer. However, as learning proceeds, the size of the This approach performs best with very large networks
neighborhood is progressively reduced to a pre-defined and training sets.

Fig. 8. The reference set of the inefficient DMUs on each tier.


290 H.K. Hong et al. / Expert Systems with Applications 16 (1999) 283–296

Fig. 9. The improvement path of the inefficient DMU.

2. Stimulate the net with a given input vector. 3.2.2. Determining the reference DMU of the inefficient
3. Calculate the Euclidean distance between the input and DMUs on each tier
each output node and select the output with the minimum Efficient DMUs in the upper tier become a reference set
distance. of inefficient DMUs in the lower tier. How can we select a
4. Update weights for the selected node and the nodes target reference DMU among DMUs in the reference set?
within its neighborhood. We use the SOM in advance to find a target reference DMU
5. Repeat from 2 until a stopping criterion is met. in the upper tier. DMUs, which have similar characteristics

Table 2
SI project efficiency ratings

SI project (DMU) Efficiency rating Reference set SI project (DMU) Efficiency rating Reference set

P1 64% P16 P29 P40 P41 P45 P26 50% P7 P32


P2 100% P27 6% P8 P21 P40
P3 89% P40 P41 P45 P28 83% P7 P16 P29 P41
P4 36% P14 P16 P29 P47 P29 100%
P5 98% P16 P40 P46 P30 38% P40
P6 90% P21 P29 P40 P31 29% P8 P32 P45
P7 100% P32 100%
P8 100% P33 12% P8 P46
P9 44% P21 P40 P34 32% P21 P40
P10 2% P40 P45 P35 27% P16 P29 P45
P11 19% P7 P8 P16 P32 P36 43% P16 P36 P40 P46
P12 2% P16 P21 P32 P37 93% P8 P21 P32 P40
P13 25% P21 P29 P40 P38 75% P21 P46
P14 100% P39 76% P8 P16 P39
P15 15% P8 P21 P40 P45 P40 100%
P16 100% P41 100%
P17 27% P8 P40 P45 P42 7% P16 P40 P45
P18 57% P7 P8 P16 P32 P43 53% P21 P45
P19 39% P29 P40 P44 29% P8 P32 P40
P20 54% P16 P21 P32 P45 100%
P21 100% P8 P16 P40 P46 100%
P22 28% P47 100%
P23 9% P8 P40 P45 P48 63% P7 P32
P24 85% P7 P14 P16 P29 P49 14% P8 P16 P21 P32
P25 49% P8 P21 P40 P46 P50 44% P8 P21 P32 P40 P45
H.K. Hong et al. / Expert Systems with Applications 16 (1999) 283–296 291

Table 3 efficient best-practice projects and the less-productive


Clustering of SI projects by the tier analysis — tier 1 projects (refer to Table 2).
Group (tier) DMUs (SI projects) Reference set The analysis showed that 13 projects, including such
projects such as P2 and P7, were best-practiced projects
1 P12 No reference set indicated by the DEA productivity rating of 100%. Project
P17
P1 was less productive with a DEA productivity rating of
P18
P114 64%, suggesting that it could provide its current mix and
P116 volume of services with only about 64% of the resources it
P121 actually consumed. Project P3 had a DEA productivity
P129 rating of 89%, indicating that it was using about 11% excess
P132
resources. In fact, 37 of the 50 projects were using excess
P140
P141 resources. These findings indicated that the SI company
P145 could make substantial productivity improvements and
P146 cost reductions.
P147

4.1.2. Clustering the DMUs through the tier analysis


We grouped the 50 SI projects together into four groups
by the SOM, lie on the same tier, the upper tier or the lower by the tier analysis. The efficiency score itself is not impor-
tier. We choose a DMU in the upper tier as a benchmarking tant in this time. What matters is only which tier each
DMU for the inefficient DMU on the lower tier, which has project belongs to.
similar characteristics. Therefore, SOM is one machine
learning tool for clustering DMUs and the tier analysis is 1. After the first tier analysis, the efficient DMUs by DEA
a method for clustering DMUs by using the DEA recur- form tier 1 and the remaining inefficient DMUs become
sively (refer to Fig. 8). the candidates for the second application of DEA. The
results of the first tier analysis are summarized as Table
3. Note that, in the numbers at the column DMUs, of the
3.2.3. Determining the improvement path form PnN, n indicates the tier that each SI project belongs
Once the tiers and the DMU clusters by SOM have been to and N indicates the SI project number.
identified, we determine the stepwise path for improving the 2. After the second tier analysis we proceed DEA analysis
efficiency of each inefficient DMU (refer to Fig. 9). again only with the inefficient DMUs that are not part of
tier 1. DMUs whose efficiency scores in the second phase
are 1 are tier 2 (refer to Table 4). The same procedure
4. Results of analysis should be repeated while the number of remaining DMUs
is at least three times greater than that of inputs plus
4.1. Classification rule generation for each tier outputs.
3. After the third tier analysis, the results are as summarized
4.1.1. Evaluating the efficiency of the DMUs using DEA in Tables 5 and 6. In our application of 50 projects, the
The data we used are those of 50 SI projects that were fourth tier is the last one in the tier analysis.
carried out during 1995 and 1996. We used the Charnes–
Cooper–Rhodes (CCR) ratio model of DEA (refer to
Appendix A). 4.1.3. Generating classification rules for each tier using
We also followed the application procedure suggested by C4.5
Golany and Roll (1989). Results identified the relatively A classification using C4.5 starts preparing a training set of
cases, each described in terms of the given attributes (eight
Table 4
Clustering of SI projects by the tier analysis — tier 2
input and output factors) and a known class (tier number).
These cases come from a source such as SI project tiers as a
Group (tier) DMUs (SI projects) Reference set in tier 1 result of tier analysis. The induction process of C4.5 will
attempt to find a method of classifying a case, expressed as
2 P23 P140 P141 P145
P25 P116 P141 P145 a function of the attributes, that explains the training cases and
P26 P121 P129 P140 that may also be used to classify unseen cases.
P224 P17 P114 P116 P129 There are four classes (1, 2, 3 and 4), which are those tiers
P228 P17 P116 P129 P141 identified by the tier analysis, and eight factors which influ-
P237 P18 P121 P132 P140
ence the decision (class) in our example:
P238 P121 P146
P239 P18 P116 P146 • labor hours A (La);
P243 P121 P145
• labor hours B (Lb);
P248 P17 P132
• labor hours C (Lc);
292 H.K. Hong et al. / Expert Systems with Applications 16 (1999) 283–296

Table 5 Table 7
Clustering of SI projects by the tier analysis — tier 3 Training cases for C4.5

Group (tier) DMUs (SI projects) Reference set in tier 2 DMUs Variables

3 P31 P26 P224 La Lb Lc Mr CSI Sp Bp Rew Tier


P39 P25 P26 P237
P317 P23 P26 P237 P243 P1 2.0 7.0 7.0 14.5 90.8 0.96 8.1 4.11 3
P318 P228 P237 P248 P2 1.1 3.3 2.0 4.6 89.5 1.2 4.1 4.87 1
P319 P26 P3 1.0 2.0 4.0 5.8 86.6 0.99 2.7 4.91 2
P320 P25 P224 P237 P4 10.6 31.4 16.4 30.9 79.6 0.84 12.1 3.29 4
P325 P25 P26 P237 P5 — — — — — — — — —
P326 P228 P237 P248
P330 P26
P336 P25 P26 P237 P243
P350 P25 P26 P237 done to produce more comprehensible tree structures and
finally simpler production rules without compromising
accuracy on classifying unseen cases. From the decision
• material and equipment resources (Mr);
tree, we can extract classification rules (refer to Fig. 12).
• customer satisfaction index (CSI);
In addition to generating classification rules, C4.5 can be
• schedule performance (Sp);
able to discover major input variables and output variables
• budget performance (Bp); and
affecting the efficiency of DMUs. It can also find the order
• rework hours after delivery (Rew).
of influences, such as a sequence La, Lb, Sp, Rew, Mr, Lc,
The values of these factors and the decision (class) are Bp and CSI, as shown in Fig. 12. That is, the effect of labor
shown in Table 7. Moving from left to right, eight factor resource A to the efficiency of DMUs is greater than that of
values and tier number are arranged in rows. Fifty cases are labor resource B.
utilized to train a C4.5.
The output of the decision tree generator in our instance
4.2. Determining the improvement path of an inefficient
appears in Fig. 10. Note the numbers at the leaves, of the
DMU
form (N) or (N/E), where N is the sum of the fractional cases
that reach the leaf and E is the number of cases that belong 4.2.1. Clustering the DMUs using SOM
to classes other than the nominated class. Clustering the DMUs using SOM is divided into two
Decision trees are usually simplified by discarding one or steps. The first step is to train the SOM against DMUs as
more subtrees and replacing them with leaves; as when a training data set. The second one is to map input DMUs to
building trees, the class associated with a leaf is found by output DMU clusters.
examining the training cases covered by the leaf and choos-
ing the most frequent class. C4.5 also allows replacement of 1. To train the SOM — training algorithms of an SOM
a subtree by one of its branches. Fig. 11 shows a decision
tree after a pruning operation. Pruning a decision tree might
cause it to misclassify more of the training cases. But it is

Table 6
Clustering of SI projects by the tier analysis — tier 4

Group (tier) DMUs (SI projects) Reference set in tier 3

4 P44 P31 P318


P410 P317 P350
P411 P39 P320 P350
P412 P31 P318 P325 P336
P413 P31 P39 P330 P336 P350
P415 P325 P336 P350
P422 P320 P325 P350
P423 P31 P320 P325
P427 P39 P330 P336 P350
P431 P31 P350
P433 P325 P350
P434 P39 P325 P330 P336
P435 P31 P318
P442 P31 P325 P350
P444 P31 P325 P350
P449 P31 P318 P320 P325 P350
Fig. 10. Induced decision tree by C4.5.
H.K. Hong et al. / Expert Systems with Applications 16 (1999) 283–296 293

Fig. 13. The clustering result of SOM.

4.2.2. Determining the reference DMU of the inefficient


Fig. 11. Simplified decision tree after pruning. DMUs on each tier
By the tier analysis, 50 SI projects are divided into four
different tiers according to their efficiency level. And by the
adjust the weights and thresholds using sets of training SOM, DMUs on the lower tiers can find a way for improv-
patterns. The iterative gradient-descent training algo- ing their efficiencies. How can it be done? DMUs on each
rithms, including the SOM, attempt to reduce the training tier can improve their efficiencies via finding only one refer-
error on each epoch. It is allowed to specify different ence DMU on the very upper tier which shares the similar
stopping conditions as to when training should stop. characteristics with other DMUs in the reference set by
The simplest condition is that training should stop after SOM.
a set number of epochs, or iterations. This is the most For example, P320 on tier 3 has a reference set that
commonly used condition. consists of efficient DMUs, such as P25, P224 and P237, on
2. To map input DMUs to output DMU clusters — an SOM the upper efficient frontier 2 (tier 2). Among them, we
is designed to do unsupervised clustering of data; i.e., choose P25 as a benchmarking target, since it belongs to
given training patterns which contain inputs only, the the same cluster with P320 by SOM.
SOM assigns output units which represent clusters to
inputs. Once an SOM has been trained, it forms a topo- 4.2.3. Determining the improvement path
logical map using the output layer. The mappings input Based on the previous results that identify benchmarking
DMU patterns to the output DMU clusters reflect the references of each DMU on each tier, we can finally deter-
existing similarities in the inputs. The self-organizing mine the stepwise improvement path for the DMUs on every
algorithm not only assigns cluster centers to units, but tier except tier 1.
also tends to group similar centers on units close For example, looking at Fig. 14, we are able to determine
together. Fig. 13 shows the results of clustering 50 a path like P423 ! P320 ! P25 ! P146 as the improvement
DMUs. Here come four clusters out. The numbers in path for P423.
each cluster indicate each SI project which belongs to As shown in Table 9, P320 as the first benchmarking
that cluster. Therefore, cluster 1 has one member, such as target on the improvement path toward P146 has much
P10, and cluster 2 has 22 DMUs in it. higher schedule performance (Sp) than that of P423. And,
Table 8 summarizes the characteristics of each cluster in as the second benchmarking target, P25 has the similar level
detail. It shows, for each cluster, the number of DMUs of output factors compared with P320, but consumes fewer
which belong to each cluster and average values of inputs resources. At last, P146 has much higher score on CSI than
and outputs. that of P25. Therefore, the important managerial points to
keep in mind in the stepwise improvement for P423 are
schedule performance, followed by resources reduction,
followed by CSI improvement.

5. Conclusion

Fig. 12. Rule induced from the decision tree. DEA is good at estimating the ‘‘relative’’ efficiency of a
294 H.K. Hong et al. / Expert Systems with Applications 16 (1999) 283–296

Table 8
The characteristics of each cluster

Cluster Count La (avg) Lb (avg) Lc (avg) Mr (avg) CSI (avg) Sp (avg) Bp (avg) Rew (avg)

1 1 56.0 43.3 51.0 61.2 64.8 0.51 0.0001 0.11


2 22 3.9 8.1 4.5 5.7 71.0 0.93 0.63 3.0
3 16 1.6 4.8 3.9 5.2 90.1 1.1 4.2 4.4
4 11 10.5 18.3 11.8 19.3 81.5 1.12 11.2 3.3

DMU, it can tell us how well we are doing compared with DMUs can improve since a reference set of an inefficient
our peers but not compared with a ‘‘theoretical maximum’’. DMU consists of several efficient DMUs. Hence, we
Thus, to measure the efficiency of new DMU, we have to utilized a self-organizing map (SOM), which is one clus-
develop DEA previously with the same method with the tering tool for grouping similar DMUs according to
data of used DMU. Also, we cannot predict the efficient inputs, for the inefficient DMU to select one efficient
level of a new DMU. Second, because DMUs are directly DMU in a reference set as a benchmarking target. With
compared against a peer or combination of peers, DEA the tiers identified by the tier analysis, it could provide
offers no guide as to where a relatively inefficient DMU the guidelines for stepwise improvements of inefficient
improve. Third, DEA identifies peer DMUs and targets for DMUs.
inefficient DMUs, but it does not provide the stepwise path
In conventional DEA, it can only (1) identify inefficien-
for improving the efficiency of each inefficient DMU
cies, (2) identify comparable efficient units and (3) locate
considering the difference of efficiency. In order to over-
slack resources. But, we provide more information about
come this limitation of DEA, we suggest to a new
discriminant descriptors among input and output variables,
methodology that is found to be meaningful.
which affects the efficiency of DMUs, rules for classifying
The methodology we propose is a hybrid analysis and
new SI projects, and stepwise improvement path.
longitudinal analysis using machine learning, and it can be
We resolved the limitations of the DEA that are listed in
summarized as follows.
Section 1.
1. We apply a DEA to evaluate the efficiency of the DMUs 1. To evaluate a new SI project, the conventional DEA has
with their multidimensional inputs and outputs. After to be reapplied to all SI projects including the new one.
that, we clustered the DMUs together through the tier That is, because it measures the relative efficiencies of SI
analysis, which recursively apply the DEA to the remaining projects, after we group 50 SI projects together into four
inefficient DMUs, and then generated the DMU classifica- groups by the tier analysis, we can generate classification
tion rules using the C4.5, the decision tree classifier, with the rules using C4.5 in order to classify any new SI project
DMU tiers that had been identified by the tier analysis. without perturbing previously existing evaluation
2. DEA offers no guidelines where relatively inefficient structures.

Fig. 14. Improvement path for a DMU, P423, on tier 4.


H.K. Hong et al. / Expert Systems with Applications 16 (1999) 283–296 295

Table 9
Input/output factors of DMUs on the improvement path

Group (tier) DMU (SI project) Input factors Output factors

La Lb Lc Mr CSI Sp Bp Rew

1 P146 2.5 2.1 0.2 1.85 94.48 1.01 0.03 2.16


2 P25 0.9 2.6 0.8 1.9 78.88 0.96 0.9 1.95
3 P320 2.0 6.0 2.3 4.05 70.0 1.05 1.1 1.74
4 P423 15.0 14.0 9.1 15.99 68.62 0.75 1.11 1.64

2. The conventional DEA provide a reference set (multiple the Quality Control Group at Samsung Data System (SDS)
efficient SI projects) for each inefficient SI project. It for providing data and valuable advice.
cannot give a hint on where relatively inefficient
DMUs improve. But, since we utilize SOM as a tool
for clustering SI projects according to similarity of Appendix A. The DEA model
inputs, we can choose one reference project on the refer-
The CCR ratio model was proposed by Charnes et al.
ence set as a benchmarking target for each inefficient SI
(1978). In this model, the efficiency measure of any DMU
project.
is obtained as the maximum of a ratio of weighted outputs to
3. The conventional DEA cannot provide information about
weighted inputs subject to the condition that the similar
a continuous improvement path. It simply gives us infor-
ratios for every DMU be less than or equal to unity. That
mation about identification of inefficient SI projects and
is, the model is as follows:
slack variables via the reference set. We can resolve this
,
problem and provide information about continuous X X
improvement paths by using the SI project clusters by max e0 ˆ ur Yr0 vi Xi0
r僆R i僆I
SOM and reference project by tier analysis.
,
However, the present research has a number of limita- X X
tions. They can be also the topics for further research. s:t: ur Yrj vi Xrj ⱕ 1 ᭙ j 僆 N
i僆I i僆I
1. Obviously, environmental factors such as project ,
complexity, the quality of available hardware and soft- X
ware tools may also affect the efficiency of SI projects. ur vi Xi0 ⱖ e ᭙ r 僆 R
i僆I
Unfortunately, due to the unavailability of data, those
factors could not be included in this research. In future ,
X
DEA analyses, these factors may be incorporated into the vi vi Xi0 ⱖ e ᭙ i 僆 I
production model as exogenous (uncontrollable) inputs i僆I
or outputs, or as categorical variables.
where
2. The present model does not include project management
i index for input i, i 僆 I ˆ {1,2,…,I}
indexes (process indexes) such as the observance of
company-internal project procedure guidelines and of j index for DMU j, j 僆 N ˆ {1,2,…,n}
design review meetings, etc., which are determined in
r index for output r, r 僆 R ˆ {1,2,…,R}
the progress of the project. Experienced managers
knew that these variables are important to manage an vi virtual multiplier (weight) of ith input
SI project, yet they are not sure how these characteristics
vr virtual multiplier (weight) of rth output
should determine the overall quality of projects. In our
next study, we will apply the DEA technique to find how Xij the values (ⱖ 3 0) of input i for jth DMU
these process index variables influence performance (for j ˆ 1,…,n)
within the limited input resources.
Yrj the values (ⱖ 3 0) of output r for jth DMU
3. The data used in this study are from one large SI
(for j ˆ 1,…,n)
company in Korea. Caution should be taken in general-
izing the results of this study to other firms. e non-Archimedean infinitesimal

References
Acknowledgements Adolphson, et al. (1989). Railroad property valuation using data envelop-
ment analysis. Interfaces, 19, 18–26.
The authors would like to thank Dr Hyun Seok Jeong of Ahn, T. S. (1987). Efficiency and related issues in higher education: a data
296 H.K. Hong et al. / Expert Systems with Applications 16 (1999) 283–296

envelopment analysis approach. Ph.D. dissertation, The University of Fae, R., Grosskopf, S., Lindgen, B., & Roos, P. (1995). Productivity
Texas at Austin. developments in Swedish hospitals: a Malmquist output index
Athanassopoulos, A. D. (1997). Service quality and operating efficiency approach. In A. Charnes, W. W. Cooper, A. Y. Lewin, & L. M. Seiford
synergies for management control in the provision of financial services: (Eds.), Data envelopment analysis: theory, methodolgy, and applica-
evidence from Greek bank branches. European Journal of Operational tion. Boston: Kluwer Academic.
Research, 87, 300–313. Golany, B., & Roll, Y. (1989). An application procedure for DEA.
Banker, R. D., & Kemerer, C. F. (1992). Performance evaluation metrics OMEGA, 17 (3), 237–250.
for information systems development: a principal-agent model. Infor- Haag, S., Jaska, P., & Semple, J. (1992). Assessing the relative efficiency of
mation Systems Research, 379–398. agricultural production units in the Blackland Prairie, Texas. Applied
Banker, R. D., Charnes, A., & Cooper, W. W. (1984). Some models for Economics, 24, 559–565.
estimating technical and scale inefficiencies in data envelopment analy- Hjalmmarsson, J., & Odeck, J. (1996). Efficiency of trucks in road construc-
sis. Management Science, 30, 29–40. tion and maintenance: an evaluation with data envelopment analysis.
Beasley, J. E. (1990). Comparing university departments. OMEGA, 18 (2), Computers Operations Research, 23 (4), 393–404.
171–183. Lewin, A. Y., Morey, R., & Cook, T. (1982). Evaluating the administrative
Berry, M. J. A., & Linoff, G. (1997). Data mining techniques for marketing, efficiency of courts. OMEGA, 10, 404–411.
sales and customer support. New York: Wiley. Mahmood, M. A., Pettingell, K. J., & Shaskevich, A. I. (1996). Measuring
Brockett, P. L., Charnes, A., Cooper, W. W., & Sun, D. B. (1997). Data productivity of software projects: a data envelopment analysis
transformations in DEA cone ratio envelopment approaches for moni- approach. Decision Sciences, 27 (1), 56–77.
toring bank performances. European Journal of Operational Research, Malmquist, S. (1953). Index numbers and indifference surfaces. Trabajos
98, 250–268. de Estadistica, 4, 209–242.
Charnes, A., Cooper, W. W., & Rhodes, E. (1978). Measuring the efficiency Oral, M., Kettani, O., & Yolalan, R. (1992). An empirical study on analyz-
of decision making units. European Journal of Operational Research, ing productivity of bank branches. IIE Transactions, 24, 166–176.
2, 429–444. Pina, V., & Torres, L. (1992). Evaluating the efficiency of nonprofit orga-
Charnes, A., Clark, C. T., Cooper, W. W., & Golany, B. (1985). A devel- nizations: an application of data envelopment analysis to public health
opment study of data envelopment analysis in measuring the efficiency service. Financial Accountability and Management, 8, 213–224.
of maintenance units in the U.S. Air Force. Annals of Operations Quinlan, J. R. (1993). C4.5: programs for machine learning. San Mateo,
Research, 2, 95–112. CA: Morgan Kaufmann.
Charnes, A., Cooper, W. W., Golany, B., Phillips, F. Y., & Rousseau, J. J. Rutledge, R., Parsons, S., & Knaebel, R. (1995). Assessing hospital
(1994). A multi-period analysis of market segments and branch effi- efficiency over time: an empirical application of data envelopment
ciency in the competitive carbonated beverage industry. In A. Charnes, analysis. Journal of Information Technology Management, 6 (1), 13–
W. W. Cooper, A. Y. Lewin, & L. M. Seiford (Eds.), Data envelopment 23.
analysis: theory, methodolgy, and application. Boston: Kluwer Schaffnit, C., Rosen, D., & Paradi, J. C. (1997). Best practice analysis of
Academic. bank branches: an application of DEA in a large Canadian bank.
Christopher, et al. (1996). Software processes and project performance. European Journal of Operational Research, 87, 269–289.
Journal of Management Information System, 12 (3), 187–205. Schefczyk, M. (1993). Operational performance of airlines: an extension of
Clark, R. L. (1992). Evaluating USAF vehicle maintenance productivity traditional measurement paradigms. Strategic Management Journal,
over time: an application of data envelopment analysis. Decision 14, 301–317.
Sciences, 24, 376–384. Sestito, S., & Dillon, T. S. (1994). Automated knowledge acquisition.
Cook, W. D., Roll, Y., & Kazakov, A. (1990). A DEA model for measuring Prentice Hall.
the relative efficiency of highway maintenance patrols. INFOR, 28 (2), Sherman, H. D., & Ladino, G. (1995). Managing bank productivity using
113–124. data envelopment analysis (DEA). Interfaces, 25 (2), 60–73.
Cooper, W. W., Thomson, R. G., & Trall, R. M. (1996). Introduction: Thanassoulis, E. (1995). Assessing police forces in England and Wales
extensions and new developments in DEA. Annals of Operations using data envelopment analysis. European Journal of Operational
Research, 66, 3–45. Research, 87, 641–657.
Day, D. L., Lewin, A. Y., Salazar, R., & Li, H. (1994). Strategic leaders in Thanassoulis, E. (1996). A data envelopment analysis approach to cluster-
the U.S. brewing industry: a longitudinal analysis of outliers. In A. ing operating units for resource allocation purposes. Omega Interna-
Charnes, W. W. Cooper, A. Y. Lewin, & L. M. Seiford (Eds.), Data tional Journal of Management Science, 24 (4), 463–476.
envelopment analysis: theory, methodolgy, and application. Boston: Thompson, R. G., Dharmapala, P. S., & Rothenberg, L. J. (1996). DEA/AR
Kluwer Academic. efficiency and profitability of 14 major oil companies in U.S.
Day, D. L., Lewin, A. Y., & Li, H. (1995). Strategic leaders or strategic exploration and production. Computers Operations Research, 23 (4),
groups: a longitudinal data envelopment analysis of the U.S. brewing 357–373.
industry. European Journal of Operational Research, 80, 619–638. Thompson, R. G., Brinkmann, E. J., Dharmapala, P. S., & Gonzalez-Lima,
Drake, L., & Howcroft, B. (1994). Relative efficiency in the branch network M. D. (1997). DEA/AR profit ratio and sensitivity of 100 large U.S.
of a UK bank: an empirical study. OMEGA, 22 (1), 83–90. banks. European Journal of Operational Research, 98, 213–229.

You might also like