Performance Measurementand Program Effectiveness AStructural Equation Modeling Approach
Performance Measurementand Program Effectiveness AStructural Equation Modeling Approach
net/publication/247531056
CITATIONS READS
20 689
2 authors, including:
Christopher Mausolff
Nevada County
4 PUBLICATIONS 94 CITATIONS
SEE PROFILE
All content following this page was uploaded by Christopher Mausolff on 28 September 2018.
To cite this article: Christopher Mausolff & John Spence (2008) Performance Measurement
and Program Effectiveness: A Structural Equation Modeling Approach, Intl Journal of Public
Administration, 31:6, 595-615, DOI: 10.1080/01900690701640929
Christopher Mausolff
Performance
Mausolff andMeasurement
Spence and Program Effectiveness
John Spence
John D. Spence Consulting, Inc., Louisville, Kentucky, USA
INTRODUCTION
Learning
Objectives
Performance Results
Indicators
Measurement
Data
Competence
Clients Resources
tasks to perform, such as drawing a line or pulling a lever. The subjects in the
control group are not given feedback about their performance, while those in
the treatment group are given feedback for a short period. The superior perfor-
mance of the treatment group that persists, after feedback is withdrawn, is
considered evidence of a learning effect.
It is reasonable to hypothesize that feedback could also stimulate learning
at the organizational level. Feedback is portrayed as central to learning in a
number of conceptual, theoretical works on organizational learning.[30–32]
The role of feedback in organizational learning is also supported in case
studies,[33] organization simulations,[34,35] and in longitudinal research.[36]
In addition to the literature discussing feedback in general, there is also
research on the role of performance measurement, specifically, in stimulating
learning. Mausolff's research on organizational learning from performance
measurement in employment services agencies provides detailed descriptions
of this learning process.[26] In the studied agencies, performance measurement
triggered awareness of performance gaps. In response, organization members
initiated problem solving activities, including developing interpretations for
the causes of the performance gaps, sharing these interpretations, searching
for additional information to better understand the presenting problems,
selecting solutions, and implementing solutions. At times, the problem was cor-
rected with a solution that involved a new theory-of-action. In such instances,
organization members were learning from performance measurement.[37]
Overall, research at the individual and organizational levels supports the
hypothesis that feedback contributes to learning. Since performance measurement
is a form of feedback, the following hypothesis is plausible.
Organizational Learning-Effectiveness
learning practices of five large organizations.[40] Those that were closest to the
ideal of being learning organizations, with respect to strategy, structure, and
culture, had the highest performance. Yeung, Ulrich, Nason, and Von Glinow
studied over 400 business firms and identified four learning styles: experimen-
tation, competency acquisition, benchmarking, and continuous improve-
ment.[41] Of these, the experimentation approach had the highest performance
(as measured by competitiveness, innovativeness, and new product innova-
tion). If the style of learning matters, then by extension, learning matters. These
findings all help to establish the plausibility of the following hypothesis.
Performance Measurement-Effectiveness
There is also the potential for negative impacts from performance mea-
surement. The literature on management control systems in business firms
suggests a number of distortions that can arise from performance measurement,
including tunnel vision, suboptimization, myopia, ossification, gaming, and
misrepresentation.[18–21] Many of these distorting effects have also been observed
in government agencies.[17,43,44] In Poister and Streib's survey research, 21.1
percent of municipalities report modest goal displacement from performance
measurement and 1.5 percent report substantial goal displacement.[15]
Despite the possible negative impacts of performance measurement, the
potential for improved focus and motivation suggest the possibility of
improved effectiveness.
up to date (12.1%); and compiling and distributing data from the performance
measurement system in a timely manner (10.8%).[15]
The competence-performance measurement relationship was directly
tested in Berman and Wang’s research on performance measurement in
counties.[16] The researchers identified a number of competencies that are
significantly correlated with the use of performance measurement: the ability
to develop outcome measures, collect performance measurement data in a
timely way, and assess the validity of performance measures. These findings
all support the following hypothesis.
Competence — Results
RESEARCH METHODS
The organizations used to test the model hypotheses are nonprofits that
receive financial support from Metro United Way in Louisville, Kentucky.
Metro United Way is a large United Way affiliate with donations of over $30
million annually. Its 80 person staff serves over 100 health and human services
agencies in a seven county region of Kentucky and Indiana. These agencies
manage approximately 170 programs in the following four sectors:
Data Collection
The data used to test the model was gathered by 13 volunteer evaluation
teams. Each team was led by a volunteer and a Metro United Way staff member
and was comprised of between 12 and 25 members. Beginning in 2000, Metro
United Way has had evaluation teams assess all of the programs in its portfolio
every two years. The data used in this study is based on assessments conducted
in 2000 and 2002.
The program scoring process involves just a few simple steps. Prior to the
evaluation, the focal agency develops a program funding proposal consisting of:
After studying the funding proposals, the evaluation team meets with the man-
agement and the Board representatives. At this point, the evaluation team can
seek clarification on any issues or gaps in the funding proposal. The evaluation
team then scores the program using a standardized evaluation form. There are
variations in the way each team determines these scores. Some teams simply
average each evaluator's scores, while others attempt to reach a consensus
score for each category. The scores in each category are weighted and then
added together to yield a single program score. The validity and reliability of
the scoring is enhanced by training the evaluators.
The criteria Metro United Way uses to evaluate the programs is quite dif-
ferent from other United Way affiliates. Other affiliates that we have worked
with use evaluation criteria based on management best practices, e.g. use of
strategic planning, job descriptions for employees, and appropriate by-laws.
Metro United Way has taken a very different approach. Instead of adherence
to a wide range of specific practices, the evaluation criteria focus primarily on
performance measurement, learning, and achieving results. Thus, rather than
measuring numerous processes, Metro United Way is more focused on learning
and results.
The scores assigned by the evaluation teams in each category are used to
test the proposed structural equation model. These scores should have higher
validity and reliability than those ordinarily obtained in survey research. In
much survey research, the scores for an entire organization are filled out by
just one of its members. By comparison, in this research, each score reflects
the input of 12 to 25 trained evaluators and, therefore, benefits from the
consideration of multiple perspectives. These evaluators also have the benefit
604 Mausolff and Spence
of being able to compare the program being evaluated with the others they
have previously assessed. Finally, the data collection method does not require
heroic assumptions about the cognitive abilities of respondents. In existing
survey research, the respondents answer questions such as whether performance
measurement “improved service quality” or enhanced the organization’s “ability
to achieve improvements…” Such questions require respondents to make
complex attributions about the causes of observed improvements in their
programs. Is it reasonable to assume that respondents have the data and the
training for judging causality? The present study uses a different approach.
The external evaluators merely assess how well each program has performed
on each dimension of the evaluation instrument. They do not need to make
attributions about the causal factors contributing to the results. Instead, possible
relationships between better quality performance measurement and program
results are assessed statistically. If performance measurement is not important
for effectiveness, then doing it better should not make any difference for orga-
nizational learning and results. Conversely, if performance measurement is
important, then when it is performed well, there should be a corresponding
improvement in organizational learning and program results.
A weakness of the data collection method is that the resulting data do not
have normal distributions. The average score on a 0–4 scale ranged from 2.7
to 3.5 for year 2000, and from 2.9 to 3.7 for year 2002. (See Tables 1 and 2).
As a result, the data are negatively skewed with statistically significant skewness
scores ranging from −0.55 to −1.78 (year 2000) and −0.61 to −2.28 (year 2002).
Although high, these deviations from normality are, with the exception of the
clients variable in year 2002, below the recommended maximum threshold for
maximum likelihood estimation techniques.[49] As the level of skewness
increases,
1. the chi-square goodness-of-fit test can become less accurate and reject too
many true models, and
2. the parameter estimates can become biased and provide too many significant
results.[49]
The constructs in the model consist of both observed and unobserved vari-
ables. The observed variables are operationalized based on their description in
the evaluation instrument. The evaluators use these descriptions when assigning
scores to each program. The unobserved (or latent) variables are constructed
from these observed variables (See Figure 1 above). A description of each
construct in the model is provided below.
Performance Measurement
The score for “indicators” is based on the extent to which the chosen measures
match the objectives and signal that the objective has been achieved. The
“data collection” score depends on how well the data collection instrument
provides credible data for all indicators, is usable for measuring program
effectiveness, and is usable for program improvement.
Learning
Results
Results and effectiveness are treated as equivalent terms in this study. Results are
operationalized primarily by the extent of improvement in client conditions attrib-
utable to the program. The evaluation instrument also includes, as one consider-
ation in scoring, the quality of the explanation of the factors affecting the results.
Competence
Data Analysis
The model relationships were tested using the maximum likelihood estimation
method in Amos 4.0 software. Maximum likelihood estimation is recom-
mended when the variable “distributions are not substantially nonnormal.”[49]
FINDINGS
The test of the model is especially stringent in that it involves two tests: one
with year 2000 data and a second with year 2002 data.
No one fit index is sufficient for evaluating overall model fit.[50,51] Researchers
therefore encourage reporting a range of fit indexes.[52] Statistics from both
Performance Measurement and Program Effectiveness 607
absolute and incremental fit indexes are reported here. Most of the overall fit
indexes are normed so that scores close to 1 are ideal and scores above .90 are
considered reasonable.[53]
Absolute fit refers to the degree to which the covariances implied by the
model match the observed covariances.[53] The scores for two absolute fit
indexes, the relative chi-square and the GFI, are reported in Table 3. There is
no precise significance limit for the relative chi-square. Researchers have sug-
gested, as maximum acceptable limits, scores ranging from three to five.[54]
The relative chi-square scores of 3.226 (year 2000 data) and 2.451 (year 2002
data) are therefore on the borderline, indicating relatively poor fit due to
restrictions placed on the model.
The goodness-of-fit index (GFI) measures the amount of variances and
covariances explained by the model.[53] It is similar to the r-square widely
used in multiple regression analysis.[53] The GFI scores above .95 for both the
year 2000 and year 2002 data indicate excellent model fit.
Parameter Estimates
For the individual parameter estimates, the results are mixed. The learning
pathway is supported only for the year 2002 data, while the direct, perfor-
mance measurement-results relationship is supported for both years (See
Tables 5 and 6 below).
We hypothesized that performance measurement would improve results
by enhancing organizational learning (See Figures 2 and 3 below). This
hypothesis was not supported with the year 2000 data. However, if the model
is analyzed without the latent variable for organizational competence, perfor-
mance measurement contributes to learning in both 2000 and 2002. Therefore,
the organizations that do better at performance measurement also do better at
learning. However, for year 2000, this correlation is due entirely to underlying
competence, rather than an independent benefit of performance measurement.
For 2002, even after controlling for competence, there is still an independent,
incremental learning benefit from performance measurement.
The performance measurement-results path tests explanations, other than
learning, by which performance measurement could contribute to results.
These other mechanisms could include improved focus and motivation.[15,16]
error 7
.31 1 .40
error 6 Objectives Learning
error 8
1 1.00
.04 .17
.29 1 .38
error 5 1.12 Performance 1.77 Results
Indicators
1 Measurement
1.26
.40 1.12
1 .08 .93 –.89
error 4 Data
1 error 3 Competence
Clients Resources
1 .28 1 .57
error 2 error 1
error 7
.26 1 .31
error 6 Objectives Learning error 8
1 1.00
.49 .24
.23 1 .38
error 5 1.16 Performance .90 Results
Indicators
1 Measurement
1.33
.20 1.32
1 .21 1.27 .10
error 4 Data
1 Competence
error 3
Clients Resources
1 .17 1 .45
error 2 error 1
In the present sample, these other, residual, mechanisms are important. There
is a positive relationship between performance measurement and results for
both 2000 and 2002 (significant at the .01 level).
610 Mausolff and Spence
DISCUSSION
CONCLUSION
A structural equation model was used to test the impact of performance mea-
surement on effectiveness and the importance of organizational learning in
this process. The relatively large sample size of approximately 170 health and
human service programs, the independent scoring of programs, and replica-
tion of the test with two datasets, provided a relatively rigorous test of the
hypotheses.
The results were mixed. The organizational learning hypothesis was only
partially supported. It is statistically significant in 2002, but not 2000. One
explanation for this result is that the organizations needed additional time for the
incremental benefits of data-driven learning to add up to significant program
impacts.
The direct performance measurement-effectiveness relationship is
strongly supported in this study. The significant and positive correlations for
612 Mausolff and Spence
the model results with both the 2000 and 2002 data sets, provides strong evi-
dence for the efficacy of performance measurement. These results support the
promotion of performance measurement by the United Way of America and
other nonprofit organizations. Since this study is based entirely on nonprofit
health and human service organizations, caution should be exercised in gener-
alizing these results to government and private business.
The results also support the hypothesis that factors other than learning can
be important in explaining the impact of performance measurement on results.
These other mechanisms could include greater motivation and improved
focus. It would therefore be useful to conduct additional research on these
other mechanisms, to better understand their impact.
REFERENCES