Evaluationasystematicapproachnodrm 160314225035 PDF
Evaluationasystematicapproachnodrm 160314225035 PDF
Evaluationasystematicapproachnodrm 160314225035 PDF
SEVENTH EDITION
PETER H. ROSSI
University of Massachussets, Amherst
MARK W. LIPSEY
Vanderbilt University, Nashville, TN
HOWARD E. FREEMAN
All rights reserved. No part of this book may be reproduced or utilized in any form or
by any means, electronic or mechanical, including photocopying, recording, or by any
information storage and retrieval system, without permission in writing from the
publisher.
For information:
03 04 05 06 07 08 09 10 9 8 7 6 5 4 3 2 1
Preface
CHAPTER 1 AN OVERVIEW OF PROGRAM EVALUATION
What Is Program Evaluation?
A Brief History of Evaluation
The Defining Characteristics of Program Evaluation
Evaluation Research in Practice
Who Can Do Evaluations?
Summary
CHAPTER 2 TAILORING EVALUATIONS
What Aspects of the Evaluation Plan Must Be Tailored?
What Features of the Situation Should the Evaluation Plan Take Into
Account?
The Nature of the Evaluator-Stakeholder Relationship
Evaluation Questions and Evaluation Methods
Summary
CHAPTER 3 IDENTIFYING ISSUES AND FORMULATING QUESTIONS
What Makes a Good Evaluation Question?
Determining the Specific Questions the Evaluation Should Answer
Collating Evaluation Questions and Setting Priorities
Summary
CHAPTER 4 ASSESSING THE NEED FOR A PROGRAM
The Role of Evaluators in Diagnosing Social Conditions and Service
Needs
Defining the Problem to Be Addressed
Specifying the Extent of the Problem: When, Where, and How Big?
Defining and Identifying the Targets of Interventions
Describing Target Populations
This seventh edition contains some new material and extensive revisions of topics that
appeared in previous editions. The amendments include an extended treatment of
outcome measurement and monitoring, a better exposition of impact assessment designs,
a fuller treatment of some key statistical issues in evaluation research, and a more
detailed description of meta-analysis. We believe that these changes bring the volume
more completely in line with the current leading edge of the field.
However, the central theme of providing an introduction to the field of program
evaluation has not been changed. We cover the full range of evaluation research
activities used in appraising the design, implementation, effectiveness, and efficiency of
social programs. Throughout the many revisions of this book, we retain the ambition to
communicate the technical knowledge and collective experiences of practicing
evaluators to those who might consider evaluation as a calling and to those who need to
know what evaluation is all about. Our intended readers are students, practitioners,
sponsors of social programs, social commentators, and anyone concerned with how to
measure the successes and failures of attempts to improve social conditions.
We believe that reading this book will provide enough knowledge to understand and
assess evaluations. However, it is not intended to be a cookbook of procedures for
conducting evaluations, although we identify sources in which such procedures are
described in detail, including references to advanced literature for the adventurous.
Ultimately, nothing teaches how to do evaluations as well as direct experience in
designing and running actual evaluations. We urge all those considering entering the
field of evaluation research to seek hands-on experience.
In the 1970s when the first edition of this book was published, evaluation was not
yet fully established as a way of assessing social programs. It is quite different now. In
the 21st century, evaluation research has become solidly incorporated into the routine
activities of all levels of government throughout the world, into the operations of
nongovernmental organizations, and into the public discussions of social issues. Hardly
a week goes by when the media do not report the results of some evaluation. We believe
that evaluation research makes an important contribution to the formation and
improvement of social policies. Being an evaluator can be an exciting professional role
providing opportunities to participate in the advancement of social well-being along
******ebook converter DEMO Watermarks*******
with the exercise of technical and interpersonal skills.
We dedicate this edition to the memory of Daniel Patrick Moynihan, who died
recently. Over the last half century, Pat Moynihan held an astonishing array of key
positions in academia (Harvard), in federal agencies (Assistant Secretary of Labor in
the Kennedy and Johnson administrations), as White House staff adviser on urban issues
in the Nixon administration, and two terms as a senator representing New York. He
published several influential books on social policy and decision making in the federal
government. His presence in the Senate measurably raised the intellectual level of
Senate deliberations on social policy. In all the positions he held, the improvement of
social policy was his central concern. In addition, he was a firm and eloquent advocate
of social research and of evaluation research in particular. (For an example of his
advocacy, see Exhibit 1-A in Chapter 1.) Pat Moynihan played a critical role in building
and supporting federal evaluation activities as well as advancing the social well-being
of our society.
—P.H.R. and M.W.L.
Chapter Outline
What Is Program Evaluation?
A Brief History of Evaluation
Evaluation Research as a Social Science Activity
The Boom Period in Evaluation Research
Social Policy and Public Administration Movements Development of Policy and
Public Administration Specialists
The Evaluation Enterprise From the Great Society to the Present Day
The Defining Characteristics of Program Evaluation
Application of Social Research Methods
The Effectiveness of Social Programs
Adapting Evaluation to the Political and Organizational Context
Informing Social Action to Improve Social Conditions
Evaluation Research in Practice
Evaluation and the Volatility of Social Programs
Scientific Versus Pragmatic Evaluation Postures
Diversity in Evaluation Outlooks and Approaches
Who Can Do Evaluations?
Since antiquity organized efforts have been undertaken to describe, understand, and
ameliorate defects in the human condition. This book is rooted in the tradition of
scientific study of social problems—a tradition that has aspired to improve the quality
of our physical and social environments and enhance our individual and collective
well-being through the systematic creation and application of knowledge. Although the
terms program evaluation and evaluation research are relatively recent inventions, the
activities that we will consider under these rubrics are not. They can be traced to the
very beginnings of modern science. Three centuries ago, as Cronbach and colleagues
(1980) point out, Thomas Hobbes and his contemporaries tried to calculate numerical
measures to assess social conditions and identify the causes of mortality, morbidity, and
social disorganization.
Even social experiments, the most technically challenging form of contemporary
evaluation research, are hardly a recent invention. One of the earliest “social
experiments” took place in the 1700s when a British naval captain observed the lack of
******ebook converter DEMO Watermarks*******
scurvy among sailors serving on the ships of Mediterranean countries where citrus fruit
was part of the rations. Thereupon he made half his crew consume limes while the other
half continued with their regular diet. The good captain probably did not know that he
was evaluating a demonstration project nor did he likely have an explicit “program
theory”(a term we will discuss later), namely, that scurvy is a consequence of a vitamin
C deficiency and that limes are rich in vitamin C. Nevertheless, the intervention worked
and British seamen eventually were compelled to consume citrus fruit regularly, a
practice that gave rise to the still-popular label limeys. Incidentally, it took about 50
years before the captain’s “social program” was widely adopted. Then, as now,
diffusion and acceptance of evaluation findings did not come easily.
What are the nature and scope of the problem? Where is it located, whom does it
affect, how many are affected, and how does the problem affect them?
What is it about the problem or its effects that justifies new, expanded, or modified
social programs?
What feasible interventions are likely to significantly ameliorate the problem?
What are the appropriate target populations for intervention?
Is a particular intervention reaching its target population?
Is the intervention being implemented well? Are the intended services being
provided?
Is the intervention effective in attaining the desired goals or benefits?
Is the program cost reasonable in relation to its effectiveness and benefits?
Answers to such questions are necessary for local or specialized programs, such as
job training in a small town, a new mathematics curriculum for elementary schools, or
the outpatient services of a community mental health clinic, as well as for broad
national or state programs in such areas as health care, welfare, and educational reform.
Providing those answers is the work of persons in the program evaluation field.
valuators use social research methods to study, appraise, and help improve social
programs, including the soundness of the programs’ diagnoses of the social problems
******ebook converter DEMO Watermarks*******
they address, the way the programs are conceptualized and implemented, the outcomes
they achieve, and their efficiency. (Exhibit 1-A conveys the views of one feisty senator
about the need for evaluation evidence on the effectiveness of programs.)
EXHIBIT 1-A
A Veteran Policymaker Wants to See the Evaluation Results
But all the while we were taking on this large—and, as we can now say, hugely
successful—effort [deficit reduction], we were constantly besieged by
administration officials wanting us to add money for this social program or that
social program…. My favorite in this miscellany was something called “family
preservation,” yet another categorical aid program (there were a dozen in place
already) which amounted to a dollop of social services and a press release for some
subcommittee chairman. The program was to cost $930 million over five years,
starting at $60 million in fiscal year 1994. For three decades I had been watching
families come apart in our society; now I was being told by seemingly everyone on
the new team that one more program would do the trick…. At the risk of
indiscretion, let me include in the record at this point a letter I wrote on July 28,
1993, to Dr. Laura D’ Andrea Tyson, then the distinguished chairman of the Council
of Economic Advisors, regarding the Family Preservation program:
You will recall that last Thursday when you so kindly joined us at a meeting of
the Democratic Policy Committee you and I discussed the President’s family
preservation proposal. You indicated how much he supports the measure. I assured
you I, too, support it, but went on to ask what evidence was there that it would have
any effect. You assured me there were such data. Just for fun, I asked for two
citations.
The next day we received a fax from Sharon Glied of your staff with a number of
citations and a paper, “Evaluating the Results,” that appears to have been written by
Frank Farrow of the Center for the Study of Social Policy here in Washington and
Harold Richman at the Chapin Hall Center at the University of Chicago. The paper
is quite direct: “Solid proof that family preservation services can affect a state’s
overall placement rates is still lacking.”
Just yesterday, the same Chapin Hall Center released an “Evaluation of the
Illinois Family First Placement Prevention Program: Final Report.” This was a
******ebook converter DEMO Watermarks*******
large scale study of the Illinois Family First initiative authorized by the Illinois
Family Preservation Act of 1987. It was “designed to test effects of this program on
out-of-home placement and other outcomes, such as subsequent child maltreatment.”
Data on case and service characteristics were provided by Family First
caseworkers on approximately 4,500 cases: approximately 1,600 families
participated in the randomized experiment. The findings are clear enough.
This is nothing new. Here is Peter Rossi’s conclusion in his 1992 paper,
“Assessing Family Preservation Programs.” Evaluations conducted to date “do not
form a sufficient basis upon which to firmly decide whether family preservation
programs are either effective or not.” May I say to you that there is nothing in the
least surprising in either of these findings? From the mid-60s on this has been the
repeated, I almost want to say consistent, pattern of evaluation studies. Either few
effects or negative effects. Thus the negative income tax experiments of the 1970s
appeared to produce an increase in family breakup.
I write you at such length for what I believe to be an important purpose. In the
last six months I have been repeatedly impressed by the number of members of the
Clinton administration who have assured me with great vigor that something or other
is known in an area of social policy which, to the best of my understanding, is not
known at all. This seems to me perilous. It is quite possible to live with uncertainty,
with the possibility, even the likelihood that one is wrong. But beware of certainty
where none exists. Ideological certainty easily degenerates into an insistence upon
ignorance.
The great strength of political conservatives at this time (and for a generation) is
that they are open to the thought that matters are complex. Liberals got into a
reflexive pattern of denying this. I had hoped twelve years in the wilderness might
******ebook converter DEMO Watermarks*******
have changed this; it may be it has only reinforced it. If this is so, current revival of
liberalism will be brief and inconsequential.
Respectfully,
Although this text emphasizes the evaluation of social programs, especially human
service programs, program evaluation is not restricted to that arena. The broad scope of
program evaluation can be seen in the evaluations of the U.S. General Accounting
Office (GAO), which have covered the procurement and testing of military hardware,
quality control for drinking water, the maintenance of major highways, the use of
hormones to stimulate growth in beef cattle, and other organized activities far afield
from human services.
Indeed, the techniques described in this text are useful in virtually all spheres of
activity in which issues are raised about the effectiveness of organized social action.
For example, the mass communication and advertising industries use essentially the
same approaches in developing media programs and marketing products. Commercial
and industrial corporations evaluate the procedures they use in selecting, training, and
promoting employees and organizing their workforces. Political candidates develop
their campaigns by evaluating the voter appeal of different strategies. Consumer
products are tested for performance, durability, and safety. Administrators in both the
public and private sectors often assess the managerial, fiscal, and personnel practices of
their organizations. This list of examples could be extended indefinitely.
These various applications of evaluation are distinguished primarily by the nature
and goals of the endeavors being evaluated. In this text, we have chosen to emphasize
the evaluation of social programs—programs designed to benefit the human condition—
rather than efforts that have such purposes as increasing profits or amassing influence
and power. This choice stems from a desire to concentrate on a particularly significant
and active area of evaluation as well as from a practical need to limit the scope of the
book. Note that throughout this book we use the terms evaluation, program evaluation,
and evaluation research interchangeably.
To illustrate the evaluation of social programs more concretely, we offer below
examples of social programs that have been evaluated under the sponsorship of local,
state, and federal government agencies, international organizations, private foundations
and philanthropies, and both nonprofit and for-profit associations and corporations.
These examples illustrate the diversity of social interventions that have been
systematically evaluated. However, all of them involve one particular evaluation
activity: assessing the outcomes of programs. As we will discuss later, evaluation may
also focus on the need for a program, its design, operation and service delivery, or
efficiency.
Following World War II, numerous major federal and privately funded programs
were launched to provide urban development and housing, technological and cultural
education, occupational training, and preventive health activities. It was also during this
time that federal agencies and private foundations made major commitments to
international programs for family planning, health and nutrition, and rural development.
Expenditures were very large and consequently were accompanied by demands for
“knowledge of results.”
By the end of the 1950s, program evaluation was commonplace. Social scientists
engaged in assessments of delinquency prevention programs, psychotherapeutic and
psychopharmacological treatments, public housing programs, educational activities,
community organization initiatives, and numerous other initiatives. Studies were
undertaken not only in the United States, Europe, and other industrialized countries but
also in less developed nations. Increasingly, evaluation components were included in
programs for family planning in Asia, nutrition and health care in Latin America, and
agricultural and community development in Africa (Freeman, Rossi, and Wright, 1980;
Levine, Solomon, and Hellstern, 1981). Expanding knowledge of the methods of social
research, including sample surveys and advanced statistical procedures, and increased
funding and administrative know-how, made possible large-scale, multisite evaluation
studies.
During the 1960s, the number of articles and books about evaluation research grew
dramatically. Hayes’s (1959) monograph on evaluation research in less developed
countries, Suchman’s (1967) review of evaluation research methods, and Campbell’s
(1969) call for social experimentation are a few illustrations. In the United States, a key
impetus for the spurt of interest in evaluation research was the federal war on poverty,
initiated under Lyndon Johnson’s presidency. By the late 1960s, evaluation research had
become a growth industry.
In the early 1970s, evaluation research emerged as a distinct specialty field in the
social sciences. A variety of books appeared, including the first texts (Rossi and
Williams, 1972; Weiss, 1972), critiques of the methodological quality of evaluation
studies (Bernstein and Freeman, 1975), and discussions of the organizational and
structural constraints on evaluation research (Riecken and Boruch, 1974). The first
journal in evaluation, Evaluation Review, was launched in 1976 by Sage Publications.
Other journals followed in rapid succession, and today there are at least a dozen
******ebook converter DEMO Watermarks*******
devoted primarily to evaluation. During this period, special sessions on evaluation
studies at meetings of academic and practitioner groups became commonplace, and
professional associations specifically for evaluation researchers were founded (see
Exhibit 1-B for a listing of the major journals and professional organizations). By 1980,
Cronbach and his associates were able to state that “evaluation has become the liveliest
frontier of American social science” (pp. 12-13).
As evaluation research matured, a qualitative change occurred. In its early years,
evaluation was shaped mainly by the interests of social researchers. In later stages,
however, the consumers of evaluation research exercised a significant influence on the
field. Evaluation is now sustained primarily by funding from policymakers, program
planners, and administrators who use the findings and by the interests of the general
public and the clients of the programs evaluated. Evaluation results may not make front-
page headlines, but they are often matters of intense concern to informed citizens,
program sponsors, and decisionmakers, and those whose lives are affected, directly or
indirectly, by the programs at issue.
EXHIBIT 1-B
Major Evaluation Journals and Professional Organizations
Incorporation of the consumer perspective into evaluation research has moved the
field beyond academic social science. Evaluation has now become a political and
managerial activity that makes significant input into the complex mosaic from which
emerge policy decisions and resources for starting, enlarging, changing, or sustaining
programs to better the human condition. In this regard, evaluation research must be seen
as an integral part of the social policy and public administration movements.
Social programs and the associated evaluation activities have emerged from the
relatively recent transfer of responsibility for the nation’s social and environmental
conditions, and the quality of life of its citizens, to government bodies. As Bremner
(1956) has described, before World War I, except for war veterans, the provision of
human services was seen primarily as the obligation of individuals and voluntary
associations. Poor people, physically and mentally disabled persons, and troubled
families were the clients of local charities staffed mainly by volunteers drawn from the
ranks of the more fortunate. Our image of these volunteers as wealthy matrons toting
baskets of food and hand-me-down clothing to give to the poor and unfortunate is only
somewhat exaggerated. Along with civic associations, charity hospitals, county and
state asylums, locally supported public schools, state normal schools, and sectarian old-
age homes, volunteers were the bulwark of our human service “system.” Indeed,
The Evaluation Enterprise From the Great Society to the Present Day
EXHIBIT 1-C
The Rise of Policy Analysis
The steady growth in the number, variety, complexity, and social importance of
policy issues confronting government is making increasing intellectual demands on
public officials and their staffs. What should be done about nuclear safety, teenage
pregnancies, urban decline, rising hospital costs, unemployment among black youth,
******ebook converter DEMO Watermarks*******
violence toward spouses and children, and the disposal of toxic wastes? Many of
these subjects were not on the public agenda 20 years ago. They are priority issues
now, and new ones of a similar character emerge virtually every year. For most
elected and appointed officials and their staffs, such complicated and controversial
questions are outside the scope of their judgment and previous experience. Yet the
questions cannot be sidestepped; government executives are expected to deal with
them responsibly and effectively.
To aid them in thinking about and deciding on such matters, public officials have
been depending to an increasing extent on knowledge derived from research, policy
analysis, program evaluations, and statistics to inform or buttress their views. More
often than in the past, elected and appointed officials in the various branches and
levels of government, from federal judges to town selectmen, are citing studies,
official data, and expert opinion in at least partial justification for their actions.
Their staffs, which have been increasing in size and responsibility in recent
decades, include growing numbers of people trained in or familiar with analytic
techniques to gather and evaluate information. Increasing amounts of research,
analysis, and data gathering are being done.
SOURCE: Adapted, with permission, from Laurence E. Lynn, Jr., Designing Public
Policy (Santa Monica, CA: Scott, Foresman, 1980).
******ebook converter DEMO Watermarks*******
EXHIBIT 1-D
The 1960s Growth in Policy Analysis and Evaluation Research
The year 1965 was an important one in the evolution of “policy analysis and
evaluation research” as an independent branch of study. Two developments at the
federal government level—the War on Poverty-Great Society initiative and the
Executive Order establishing the Planning-Programming-Budgeting (PPB) system—
were of signal importance in this regard. Both offered standing, legitimacy, and
financial support to scholars who would turn their skills and interests toward
examining the efficiency with which public measures allocate resources, their
impacts on individual behavior, their effectiveness in attaining the objectives for
which they were designed, and their effects on the well-being of rich versus poor,
minority versus majority, and North versus South.
EXHIBIT 1-E
The Two Arms of Evaluation
Evaluation is the process of determining the merit, worth, and value of things, and
evaluations are the products of that process…. Evaluation is not the mere
accumulation and summarizing of data that are clearly relevant for decision making,
although there are still evaluation theorists who take that to be its definition…. In all
contexts, gathering and analyzing the data that are needed for decision making—
difficult though that often is—comprises only one of the two key components in
evaluation; absent the other component, and absent a procedure for combining them,
we simply lack anything that qualifies as an evaluation. Consumer Reports does not
just test products and report the test scores; it (i) rates or ranks by (ii) merit or
cost-effectiveness. To get to that kind of conclusion requires an input of something
besides data, in the usual sense of that term. The second element is required to get to
conclusions about merit or net benefits, and it consists of evaluative premises or
standards…. A more straightforward approach is just to say that evaluation has two
arms, only one of which is engaged in data-gathering. The other arm collects,
clarifies, and verifies relevant values and standards.
SOURCE: Quoted, with permission, from Michael Scriven, Evaluation Thesaurus, 4th
ed. (Newbury Park, CA: Sage, 1991), pp. 1, 4-5.
Finally, this view does not imply that methodological quality is necessarily the most
important aspect of an evaluation nor that only the highest technical standards, without
compromise, are always appropriate for evaluation research. As Carol Weiss (1972)
once observed, social programs are inherently inhospitable environments for research
purposes. The circumstances surrounding specific programs, and the particular issues
the evaluator is called on to address, frequently compel evaluators to compromise and
adapt textbook methodological standards. The challenges to the evaluator are to match
******ebook converter DEMO Watermarks*******
the research procedures to the evaluation questions and circumstances as well as
possible and, whatever procedures are used, to apply them at the highest possible
standard feasible to those questions and circumstances.
By definition, social programs are activities whose principal reason for existing is
to “do good,” that is, to ameliorate a social problem or improve social conditions. It
follows that it is appropriate for the parties who invest in social programs to hold them
accountable for their contribution to the social good. Correspondingly, any evaluation of
such programs that is worthy of the name must evaluate—that is, judge—the quality of a
program’s performance as it relates to some aspect of its effectiveness in producing
social benefits. More specifically, the evaluation of a program generally involves
assessing one or more of five domains: (1) the need for the program, (2) the program’s
design, (3) its implementation and service delivery, (4) its impact, or outcomes, and (5)
its efficiency. Subsequent chapters will address how evaluators make these
assessments.
EXHIBIT 1-F
Where Politics and Evaluation Meet
First, the policies and programs with which evaluation deals are the creatures of
political decisions. They were proposed, defined, debated, enacted, and funded
through political processes, and in implementation they remain subject to pressures
—both supportive and hostile—that arise out of the play of politics.
Second, because evaluation is undertaken in order to feed into decision making, its
reports enter the political arena. There evaluative evidence of program outcomes
has to compete for attention with other factors that carry weight in the political
process.
Third, and perhaps least recognized, evaluation itself has a political stance. By its
very nature, it makes implicit political statements about such issues as the
problematic nature of some programs and the unchallenge-ability of others, the
legitimacy of program goals and program strategies, the utility of strategies of
******ebook converter DEMO Watermarks*******
incremental reform, and even the appropriate role of the social scientist in policy
and program formation.
Knowing that political constraints and resistance exist is not a reason for
abandoning evaluation research; rather, it is a precondition for usable evaluation
research. Only when the evaluator has insight into the interests and motivations of
other actors in the system, into the roles that he himself is consciously or
inadvertently playing, the obstacles and opportunities that impinge upon the
evaluative effort, and the limitations and possibilities for putting the results of
evaluation to work—only with sensitivity to the politics of evaluation research—
can the evaluator be as creative and strategically useful as he should be.
SOURCE: From Carol H. Weiss, “Where Politics and Evaluation Research Meet,”
Evaluation Practice, 1993, 14(1):94, where the original 1973 version was reprinted as
one of the classics in the evaluation field.
These assertions assume that an evaluation would not be undertaken unless there
was an audience interested in receiving and, at least potentially, using the findings.
Unfortunately, sponsors sometimes commission evaluations with little intention of using
the findings. For example, an evaluation may be conducted because it is mandated by
program funders and then used only to demonstrate compliance with that requirement.
Responsible evaluators try to avoid being drawn into such situations of “ritualistic”
evaluation. An early step in planning an evaluation, therefore, is a thorough inquiry into
the motivation of the evaluation sponsors, the intended purposes of the evaluation, and
the uses to be made of the findings.
As a practical matter, an evaluation must also be tailored to the organizational
makeup of the program. In designing the evaluation, the evaluator must take into account
any number of organizational factors, such as the availability of administrative
cooperation and support; the ways in which program files and data are kept and access
permitted to them; the character of the services provided; and the nature, frequency,
duration, and location of the contact between the program and its clients. In addition,
once an evaluation is launched, it is common for changes and “in-flight” corrections to
be required. Modifications, perhaps even compromises, may be necessary in the types,
quantity, or quality of the data collected as a result of unanticipated practical or
political obstacles, changes in the operation of the program, or shifts in the interests of
the stakeholders.
Perhaps the single most influential article in the evaluation field was written by the
******ebook converter DEMO Watermarks*******
late Donald Campbell and published in 1969. This article outlined a perspective that
Campbell advanced over several decades: Policy and program decisions should emerge
from continual social experimentation that tests ways to improve social conditions.
Campbell asserted that the technology of social research made it feasible to extend the
experimental model to evaluation research to create an “experimenting society.”
Although he tempered his position in later writing, it is fair to characterize him as fitting
evaluation research into the scientific research paradigm (see Exhibit 1-G).
Campbell’s position was challenged by Lee Cronbach, another giant in the
evaluation field. While acknowledging that scientific investigation and evaluation may
use some of the same research procedures, Cronbach (1982) argued that the purpose of
evaluation sharply differentiates it from scientific research. In his view, evaluation is
more art than science and should be oriented toward meeting the needs of program
decisionmakers and stakeholders. Whereas scientific studies strive principally to meet
research standards, Cronbach thought evaluations should be dedicated to providing the
maximally useful information that the political circumstances, program constraints, and
available resources allow (see Exhibit 1-H).
One might be inclined to agree with both these views—that evaluations should meet
high standards of scientific research and at the same time be dedicated to serving the
information needs of program decisionmakers. The problem is that in practice these two
goals often are not especially compatible. Conducting social research at a high
scientific standard generally requires resources that exceed what is available for
evaluation projects. These resources include time, because high-quality research cannot
be done quickly, whereas program decisions often have to be made on short notice.
They also include the funding needed for the expertise and level of effort required for
high-quality scientific research. Moreover, research within the scientific framework
may require structuring the inquiry in ways that do not mesh well with the perspectives
of those who make decisions about the program. For example, specifying variables so
that they are well defined and measurable under scientific standards may appear to
trivialize what policymakers see as complex and dynamic facets of the program.
Similarly, scientific standards for inferring causality, as when investigating program
outcomes (was the program the cause of an observed change?), may require such
elaborate experimental controls that what is studied is no longer the program’s typical
services, but some constrained version of uncertain relevance to the actual program.
EXHIBIT 1-G
Reforms as Experiments
The United States and other modern nations should be ready for an experimental
******ebook converter DEMO Watermarks*******
approach to social reform, an approach in which we try out new programs designed
to cure specific social problems, in which we learn whether or not these programs
are effective, and in which we retain, imitate, modify, or discard them on the basis
of apparent effectiveness on the multiple imperfect criteria available.
EXHIBIT 1-H
Evaluators as Teachers
An evaluative study of a social program is justified to the extent that it facilitates the
work of the polity. It therefore is to be judged primarily by its contribution to public
thinking and to the quality of service provided subsequent to the evaluation… . An
evaluation pays off to the extent that it offers ideas pertinent to pending actions and
people think more clearly as a result. To enlighten, it must do more than amass good
data. Timely communications—generally not “final” ones—should distribute
information to the persons rightfully concerned, and those hearers should take the
information into their thinking. To speak broadly, an evaluation ought to inform and
improve the operations of the social system.
SOURCE: Quoted from Lee J. Cronbach and Associates, Toward Reform of Program
Evaluation (San Francisco: Jossey-Bass, 1980), pp. 65-66.
Nor can one blithely dismiss scientific concerns in evaluation. Properly understood,
the scientific approach is a very considered attempt to produce conclusions that are
valid and credible. Even when an evaluation falls short of this ideal—as all do to some
extent—science-based findings make an important contribution to a decision-making
context that is otherwise rife with self-interested perceptions and assertions, ideological
biases, and undocumented claims. But this statement, in turn, assumes that the evaluation
conclusions meaningfully address the situation of concern to decisionmakers; if not, they
may be praiseworthy for their validity and credibility, but still be irrelevant.
In practice, therefore, the evaluator must struggle to find a workable balance
between the emphasis placed on procedures that ensure the validity of the findings and
those that make the findings timely, meaningful, and useful to the consumers. Where that
balance point should be will depend on the purposes of the evaluation, the nature of the
program, and the political or decision-making context. In many cases, evaluations will
justifiably be undertaken that are “good enough” for answering relevant policy and
******ebook converter DEMO Watermarks*******
program questions even though program conditions or available resources prevent them
from being the best possible designs from a scientific standpoint.
A further complication is that it is often unclear who the ultimate users of the
evaluation will be and which of the potential users should be given priority in the
design. An evaluation generally has various audiences, some with very immediate
interests in particular aspects of the program under investigation, some with broader
interests in the type of intervention the particular program represents, and others falling
somewhere in between. Occasionally, the purposes and priority users of an evaluation
are defined so clearly and explicitly in advance that the evaluator has relatively little
difficulty in balancing scientific and pragmatic considerations. However, many
evaluation situations are not so clear-cut. Evaluation may be routinely required as part
of funding or contract arrangements with the presumption that it will be generally
informative to a program’s managers, the evaluation sponsors, and other interested
parties. Or it may evolve from a collaboration between a service agency with a need for
information for management purposes and a researcher with broader interests in the type
of intervention that a particular program provides. Under such circumstances, the trade-
offs between utility for program decisionmakers and scientific rigor are such that it is
rarely possible to design an evaluation that serves both interests well.
Some evaluation theorists champion utilization of evaluation as the overriding
concern and advocate evaluation that is designed around the information needs of
specific stakeholding consumers with whom the evaluator collaborates very closely
(e.g., Patton, 1997). A contrary view is advanced by the authors of review articles in
applied research journals who attempt to synthesize available research on the
effectiveness of various interventions. These scholars generally deplore the poor
methodological quality of evaluation studies and urge a higher standard. Some
commentators want to have it both ways and press the view that evaluations should
strive to have utility to program stakeholders and contribute to cumulative knowledge
about social intervention (Lipsey, 1997). Our outlook, for the didactic purposes of this
book, is that all these options are defensible, but not necessarily equally defensible in
any given evaluation situation. This, then, presents yet another issue for which the
evaluator must make a judgment call and attempt to tailor the evaluation design to the
particular purposes and circumstances.
EXHIBIT 1-I
The Ideal Evaluation Theory
******ebook converter DEMO Watermarks*******
The ideal (never achievable) evaluation theory would describe and justify why
certain evaluation practices lead to particular kinds of results across situations that
evaluators confront. It would (a) clarify the activities, processes, and goals of
evaluation; (b) explicate relationships among evaluative activities and the processes
and goals they facilitate; and (c) empirically test propositions to identify and
address those that conflict with research and other critically appraised knowledge
about evaluation.
SOURCE: Quoted from William R. Shadish, Thomas D. Cook, and Laura C. Leviton,
Foundations of Program Evaluation: Theories of Practice (Newbury Park, CA: Sage,
1991), pp. 30-31.
EXHIBIT 1-J
Diversity of the Members of the American Evaluation Association (in percentages)
Summary
Program evaluation is the use of social research methods to systematically
investigate the effectiveness of social intervention programs. It draws on the techniques
and concepts of social science disciplines and is intended to be useful for improving
programs and informing social action aimed at ameliorating social problems.
Modern evaluation research grew from pioneering efforts in the 1930s and
burgeoned in the years after World War II as new methodologies were developed that
could be applied to the rapidly growing social program arena. The social policy and
public administration movements have contributed to the professionalization of the field
and to the sophistication of the consumers of evaluation research.
The need for program evaluation is undiminished in the current era and may even
be expected to grow. Indeed, contemporary concern over the allocation of scarce
resources makes it more essential than ever to evaluate the effectiveness of social
interventions.
Most evaluators are trained either in one of the social sciences or in professional
schools that offer applied social research courses. Highly specialized, technical, or
complex evaluations may require specialized evaluation staffs. A basic knowledge of
the evaluation field, however, is relevant not only to those who will perform
evaluations but also to the consumers of evaluation research.
K EY CONCEPTS
Evaluation sponsor
The person, group, or organization that requests or requires the evaluation and provides
the resources to conduct it.
Program evaluation
The use of social research methods to systematically investigate the effectiveness of
social intervention programs in ways that are adapted to their political and
organizational environments and are designed to inform social action in ways that
improve social conditions.
Stakeholders
Individuals, groups, or organizations having a significant interest in how well a program
functions, for instance, those with decision-making authority over the program, funders
and sponsors, administrators and personnel, and clients or intended beneficiaries.
Utilization of evaluation
The use of the concepts and findings of an evaluation by decisionmakers and other
stakeholders whether at the day-to-day management level or at broader funding or
policy levels.
1. Terms in boldface are defined in the Key Concepts list at the end of the chapter and in
the Glossary.
Tailoring Evaluations
Chapter Outline
What Aspects of the Evaluation Plan Must Be Tailored?
What Features of the Situation Should the Evaluation Plan Take Into Account?
The Purposes of the Evaluation
Program Improvement
Accountability
Knowledge Generation
Hidden Agendas
The Program Structure and Circumstances
Stage of Program Development
Administrative and Political Context of the Program
Conceptual and Organizational Structure of the Program
The Resources Available for the Evaluation
The Nature of the Evaluator-Stakeholder Relationship
Evaluation Questions and Evaluation Methods
Needs Assessment
Assessment of Program Theory
Assessment of Program Process
Impact Assessment
Efficiency Assessment
Every evaluation must be tailored to a specific set of circumstances. The tasks that
******ebook converter DEMO Watermarks*******
evaluators undertake depend on the purposes of the evaluation, the conceptual and
organizational structure of the program being evaluated, and the resources available.
Formulating an evaluation plan requires the evaluator to first explore these aspects
of the evaluation situation with the evaluation sponsor and other key stakeholders.
Based on this reconnaissance, the evaluator can then develop a plan that identifies
the evaluation questions to be answered, the methods for answering them, and the
relationships to be developed with the stakeholders during the course of the
evaluation.
No hard-and-fast guidelines direct the process of designing an evaluation.
Nonetheless, achieving a good fit between the evaluation plan and the program
circumstances involves attention to certain critical themes. It is essential that the
evaluation plan be responsive to the purposes of the evaluation as understood by the
evaluation sponsor and key stakeholders. An evaluation intended to provide feedback
to program decisionmakers for improving a program will take a different approach
than one intended to help funders determine whether a program should be
terminated. In addition, the evaluation plan must reflect an understanding of how the
program is designed and organized so that the questions asked and the data collected
will be appropriate to the circumstances. Finally, of course, any evaluation will have
to be designed within the constraints of available time, personnel, and funding.
Although the particulars are diverse, the situations confronting the evaluator
typically present one of a small number of variations. In practice, therefore, tailoring
an evaluation is usually a matter of selecting and adapting one or another of a set of
familiar evaluation schemes to the circumstances at hand. One set of evaluation
schemes centers around the nature of the evaluator-stakeholder relations. Another
distinct set of approaches is organized around common combinations of evaluation
questions and the usual methods for answering them. This chapter provides an
overview of the issues and considerations the evaluator should take into account
when tailoring an evaluation plan.
One of the most challenging aspects of evaluation is that there is no “one size fits all”
approach. Every evaluation situation has a different and sometimes unique profile of
characteristics. The evaluation design must, therefore, involve an interplay between the
nature of the evaluation situation and the evaluator’s repertoire of approaches,
techniques, and concepts. A good evaluation design is one that fits the circumstances
while yielding credible and useful answers to the questions that motivate it. We begin
******ebook converter DEMO Watermarks*******
our discussion of how to accomplish this goal by taking inventory of the aspects of the
evaluation plan that need to be tailored to the program and the context of the evaluation.
The questions the evaluation is to answer. A large number of questions might be raised
about any social program by interested parties. There may be concerns about such
matters as the needs of the targets (persons, families, or social groups) to which a
program is directed and whether they are being adequately served, the management and
operation of the program, whether the program is having the desired impact, and its
costs and efficiency. No evaluation can, nor generally should, attempt to address all
such concerns. A central feature of an evaluation design, therefore, is specification of its
guiding purpose and the corresponding questions on which it will focus.
The methods and procedures the evaluation will use to answer the questions. A
critical skill for the evaluator is knowing how to obtain useful, timely, and credible
information about the various aspects of program performance. A large repertoire of
social research techniques and conceptual tools is available for this task. An evaluation
design must identify the methods that will be used to answer each of the questions at
issue and organize them into a feasible work plan. Moreover, the methods selected must
be practical as well as capable of providing meaningful answers to the questions with
the degree of scientific rigor appropriate to the evaluation circumstances.
Evaluations are initiated for many reasons. They may be intended to help
management improve a program; support advocacy by proponents or critics; gain
knowledge about the program’s effects; provide input to decisions about the program’s
funding, structure, or administration; or respond to political pressures. One of the first
determinations the evaluator must make is just what the purposes of a specific
evaluation are. This is not always a simple matter. A statement of the purposes generally
accompanies the initial request for an evaluation, but these announced purposes rarely
tell the whole story and sometimes are only rhetorical. Furthermore, evaluations may be
routinely required in a program situation or sought simply because it is presumed to be a
good idea without any distinct articulation of the sponsor’s intent (see Exhibit 2-A).
The prospective evaluator must attempt to determine who wants the evaluation, what
they want, and why they want it. There is no cut-and-dried method for doing this, but it
is usually best to approach this task the way a journalist would dig out a story. The
evaluator should examine source documents, interview key informants with different
vantage points, and uncover pertinent history and background. Generally, the purposes
of the evaluation will relate mainly to program improvement, accountability, or
knowledge generation (Chelimsky,1997),but sometimes quite different motivations are
in play.
EXHIBIT 2-A
Does Anybody Want This Evaluation?
SOURCE: Quoted from Dennis J. Palumbo and Michael A. Hallett, “Conflict Versus
Consensus Models in Policy Evaluation and Implementation,” Evaluation and Program
Planning, 1993, 16(1):11-23.
EXHIBIT 2-B
A Stop-Smoking Telephone Line That Nobody Called
Formative evaluation procedures were used to help design a “stop smoking” hotline
for adult smokers in a cancer control project sponsored by a health maintenance
organization (HMO). Phone scripts for use by the hotline counselors and other
******ebook converter DEMO Watermarks*******
aspects of the planned services were discussed with focus groups of smokers and
reviewed in telephone interviews with a representative sample of HMO members
who smoked. Feedback from these informants led to refinement of the scripts, hours
of operation arranged around the times participants said they were most likely to
call, and advertising of the service through newsletters and “quit kits” routinely
distributed to all project participants. Despite these efforts, an average of less than
three calls per month was made during the 33 months the hotline was in operation.
To further assess this disappointing response, comparisons were made with similar
services around the country. This revealed that low use rates were typical but the
other hotlines served much larger populations and therefore received more calls.
The program sponsors concluded that to be successful, the smoker’s hotline would
have to be offered to a larger population and be intensively publicized
Accountability
Knowledge Generation
Some evaluations are undertaken to describe the nature and effects of an intervention
as a contribution to knowledge. For instance, an academic researcher might initiate an
evaluation to test whether a program designed on the basis of theory, say, an innovative
science curriculum, is workable and effective (see Exhibit 2-D for an example).
Similarly, a government agency or private foundation may mount and evaluate a
demonstration program to investigate a new approach to a social problem, which, if
successful, could then be implemented more widely. Because evaluations of this sort are
intended to make contributions to the social science knowledge base or be a basis for
significant program innovation, they are usually conducted using the most rigorous
methods feasible. The audience for the findings will include the sponsors of the
research as well as a broader audience of interested scholars and policymakers. In these
situations, the findings of the evaluation are most likely to be disseminated through
scholarly journals, research monographs, conference papers, and other professional
outlets.
EXHIBIT 2-C
U.S. General Accounting Office Assesses Early Effects of the Mammography Quality
Standards Act
The Mammography Quality Standards Act of 1992 required the Food and Drug
Administration (FDA) to administer a code of standards for mammo-gram-screening
procedures in all the states. When the act was passed, Congress was concerned that
access to mammography services might decrease if providers choose to drop them
rather than upgrade to comply with the new standards. The U.S. General Accounting
Office (GAO) was asked to assess the early effects of implementing the act and
report back to Congress. It found that the FDA had taken a gradual approach to
implementing the requirements, which had helped to minimize adverse effects on
access. The FDA inspectors had not closed many facilities that failed to meet the
standards but, instead, had allowed additional time to correct the problems found
during inspections. Only a relatively small number of facilities had terminated their
mammography services and those were generally small providers located within 25
miles of another certified facility. The GAO concluded that the Mammography
******ebook converter DEMO Watermarks*******
Quality Standards Act was having a positive effect on the quality of mammography
services, as Congress had intended.
Hidden Agendas
Sometimes the true purpose of the evaluation, at least for those who initiate it, has
little to do with actually obtaining information about the program’s performance.
Program administrators or boards may launch an evaluation because they believe it will
be good public relations and might impress funders or political decisionmakers.
Occasionally, an evaluation is commissioned to provide a rationale for a decision that
has already been made behind the scenes to terminate a program, fire an administrator,
and the like. Or the evaluation may be commissioned as a delaying tactic to appease
critics and defer difficult decisions.
Virtually all evaluations involve some political maneuvering and public relations,
but when these are the principal purposes, the prospective evaluator is presented with a
difficult dilemma. The evaluation must either be guided by the political or public
relations purposes, which will likely compromise its integrity, or focus on program
performance issues that are of no real interest to those commissioning the evaluation and
may even be threatening to them. In either case, the evaluator is well advised to try to
avoid such situations. If a lack of serious intent becomes evident during the initial
exploration of the evaluation context, the prospective evaluator may wish to decline to
participate. Alternatively, the evaluator might assume a consultant role at that point to
help the parties clarify the nature of evaluation and redirect their efforts toward
approaches more appropriate to their purposes.
EXHIBIT 2-D
Testing an Innovative Treatment Concept for Pathological Gambling
EXHIBIT 2-E
Stages of Program Development and Related Evaluation Functions
SOURCE: Adapted from S. Mark Pancer and Anne Westhues, “A Developmental Stage
Approach to Program Planning and Evaluation,” Evaluation Review, 1989, 13(1):56-
77.
Except possibly for academic researchers who conduct evaluation studies on their
own initiative for the purpose of generating knowledge, evaluators are not free to
establish their own definitions of what the program is about, its goals and objectives,
and what evaluation questions should be addressed. The evaluator works with the
evaluation sponsor, program management, and other stakeholders to develop this
essential background. Different perspectives from these various groups are to be
expected. In most instances, the evaluator will solicit input from all the major
stakeholders and attempt to incorporate their concerns so that the evaluation will be as
inclusive and informative as possible.
If significant stakeholders are not in substantial agreement about the mission, goals,
or other critical issues for the program, evaluation design becomes very difficult (see
Exhibit 2-F). The evaluator can attempt to incorporate the conflicting perspectives into
the design, but this may not be easy. The evaluation sponsors may not be willing to
embrace the inclusion of issues and perspectives from groups they view as adversaries.
Furthermore, these perspectives may be so different that they cannot be readily
incorporated into a single evaluation plan with the time and resources available.
Alternatively, the evaluator can plan the evaluation from the perspective of only one
of the stakeholders, typically the evaluation sponsor. This, of course, will not be greeted
with enthusiasm by stakeholders with conflicting perspectives and they will likely
oppose the evaluation and criticize the evaluator. The challenge to the evaluator is to be
clear and straightforward about the perspective represented in the evaluation and the
reasons for it, despite the objections. It is not necessarily inappropriate for an
evaluation sponsor to insist that the evaluation emphasize its perspective, nor is it
necessarily wrong for an evaluator to conduct an evaluation from that perspective
without giving strong representation to conflicting views.
EXHIBIT 2-F
Stakeholder Conflict Over Home Arrest Program
There were numerous conflicting goals that were considered important by different
agencies, including lowering costs and prison diversion, control and public safety,
intermediate punishment and increased options for corrections, and treatment and
rehabilitation. Different stakeholders emphasized different goals. Some legislators
stressed reduced costs, others emphasized public safety, and still others were
mainly concerned with diverting offenders from prison. Some implementers stressed
the need for control and discipline for certain “dysfunctional” individuals, whereas
others focused on rehabilitation and helping offenders become reintegrated into
society. Thus, there was no common ground for enabling “key policy-makers,
managers, and staff” to come to an agreement about which goals should have
priority or about what might constitute program improvement.
Suppose, for instance, that the funders of a job training program for unemployed
persons have concerns about whether the program is mainly taking cases that are easy to
work with and, additionally, is providing only vocational counseling services rather
than training in marketable job skills. The sponsors may appropriately commission an
evaluation to examine these questions. Program managers, in contrast, will likely have a
sharply conflicting perspective that justifies their selection of clients, program
activities, and management practices. A conscientious evaluator will listen to the
managers’ perspective and encourage their input so that the evaluation can be as
sensitive as possible to their legitimate concerns about what the program is doing and
why. But the evaluation design should, nonetheless, be developed primarily from the
perspective of the program funders and the issues that concern them. The evaluator’s
primary obligations are to be forthright about the perspective the evaluation takes, so
there is no misunderstanding, and to treat the program personnel fairly and honestly.
Another approach to situations of stakeholder conflict is for the evaluator to design
an evaluation that attempts to facilitate better understanding among the conflicting
parties about the aspects of the program at issue. This might be done through efforts to
clarify the nature of the different concerns, assumptions, and perspectives of the parties.
For instance, parents of special education children may believe that their children are
stigmatized and discriminated against when mainstreamed in regular classrooms.
Teachers may feel equally strongly that this is not true. A careful observational study by
an evaluator of the interaction of regular and special education children may reveal that
******ebook converter DEMO Watermarks*******
there is a problem, as the parents claim, but that it occurs outside the classroom on the
playground and during other informal interactions among the children, thus accounting
for the teachers’ perspective.
Where stakeholder conflict is deep and hostile, it may be based on such profound
differences in political values or ideology that an evaluation, no matter how
comprehensive and ecumenical, cannot reconcile them. One school of thought in the
evaluation field holds that many program situations are of this sort and that differences
in values and ideology are the central matter to which the evaluator must attend. In this
view, the social problems that programs address, the programs themselves, and the
meaning and importance of those programs are all social constructions that will
inevitably differ for different individuals and groups. Thus, rather than focus on program
objectives, decisions, outcomes, and the like, evaluators are advised to directly engage
the diverse claims, concerns, issues, and values put forth by the various stakeholders.
Guba and Lincoln (1987, 1989, 1994), the leading proponents of this particular
construction of evaluation, have argued that the proper role of the evaluator is to
encourage interpretive dialogue among the program stakeholders. From this perspective,
the primary purpose of an evaluation is to facilitate negotiations among the stake-
holders from which a more shared construction of the value and social significance of
the program can emerge that still respects the various ideologies and concerns of the
different stakeholders.
Finally, evaluators must realize that, despite their best efforts to communicate
effectively and develop appropriate, responsive evaluation plans, program stakeholders
owe primary allegiance to their own positions and political alignments. This means that
sponsors of evaluation and other stakeholders may turn on the evaluator and harshly
criticize the evaluation if the results contradict the policies and perspectives they
advocate. Thus, even those evaluators who do a superb job of working with stake-
holders and incorporating their views and concerns in the evaluation plan should not
expect to be acclaimed as heroes by all when the results are in. The multiplicity of
stake-holder perspectives makes it likely that no matter how the results come out,
someone will be unhappy. It may matter little that everyone agreed in advance on the
evaluation questions and the plan for answering them, or that each stakeholder group
understood that honest results might not favor its position. Nonetheless, it is highly
advisable for the evaluator to give early attention to identifying stakeholders, devising
strategies for minimizing discord due to their different perspectives, and conditioning
their expectations about the evaluation results.
It is a simple truism that if stakeholders do not have a clear idea about what a
******ebook converter DEMO Watermarks*******
program is supposed to be doing, it will be difficult to evaluate how well it is doing it.
One factor that shapes the evaluation design, therefore, is the conceptualization of the
program, or the program theory, that is, its plan of operation, the logic that connects its
activities to the intended outcomes, and the rationale for why it does what it does. As
we will discuss later in this chapter, this conceptual structure can itself be a focus of
evaluation. The more explicit and cogent the program conceptualization, the easier it
will be for the evaluator to identify the program functions and effects on which the
evaluation should focus. If there is significant uncertainty about whether the program
conceptualization is appropriate for the social problem the program addresses, it may
make little sense for the evaluation design to focus on how well the conceptualization
has been implemented. In such cases, the evaluation may be more usefully devoted to
assessing and better developing the program plan. In the planning stages of a new
program, an evaluator can often help sharpen and shape the program design to make it
both more explicit and more likely to effectively achieve its objectives.
When a program is well established, everyday practice and routine operating
procedures tend to dominate, and key stakeholders may find it difficult to articulate the
underlying program rationale or agree on any single version of it. For instance, the
administrators of a counseling agency under contract to a school district to work with
children having academic problems may be quite articulate about their counseling
theories, goals for clients, and therapeutic techniques. But they may have difficulty
expressing a clear view of how their focus on improving family communication is
supposed to translate into better grades. It may then become the task of the evaluator to
help program personnel to formulate the implicit but unarticulated rationale for program
activities.
At a more concrete level, evaluators also need to take into consideration the
organizational structure of the program when planning an evaluation. Such program
characteristics as multiple services or multiple target populations, distributed service
sites or facilities, or extensive collaboration with other organizational entities have
powerful implications for evaluation. In general, organizational structures that are
larger, more complex, more decentralized, and more geographically dispersed will
present greater practical difficulties than their simpler counterparts. In such cases, a
team of evaluators is often needed, with resources and time proportionate to the size and
complexity of the program. The challenges of evaluating complex, multisite programs
are sufficiently daunting that they are distinct topics of discussion in the evaluation
literature (see Exhibit 2-G; Turpin and Sinacore, 1991).
Equally important are the nature and structure of the particular intervention or
service the program provides. The easiest interventions to evaluate are those that
involve discrete, concrete activities (e.g., serving meals to homeless persons) expected
to have relatively immediate and observable effects (the beneficiaries of the program
are not hungry). The organizational activities and delivery systems for such
******ebook converter DEMO Watermarks*******
interventions are usually straightforward (soup kitchen), the service itself is
uncomplicated (hand out meals), and the outcomes are direct (people eat). These
features greatly simplify the evaluation questions likely to be raised, the data collection
required to address them, and the interpretation of the findings.
EXHIBIT 2-G
Multisite Evaluations in Criminal Justice: Structural Obstacles to Success
The most difficult interventions to evaluate are those that are diffuse in nature (e.g.,
******ebook converter DEMO Watermarks*******
community organizing), extend over long time periods (an elementary school math
curriculum), vary widely across applications (psychotherapy), or have expected
outcomes that are long term (preschool compensatory education) or indistinct (improved
quality of life). For such interventions, many evaluation questions dealing with a
program’s process and outcome can arise because of the ambiguity of the services and
their potential effects. The evaluator may also have difficulty developing measures that
capture the critical aspects of the program’s implementation and outcomes. Actual data
collection, too, may be challenging if it must take place over extended time periods or
involve many different variables and observations. All these factors have implications
for the evaluation plan and, especially, for the effort and resources required to complete
the plan.
All these groups or only a few may be involved in any given evaluation. But,
whatever the assortment of stakeholders, the evaluator must be aware of their concerns
and include in the evaluation planning appropriate means for interacting with at least the
major stakeholders (Exhibit 2-H provides suggestions about how to do that).
At the top of the list of stakeholders is the evaluation sponsor. The sponsor is the
agent who initiates the evaluation, usually provides the funding, and makes the decisions
about how and when it will be done and who should do it. Various relationships with
the evaluation sponsor are possible and will largely depend on the sponsor’s
preferences and whatever negotiation takes place with the evaluator. A common
situation is one in which the sponsor expects the evaluator to function as an independent
professional practitioner who will receive guidance from the sponsor, especially at the
beginning, but otherwise take full responsibility for planning, conducting, and reporting
the evaluation. For instance, program funders often commission evaluations by
publishing a request for proposals (RFP) or applications (RFA) to which evaluators
respond with statements of their capability, proposed design, budget, and time line, as
requested. The evaluation sponsor then selects an evaluator from among those
responding and establishes a contractual arrangement for the agreed-on work.
Other situations call for the evaluator to work more collaboratively with the
evaluation sponsor. The sponsor may want to be involved in the planning,
implementation, and analysis of results, either to react step by step as the evaluator
develops the project or to actually participate with the evaluator in each step. Variations
on this form of relationship are typical for internal evaluators who are part of the
organization whose program is being evaluated. In such cases, the evaluator generally
works closely with management in planning and conducting the evaluation, whether
management of the evaluation unit, the program being evaluated, someone higher up in
the organization, or some combination.
EXHIBIT 2-H
Stakeholder Involvement in Evaluation: Suggestions for Practice
******ebook converter DEMO Watermarks*******
Based on experience working with school district staff, one evaluator offers the
following advice for bolstering evaluation use through stake-holder involvement:
Identify stakeholders: At the outset, define the specific stakeholders who will
be involved with emphasis on those closest to the program and who hold high
stakes in it.
Involve stakeholders early: Engage stakeholders in the evaluation process as
soon as they have been identified because many critical decisions that affect
the evaluation occur early in the process.
Involve stakeholders continuously: The input of key stakeholders should be
part of virtually all phases of the evaluation; if possible, schedule regular
group meetings.
Involve stakeholders actively: The essential element of stakeholder
involvement is that it be active; stakeholders should be asked to address design
issues, help draft survey questions, provide input into the final report, and
deliberate about all important aspects of the project.
Establish a structure: Develop and use a conceptual framework based in
content familiar to stakeholders that can help keep dialogue focused. This
framework should highlight key issues within the local setting as topics for
discussion so that stakeholders can share concerns and ideas, identify
information needs, and interpret evaluation results.
In some instances, the evaluation sponsor will ask that the evaluator work
collaboratively but stipulate that the collaboration be with another stakeholder group.
For instance, private foundations often want evaluations to be developed in
collaboration with the local stakeholders of the programs they fund. An especially
interesting variant of this approach is when it is required that the recipients of program
services take the primary role in planning, setting priorities, collecting information, and
interpreting the results of the evaluation.
The evaluator’s relationship to the evaluation sponsor and other stakeholders is so
central to the evaluation context and planning process that a special vocabulary has
arisen to describe various circumstances. The major recognized forms of evaluator-
stakeholder relationships are as follows:
EXHIBIT 2-I
Successful Communication With Stakeholders
Torres, Preskill, and Piontek (1996) surveyed and interviewed members of the
American Evaluation Association about their experiences communicating with
stakeholders and reporting evaluation findings. The respondents identified the
following elements of effective communication:
SOURCE: Adapted from Rosalie T. Torres, Hallie S. Preskill, and Mary E. Piontek,
Evaluation Strategies for Communicating and Reporting: Enhancing Learning in
Organizations (Thousand Oaks, CA: Sage, 1996), pp. 4-6.
These forms of evaluation are discussed in detail in Chapters 4-11.Here we will only
provide some guidance regarding the circumstances for which each is most appropriate.
Needs Assessment
The primary rationale for a social program is to alleviate a social problem. The
impetus for a new program to increase literacy, for example, is likely to be recognition
that a significant proportion of persons in a given population are deficient in reading
skills. Similarly, an ongoing program may be justified by the persistence of a social
problem: Driver education in high schools receives public support because of the
continuing high rates of automobile accidents among adolescent drivers.
One important form of evaluation, therefore, assesses the nature, magnitude, and
distribution of a social problem; the extent to which there is a need for intervention; and
the implications of these circumstances for the design of the intervention. These
diagnostic activities are referred to as needs assessment in the evaluation field but
overlap what is called social epidemiology and social indicators research in other
fields (McKillip, 1987; Reviere et al., 1996; Soriano, 1995; Witkin and Altschuld,
1995). Needs assessment is often a first step in planning a new program or restructuring
an established one to provide information about what services are needed and how they
might best be delivered. Needs assessment may also be appropriate to examine whether
established programs are responsive to the current needs of the target participants and
provide guidance for improvement. Exhibit 2-J provides an example of one of the
several approaches that can be taken. Chapter 4 discusses the various aspects of needs
assessment in detail.
******ebook converter DEMO Watermarks*******
Assessment of Program Theory
Given a recognized problem and need for intervention, it does not follow that any
program, willy-nilly, will be appropriate for the job. The conceptualization and design
of the program must reflect valid assumptions about the nature of the problem and
represent a feasible approach to resolving it. Put another way, every social program is
based on some plan or blueprint that represents the way it is “supposed to work.” This
plan is rarely written out in complete detail but exists nonetheless as a shared
conceptualization among the principal stakeholders. Because this program plan consists
essentially of assumptions and expectations about how the program should conduct its
business in order to attain its goals, we will refer to it as the program theory. If this
theory is faulty, the intervention will fail no matter how elegantly it is conceived or how
well it is implemented (Chen, 1990; Weiss, 1972).
EXHIBIT 2-J
Needs for Help Among Homeless Men and Women
EXHIBIT 2-K
A Flaw in the Design of Family Preservation Programs
SOURCE: Adapted from Joseph S. Wholey, “Assessing the Feasibility and Likely
Usefulness of Evaluation,” in Handbook of Practical Program Evaluation, eds. J. S.
Wholey, H. P. Hatry, and K. E. Newcomer (San Francisco: Jossey-Bass, 1994), pp. 29-
31. Wholey’s account, in turn, is based on Kaye and Bell (1993).
EXHIBIT 2-L
Failure on the Front Lines: Implementing Welfare Reform
SOURCE: Adapted from Marcia K. Meyers, Bonnie Glaser, and Karin MacDonald,
“On the Front Lines of Welfare Delivery: Are Workers Implementing Policy Reforms?”
Journal of Policy Analysis and Management, 1998, 17(1):l-22.
Process evaluation is the most frequent form of program evaluation. It is used both
as a freestanding evaluation and in conjunction with impact assessment (discussed
below) as part of a more comprehensive evaluation. As a freestanding evaluation, it
yields quality assurance information, assessing the extent to which a program is
implemented as intended and operating up to the standards established for it. When the
program model employed is one of established effectiveness, a demonstration that the
program is well implemented can be presumptive evidence that the expected outcomes
are produced as well. When the program is new, a process evaluation provides
valuable feedback to administrators and other stakeholders about the progress that has
been made implementing the program plan. From a management perspective, process
evaluation provides the feedback that allows a program to be managed for high
performance (Wholey and Hatry, 1992), and the associated data collection and reporting
of key indicators may be institutionalized in the form of a management information
system (MIS) to provide routine, ongoing performance feedback.
In its other common application, process evaluation is an indispensable adjunct to
impact assessment. The information about program outcomes that evaluations of impact
provide is incomplete and ambiguous without knowledge of the program activities and
services that produced those outcomes. When no impact is found, process evaluation
has significant diagnostic value by indicating whether this was because of
implementation failure, that is, the intended services were not provided hence the
expected benefits could not have occurred, or theory failure, that is, the program was
implemented as intended but failed to produce the expected effects. On the other hand,
when program effects are found, process evaluation helps confirm that they resulted
from program activities, rather than spurious sources, and identify the aspects of service
most instrumental to producing the effects. Process evaluation is described in more
detail in Chapter 6.
Impact Assessment
******ebook converter DEMO Watermarks*******
An impact assessment, sometimes called an impact evaluation or outcome
evaluation, gauges the extent to which a program produces the intended improvements in
the social conditions it addresses. Impact assessment asks whether the desired outcomes
were attained and whether those changes included unintended side effects.
The major difficulty in assessing the impact of a program is that usually the desired
outcomes can also be caused by factors unrelated to the program. Accordingly, impact
assessment involves producing an estimate of the net effects of a program—the changes
brought about by the intervention above and beyond those resulting from other processes
and events affecting the targeted social conditions. To conduct an impact assessment, the
evaluator must thus design a study capable of establishing the status of program
recipients on relevant outcome measures and also estimating what their status would be
had they not received the intervention. Much of the complexity of impact assessment is
associated with obtaining a valid estimate of the latter status, known as the
counterfactual because it describes a condition contrary to what actually happened to
program recipients (Exhibit 2-M presents an example of impact evaluation).
Determining when an impact assessment is appropriate and what evaluation design
to use present considerable challenges to the evaluator. Evaluation sponsors often
believe that they need an impact evaluation and, indeed, it is the only way to determine
if the program is having the intended effects. However, an impact assessment is
characteristically very demanding of expertise, time, and resources and is often difficult
to set up properly within the constraints of routine program operation. If the need for
outcome information is sufficient to justify an impact assessment, there is still a question
of whether the program circumstances are suitable for conducting such an evaluation.
For instance, it makes little sense to establish the impact of a program that is not well
structured or cannot be adequately described. Impact assessment, therefore, is most
appropriate for mature, stable programs with a well-defined program model and a clear
use for the results that justifies the effort required. Chapters 7-10 discuss impact
assessment and the various ways in which it can be designed and conducted
EXHIBIT 2-M
No Impact on Garbage
The impact assessment was conducted by obtaining records of the daily volume of
garbage for Nei-fu and the similar, adjacent suburb of Nan-kan for a period
beginning four months prior to the program onset and continuing four months after.
Analysis showed no reduction in the volume of garbage collected in Nei-fu during
the program period relative to the preprogram volume or that in the comparison
community. The evidence indicated that residents simply saved their customary
volume of Tuesday garbage and disposed of it on Wednesday, with no carryover
effects on the volume for the remainder of each week. Interviews with residents
revealed that the program theory was wrong—they did not report the inconvenience
or unpleasantness expected to be associated with storing garbage in their homes.
SOURCE: Adapted from Huey-Tsyh Chen, Juju C. S. Wang, and Lung-Ho Lin,
“Evaluating the Process and Outcome of a Garbage Reduction Program in Taiwan,”
Evaluation Review, 1997, 21(1):27-42.
Efficiency Assessment
Finding that a program has positive effects on the target problem is often insufficient
for assessing its social value. Resources for social programs are limited so their
accomplishments must also be judged against their costs. Some effective programs may
not be attractive because their costs are high relative to their impact in comparison to
other program alternatives (Exhibit 2-N presents an example).
EXHIBIT 2-N
The Cost-Effectiveness of Community Treatment for Persons With Mental Disabilities
If provided with supportive services, persons with mental disabilities can often be
maintained in community settings rather than state mental hospitals. But is such
community treatment more costly than residential hospital care? A team of
researchers in Ohio compared the costs of a community program that provides
******ebook converter DEMO Watermarks*******
housing subsidies and case management for state-certified severely mentally
disabled clients with the costs of residential patients at the regional psychiatric
hospital. Program clients were interviewed monthly for more than two years to
determine their consumption of mental health services, medical and dental services,
housing services, and other personal consumption. Information on the cost of those
services was obtained from the respective service providers and combined with the
direct cost of the community program itself. Costs for wards where patients resided
90 or more days were gathered from the Ohio Department of Mental Health budget
data and subdivided into categories that corresponded as closely as possible to
those tabulated for the community program participants. Mental health care
comprised the largest component of service cost for both program and hospital
clients. Overall, however, the total cost for all services was estimated at $1,730 per
month for the most intensive version of community program services and about
$6,250 per month for residential hospital care. Community care, therefore, was
much less costly than hospital care, not more costly.
Key aspects of the evaluation plan that must be tailored include the questions the
evaluation is to answer, the methods and procedures to be used in answering those
questions, and the nature of the evaluator-stakeholder relationship.
Three principal features of the evaluation context must be taken into account in an
evaluation plan: the purpose of the evaluation, the structure and circumstances of the
program being evaluated, and the resources available for the evaluation.
The overall purpose of the evaluation necessarily shapes its focus, scope, and
construction. Evaluation is generally intended to provide feedback to program managers
and sponsors, establish accountability to decisionmakers, or contribute to knowledge
about social intervention.
An often neglected but critical aspect of an evaluation plan involves spelling out
the appropriate relationship between the evaluator and the evaluation sponsor and other
major stakeholders. The three major types of evaluator-stakeholder relationships are (1)
independent evaluation, in which the evaluator takes primary responsibility for
designing and conducting the evaluation; (2) participatory or collaborative evaluation,
in which the evaluation is conducted as a team project involving stakeholders; and (3)
empowerment evaluation, in which the evaluation is designed to help develop the
capabilities of the participating stakeholders in ways that enhance their skills or
K EY CONCEPTS
Assessment of program process
An evaluative study that answers questions about program operations, implementation,
and service delivery. Also known as a process evaluation or an implementation
assessment.
Cost-benefit analysis
Analytical procedure for determining the economic efficiency of a program, expressed
as the relationship between costs and outcomes, usually measured in monetary terms.
Cost-effectiveness analysis
Analytical procedure for determining the efficacy of a program in achieving given
intervention outcomes in relation to the program costs.
Efficiency assessment
******ebook converter DEMO Watermarks*******
An evaluative study that answers questions about program costs in comparison to either
the monetary value of its benefits or its effectiveness in terms of the changes brought
about in the social conditions it addresses.
Empowerment evaluation
A participatory or collaborative evaluation in which the evaluator’s role includes
consultation and facilitation directed toward the development of the capabilities of the
participating stakeholders to conduct evaluation on their own, to use it effectively for
advocacy and change, and to have some influence on a program that affects their lives.
Evaluation questions
A set of questions developed by the evaluator, evaluation sponsor, and other
stakeholders; the questions define the issues the evaluation will investigate and are
stated in terms such that they can be answered using methods available to the evaluator
in a way useful to stakeholders.
Formative evaluation
Evaluative activities undertaken to furnish information that will guide program
improvement.
Impact assessment
An evaluative study that answers questions about program outcomes and impact on the
social conditions it is intended to ameliorate. Also known as an impact evaluation or an
outcome evaluation.
Independent evaluation
An evaluation in which the evaluator has the primary responsibility for developing the
evaluation plan, conducting the evaluation, and disseminating the results.
Needs assessment
An evaluative study that answers questions about the social conditions a program is
intended to address and the need for the program.
Process evaluation
A form of program monitoring designed to determine whether the program is delivered
as intended to the target recipients. Also known as implementation assessment.
Program monitoring
The systematic documentation of aspects of program performance that are indicative of
whether the program is functioning as intended or according to some appropriate
standard. Monitoring generally involves program performance related to program
process, program outcomes, or both.
Program theory
The set of assumptions about the manner in which a program relates to the social
benefits it is expected to produce and the strategy and tactics the program has adopted to
achieve its goals and objectives. Within program theory we can distinguish impact
theory, relating to the nature of the change in social conditions brought about by
program action, and process theory, which depicts the program’s organizational plan
and service utilization plan.
Summative evaluation
Evaluative activities undertaken to render a summary judgment on certain critical
aspects of the program’s performance, for instance, to determine if specific goals and
objectives were met.
Target
The unit (individual, family, community, etc.) to which a program intervention is
directed. All such units within the area served by a program comprise its target
population.
Chapter Outline
What Makes a Good Evaluation Question?
Dimensions of Program Performance
Evaluation Questions Must Be Reasonable and Appropriate
Evaluation Questions Must Be Answerable
Criteria for Program Performance
Typical Evaluation Questions
The Evaluation Hierarchy
Determining the Specific Questions the Evaluation Should Answer
Representing the Concerns of the Evaluation Sponsor and Major Stakeholders
Obtaining Input From Stakeholders
Topics for Discussion With Stakeholders
Analysis of Program Assumptions and Theory
Collating Evaluation Questions and Setting Priorities
The previous chapter presented an overview of the many considerations that go into
tailoring an evaluation. Although all those matters are important to evaluation
design, the essence of evaluation is generating credible answers to questions about
the performance of a social program. Good evaluation questions must address issues
that are meaningful in relation to the nature of the program and also of concern to
key stakeholders. They must be answerable with the research techniques available to
******ebook converter DEMO Watermarks*******
the evaluator and formulated so that the criteria by which the corresponding
program performance will be judged are explicit or can be determined in a
straightforward way.
A set of carefully crafted evaluation questions, therefore, is the hub around which
evaluation revolves. It follows that a careful, explicit formulation of those questions
greatly facilitates the design of the evaluation and the use of its findings. Evaluation
questions may take various forms, some of which are more useful and meaningful
than others for stakeholders and program decisionmakers. Furthermore, some forms
of evaluation questions are more amenable to the evaluator’s task of providing
credible answers, and some address critical program effectiveness issues more
directly than others.
This chapter discusses practical ways in which evaluators can fashion effective
evaluation questions. An essential step is identification of the decisionmakers who
will use the evaluation results, what information they need, and how they expect to
use it. The evaluator’s own analysis of the program is also important. One approach
that is particularly useful for this purpose is articulation of the program theory, a
detailed account of how and why the program is supposed to work. Consideration of
program theory focuses attention on critical events and premises that may be
appropriate topics of inquiry in the evaluation.
EXHIBIT 3-A
What It Means to Evaluate Something
There are different kinds of inquiry across practice areas, such as that which is
found in law, medicine, and science. Common to each kind of inquiry is a general
pattern of reasoning or basic logic that guides and informs the practice… .
Evaluation is one kind of inquiry, and it, too, has a basic logic or general pattern of
reasoning [that has been put forth by Michael Scriven]… . This general logic of
evaluation is as follows:
Good evaluation questions must first of all be reasonable and appropriate. That is,
they must identify performance dimensions that are relevant to the expectations
stakeholders hold for the program and that represent domains in which the program can
realistically hope to have accomplishments. It would hardly be fair or sensible, for
******ebook converter DEMO Watermarks*******
instance, to ask if a low-income housing weatherization program reduced the prevalence
of drug dealing in a neighborhood. Nor would it generally be useful to ask a question as
narrow as whether the program got a bargain in its purchase of file cabinets for its
office. Furthermore, evaluation questions must be answerable; that is, they must involve
performance dimensions that are sufficiently specific, concrete, practical, and
measurable that meaningful information can be obtained about their status. An evaluator
would have great difficulty determining whether an adult literacy program improved a
community’s competitiveness in the global economy or whether the counselors in a drug
prevention program were sufficiently caring in their relations with clients.
Evaluation Questions Must Be Reasonable and Appropriate
Program advocates often proclaim grandiose goals (e.g., improve the quality of life
for children), expect unrealistically large effects, or believe the program to have
accomplishments that are clearly beyond its actual capabilities. Good evaluation
questions deal with performance dimensions that are appropriate and realistic for the
program. This means that the evaluator must often work with relevant stakeholders to
scale down and focus the evaluation questions. The manager of a community health
program, for instance, might initially ask, “Are our education and outreach services
successful in informing the public about the risk of AIDS?” In practice, however, those
services may consist of little more than occasional presentations by program staff at
civic club meetings and health fairs. With this rather modest level of activity, it may be
unrealistic to expect the public at large to receive much AIDS information. If a question
about this service is deemed important for the evaluation, a better version might be
something such as “Do our education and outreach services raise awareness of AIDS
issues among the audiences addressed?” and “Do those audiences represent community
leaders who are likely to influence the level of awareness of AIDS issues among other
people?”
There are two complementary ways for an evaluator, in collaboration with pertinent
stakeholders, to assess how appropriate and realistic a candidate evaluation question is.
The first is to examine the question in the context of the actual program activities related
to it. In the example above, for instance, the low-key nature of the education and
outreach services was clearly not up to the task of “informing the public about the risk
of AIDS,”and there would be little point in having the evaluation attempt to determine if
this was the actual outcome. The evaluator and relevant stakeholders should identify and
scrutinize the program components, activities, and personnel assignments that relate to
program performance and formulate the evaluation question in a way that is reasonable
given those characteristics.
The second way to assess whether candidate evaluation questions are reasonable
and appropriate is to analyze them in relationship to the findings reported in applicable
social science and social service literature. For instance, the sponsor of an evaluation
******ebook converter DEMO Watermarks*******
of a program for juvenile delinquents might initially ask if the program increases the
self-esteem of the delinquents, in the belief that inadequate self-esteem is a problem for
these juveniles and improvements will lead to better behavior. Examination of the
applicable social science research, however, will reveal that juvenile delinquents do
not generally have problems with self-esteem and, moreover, that increases in self-
esteem are not generally associated with reductions in delinquency. In light of this
information, the evaluator and the evaluation sponsor may well agree that the question
of the program’s impact on self-esteem is not appropriate.
The foundation for formulating appropriate and realistic evaluation questions is
detailed and complete program description. Early in the process, the evaluator should
become thoroughly acquainted with the program—how it is structured, what activities
take place, the roles and tasks of the various personnel, the nature of the participants,
and the assumptions inherent in its principal functions. The stakeholder groups with
whom the evaluator collaborates (especially program managers and staff) will also
have knowledge about the program, of course. Evaluation questions that are inspired by
close consideration of actual program activities and assumptions will almost
automatically be appropriate and realistic.
Evaluation Questions Must Be Answerable
It is obvious that the evaluation questions around which an evaluation plan is
developed should be answerable. Questions that cannot be answered may be intriguing
to philosophers but do not serve the needs of evaluators and the decisionmakers who
intend to use the evaluation results. What is not so obvious, perhaps, is how easy it is to
formulate an unanswerable evaluation question without realizing it. This may occur
because the terms used in the question, although seemingly commonsensical, are actually
ambiguous or vague when the time comes for a concrete interpretation (“Does this
program enhance family values?”). Or sensible-sounding questions may invoke issues
for which there are so few observable indicators that little can be learned about them
(“Are the case managers sensitive to the social circumstances of their clients?”). Also,
some questions lack sufficient indication of the relevant criteria to permit a meaningful
answer (“Is this program successful?”). Finally, some questions may be answerable
only with more expertise, data, time, or resources than are available to the evaluation
(“Do the prenatal services this program provides to high-risk women increase the
chances that their children will complete college?”).
For an evaluation question to be answerable, it must be possible to identify some
evidence or “observables” that can realistically be obtained and that will be credible as
the basis for an answer. This generally means developing questions that involve
measurable performance dimensions stated in terms that have unambiguous and
noncontroversial definitions. In addition, the relevant standards or criteria must be
specified with equal clarity. Suppose, for instance, that a proposed evaluation question
******ebook converter DEMO Watermarks*******
for a compensatory education program like Head Start is,“Are we reaching the children
most in need of this program?” To affirm that this is an answerable question, the
evaluator should be able to do the following:
1. Define the group of children at issue (e.g., those in census tract such and such, four
or five years old, living in households with annual income under 150% of the
federal poverty level).
2. Identify the specific measurable characteristics and cutoff values that represent the
greatest need (e.g., annual income below the federal poverty level, single parent in
the household with educational attainment of less than high school).
3. Give an example of the evaluation finding that might result (e.g., 60% of the
children currently served fall in the high-need category; 75% of the high-need
children in the catchment area—the geographic area being served by the program
— are not enrolled in the program).
4. Stipulate the evaluative criteria (e.g., to be satisfactory, at least 90% of the
children in the program should be high need and at least 50% of the high-need
children in the catchment area should be in the program).
5. Have the evaluation sponsors and other pertinent stakeholders (who should be
involved in the whole process) agree that a finding meeting these criteria would,
indeed, answer the question.
If such conditions can be met and, in addition, the resources are available to collect,
analyze, and report the applicable data, then the evaluation question can be considered
answerable.
EXHIBIT 3-B
Many Criteria May Be Relevant to Program Performance
Some program objectives, on the other hand, may be very specific. These often
come in the form of administrative objectives adopted as targets for routine program
functions. The target levels may be set according to past experience, the experience of
comparable programs, a judgment of what is reasonable and desirable, or maybe only a
“best guess.” Examples of administrative objectives may be to complete intake actions
for 90% of the referrals within 30 days, to have 75% of the clients complete the full
term of service, to have 85% “good” or “outstanding” ratings on a client satisfaction
questionnaire, to provide at least three appropriate services to each person under case
management, and the like. There is typically a certain amount of arbitrariness in these
criterion levels. But, if they are administratively stipulated or can be established
through stakeholder consensus, and if they are reasonable, they are quite serviceable in
the formulation of evaluation questions and the interpretation of the subsequent findings.
******ebook converter DEMO Watermarks*******
However, it is not generally wise for the evaluator to press for specific statements of
target performance levels if the program does not have them or cannot readily and
confidently develop them. Setting such targets with little justification only creates a
situation in which they are arbitrarily revised when the evaluation results are in.
In some instances, there are established professional standards that can be invoked
as performance criteria. This is particularly likely in medical and health programs,
where practice guidelines and managed care standards have developed that may be
relevant for setting desirable performance levels. Much more common, however, is the
situation where there are no established criteria or even arbitrary administrative
objectives to invoke. A typical situation is one in which the performance dimension is
clearly recognized but there is ambiguity about the criteria for good performance on that
dimension. For instance, stakeholders may agree that the program should have a low
drop-out rate, a high proportion of clients completing service, a high level of client
satisfaction, and the like, but only nebulous ideas as to what level constitutes “low” or
“high.” Sometimes the evaluator can make use of prior experience or find information in
the evaluation and program literature that provides a reasonable basis for setting a
criterion level. Another approach is to collect judgment ratings from relevant stake-
holders to establish the criterion levels or, perhaps, criterion ranges, that can be
accepted to distinguish, say, high, medium, and low performance.
Establishing a criterion level can be particularly difficult when the performance
dimension in an evaluation question involves outcome or impact issues. Program
stakeholders and evaluators alike may have little idea about how much change on a
given outcome variable (e.g., a scale of attitude toward drug use) is large and how much
is small. By default, these judgments are often made on the basis of statistical criteria;
for example, a program is judged to be effective solely because the measured effects are
statistically significant. This is a poor practice for reasons that will be more fully
examined when impact evaluation is discussed later in this volume. Statistical criteria
have no intrinsic relationship to the practical significance of a change on an outcome
dimension and can be misleading. A juvenile delinquency program that is found to have
the statistically significant effect of lowering delinquency recidivism by 2% may not
make a large enough difference to be worthwhile continuing. Thus, as much as possible,
the evaluator should use the techniques suggested above to attempt to determine and
specify in practical terms what “success” level is appropriate for judging the nature and
magnitude of the program’s effects.
The relationships among the different types of evaluation questions just described
define a hierarchy of evaluation issues that has implications beyond simply organizing
categories of questions. As mentioned in the overview of kinds of evaluation studies in
Chapter 2, the types of evaluation questions and the methods for answering them are
sufficiently distinct that each constitutes a form of evaluation in its own right. These
groupings of evaluation questions and methods constitute the building blocks of
evaluation research and, individually or in combination, are recognizable in virtually all
evaluation studies. As shown in Exhibit 3-C, we can think of these evaluation building
blocks in the form of a hierarchy in which each rests on those below it.
EXHIBIT 3-C
The Evaluation Hierarchy
develop an appropriate evaluation plan, the many possible questions that might be asked
about the program must be narrowed down to those specific ones that are most relevant
to the program’s circumstances. (See Exhibit 3-D for an example of specific evaluation
questions for an actual program.)
As we emphasized in Chapter 2, the concerns of the evaluation sponsor and other
major stakeholders should play a central role in shaping the questions that structure the
evaluation plan. In the discussion that follows, therefore, we first examine the matter of
obtaining appropriate input from the evaluation sponsor and relevant stakeholders prior
to and during the design stage of the evaluation.
However, it is rarely appropriate for the evaluator to rely only on input from the
evaluation sponsor and stakeholders to determine the questions on which the evaluation
should focus. Because of their close familiarity with the program, stake-holders may
overlook critical, but relatively routine, aspects of program performance. Also, the
experience and knowledge of the evaluator may yield distinctive insights into program
issues and their interrelations that are important for identifying relevant evaluation
******ebook converter DEMO Watermarks*******
questions. Generally, therefore, it is desirable for the evaluator to make an independent
analysis of the areas of program performance that may be pertinent for investigation.
Accordingly, the second topic addressed in the discussion that follows is how the
evaluator can analyze a program in a way that will uncover potentially important
evaluation questions for consideration in designing the evaluation. The evaluation
hierarchy, described above, is one tool that can be used for this purpose. Especially
useful is the concept of program theory. By depicting the significant assumptions and
expectations on which the program depends for its success, the program theory can
highlight critical issues that the evaluation should look into.
EXHIBIT 3-D
Evaluation Questions for a Neighborhood Afterschool Program
EXHIBIT 3-E
Diverse Stakeholder Perspectives on an Evaluation of a Multiagency Program for the
Homeless
The Joint Program was initiated to improve the accessibility of health and social
services for the homeless population of Montreal through coordinated activities
involving provincial, regional, and municipal authorities and more than 20 nonprofit
and public agencies. The services developed through the program included walk-in
and referral services, mobile drop-in centers, an outreach team in a community
health center, medical and nursing care in shelters, and case management. To ensure
stakeholder participation in the evaluation, an evaluation steering committee was set
******ebook converter DEMO Watermarks*******
up with representatives of the different types of agencies involved in the program
and which, in turn, coordinated with two other stakeholder committees charged with
program responsibilities.
Even though all the stakeholders shared a common cause to which they were firmly
committed—the welfare of the homeless—they had quite varied perspectives on the
evaluation. Some of these were described by the evaluators as follows:
The starting point, of course, is the evaluation sponsors. Those who have
commissioned and funded the evaluation rightfully have priority in defining the issues it
should address. Sometimes evaluation sponsors have stipulated the evaluation questions
and methods completely and want the evaluator only to manage the practical details. In
such circumstances, the evaluator should assess which, if any, stakeholder perspectives
The major stakeholders, by definition, have a significant interest in the program and
the evaluation. It is thus generally straightforward to identify them and obtain their
views about the issues and questions to which the evaluation should attend. The
evaluation sponsor, program administrators (who may also be the evaluation sponsor),
and intended program beneficiaries are virtually always major stakeholders.
Identification of other important stakeholders can usually be accomplished by analyzing
the network of relationships surrounding a program. The most revealing relationships
involve the flow of money to or from the program, political influence on and by the
program, those whose actions affect or are affected by the program, and the set of direct
interactions between the program and its various boards, patrons, collaborators,
competitors, clients, and the like.
A snowball approach may be helpful in identifying the various stakeholder groups
******ebook converter DEMO Watermarks*******
and persons involved in relationships with the program. As each such representative is
identified and contacted, the evaluator asks for nominations of other persons or groups
with a significant interest in the program. Those representatives, in turn, are asked the
same question. When this process no longer produces important new nominations, the
evaluator can be reasonably assured that all major stakeholders have been identified.
If the evaluation is structured as a collaborative or participatory endeavor with
certain stakeholders directly involved in designing and conducting the evaluation (as
described in Chapter 2), the participating stakeholders will, of course, have a firsthand
role in shaping the evaluation questions. Similarly, an internal evaluator who is part of
the organization that administers the program will likely receive forthright counsel from
program personnel. Even when such stakeholder involvement is built into the
evaluation, however, this arrangement is usually not sufficient to represent the full range
of pertinent stakeholder perspectives. There may be important stakeholder groups that
are not involved in the participatory structure but have distinct and significant
perspectives on the program and the evaluation. Moreover, there may be a range of
viewpoints among the members of groups that are represented in the evaluation process
so that a broader sampling of opinion is needed than that brought by the designated
participant on the evaluation team.
Generally, therefore, formulating responsive evaluation questions requires
discussion with members of stakeholder groups who are not directly represented on the
evaluation team. Fewer such contacts may be needed by evaluation teams that represent
many stakeholders and more by those on which few or no stakeholders are represented.
In cases where the evaluation has not initially been organized as a collaborative
endeavor with stakeholders, the evaluator may wish to consider configuring such an
arrangement to ensure that key stakeholders are engaged and their views fully
represented in the design and implementation of the evaluation. Participatory
arrangements might be made through stakeholder advisory boards, steering committees,
or simply regular consultations between the evaluator and key stakeholder
representatives. More information about the procedures and benefits of such approaches
can be found in Fetterman, Kaftarian, and Wandersman (1996), Greene (1988), Mark
and Shotland (1985), and Patton (1997).
Outside of organized arrangements, evaluators generally obtain stakeholder views
about the important evaluation issues through interviews. Because early contacts with
stakeholders are primarily for orientation and reconnaissance, interviews at this stage
typically are unstructured or, perhaps, semistructured around a small set of themes of
interest to the evaluator. Input from individuals representing stakeholder groups might
also be obtained through focus groups (Krueger, 1988). Focus groups have the
advantages of efficiency in getting information from a number of people and the
facilitative effect of group interaction in stimulating ideas and observations. They also
may have some disadvantages for this purpose, notably the potential for conflict in
******ebook converter DEMO Watermarks*******
politically volatile situations and the lack of confidentiality in group settings. In some
cases, stake-holders may speak more frankly about the program and the evaluation in
one-on-one conversations with the evaluator.
The evaluator will rarely be able to obtain input from every member of every stake-
holder group, nor will that ordinarily be necessary to identify the major issues and
questions with which the evaluation should be concerned. A modest number of carefully
selected stakeholder informants who are representative of significant groups or
distinctly positioned in relation to the program is usually sufficient to identify the
principal issues. When the evaluator no longer hears new themes in discussions with
diverse stakeholders, the most significant prevailing issues have probably all been
discovered.
Topics for Discussion With Stakeholders
The issues identified by the evaluation sponsor when the evaluation is requested
usually need further discussion with the sponsor and other stakeholders to clarify what
these issues mean to the various parties and what sort of information would usefully
bear on them. The topics that should be addressed in these discussions will depend in
large part on the particulars of the evaluation situation. Here we will review some of
the general topics that are often relevant.
Why is an evaluation needed? It is usually worthwhile for the evaluator to probe the
reasons an evaluation is desired. The evaluation may be motivated by an external
requirement, in which case it is important to know the nature of that requirement and
what use is to be made of the results. The evaluation may be desired by program
managers to determine whether the program is effective, to find ways to improve it, or
to “prove” its value to potential funders, donors, critics, or the like. Sometimes the
evaluation is politically motivated, for example, as a stalling tactic to avoid some
unpleasant decision. Whatever the reasons, they provide an important starting point for
determining what questions will be most important for the evaluation to answer and for
whom.
What are the program goals and objectives? Inevitably, whether a program achieves
certain of the goals and objectives ascribed to it will be pivotal questions for the
evaluation to answer. The distinction between goals and objectives is critical. A
program goal relates to the overall mission of the program and typically is stated in
broad and rather abstract terms. For example, a program for the homeless may have as
its goal “the reduction of homelessness” in its urban catchment area. Although easily
understood, such a goal is too vague to determine whether it has been met. Is a
“reduction of homelessness”5%, 10%, or 100%? Does “homelessness” mean only those
******ebook converter DEMO Watermarks*******
living on the streets or does it include those in shelters or temporary housing? For
evaluation purposes, broad goals must be translated into concrete statements that specify
the condition to be dealt with together with one or more measurable criteria of success.
Evaluators generally refer to specific statements of measurable attainments as program
objectives. Related sets of objectives identify the particular accomplishments presumed
necessary to attain the program goals. Exhibit 3-F presents helpful suggestions for
specifying objectives.
An important task for the evaluator, therefore, is to collaborate with relevant
stakeholders to identify the program goals and transform overly broad, ambiguous, or
idealized representations of them into clear, explicit, concrete statements of objectives.
The more closely the objectives describe situations that can be directly and reliably
observed, the more likely it is that a meaningful evaluation will result. Furthermore, it is
essential that the evaluator and stakeholders achieve a workable agreement on which
program objectives are most central to the evaluation and the criteria to be used in
assessing whether those objectives have been met. For instance, if one stated objective
of a job training program is to maintain a low drop-out rate, the key stakeholders should
agree to its importance before it is accepted as one of the focal issues around which the
evaluation will be designed.
If consensus about the important objectives is not attained, one solution is to include
all those put forward by the various stakeholders and, perhaps, additional objectives
drawn from current viewpoints and theories in the relevant substantive field (Chen,
1990). For example, the sponsors of a job training program may be interested solely in
the frequency and duration of postprogram employment. But the evaluator may propose
that stability of living arrangements, competence in handling finances, and efforts to
obtain additional education be considered as program outcomes because these lifestyle
features also may undergo positive change with increased employment and job-related
skills.
What are the most important questions for the evaluation to answer? We echo
Patton’s (1997) view that priority should be given to evaluation questions that will yield
information most likely to be used. Evaluation results are rarely intended by evaluators
or evaluation sponsors to be “knowledge for knowledge’s sake.” Rather, they are
intended to be useful, and to be used, by those with responsibility for making decisions
about the program, whether at the day-to-day management level or at broader funding or
policy levels (see Exhibit 3-G for an evaluation manager’s view of this process).
EXHIBIT 3-F
Specifying Objectives
A second suggestion for writing a clear objective is to state only a single aim or
purpose. Most programs will have multiple objectives, but within each objective
only a single purpose should be delineated. An objective that states two or more
purposes or desired outcomes may require different implementation and assessment
strategies, making achievement of the objective difficult to determine. For example,
the statement “to begin three prenatal classes for pregnant women and provide
outreach transportation services to accommodate twenty-five women per class”
creates difficulties. This objective contains two aims—to provide prenatal classes
and to provide outreach services. If one aim is accomplished but not the other, to
what extent has the objective been met?
A clearly written objective must have both a single aim and a single end-product or
result. For example, the statement “to establish communication with the Health
Systems Agency” indicates the aim but not the desired end-product or result. What
constitutes evidence of communication— telephone calls, meetings, reports? Failure
to specify a clear end-product makes it difficult for assessment to take place.
Those involved in writing and evaluating objectives need to keep two questions in
mind. First, would anyone reading the objective find the same purpose as the one
intended? Second, what visible, measurable, or tangible results are present as
evidence that the objective has been met? Purpose or aim describes what will be
******ebook converter DEMO Watermarks*******
done; end-product or result describes evidence that will exist when it has been
done. This is assurance that you “know one when you see one.”
Finally, it is useful to specify the time of expected achievement of the objective. The
statement “to establish a walk-in clinic as soon as possible” is not a useful
objective because of the vagueness of “as soon as possible.” It is far more useful to
specify a target date, or in cases where some uncertainty exists about some specific
date, a range of target dates—for example, “sometime between March 1 and March
30.”
1. The utilization of evaluation or research does not take care of itself. Evaluation
reports are inanimate objects, and it takes human interest and personal action to
use and implement evaluation findings and recommendations. The implications
of evaluation must be transferred from the written page to the agenda of
program managers.
2. Utilization of evaluation, through which program lessons are identified, usually
demands changed behaviors or policies. This requires the shifting of priorities
and the development of new action plans for the operational manager.
3. Utilization of evaluation research involves political activity. It is based on a
recognition and focus on who in the organization has what authority to make x,
y, or z happen. To change programs or organizations as a result of some
evaluation requires support from the highest levels of management.
4. Ongoing systems to engender evaluation use are necessary to legitimate and
formalize the organizational learning process. Otherwise, utilization can
become a personalized issue and evaluation advocates just another self-serving
group vying for power and control.
In each case, the evaluator should work with the respective evaluation users to
describe the range of potential decisions or actions that they might consider taking and
the form and nature of information that they would find pertinent in their deliberation. To
press this exercise to the greatest level of specificity, the evaluator might even generate
dummy information of the sort that the evaluation will produce (e.g., “20% of the clients
who complete the program relapse within 30 days”) and discuss with the prospective
users what this would mean to them and how they would use such information.
A careful specification of the intended use of the evaluation results and the nature of
the information expected to be useful leads directly to the formulation of questions the
evaluation must attempt to answer (e.g., “What proportion of the clients who complete
SOURCE: Adapted from United Way of America Task Force on Impact, Measuring
Program Outcomes: A Practical Approach. Alexandria, VA: Author, 1996, p. 42. Used
by permission, United Way of America.
Good evaluation questions must be reasonable and appropriate, and they must be
answerable. That is, they must identify clear, observable dimensions of program
performance that are relevant to the program’s goals and represent domains in which the
program can realistically be expected to have accomplishments.
To ensure that the matters of greatest significance are covered in the evaluation
design, the evaluation questions are best formulated through interaction and negotiation
with the evaluation sponsors and other stakeholders representative of significant groups
or distinctly positioned in relation to program decision making.
One useful way to reveal aspects of program performance that may be important
is to make the program theory explicit. Program theory describes the assumptions
inherent in a program about the activities it undertakes and how those relate to the social
benefits it is expected to produce. Critical analysis of program theory can surface
important evaluation questions that might otherwise have been overlooked.
When these various procedures have generated a full set of candidate evaluation
questions, the evaluator must organize them into related clusters and draw on
K EY CONCEPTS
Catchment area
The geographic area served by a program.
Implementation failure
The program does not adequately perform the activities specified in the program design
that are assumed to be necessary for bringing about the intended social improvements. It
includes situations in which no service, not enough service, or the wrong service is
delivered, or the service varies excessively across the target population.
Performance criterion
The standard against which a dimension of program performance is compared so that it
can be evaluated.
Program goal
A statement, usually general and abstract, of a desired state toward which a program is
directed. Compare program objectives.
Program objectives
Specific statements detailing the desired accomplishments of a program together with
one or more measurable criteria of success.
Theory failure
The program is implemented as planned but its services do not produce the immediate
effects on the participants that are expected or the ultimate social benefits that are
intended, or both.
Chapter Outline
The Role of Evaluators in Diagnosing Social Conditions and Service Needs
Defining the Problem to Be Addressed
Specifying the Extent of the Problem: When, Where, and How Big?
Using Existing Data Sources to Develop Estimates
Using Social Indicators to Identify Trends
Estimating Problem Parameters Through Social Research
Agency Records
Surveys and Censuses
Key Informant Surveys
Forecasting Needs
Defining and Identifying the Targets of Interventions
What Is a Target?
Direct and Indirect Targets
Specifying Targets
Target Boundaries
Varying Perspectives on Target Specification
Describing Target Populations
Risk, Need, and Demand
Incidence and Prevalence
EXHIBIT 4-A
Steps in Analyzing Need
1. Identification of users and uses. The users of the analysis are those who will
act on the basis of the results and the audiences who may be affected by it. The
involvement of both groups will usually facilitate the analysis and
implementation of its recommendations. Knowing the uses of the need
assessment helps the researcher focus on the problems and solutions that can be
entertained, but also may limit the problems and solutions identified in Step 3,
below.
2. Description of the target population and service environment. Geographic
dispersion, transportation, demographic characteristics (including strengths) of
the target population, eligibility restrictions, and service capacity are
important. Social indicators are often used to describe the target population
either directly or by projection. Resource inventories detailing services
******ebook converter DEMO Watermarks*******
available can identify gaps in services and complementary and competing
programs. Comparison of those who use services with the target population can
reveal unmet needs or barriers to solution implementation.
3. Need identification. Here problems of the target population(s) and possible
solutions are described. Usually, more than one source of information is used.
Identification should include information on expectations for outcomes; on
current outcomes; and on the efficacy, feasibility, and utilization of solutions.
Social indicators, surveys, community forums, and direct observation are
frequently used.
4. Need assessment. Once problems and solutions have been identified, this
information is integrated to produce recommendations for action. Both
quantitative and qualitative integration algorithms can be used. The more
explicit and open the process, the greater the likelihood that results will be
accepted and implemented.
5. Communication. Finally, the results of the need analysis must be communicated
to decisionmakers, users, and other relevant audiences. The effort that goes into
this communication should equal that given the other steps of the need analysis.
SOURCE: Adapted from Jack McKillip, “Need Analysis: Process and Techniques,” in
Handbook of Applied Social Research Methods, eds. L. Bickman and D. J. Rog
(Thousand Oaks, CA: Sage, 1998), pp. 261-284.
It is generally agreed, for example, that poverty is a social problem. The observable
facts are the statistics on the distribution of income and assets. However, those statistics
do not define poverty, they merely permit one to determine how many are poor when a
definition is given. Nor do they establish poverty as a social problem; they only
characterize a situation that individuals and social agents may view as problematic.
Moreover, both the definition of poverty and the goals of programs to improve the lot of
the poor can vary over time, between communities, and among stakeholders. Initiatives
to reduce poverty, therefore, may range widely—for example, from increasing
employment opportunities to simply lowering the expectations of persons with low
income.
Defining a social problem and specifying the goals of intervention are thus
ultimately political processes that do not follow automatically from the inherent
characteristics of the situation. This circumstance is illustrated nicely in an analysis of
legislation designed to reduce adolescent pregnancy that was conducted by the U.S.
General Accounting Office (GAO, 1986). The GAO found that none of the pending
legislative proposals defined the problem as involving the fathers of the children in
question; every one addressed adolescent pregnancy as an issue of young mothers.
******ebook converter DEMO Watermarks*******
Although this view of adolescent pregnancy may lead to effective programs, it
nonetheless clearly represents arguable assumptions about the nature of the problem and
how a solution should be approached.
The social definition of a problem is so central to the political response that the
preamble to proposed legislation usually shows some effort to specify the conditions for
which the proposal is designed as a remedy. For example, two contending legislative
proposals may both be addressed to the problem of homelessness, but one may identify
the homeless as needy persons who have no kin on whom to be dependent, whereas the
other defines homelessness as the lack of access to conventional shelter. The first
definition centers attention primarily on the social isolation of potential clients; the
second focuses on housing arrangements. The ameliorative actions that are justified in
terms of these definitions will likely be different as well. The first definition, for
instance, would support programs that attempt to reconcile homeless persons with
alienated relatives; the second, subsidized housing programs.
It is usually informative, therefore, for an evaluator to determine what the major
political actors think the problem is. The evaluator might, for instance, study the
definitions given in policy and program proposals or in enabling legislation. Such
information may also be found in legislative proceedings, program documents,
newspaper and magazine articles, and other sources in which discussions of the
problem or the program appear. Such materials may explicitly describe the nature of the
problem and the program’s plan of attack, as in funding proposals, or implicitly define
the problem through the assumptions that underlie statements about program activities,
successes, and plans.
This inquiry will almost certainly turn up information useful for a preliminary
description of the social need to which the program is presumably designed to respond.
As such, it can guide a more probing needs assessment, both with regard to how the
problem is defined and what alternative perspectives might be applicable.
An important role evaluators may play at this stage is to provide policymakers and
program managers with a critique of the problem definition inherent in their policies
and programs and propose alternative definitions that may be more serviceable. For
example, evaluators could point out that a definition of the problem of teenage
pregnancies as primarily one of illegitimate births ignores the large number of births
that occur to married teenagers and suggest program implications that follow from that
definition.
EXHIBIT 4-B
Estimating the Frequency of Domestic Violence Against Pregnant Women
All women are at risk of being battered; however, pregnancy places a woman at
increased risk for severe injury and adverse health consequences, both for herself
and her unborn infant. Local and exploratory studies have found as many as
40%-60% of battered women to have been abused during pregnancy. Among 542
women in a Dallas shelter, for example, 42% had been battered when pregnant.
Most of the women reported that the violence became more acute during the
pregnancy and the child’s infancy. In another study, interviews of 270 battered
******ebook converter DEMO Watermarks*******
women across the United States found that 44% had been abused during pregnancy.
But most reports on battering during pregnancy have come from samples of battered
women, usually women in shelters. To establish the prevalence of battering during
pregnancy in a representative obstetric population, McFarlane and associates
randomly sampled and interviewed 290 healthy pregnant women from public and
private clinics in a large metropolitan area with a population exceeding three
million. The 290 Black, White, and Latina women ranged in age from 18 to 43
years; most were married, and 80% were at least five months pregnant. Nine
questions relating to abuse were asked of the women, for example, whether they
were in a relationship with a male partner who had hit, slapped, kicked, or
otherwise physically hurt them during the current pregnancy and, if yes, had the
abuse increased. Of the 290 women, 8% reported battering during the current
pregnancy (one out of every twelve women interviewed). An additional 15%
reported battering before the current pregnancy. The frequency of battering did not
vary as a function of demographic variables.
For some social issues, existing data sources, such as surveys and censuses, may be
of sufficient quality to be used with confidence for assessing certain aspects of a social
problem. For example, accurate and trustworthy information can usually be obtained
about issues on which information is collected by the Current Population Survey of the
U.S. Bureau of the Census or the decennial U.S. Census. The decennial Census volumes
contain data on census tracts (small areas containing about 4,000 households) that can
be aggregated to get neighborhood and community data. As an illustration, Exhibit 4-C
describes the use of vital statistics records and census data to assess the nature and
magnitude of the problem of poor birth outcomes in a Florida county. This needs
assessment was aimed at estimating child and maternal health needs so that appropriate
services could be planned. Even when such direct information about the problem of
interest is not available from existing records, indirect estimates may be possible if the
empirical relationships between available information and problem indicators are
known (e.g., Ciarlo et al., 1992). For example, the proportion of schoolchildren in a
given neighborhood who are eligible for free lunches is often used as an indicator of the
prevalence of poverty in that neighborhood.
When evaluators use sources whose validity is not as widely recognized as that of
******ebook converter DEMO Watermarks*******
the census, they must assess the validity of the data by examining carefully how they
were collected. A good rule of thumb is to anticipate that, on any issue, different data
sources will provide disparate or even contradictory estimates.
On some topics, existing data sources provide periodic measures that chart
historical trends in the society. For example, the Current Population Survey of the
Bureau of the Census collects annual data on the characteristics of the U.S. population
using a large household sample. The data include measures of the composition of
households, individual and household income, and household members’ age, sex, and
race. The regular Survey of Income and Program Participation provides data on the
extent to which the U.S. population participates in various social programs:
unemployment benefits, food stamps, job training programs, and so on.
A regularly occurring measure such as those mentioned above, called a social
indicator, can provide important information for assessing social problems and needs in
several ways. First, when properly analyzed, the data can often be used to estimate the
size and distribution of the social problem whose course is being tracked over time.
Second, the trends shown can be used to alert decisionmakers to whether certain social
conditions are improving, remaining the same, or deteriorating. Finally, the social
indicator trends can be used to provide a first, if crude, estimate of the effects of social
programs that have been in place. For example, the Survey of Income and Program
Participation can be used to estimate the coverage of such national programs as food
stamps or job training.
EXHIBIT 4-C
Using Vital Statistics and Census Data to Assess Child and Maternal Health Needs
Infant mortality. The county’s rate was far higher than national or state rates.
Fetal mortality. The rate for the county was higher than the state goal, and the
rate for African American mothers was higher than for white mothers.
Neonatal mortality. The rates were higher than the state goal for white mothers
but below the state goal for African American mothers.
Postneonatal mortality. The rates were below state goals.
Low birth weight babies. There was a higher incidence for adolescents and
women over age 35.
Very low birth weight births. The overall rate was twice that for the whole
state and exceeded state goals for both African American and white mothers.
Adolescent pregnancy. The proportion of births to teens was over twice the
state average; the rate for African American teens was more than twice that for
white teens.
Age of mother. The infant mortality and low birth rates were highest among
children born to mothers 16-18 years of age.
Education of mother. Mothers with less than a high school education were
slightly more likely to have low birth weight newborns but almost eight times
more likely to have newborns identified as high risk on infant screening
measures.
Based on these findings, three groups were identified with high risk for poor birth
outcomes:
U.S. Census data were then used to identify the number of women of childbearing
age in each of these risk categories, the proportions in low-income strata, and their
geographic concentrations within the county. This information was used by the
coalition to identify the major problem areas in the county, set goals, and plan
services.
SOURCE: Adapted from E. Walter Terrie, “Assessing Child and Maternal Health: The
First Step in the Design of Community-Based Interventions,” in Needs Assessment: A
Creative and Practical Guide for Social Scientists, eds. R. Reviere, S. Berkowitz, C.
******ebook converter DEMO Watermarks*******
C. Carter, and C. G. Ferguson (Washington, DC: Taylor & Francis, 1996), pp. 121-146.
Social indicator data are often used to monitor changes in social conditions that may
be affected by social programs. Considerable effort has gone into the collection of
social indicator data on poor households in an effort to judge whether their
circumstances have worsened or improved after the radical reforms in welfare enacted
in the Personal Responsibility and Work Opportunity Reconciliation Act of 1996.
Repeated special surveys concentrating on the well-being of children are being
conducted by the Urban Institute and the Manpower Development Research
Corporation. In addition, the Bureau of the Census has extended the Survey of Income
and Program Participation to constitute a panel of households repeatedly interviewed
before and after the welfare reforms were instituted (Rossi, 2001).
Unfortunately, the social indicators currently available are limited in their coverage
of social problems, focusing mainly on issues of poverty and employment, national
program participation, and household composition. For many social problems, no social
indicators exist. In addition, those that do exist support only analysis of national and
regional trends and cannot be broken down to provide useful indicators of trends in
states or smaller jurisdictions.
In many instances, no existing data source will provide estimates of the extent and
distribution of a problem of interest. For example, there are no ready sources of
information about household pesticide misuse that would indicate whether it is a
problem, say, in households with children. In other instances, good information about a
problem may be available for a national or regional sample that cannot be disaggregated
to a relevant local level. The National Survey of Household Drug Use, for instance,
uses a nationally representative sample to track the nature and extent of substance abuse.
However, the number of respondents from most states is not large enough to provide
good state-level estimates of drug abuse, and no valid city-level estimates can be
derived at all.
When pertinent data are nonexistent or insufficient, the evaluator must consider
collecting new data. There are several ways of making estimates of the extent and
distribution of social problems, ranging from expert opinion to a large-scale sample
survey. Decisions about the kind of research effort to undertake must be based in part
on the funds available and how important it is to have precise estimates. If, for
legislative or program design purposes, it is critical to know the precise number of
malnourished infants in a political jurisdiction, a carefully planned health interview
survey may be necessary. In contrast, if the need is simply to determine whether there is
******ebook converter DEMO Watermarks*******
any malnutrition among infants, input from knowledgeable informants may be all that is
required. This section reviews three types of sources that evaluators can mine for
pertinent data.
Agency Records
When it is necessary to get very accurate information on the extent and distribution
of a problem and there are no existing credible data, the evaluator may need to
undertake original research using sample surveys or censuses (complete enumerations).
Because they come in a variety of sizes and degrees of technical complexity, either of
these techniques can involve considerable effort and skill, not to mention a substantial
commitment of resources.
To illustrate one extreme, Exhibit 4-D describes a needs assessment survey
undertaken to estimate the size and composition of the homeless population of Chicago.
The survey covered both persons in emergency shelters and homeless persons who did
not use shelters. Surveying the latter involved searching Chicago streets in the middle of
the night. The survey was undertaken because the Robert Wood Johnson Foundation and
the Pew Memorial Trust were planning a program for increasing the access of homeless
persons to medical care. Although there was ample evidence that serious medical
conditions existed among the homeless populations in urban centers, no reliable
******ebook converter DEMO Watermarks*******
information was available about either the size of the homeless population or the extent
of their medical problems. Hence, the foundations funded a research project to collect
that information.
Usually, however, needs assessment research is not as elaborate as that described in
4-D. In many cases, conventional sample surveys can provide adequate information. If,
for example, reliable information is required about the number and distribution of
children needing child care so that new facilities can be planned, it will usually be
feasible to obtain it from sample surveys conducted on the telephone. Exhibit 4-E
describes a telephone survey conducted with more than 1,100 residents of Los Angeles
County to ascertain the extent of public knowledge about the effectiveness of AIDS
prevention behaviors. For mass media educational programs aimed at increasing
awareness of ways to prevent AIDS, a survey such as this identifies both the extent and
the nature of the gaps in public knowledge.
Many survey organizations have the capability to plan, carry out, and analyze
sample surveys for needs assessment. In addition, it is often possible to add questions to
regularly conducted studies in which different organizations buy time, thereby reducing
costs. Whatever the approach, it must be recognized that designing and implementing
sample surveys can be a complicated endeavor requiring high skill levels. Indeed, for
many evaluators, the most sensible approach may be to contract with a reputable survey
organization. For further discussion of the various aspects of sample survey
methodology, see Fowler (1993) and Henry (1990).
Key Informant Surveys
Perhaps the easiest, though by no means most reliable, approach to estimating the
extent of a social problem is to ask key informants, persons whose position or
experience should provide them with some knowledge of the magnitude and distribution
of the problem. Key informants can often provide very useful information about the
characteristics of a target populations and the nature of service needs. Unfortunately,
few are likely to have a vantage point or information sources that permit very good
estimation of the actual number of persons affected by a social condition or the
demographic and geographic distribution of those persons. Well-placed key informants,
for instance, may have experience with the homeless, but it will be difficult for them to
extrapolate from that experience to an estimate of the size of the total population.
Indeed, it has been shown that selected informants’ guesses about the numbers of
homeless in their localities vary widely and are generally erroneous (see Exhibit 4-F).
EXHIBIT 4-D
Using Sample Surveys to Study the Chicago Homeless
A person was classified as homeless at the time of the survey if that person was a
resident of a shelter for homeless persons or was encountered in the block searches
and found not to rent, own, or be a member of a household renting or owning a
conventional dwelling unit. Conventional dwelling units included apartments,
houses, rooms in hotels or other structures, and mobile homes.
SOURCE: Adapted from P. H. Rossi, Down and Out in America: The Origins of
Homelessness (Chicago: University of Chicago Press, 1989).
EXHIBIT 4-E
Assessing the Extent of Knowledge About AIDS Prevention
To gauge the extent of knowledge about how to avoid HIV infection, a sample of Los
Angeles County residents was interviewed on the telephone. The residents were
asked to rate the effectiveness of four methods that “some people use to avoid
On the grounds that key informants’ reports of the extent of a problem are better than
no information at all, evaluators may wish to use a key informant survey when a better
approach is not feasible. Under such circumstances, the survey must be conducted as
cautiously as possible. The evaluator should choose persons to be surveyed who have
the necessary expertise and ensure that they are questioned in a careful manner (Averch,
1994).
EXHIBIT 4-F
Using Key Informant Estimates of the Homeless Population
Clearly, the estimates were all over the map. Two providers (4 and 5) came fairly
close to what the researchers estimated as the most likely number, based on shelter,
SRO, and street counts.
SOURCE: Adapted from Hamilton, Rabinowitz, and Alschuler, Inc., The Changing
Face of Misery: Los Angeles’ Skid Row Area in Transition—Housing and Social
Services Needs of Central City East (Los Angeles: Community Redevelopment Agency,
July 1987).
Forecasting Needs
What Is a Target?
The targets of a social program are usually individuals. But they also may be groups
(families, work teams, organizations), geographically and politically related areas (such
as communities), or physical units (houses, road systems, factories). Whatever the
target, it is imperative at the outset of a needs assessment to clearly define the relevant
******ebook converter DEMO Watermarks*******
units.
In the case of individuals, targets are usually identified in terms of their social and
demographic characteristics or their problems, difficulties, and conditions. Thus, targets
of an educational program may be designated as children aged 10 to 14 who are one to
three years below their normal grade in school. The targets of a maternal and infant care
program may be defined as pregnant women and mothers of infants with annual incomes
less than 150% of the poverty line.
When aggregates (groups or organizations) are targets, they are often defined in
terms of the characteristics of the individuals that constitute them: their informal and
formal collective properties and their shared problems. For example, an organizational-
level target for an educational intervention might be elementary schools (kindergarten to
eighth grade) with at least 300 pupils in which at least 30% of the pupils qualify for the
federal free lunch program.
Specifying Targets
At first glance, specifying the target population for a program may seem simple.
However, although target definitions are easy to write, the results often fall short when
the program or the evaluator attempts to use them to identify who is properly included
or excluded from program services. There are few social problems that can be easily
and convincingly described in terms of simple, unambiguous characteristics of the
individuals experiencing the problem.
Take a single illustration: What is a resident with cancer in a given community? The
answer depends on the meanings of both “resident” and “cancer.” Does “resident”
include only permanent residents, or does it also include temporary ones (a decision
******ebook converter DEMO Watermarks*******
that would be especially important in a community with a large number of vacationers,
such as Orlando, Florida).As for “cancer,” are “recovered” cases to be included, and,
whether they are in or out, how long without a relapse constitutes recovery? Are cases
of cancer to be defined only as diagnosed cases, or do they also include persons whose
cancer had not yet been detected? Are all cancers to be included regardless of type or
severity? While it should be possible to formulate answers to these and similar
questions for a given program, this illustration shows that the evaluator can expect a
certain amount of difficulty in determining exactly how a program’s target population is
defined.
Target Boundaries
Adequate target specification establishes boundaries, that is, rules determining who
or what is included and excluded when the specification is applied. One risk in
specifying target populations is to make a definition too broad or overinclusive. For
example, specifying that a criminal is anyone who has ever violated a law is useless;
only saints have not at one time or another violated some law, wittingly or otherwise.
This definition of criminal is too inclusive, lumping together in one category trivial and
serious offenses and infrequent violators with habitual felons.
Definitions may also prove too restrictive, or underinclusive, sometimes to the point
that almost no one falls into the target population. Suppose that the designers of a
program to rehabilitate released felons decide to include only those who have never
been drug or alcohol abusers. The extent of substance abuse is so great among released
prisoners that few would be eligible given this exclusion. In addition, because persons
with long arrest and conviction histories are more likely to be substance abusers, this
definition eliminates those most in need of rehabilitation as targets of the proposed
program.
Useful target definitions must also be feasible to apply. A specification that hinges
on a characteristic that is difficult to observe or for which existing records contain no
data may be virtually impossible to put into practice. Consider, for example, the
difficulty of identifying the targets of a job training program if they are defined as
persons who hold favorable attitudes toward accepting job training. Complex
definitions requiring detailed information may be similarly difficult to apply. The data
required to identify targets defined as “farmer members of producers’ cooperatives who
have planted barley for at least two seasons and have an adolescent son” would be
difficult, if not impossible, to gather.
A useful distinction for describing the conditions a program aims to improve is the
difference between incidence and prevalence. Incidence refers to the number of new
cases of a particular problem that are identified or arise in a specified area or context
during a specified period of time. Prevalence refers to the total number of existing
cases in that area at a specified time. These concepts come from the field of public
health, where they are sharply distinguished. To illustrate, the incidence of influenza
during a particular month would be defined as the number of new cases reported during
that month. Its prevalence during that month would be the total number of people
afflicted, regardless of when they were first stricken. In the health sector, programs
generally are interested in incidence when dealing with disorders of short duration, such
as upper-respiratory infections and minor accidents. They are more interested in
prevalence when dealing with problems that require long-term management and
treatment efforts, such as chronic conditions and long-term illnesses.
The concepts of incidence and prevalence also apply to social problems. In studying
the impact of crime, for instance, a critical measure is the incidence of victimization—
the number of new victims in a given area per interval of time. Similarly, in programs
aimed at lowering drunken-driver accidents, the incidence of such accidents may be the
best measure of the need for intervention. But for chronic conditions such as low
educational attainment, criminality, or poverty, prevalence is generally the appropriate
measure. In the case of poverty, for instance, prevalence may be defined as the number
of poor individuals or families in a community at a given time, regardless of when they
became poor.
For other social problems, both prevalence and incidence may be relevant
characteristics of the target population. In dealing with unemployment, for instance, it is
important to know its prevalence, the proportion of the population unemployed at a
particular time. If the program’s concern is with providing short-term financial support
for the newly unemployed, however, it is the incidence rate that defines the target
population.
Rates
In some circumstances, it is useful to be able to express incidence or prevalence as
a rate within an area or population. Thus, the number of new crime victims in a
community during a given period (incidence) might be described in terms of the rate per
1,000 persons in that community (e.g., 23 new victims per 10,000 residents). Rates are
especially handy for comparing problem conditions across areas or groups. For
example, in describing crime victims, it is informative to have estimates by sex and age
******ebook converter DEMO Watermarks*******
group. Although almost every age group is subject to some kind of crime victimization,
young people are much more likely to be the victims of robbery and assault, whereas
older persons are more likely to be the victims of burglary and larceny; men are
considerably less likely than women to be the victims of sexual abuse, and so on. The
ability to identify program targets with different profiles of problems and risks allows a
needs assessment to examine the way a program is tailored (or not) to those different
groups.
In most cases, it is customary and useful to specify rates by age and sex. In
communities with cultural diversity, differences among racial, ethnic, and religious
groups may also be important aspects of a program’s target population. Other variables
that may be relevant for identifying characteristics of the target population include
socioeconomic status, geographic location, and residential mobility. (See Exhibit 4-G
for an example of crime victimization rates broken down by sex, age, and race.)
EXHIBIT 4-G
Rates of Violent Crime Victimization by Sex, Age, and Race
Community influentials:
People are on waiting lists for services for several months.
There are not enough professionals or volunteers here.
There is inadequate provider knowledge about specialized services.
SOURCE: Adapted from Holly W. Halvorson, Donna K. Pike, Frank M. Reed, Maureen
W. McClatchey, and Carol A. Gosselink, “Using Qualitative Methods to Evaluate
Health Service Delivery in Three Rural Colorado Communities,” Evaluation & the
Health Professions, 1993, 16(4):434-447.
EXHIBIT 4-I
Sample Protocol for a Needs Assessment Focus Group
A focus group protocol is a list of topics that is used to guide discussion in a focus
group session. The protocol should (1) cover topics in a logical, developmental
order so that they build on one another; (2) raise open-ended issues that are
engaging and relevant to the participants and that invite the group to make a
collective response; and (3) carve out manageable “chunks” of topics to be
examined one at a time in a delimited period. For example, the following protocol is
for use in a focus group with low-income women to explore the barriers to
receiving family support services:
Introduction: greetings; explain purpose of the session; fill out name cards;
introduce observers, ground rules, and how the focus group works (10
minutes).
Participant introductions: first names only; where participants live, age of
children; which family support services are received and for how long; other
services received (10 minutes).
Introduce idea of barriers to services: ask participants for their views on the
most important barriers to receipt of family support services (probe regarding
transportation, treatment by agency personnel, regulations, waiting lists); have
they discontinued any services or been unable to get ones they want? (30
minutes).
Probe for reasons behind their choices of most important barriers (20 minutes).
Ask for ideas on what could be done to overcome barriers: what would make it
easier to enter and remain in the service loop? (30 minutes).
Debrief and wrap up: moderator summary, clarifications, and additional
comments or questions (10 minutes).
EXHIBIT 4-J
Homeless Men and Women Report Their Needs for Help
As efforts to help the homeless move beyond the provision of temporary shelter, it is
important to understand homeless individuals’ perspectives on their needs for
assistance. Responses from a representative sample of 1,260 homeless men and
women interviewed in New York City shelters revealed that they had multiple needs
not easily met by a single service. The percentage reporting a need for help on each
of 20 items was as follows:
Finding a place to live 87.1
Having a steady income 71.0
Finding a job 63.3
Improving my job skills 57.0
Learning how to get what I have coming from agencies 45.4
Getting on public assistance 42.1
SOURCE: Adapted with permission from Daniel B. Herman, Elmer L. Struening, and
Susan M. Barrow, “Self-Reported Needs for Help Among Homeless Men and Women,”
Evaluation and Program Planning, 1994, 17(3):249-256.
Summary
Needs assessment attempts to answers questions about the social conditions a
program is intended to address and the need for the program, or to determine whether a
new program is needed. More generally, it may be used to identify, compare, and
prioritize needs within and across program areas.
To specify the size and distribution of a problem, evaluators may gather and
analyze data from existing sources, such as the U.S. Census, or use ongoing social
indicators to identify trends. Because the needed information often cannot be obtained
from such sources, evaluators frequently conduct their own research on a social
problem. Useful approaches include studies of agency records, surveys, censuses, and
key informant surveys. Each of these has its uses and limitations; for example, key
informant surveys may be relatively easy to conduct but of doubtful reliability; agency
records generally represent persons in need of services but may be incomplete; surveys
and censuses can provide valid, representative information but can be expensive and
technically demanding.
Forecasts of future needs are often very relevant to needs assessment but are
complex technical activities ordinarily performed by specialists. In using forecasts,
evaluators must take care to examine the assumptions on which the forecasts are based.
K EY CONCEPTS
******ebook converter DEMO Watermarks*******
Focus group
A small panel of persons selected for their knowledge or perspective on a topic of
interest that is convened to discuss the topic with the assistance of a facilitator. The
discussion is used to identify important themes or to construct descriptive summaries of
views and experiences on the focal topic.
Incidence
The number of new cases of a particular problem or condition that arise in a specified
area during a specified period of time. Compare prevalence.
Key informants
Persons whose personal or professional position gives them a knowledgeable
perspective on the nature and scope of a social problem or a target population and
whose views are obtained during a needs assessment.
Population at risk
The individuals or units in a specified area with characteristics indicating that they have
a significant probability of having or developing a particular condition.
Population in need
The individuals or units in a specified area that currently manifest a particular
problematic condition.
Prevalence
The total number of existing cases with a particular condition in a specified area at a
specified time. Compare incidence.
Rate
The occurrence or existence of a particular condition expressed as a proportion of units
in the relevant population (e.g., deaths per 1,000 adults).
Sample survey
Snowball sampling
A nonprobability sampling method in which each person interviewed is asked to suggest
additional knowledgeable people for interviewing. The process continues until no new
names are suggested.
Social indicator
Periodic measurements designed to track the course of a social condition over time.
Chapter Outline
The Evaluability Assessment Perspective
Describing Program Theory
Program Impact Theory
The Service Utilization Plan
The Program’s Organizational Plan
Eliciting Program Theory
Defining the Boundaries of the Program
Explicating the Program Theory
Program Goals and Objectives
Program Functions, Components, and Activities
The Logic or Sequence Linking Program Functions, Activities,
and Components
Corroborating the Description of the Program Theory
Assessing Program Theory
Assessment in Relation to Social Needs
Assessment of Logic and Plausibility
Assessment Through Comparison With Research and Practice
Assessment Via Preliminary Observation
******ebook converter DEMO Watermarks*******
Possible Outcomes of Program Theory Assessment
Mario Cuomo, former governor of New York, once described his mother’s rules for
success as (1) figure out what you want to do and (2) do it. These are pretty much the
same rules that social programs must follow if they are to be effective. Given an
identified need, program decisionmakers must (1) conceptualize a program capable of
alleviating that need and (2) implement it. In this chapter, we review the concepts and
procedures an evaluator can apply to the task of assessing the quality of the program
conceptualization, which we have called the program theory. In the next chapter, we
describe how the evaluator can assess the quality of the program’s implementation.
Whether it is expressed in a detailed program plan and rationale or only implicit in
the program’s structure and activities, the program theory explains why the program
does what it does and provides the rationale for expecting that doing so will achieve the
desired results. When examining a program’s theory, evaluators often find that it is not
very convincing. There are many poorly designed social programs with faults that
reflect deficiencies in their underlying conception of how the desired social benefits can
be attained. This happens in large part because insufficient attention is given during the
******ebook converter DEMO Watermarks*******
planning of new programs to careful, explicit conceptualization of a program’s
objectives and how they are supposed to be achieved. Sometimes the political context
within which programs originate does not permit extensive planning but, even when that
is not the case, conventional practices for designing programs pay little attention to the
underlying theory. The human service professions operate with repertoires of
established services and types of intervention associated with their respective specialty
areas. As a result, program design is often a matter of configuring a variation of familiar
“off the shelf” services into a package that seems appropriate for a social problem
without a close analysis of the match between those services and the specific nature of
the problem.
For example, many social problems that involve deviant behavior, such as alcohol
and drug abuse, criminal behavior, early sexual activity, or teen pregnancy, are
addressed by programs that provide the target population with some mix of counseling
and educational services. This approach is based on an assumption that is rarely made
explicit during the planning of the program, namely, that people will change their
problem behavior if given information and interpersonal support for doing so. While
this assumption may seem reasonable, experience and research provide ample evidence
that such behaviors are resistant to change even when participants are provided with
knowledge about how to change and receive strong encouragement from loved ones to
do so. Thus, the theory that education and supportive counseling will reduce deviant
behavior may not be a sound basis for program design.
A program’s rationale and conceptualization, therefore, are just as subject to critical
scrutiny within an evaluation as any other important aspect of the program. If the
program’s goals and objectives do not relate in a reasonable way to the social
conditions the program is intended to improve, or the assumptions and expectations
embodied in a program’s functioning do not represent a credible approach to bringing
about that improvement, there is little prospect that the program will be effective.
The first step in assessing program theory is to articulate it, that is, to produce an
explicit description of the conceptions, assumptions, and expectations that constitute the
rationale for the way the program is structured and operated. Only rarely can a program
immediately provide the evaluator with a full statement of its underlying theory.
Although the program theory is always implicit in the program’s structure and
operations, a detailed account of it is seldom written down in program documents.
Moreover, even when some write-up of program theory is available, it is often in
material that has been prepared for funding proposals or public relations purposes and
may not correspond well with actual program practice.
Assessment of program theory, therefore, almost always requires that the evaluator
synthesize and articulate the theory in a form amenable to analysis. Accordingly, the
discussion in this chapter is organized around two themes: (1) how the evaluator can
explicate and express program theory in a form that will be representative of key stake-
******ebook converter DEMO Watermarks*******
holders’ actual understanding of the program and workable for purposes of evaluation,
and (2) how the evaluator can assess the quality of the program theory that has been thus
articulated. We begin with a brief description of a perspective that has provided the
most fully developed approaches to evaluating program theory.
EXHIBIT 5-A
A Rationale for Evaluability Assessment
These four problems, which characterize many public and private programs, can be
reduced and often overcome by a qualitative evaluation process, evaluability
assessment, that documents the breadth of the four problems and helps programs—
and subsequent program evaluation work—to meet the following criteria:
SOURCE: Quoted from Joseph S. Wholey, “Assessing the Feasibility and Likely
Usefulness of Evaluation,” in Handbook of Practical Program Evaluation, eds. J. S.
Wholey, H. P. Hatry, and K. E. Newcomer (San Francisco: Jossey-Bass, 1994), p. 16.
EXHIBIT 5-B
Evaluability Assessment for the Appalachian Regional Commission
Evaluators from the Urban Institute worked with managers and policymakers in the
Appalachian Regional Commission (ARC) on the design of their health and child
development program. In this evaluability assessment, the evaluators:
Reviewed existing data on each of the 13 state ARC-funded health and child
development programs
Made visits to five states and then selected two states to participate in
evaluation design and implementation
Reviewed documentation related to congressional, commission, state, and
project objectives and activities (including the authorizing legislation,
congressional hearings and committee reports, state planning documents,
project grant applications, ARC contract reports, local planning documents,
project materials, and research projects)
Interviewed approximately 75 people on congressional staffs and in
commission headquarters, state ARC and health and child development staffs,
local planning units, and local projects
Participated in workshops with approximately 60 additional health and child
development practitioners, ARC state personnel, and outside analysts
Analysis and synthesis of the resulting data yielded a logic model that presented
program activities, program objectives, and the assumed causal links between them.
The measurability and plausibility of program objectives were then analyzed and
new program designs more likely to lead to demonstrably effective performance
were presented. These included both an overall ARC program model and a series of
individual models, each concerned with an identified objective of the program.
EXHIBIT 5-C
Overview of Program Theory
EXHIBIT 5-D
Diagrams Illustrating Program Impact Theories
EXHIBIT 5-F
Organizational Schematic for an Aftercare Program for Psychiatric Patients
A crucial early step in articulating program theory is to define the boundaries of the
program at issue (Smith, 1989). A human service agency may have many programs and
provide multiple services; a regional program may have many agencies and sites. There
is usually no one correct definition of a program, and the boundaries the evaluator
applies will depend, in large part, on the scope of the evaluation sponsor’s concerns
and the program domains to which they apply.
EXHIBIT 5-G
A Logic Model for a Teen Mother Parenting Education Program
One way to define the boundaries of a program for the purpose of articulating the
program theory is to work from the perspective of the decisionmakers who are expected
to act on the findings of the evaluation. The evaluator’s definition of the program should
at minimum represent the relevant jurisdiction of those decisionmakers and the
organizational structures and activities about which decisions are likely to be made. If,
for instance, the sponsor of the evaluation is the director of a local community mental
health agency, then the evaluator may define the boundaries of the program around one
of the distinct service packages administered by that director, such as outpatient
counseling for eating disorders. If the evaluation sponsor is the state director of mental
health, however, the relevant program boundaries may be defined around effectiveness
questions that relate to the outpatient counseling component of all the local mental health
agencies in the state.
Because program theory deals mainly with means-ends relations, the most critical
aspect of defining program boundaries is to ensure that they encompass all the important
******ebook converter DEMO Watermarks*******
activities, events, and resources linked to one or more outcomes recognized as central
to the endeavor. This can be accomplished by starting with the benefits the program
intends to produce and working backward to identify all the activities and resources
under relevant organizational auspices that are presumed to contribute to attaining those
objectives. From this perspective, the eating disorders program at either the local or
state level would be defined as the set of activities organized by the respective mental
health agency that has an identifiable role in attempting to alleviate eating disorders for
the eligible population.
Although these approaches are straightforward in concept, they can be problematic
in practice. Not only can programs be complex, with crosscutting resources, activities,
and goals, but the characteristics described above as linchpins for program definition
can themselves be difficult to establish. Thus, in this matter, as with so many other
aspects of evaluation, the evaluator must be prepared to negotiate a program definition
agreeable to the evaluation sponsor and key stakeholders and be flexible about
modifying the definition as the evaluation progresses.
For a program in the early planning stage, program theory might be built by the
planners from prior practice and research. At this stage, an evaluator may be able to
help develop a plausible and well-articulated theory. For an existing program, however,
the appropriate task is to describe the theory that is actually embodied in the program’s
structure and operation. To accomplish this, the evaluator must work with stakeholders
to draw out the theory represented in their actions and assumptions. The general
procedure for this involves successive approximation. Draft descriptions of the program
theory are generated, usually by the evaluator, and discussed with knowledgeable
stakeholder informants to get feedback. The draft is then refined on the basis of their
input and shown again to appropriate stakeholders. The theory description developed in
this fashion may involve impact theory, process theory, or any components or
combination that are deemed relevant to the purposes of the evaluation. Exhibit 5-H
presents one evaluator’s account of how a program process theory was elicited.
The primary sources of information for developing and differentiating descriptions
of program theory are (1) review of program documents; (2) interviews with program
stakeholders, and other selected informants; (3) site visits and observation of program
functions and circumstances; and (4) the social science literature. Three types of
information the evaluator may be able to extract from those sources will be especially
useful.
Program Goals and Objectives
******ebook converter DEMO Watermarks*******
Perhaps the most important matter to be determined from program sources relates to
the goals and objectives of the program, which are necessarily an integral part of the
program theory, especially its impact theory. The goals and objectives that must be
represented in program theory, however, are not necessarily the same as those identified
in a program’s mission statements or in responses to questions asked of stake-holders
about the program’s goals. To be meaningful for an evaluation, program goals must
identify a state of affairs that could realistically be attained as a result of program
actions; that is, there must be some reasonable connection between what the program
does and what it intends to accomplish. Smith (1989) suggests that, to keep the
discussion concrete and specific, the evaluator should use a line of questioning that does
not ask about goals directly but asks instead about consequences. For instance, in a
review of major program activities, the evaluator might ask about each, “Why do it?
What are the expected results? How could you tell if those results actually occurred?”
The resulting set of goal statements must then be integrated into the description of
program theory. Goals and objectives that describe the changes the program aims to
bring about in social conditions relate to program impact theory. A program goal of
reducing unemployment, for instance, identifies a distal outcome in the impact theory.
Program goals and objectives related to program activities and service delivery, in turn,
help reveal the program process theory. If the program aims to offer afterschool care for
latchkey children to working parents, a portion of the service utilization plan is
revealed. Similarly, if an objective is to offer literacy classes four times a week, an
important element of the organizational plan is identified.
To properly describe the program process theory, the evaluator must identify each
distinct program component, its functions, and the particular activities and operations
associated with those functions. Program functions include such operations as “assess
client need,” “complete intake,” “assign case manager,” “recruit referral agencies,”
“train field workers,” and the like. The evaluator can generally identify such functions
by determining the activities and job descriptions of the various program personnel.
When clustered into thematic groups, these functions represent the constituent elements
of the program process theory.
EXHIBIT 5-H
Formulating Program Process Theory for Adapted Work Service
The creation of the operational model of the AWS program involved using Post-
it notes and butcher paper to provide a wall-size depiction of the program. The
first session involved only the researcher and the program director. The first
question asked was, “What happens when a prospective participant calls the
center for information?” The response was recorded on a Post-it note and
placed on the butcher paper. The next step was then identified, and this too was
recorded and placed on the butcher paper. The process repeated itself until all
(known) activities were identified and placed on the paper. Once the program
director could not identify any more activities, the Post-it notes were combined
into clusters. The clusters were discussed until potential component labels
began to emerge. Since this exercise was the product of only two people, the
work was left in an unused room for two weeks so that the executive director
and all other members of the management team could react to the work. They
were to identify missing, incorrect, or misplaced activities as well as comment
on the proposed components. After several feedback sessions from the staff
members and discussions with the executive director, the work was typed and
prepared for presentation to the Advisory Board. The board members were
able to reflect on the content, provide further discussion, and suggest additional
changes. Several times during monthly board meetings, the executive director
asked that the model be revisited for planning purposes. This helped further
clarify the activities as well as sharpen the group’s thinking about the program.
The description of program theory that results from the procedures we have
described will generally represent the program as it was intended more than as it
actually is. Program managers and policymakers think of the idealized program as the
“real” one with various shortfalls from that ideal as glitches that do not represent what
the program is really about. Those further away from the day-to-day operations, on the
other hand, may be unaware of such shortfalls and will naturally describe what they
presume the program to be even if in actuality it does not quite live up to that image.
Some discrepancy between program theory and reality is therefore natural. Indeed,
examination of the nature and magnitude of that discrepancy is the task of process or
implementation evaluation, as discussed in the next chapter. However, if the theory is so
overblown that it cannot realistically be held up as a depiction of what is supposed to
happen, it needs to be revised. Suppose, for instance, that a job training program’s
service utilization plan calls for monthly contacts between each client and a case
manager. If the program resources are insufficient to support case managers, and none
are employed, this part of the theory is fanciful and should be restated to more
realistically depict what the program might actually be able to accomplish.
Given that the program theory depicts a realistic scenario, confirming it is a matter
of demonstrating that pertinent program personnel and stakeholders endorse it as a
******ebook converter DEMO Watermarks*******
meaningful account of how the program is intended to work. If it is not possible to
generate a theory description that all relevant stakeholders accept as reasonable, this
indicates that the program is poorly defined or that it embodies competing philosophies.
In such cases, the most appropriate response for the evaluator may be to take on a
consultant role and assist the program in clarifying its assumptions and intentions to
yield a theory description that will be acceptable to all key stakeholders.
For the evaluator, the end result of the theory description exercise is a detailed and
complete statement of the program as intended that can then be analyzed and assessed as
a distinct aspect of the evaluation. Note that the agreement of stakeholders serves only
to confirm that the theory description does, in fact, represent their understanding of how
the program is supposed to work. It does not necessarily mean that the theory is a good
one. To determine the soundness of a program theory, the evaluator must not only
describe the theory but evaluate it. The procedures evaluators use for that purpose are
described in the next section.
The most important framework for assessing program theory builds on the results of
needs assessment, as discussed in Chapter 4. Or, more generally, it is based on a
thorough understanding of the social problem the program is intended to address and the
service needs of the target population. A program theory that does not relate in an
appropriate manner to the actual nature and circumstances of the social conditions at
issue will result in an ineffective program no matter how well the program is
implemented and administered. It is fundamental, therefore, to assess program theory in
relationship to the needs of the target population the program is intended to serve.
There is no push-button procedure an evaluator can use to assess whether program
theory describes a suitable conceptualization of how social needs should be met.
Inevitably, this assessment requires judgment calls. When the assessment is especially
critical, its validity is strengthened if those judgments are made collaboratively with
relevant experts and stakeholders to broaden the range of perspectives and expertise on
which they are based. Such collaborators, for instance, might include social scientists
knowledgeable about research and theory related to the intervention, administrators
with long experience managing such programs, representatives of advocacy groups
associated with the target population, and policymakers or policy advisers highly
familiar with the program and problem area.
Whatever the nature of the group that contributes to the assessment, the crucial
aspect of the process is specificity. When program theory and social needs are
described in general terms, there often appears to be more correspondence than is
evident when the details are examined. To illustrate, consider a curfew program
prohibiting juveniles under age 18 from being outside their homes after midnight that is
initiated in a metropolitan area to address the problem of skyrocketing juvenile crime.
The program theory, in general terms, is that the curfew will keep the youths home at
night and, if they are at home, they are unlikely to commit crimes. Because the general
social problem the program addresses is juvenile crime, the program theory does seem
responsive to the social need.
A more detailed problem diagnosis and service needs assessment, however, might
show that the bulk of juvenile crimes are residential burglaries committed in the late
afternoon when school lets out. Moreover, it might reveal that the offenders represent a
relatively small proportion of the juvenile population who have a disproportionately
large impact because of their high rates of offending. Furthermore, it might be found that
these juveniles are predominantly latchkey youths who have no supervision during
******ebook converter DEMO Watermarks*******
afterschool hours. When the program theory is then examined in some detail, it is
apparent that it assumes that significant juvenile crime occurs late at night and that
potential offenders will both know about and obey the curfew. Furthermore, it depends
on enforcement by parents or the police if compliance does not occur voluntarily.
Although even more specificity than this would be desirable, this much detail
illustrates how a program theory can be compared with need to discover shortcomings
in the theory. In this example, examining the particulars of the program theory and the
social problem it is intended to address reveals a large disconnect. The program
blankets the whole city rather than targeting the small group of problem juveniles and
focuses on activity late at night rather than during the early afternoon, when most of the
crimes actually occur. In addition, it makes the questionable assumptions that youths
already engaged in more serious lawbreaking will comply with a curfew, that parents
who leave their delinquent children unsupervised during the early part of the day will
be able to supervise their later behavior, and that the overburdened police force will
invest sufficient effort in arresting juveniles who violate the curfew to enforce
compliance. Careful review of these particulars alone would raise serious doubts about
the validity of this program theory (Exhibit 5-I presents another example).
One useful approach to comparing program theory with what is known (or assumed)
about the relevant social needs is to separately assess impact theory and program
process theory. Each of these relates to the social problem in a different way and, as
each is elaborated, specific questions can be asked about how compatible the
assumptions of the theory are with the nature of the social circumstances to which it
applies. We will briefly describe the main points of comparison for each of these theory
components.
Program impact theory involves the sequence of causal links between program
services and outcomes that improve the targeted social conditions. The key point of
comparison between program impact theory and social needs, therefore, relates to
whether the effects the program is expected to have on the social conditions correspond
to what is required to improve those conditions, as revealed by the needs assessment.
Consider, for instance, a school-based educational program aimed at getting elementary
school children to learn and practice good eating habits. The problem this program
attempts to ameliorate is poor nutritional choices among school-age children, especially
those in economically disadvantaged areas. The program impact theory would show a
sequence of links between the planned instructional exercises and the children’s
awareness of the nutritional value of foods, culminating in healthier selections and
therefore improved nutrition.
EXHIBIT 5-I
The Needs of the Homeless as a Basis for Assessing Program Theory
These findings offer two lines of analysis for assessment of program theory. First,
any program that intends to alleviate homelessness must provide services that
address the major problems that homeless persons experience. That is, the expected
outcomes of those services (impact theory) must represent improvements in the most
problematic domains if the conditions of the homeless are to be appreciably
improved. Second, the design of the service delivery system (program process
theory) must be such that multiple services can be readily and flexibly provided to
homeless individuals in ways that will be accessible to them despite their limited
resources and difficult circumstances. Careful, detailed comparison of the program
theory embodied in any program for this homeless population with the respective
needs assessment data, therefore, will reveal how sound that theory is as a design
for effective intervention.
Now, suppose a thorough needs assessment shows that the children’s eating habits
are, indeed, poor but that their nutritional knowledge is not especially deficient. The
needs assessment further shows that the foods served at home and even those offered in
the school cafeterias provide limited opportunity for healthy selections. Against this
background, it is evident that the program impact theory is flawed. Even if the program
successfully imparts additional information about healthy eating, the children will not be
able to act on it because they have little control over the selection of foods available to
them. Thus, the proximal outcomes the program impact theory describes may be
achieved, but they are not what is needed to ameliorate the problem at issue.
Program process theory, on the other hand, represents assumptions about the
capability of the program to provide services that are accessible to the target population
and compatible with their needs. These assumptions, in turn, can be compared with
******ebook converter DEMO Watermarks*******
information about the target population’s opportunities to obtain service and the barriers
that inhibit them from using the service. The process theory for an adult literacy program
that offers evening classes at the local high school, for instance, may incorporate
instructional and advertising functions and an appropriate selection of courses for the
target population. The details of this scheme can be compared with needs assessment
data that show what logistical and psychological support the target population requires
to make effective use of the program. Child care and transportation may be critical for
some potential participants. Also, illiterate adults may be reluctant to enroll in courses
without more personal encouragement than they would receive from advertising.
Cultural and personal affinity with the instructors may be important factors in attracting
and maintaining participation from the target population as well. The intended program
process can thus be assessed in terms of how responsive it is to these dimensions of the
needs of the target population.
A thorough job of articulating program theory should reveal the critical assumptions
and expectations inherent in the program’s design. One essential form of assessment is
simply a critical review of the logic and plausibility of these aspects of the program
theory. Commentators familiar with assessing program theory suggest that a panel of
reviewers be organized for that purpose (Chen, 1990; Rutman, 1980; Smith, 1989;
Wholey, 1994). Such an expert review panel should include representatives of the
program staff and other major stakeholders as well as the evaluator. By definition,
however, stakeholders have some direct stake in the program. To balance the assessment
and expand the available expertise, it may be advisable to bring in informed persons
with no direct relationship to the program. Such outside experts might include
experienced administrators of similar programs, social researchers with relevant
specialties, representatives of advocacy groups or client organizations, and the like.
A review of the logic and plausibility of program theory will necessarily be a
relatively unstructured and open-ended process. Nonetheless, there are some general
issues such reviews should address. These are described below in the form of questions
reviewers can ask. Additional useful detail can be found in Rutman (1980), Smith
(1989), and Wholey (1994). Also see Exhibit 5-J for an example.
Are the program goals and objectives well defined? The outcomes for which the
program is accountable should be stated in sufficiently clear and concrete terms to
permit a determination of whether they have been attained. Goals such as
“introducing students to computer technology” are not well defined in this sense,
whereas “increasing student knowledge of the ways computers can be used” is
******ebook converter DEMO Watermarks*******
well defined and measurable.
Are the program goals and objectives feasible? That is, is it realistic to assume
that they can actually be attained as a result of the services the program delivers? A
program theory should specify expected outcomes that are of a nature and scope
that might reasonably follow from a successful program and that do not represent
unrealistically high expectations. Moreover, the stated goals and objectives should
involve conditions the program might actually be able to affect in some meaningful
fashion, not those largely beyond its influence. “Eliminating poverty” is grandiose
for any program, whereas “decreasing the unemployment rate” is not. But even the
latter goal might be unrealistic for a program located in a chronically depressed
labor market.
Is the change process presumed in the program theory plausible? The presumption
that a program will create benefits for the intended target population depends on
the occurrence of some cause-and-effect chain that begins with the targets’
interaction with the program and ends with the improved circumstances in the
target population that the program expects to bring about. Every step of this causal
chain should be plausible. Because the validity of this impact theory is the key to
the program’s ability to produce the intended effects, it is best if the theory is
supported by evidence that the assumed links and relationships actually occur. For
example, suppose a program is based on the presumption that exposure to literature
about the health hazards of drug abuse will motivate long-term heroin addicts to
renounce drug use. In this case, the program theory does not present a plausible
change process, nor is it supported by any research evidence.
Are the procedures for identifying members of the target population, delivering
service to them, and sustaining that service through completion well defined and
sufficient? The program theory should specify procedures and functions that are
both well defined and adequate for the purpose, viewed both from the perspective
of the program’s ability to perform them and the target population’s likelihood of
being engaged by them. Consider, for example, a program to test for high blood
pressure among poor and elderly populations to identify those needing medical
care. It is relevant to ask whether this service is provided in locations accessible
to members of these groups and whether there is an effective means of locating
those with uncertain addresses. Absent these characteristics, it is unlikely that
many persons from the target groups will receive the intended service.
EXHIBIT 5-J
Assessing the Clarity and Plausibility of the Program Theory for Maryland’s 4-H
Program
Although every program is distinctive in some ways, few are based entirely on
unique assumptions about how to engender change, deliver service, and perform major
program functions. Some information applicable to assessing the various components of
program theory is likely to exist in the social science and human services research
literature. One useful approach to assessing program theory, therefore, is to find out
whether it is congruent with research evidence and practical experience elsewhere
(Exhibit 5-K summarizes one example of this approach).
There are several ways in which evaluators might compare a program theory with
findings from research and practice. The most straightforward is to examine evaluations
of programs based on similar concepts. The results will give some indication of the
likelihood that a program will be successful and perhaps identify critical problem
areas. Evaluations of very similar programs, of course, will be the most informative in
this regard. However, evaluation results for programs that are similar only in terms of
general theory, even if different in other regards, might also be instructive.
EXHIBIT 5-K
GREAT Program Theory Is Consistent With Criminological Research
The program has no officially stated theoretical grounding other than Glasser’s
(1975) reality therapy, but GREAT training officers and others associated with the
program make reference to sociological and psychological concepts as they train
GREAT instructors. As part of an analysis of the program’s impact theory, a team of
criminal justice researchers identified two well-researched criminological theories
relevant to gang participation: Gottfredson and Hirschi’s self-control theory (SCT)
and Akers’s social learning theory (SLT). They then reviewed the GREAT lesson
plans to assess their consistency with the most pertinent aspects of these theories. To
illustrate their findings, a summary of Lesson 4 is provided below with the
researchers’ analysis in italics after the lesson description:
SOURCE: Adapted from L. Thomas Winfree, Jr., Finn-Aage Esbensen, and D. Wayne
Osgood, “Evaluating a School-Based Gang-Prevention Program: A Theoretical
Perspective,” Evaluation Review, 1996, 20(2):181-203.
******ebook converter DEMO Watermarks*******
Consider a mass media campaign in a metropolitan area to encourage women to
have mammogram screening for early detection of breast cancer. The impact theory for
this program presumes that exposure to TV, radio, and newspaper messages will
stimulate a reaction that will eventuate in increased rates of mammogram screening. The
credibility of the impact theory assumed to link exposure and increases in testing is
enhanced by evidence that similar media campaigns in other cities have resulted in
increased mammogram testing. Moreover, the program’s process theory also gains some
support if the evaluations for other campaigns shows that the program functions and
scheme for delivering messages to the target population were similar to that intended for
the program at issue. Suppose, however, that no evaluation results are available about
media campaigns promoting mammogram screening in other cities. It might still be
informative to examine information about analogous media campaigns. For instance,
reports may be available about media campaigns to promote immunizations, dental
checkups, or other such actions that are health related and require a visit to a provider.
So long as these campaigns involve similar principles, their success might well be
relevant to assessing the program theory on which the mammogram campaign is based.
In some instances, basic research on the social and psychological processes central
to the program may be available as a framework for assessing the program theory,
particularly impact theory. Unfortunately for the evaluation field, relatively little basic
research has been done on the social dynamics that are common and important to
intervention programs. Where such research exists, however, it can be very useful. For
instance, a mass media campaign to encourage mammogram screening involves
messages intended to change attitudes and behavior. The large body of basic research in
social psychology on attitude change and its relationship to behavior provides some
basis for assessing the impact theory for such a media campaign. One established
finding is that messages designed to raise fears are generally less effective than those
providing positive reasons for a behavior. Thus, an impact theory based on the
presumption that increasing awareness of the dangers of breast cancer will prompt
increased mammogram screening may not be a good one.
There is also a large applied research literature on media campaigns and related
approaches in the field of advertising and marketing. Although this literature largely has
to do with selling products and services, it too may provide some basis for assessing
the program theory for the breast cancer media campaign. Market segmentation studies,
for instance, may show what media and what times of the day are best for reaching
women with various demographic profiles. The evaluator can then use this information
to examine whether the program’s service utilization plan is optimal for communicating
with women whose age and circumstances put them at risk for breast cancer.
Use of the research literature to help with assessment of program theory is not
limited to situations of good overall correspondence between the programs or processes
the evaluator is investigating and those represented in the research. An alternate
******ebook converter DEMO Watermarks*******
approach is to break the theory down into its component parts and linkages and search
for research evidence relevant to each component. Much of program theory can be
stated as “if-then” propositions: If case managers are assigned, then more services will
be provided; if school performance improves, then delinquent behavior will decrease;
if teacher-to-student ratios are higher, then students will receive more individual
attention. Research may be available that indicates the plausibility of individual
propositions of this sort. The results, in turn, can provide a basis for a broader
assessment of the theory with the added advantage of identifying any especially weak
links. This approach was pioneered by the Program Evaluation and Methodology
Division of the U.S. General Accounting Office as a way to provide rapid review of
program proposals arising in the Congress (Cordray, 1993; U.S. General Accounting
Office, 1990).
EXHIBIT 5-L
Testing a Model of Patient Education for Diabetes
Summary
Program theory is an aspect of a program that can be evaluated in its own right.
Such assessment is important because a program based on weak or faulty
conceptualization has little prospect of achieving the intended results.
The most fully developed approaches to evaluating program theory have been
described in the context of evaluability assessment, an appraisal of whether a program’s
performance can be evaluated and, if so, whether it should be. Evaluability assessment
involves describing program goals and objectives, assessing whether the program is
well enough conceptualized to be evaluable, and identifying stakeholder interest in
evaluation findings. Evaluability assessment may result in efforts by program managers
to better conceptualize their program. It may indicate that the program is too poorly
defined for evaluation or that there is little likelihood that the findings will be used.
Alternatively, it could find that the program theory is well defined and plausible, that
evaluation findings will likely be used, and that a meaningful evaluation could be done.
To assess program theory, it is first necessary for the evaluator to describe the
theory in a clear, explicit form acceptable to stakeholders. The aim of this effort is to
describe the “program as intended” and its rationale, not the program as it actually is.
Three key components that should be included in this description are the program
impact theory, the service utilization plan, and the program’s organizational plan.
The assumptions and expectations that make up a program theory may be well
formulated and explicitly stated (thus constituting an articulated program theory), or they
may be inherent in the program but not overtly stated (thus constituting an implicit
program theory). When a program theory is implicit, the evaluator must extract and
articulate the theory by collating and integrating information from program documents,
interviews with program personnel and other stakeholders, and observations of program
activities. It is especially important to formulate clear, concrete statements of the
program’s goals and objectives as well as an account of how the desired outcomes are
expected to result from program action. The evaluator should seek corroboration from
stakeholders that the resulting description meaningfully and accurately describes the
“program as intended.”
There are several approaches to assessing program theory. The most important
assessment the evaluator can make is based on a comparison of the intervention
******ebook converter DEMO Watermarks*******
specified in the program theory with the social needs the program is expected to
address. Examining critical details of the program conceptualization in relation to the
social problem indicates whether the program represents a reasonable plan for
ameliorating that problem. This analysis is facilitated when a needs assessment has
been conducted to systematically diagnose the problematic social conditions (Chapter
4).
Program theory also can be assessed in relation to the support for its critical
assumptions found in research or documented practice elsewhere. Sometimes findings
are available for similar programs, or programs based on similar theory, so that the
evaluator can make an overall comparison between a program’s theory and relevant
evidence. If the research and practice literature does not support overall comparisons,
however, evidence bearing on specific key relationships assumed in the program theory
may still be obtainable.
Assessment of program theory may indicate that the program is not evaluable
because of basic flaws in its theory. Such findings are an important evaluation product
in their own right and can be informative for program stakeholders. In such cases, one
appropriate response is to redesign the program, a process in which the evaluator may
serve as a consultant. If evaluation proceeds without articulation of a credible program
theory, the results will be ambiguous. In contrast, a sound program theory provides a
basis for evaluation of how well that theory is implemented, what outcomes are
produced, and how efficiently they are produced, topics to be discussed in subsequent
chapters.
K EY CONCEPTS
Articulated program theory
An explicitly stated version of program theory that is spelled out in some detail as part
of a program’s documentation and identity or as a result of efforts by the evaluator and
stakeholders to formulate the theory.
******ebook converter DEMO Watermarks*******
Black box evaluation
Evaluation of program outcomes without the benefit of an articulated program theory to
provide insight into what is presumed to be causing those outcomes and why.
Evaluability assessment
Negotiation and investigation undertaken jointly by the evaluator, the evaluation
sponsor, and possibly other stakeholders to determine whether a program meets the
preconditions for evaluation and, if so, how the evaluation should be designed to ensure
maximum utility.
Impact theory
A causal theory describing cause-and-effect sequences in which certain program
activities are the instigating causes and certain social benefits are the effects they
eventually produce.
Organizational plan
Assumptions and expectations about what the program must do to bring about the
transactions between the target population and the program that will produce the
intended changes in social conditions. The program’s organizational plan is articulated
from the perspective of program management and encompasses both the functions and
activities the program is expected to perform and the human, financial, and physical
resources required for that performance.
Process theory
The combination of the program’s organizational plan and its service utilization plan
into an overall description of the assumptions and expectations about how the program
is supposed to operate.
Chapter Outline
What Is Program Process Evaluation and Monitoring?
Setting Criteria for Judging Program Process
Common Forms of Program Process Evaluations
Process or Implementation Evaluation
Continuous Program Process Evaluation (Monitoring) and
Management Information Systems
Perspectives on Program Process Monitoring
Process Monitoring From the Evaluator’s Perspective
Process Monitoring From an Accountability Perspective
Process Monitoring From a Management Perspective
Monitoring Service Utilization
Coverage and Bias
Measuring and Monitoring Coverage
Program Records
Surveys
Assessing Bias: Program Users, Eligibles, and Dropouts
Monitoring Organizational Functions
Service Delivery Is Fundamental
After signing a new bill, President John F. Kennedy is reputed to have said to his
aides, “Now that this bill is the law of the land, let’s hope we can get our government to
carry it out.” Both those in high places and those on the front lines are often justified in
being skeptical about the chances that a social program will be appropriately
implemented. Many steps are required to take a program from concept to full operation,
******ebook converter DEMO Watermarks*******
and much effort is needed to keep it true to its original design and purposes. Thus,
whether any program is fully carried out as envisioned by its sponsors and managers is
always problematic.
Ascertaining how well a program is operating, therefore, is an important and useful
form of evaluation, known as program process evaluation. (A widely used alternative
label is implementation evaluation.) It does not represent a single distinct evaluation
procedure but, rather, a family of approaches, concepts, and methods. The defining
theme of program process evaluation (or simply process evaluation) is a focus on the
enacted program itself—its operations, activities, functions, performance, component
parts, resources, and so forth. When process evaluation involves an ongoing effort to
measure and record information about the program’s operation, we will refer to it as
program process monitoring.
EXHIBIT 6-A
Process Evaluation to Assess Integrated Services for Children
Many analysts have observed that the traditional system of categorical funding for
children’s services, with funds allocated to respond to specific problems under
strict rules regarding eligibility and expenditures, has not served children’s needs
well. The critics argue that this system fragments services and inhibits collaboration
between programs that might otherwise lead to more effective services. In 1991, the
Robert Wood Johnson Foundation launched the Child Health Initiative to test the
feasibility of achieving systemic changes through the integration of children’s
services and finances. Specifically, the initiative called for the development of the
following components:
Nine sites across the country were selected to launch demonstration programs. The
Institute for Health Policy Studies, University of California, San Francisco,
conducted an evaluation of these programs with two major goals: (1) to gauge the
degree to which the implementation of the projects was consistent with the original
******ebook converter DEMO Watermarks*******
planning objectives (fidelity to the model) and (2) to assess the extent to which each
of the major program components was implemented. In the first year, the evaluation
focused on the political, organizational, and design phase of program development.
During subsequent years, the focus turned to implementation and preliminary
outcomes. A combination of methods was used, including site visits, written surveys
completed by the program managers, in-depth interviews of key participants, focus
groups of service providers and clients, and reviews of project-related documents.
The evaluation found that most of the nine sites experienced some degree of success
in implementing the monitoring and care coordination components, but none was
able to implement decategorization. The general findings for each component were
as follows:
SOURCE: Adapted from Claire Brindis, Dana C. Hughes, Neal Halfon, and Paul W.
Newacheck, “The Use of Formative Evaluation to Assess Integrated Services for
Children,” Evaluation & the Health Professions, 1998, 21(1):66-90.
EXHIBIT 6-B
An Integrated Information System for a Family and Marriage Counseling Agency in
Israel
The Marital and Family Counselling Agency is run under the joint auspices of the
Tel Aviv Welfare Department and the School of Social Work at Tel Aviv University.
The agency provides marital and family counseling and community services for the
Jewish, Muslim, and Christian residents of one of the poorest sections of Tel Aviv.
The integrated information system developed for the agency is designed to follow up
clients from the moment they request help to the end of treatment. It is intended to
serve the agency and the individual counselors by monitoring the process and
outcomes of treatment and providing the data needed to make organizational and
clinical decisions. To accomplish this, data are collected on three forms and then
******ebook converter DEMO Watermarks*******
programmed into the computerized information system. The data elements include
the following:
The counselors can enter and retrieve data from this system whenever they wish and
are given a graph of each client’s status every three months to support clinical
decisions. Also, reports are generated for the clinic’s management. For example, a
report of the distribution of clients by ethnic group led to the development of a
program located within Arab community centers to better reach that population.
Other management reports describe the ways and times at which treatment is
terminated, the problems that brought clients to the agency, and the percentage of
people who applied for treatment but did not show up for the first session. The
information system has also been used for research purposes. For example, studies
were conducted on the predictors of treatment success, the comparative perceptions
by clients and counselors of the treatment process and outcomes, and gender
differences in presenting problems.
SOURCE: Adapted from Rivka Savaya, “The Potential and Utilization of an Integrated
Information System at a Family and Marriage Counselling Agency in Israel,”
Evaluation and Program Planning, 1998, 21(1):11-20.
EXHIBIT 6-C
Program and Service Utilization Studies
Government sponsors and funding groups, including Congress, operate in the glare
of the mass media. Their actions are also visible to the legislative groups who authorize
programs and to government “watchdog” organizations. For example, at the federal
level, the Office of Management and Budget, part of the executive branch, wields
considerable authority over program development, funding, and expenditures. The U.S.
General Accounting Office, an arm of Congress, advises members of the House and
Senate on the utility of programs and in some cases conducts evaluations. Both state
governments and those of large cities have analogous oversight groups. No social
program that receives outside funding, whether public or private, can expect to avoid
scrutiny and escape demand for accountability.
In addition to funders and sponsors, other stakeholders may press for program
accountability. In the face of taxpayers’ reservations about spending for social
programs, together with the increased competition for resources often resulting from
cuts in available funding, all stakeholders are scrutinizing both the programs they
******ebook converter DEMO Watermarks*******
support and those they do not. Concerned parties use process monitoring information to
lobby for the expansion of programs they advocate or find congenial with their self-
interests and the curtailment or abandonment of those programs they disdain.
Stakeholders, it should be noted, include the targets themselves. A dramatic illustration
of their perspective occurred when President Ronald Reagan telephoned an artificial
heart recipient patient to wish him well and, with all of the country listening, the patient
complained about not receiving his Social Security check.
Clearly, social programs operate in a political world. It could hardly be otherwise,
given the stakes involved. The human and social service industry is not only huge in
dollar volume and number of persons employed but is also laden with ideological and
emotional baggage. Programs are often supported or opposed by armies of vocal
community members; indeed, the social program sector is comparable only to the
defense industry in its lobbying efforts, and the stands that politicians take with respect
to particular programs often determine their fates in elections. Accountability
information is a major weapon that stakeholders use in their battles as advocates and
antagonists.
Service utilization issues typically break down into questions about coverage and
bias. Whereas coverage refers to the extent to which participation by the target
population achieves the levels specified in the program design, bias is the degree to
which some subgroups participate in greater proportions than others. Clearly, coverage
and bias are related. A program that reaches all projected participants and no others is
obviously not biased in its coverage. But because few social programs ever achieve
total coverage, bias is typically an issue.
Bias can arise out of self-selection; that is, some subgroups may voluntarily
******ebook converter DEMO Watermarks*******
participate more frequently than others. It can also derive from program actions. For
instance, a program’s personnel may react favorably to some clients while rejecting or
discouraging others. One temptation commonly faced by programs is to select the most
“success prone” targets. Such “creaming” frequently occurs because of the self-interests
of one or more stakeholders (a dramatic example is described in 6-D). Finally, bias may
result from such unforeseen influences as the location of a program office, which may
encourage greater participation by a subgroup that enjoys more convenient access to
program activities.
Although there are many social programs, such as food stamps, that aspire to serve
all or a very large proportion of a defined target population, typically programs do not
have the resources to provide services to more than a portion of potential targets. In the
latter case, the target definition established during the planning and development of the
program frequently is not specific enough. Program staff and sponsors can correct this
problem by defining the characteristics of the target population more sharply and by
using resources more effectively. For example, establishing a health center to provide
medical services to persons in a defined community who do not have regular sources of
care may result in such an overwhelming demand that many of those who want services
cannot be accommodated. The solution might be to add eligibility criteria that weight
such factors as severity of the health problem, family size, age, and income to reduce the
size of the target population to manageable proportions while still serving the neediest
persons. In some programs, such as WIC (Women’s, Infants and Children Nutrition
Program) or housing vouchers for the poor, undercoverage is a systemic problem;
Congress has never provided sufficient funding to cover all who were eligible, perhaps
hoping that budgets could be expanded in the future.
The opposite effect, overcoverage, also occurs. For instance, the TV program
Sesame Street has consistently captured audiences far exceeding the intended targets
(disadvantaged preschoolers), including children who are not at all disadvantaged and
even adults. Because these additional audiences are reached at no additional cost, this
overcoverage is not a financial drain. It does, however, thwart one of Sesame Street’s
original goals, which was to lessen the gap in learning between advantaged and
disadvantaged children.
In other instances, overcoverage can be costly and problematic. Bilingual programs
in schools, for instance, have often been found to include many students whose primary
language is English. Some school systems whose funding from the program depends on
the number of children enrolled in bilingual classes have inflated attendance figures by
registering inappropriate students. In other cases, schools have used assignment to
bilingual instruction as a means of ridding classes of “problem children,” thus saturating
bilingual classes with disciplinary cases.
EXHIBIT 6-D
******ebook converter DEMO Watermarks*******
“Creaming” the Unemployed
The most common coverage problem in social interventions, however, is the failure
to achieve high target participation, either because of bias in the way targets are
recruited or retained or because potential clients are unaware of the program, are
unable to use it, or reject it. For example, in most employment training programs only
small minorities of those eligible by reason of unemployment ever attempt to
participate. Similar situations occur in mental health, substance abuse, and numerous
other programs (see Exhibit 6-E).We turn now to the question of how program coverage
and bias might be measured as a part of program process monitoring.
EXHIBIT 6-E
The Coverage of the Food Stamp Program for the Homeless
Because most homeless persons are eligible by income for food stamps, their
participation rates in that program should be high. But they are not: Burt and Cohen
reported that only 18% of the persons sampled were receiving food stamps and
almost half had never used them. This is largely because certification for food
stamps requires passing a means test, a procedure that requires some documentation.
This is not easy for many homeless, who may not have the required documents, an
address to receive the stamps, or the capability to fill out the forms.
Legislation passed in 1986 allowed homeless persons to exchange food stamps for
meals offered by nonprofit organizations and made shelter residents in places where
meals were served eligible for food stamps. By surveying food providers, shelters,
and food kitchens, however, Burt and Cohen found that few meal providers had
applied for certification as receivers of food stamps. Of the roughly 3,000 food
providers in the sample, only 40 had become authorized.
******ebook converter DEMO Watermarks*******
Furthermore, among those authorized to receive food stamps, the majority had never
started to collect food stamps or had started and then abandoned the practice. It
made little sense to collect food stamps as payment for meals that otherwise were
provided free so that, on the same food lines, food stamp participants were asked to
pay for their food with stamps while nonparticipants paid nothing. The only food
provider who was able to use the system was one that required either cash payment
or labor for meals; for this program, food stamps became a substitute for these
payments.
SOURCE: Based on Martha Burt and Barbara Cohen, Feeding the Homeless: Does the
Prepared Meals Provision Help? Report to Congress on the Prepared Meal Provision,
vols. I and II (Washington, DC: Urban Institute, 1988). Reprinted with permission.
The problem in measuring coverage is almost always the inability to specify the
number in need, that is, the magnitude of the target population. The needs assessment
procedures described in Chapter 4,if carried out as an integral part of program planning,
usually minimize this problem. In addition, three sources of information can be used to
assess the extent to which a program is serving the appropriate target population:
program records, surveys of program participants, and community surveys.
Program Records
Almost all programs keep records on targets served. Data from well-maintained
record systems—particularly from MISs—can often be used to estimate program bias or
overcoverage. For instance, information on the various screening criteria for program
intake may be tabulated to determine whether the units served are the ones specified in
the program’s design. Suppose the targets of a family planning program are women less
than 50 years of age who have been residents of the community for at least six months
and who have two or more children under age ten. Records of program participants can
be examined to see whether the women actually served are within the eligibility limits
and the degree to which particular age or parity groups are under- or overrepresented.
Such an analysis might also disclose bias in program participation in terms of the
eligibility characteristics or combinations of them. Another example, involving public
shelter utilization by the homeless, is described in 6-F.
However, programs differ widely in the quality and extensiveness of their records
and in the sophistication involved in storing and maintaining them. Moreover, the
feasibility of maintaining complete, ongoing record systems for all program participants
varies with the nature of the intervention and the available resources. In the case of
medical and mental health systems, for example, sophisticated, computerized
management and client information systems have been developed for managed care
******ebook converter DEMO Watermarks*******
purposes that would be impractical for many other types of programs.
In measuring target participation, the main concerns are that the data are accurate
and reliable. It should be noted that all record systems are subject to some degree of
error. Some records will contain incorrect or outdated information, and others will be
incomplete. The extent to which unreliable records can be used for decision making
depends on the kind and degree of their unreliability and the nature of the decisions in
question. Clearly, critical decisions involving significant outcomes require better
records than do less weighty decisions. Whereas a decision on whether to continue a
project should not be made on the basis of data derived from partly unreliable records,
data from the same records may suffice for a decision to change an administrative
procedure.
EXHIBIT 6-F
Public Shelter Utilization Among Homeless Adults in New York and Philadelphia
The cities of Philadelphia and New York have standardized admission procedures
for persons requesting services from city-funded or -operated shelters. All persons
admitted to the public shelter system must provide intake information for a
computerized registry that includes the client’s name, race, date of birth, and gender,
and must also be assessed for substance abuse and mental health problems, medical
conditions, and disabilities. A service utilization study conducted by researchers
from the University of Pennsylvania analyzed data from this registry for New York
City for 1987-1994 (110,604 men and 26,053 women) and Philadelphia for 1991-
1994 (12,843 men and 3,592 women).
They found three predominant types of users: (1) the chronically homeless,
characterized by very few shelter episodes, but episodes that might last as long as
several years; (2) the episodically homeless, characterized by multiple, increasingly
shorter stays over a long period; and (3) the transitionally homeless, who had one or
two stays of short duration within a relatively brief period of time.
The most notable finding was the size and relative resource consumption of the
chronically homeless. In New York, for instance, 18% of the shelter users stayed
180 days or more in their first year, consuming 53% of the total number of system
days for first-time shelter users, triple the days for their proportionate representation
in the shelter population. These long-stay users tended to be older people and to
have mental health, substance abuse, and, in some cases, medical problems.
Surveys
As mentioned earlier in this chapter, for many programs that fail to show impacts,
the problem is a failure to deliver the interventions specified in the program design, a
problem generally known as implementation failure. There are three kinds of
implementation failures: First, no intervention, or not enough, is delivered; second, the
wrong intervention is delivered; and third, the intervention is unstandardized or
uncontrolled and varies excessively across the target population.
“Nonprograms” and Incomplete Intervention
Consider first the problem of the “nonprogram” (Rossi, 1978). McLaughlin (1975)
reviewed the evidence on the implementation of Title I of the Elementary and Secondary
Education Act, which allocated billions of dollars yearly to aid local schools in
overcoming students’ poverty-associated educational deprivations. Even though schools
had expended the funds, local school authorities were unable to describe their Title I
activities in any detail, and few activities could even be identified as educational
services delivered to schoolchildren. In short, little evidence could be found that school
programs existed that were directed toward the goal of helping disadvantaged children.
The failure of numerous other programs to deliver services has been documented as
well. Datta (1977), for example, reviewed the evaluations on career education
programs and found that the designated targets rarely participated in the planned
program activities. Similarly, an attempt to evaluate PUSH-EXCEL, a program designed
to motivate disadvantaged high school students toward higher levels of academic
achievement, disclosed that the program consisted mainly of the distribution of buttons
and hortative literature and little else (Murray, 1980).
A delivery system may dilute the intervention so that an insufficient amount reaches
the target population. Here the problem may be a lack of commitment on the part of a
front-line delivery system, resulting in minimal delivery or “ritual compliance,” to the
point that the program does not exist. 6-G, for instance, expands on an exhibit presented
in Chapter 2 to describe the implementation of welfare reform in which welfare
workers communicated little to clients about the new policies.
Wrong Intervention
The second category of program failure—namely, delivery of the wrong intervention
—can occur in several ways. One is that the mode of delivery negates the intervention.
An example is the Performance Contracting experiment, in which private firms that
contracted to teach mathematics and reading were paid in proportion to pupils’ gains in
achievement. The companies faced extensive difficulties in delivering the program at
school sites. In some sites the school system sabotaged the experiments, and in others
EXHIBIT 6-G
On the Front Lines: Are Welfare Workers Implementing Policy Reforms?
In the early 1990s, the state of California initiated the Work Pays demonstration
project, which expanded the state job preparation program (JOBS) and modified
Aid to Families with Dependent Children (AFDC) welfare policies to increase the
incentives and support for finding employment. The Work Pays demonstration was
designed to “substantially change the focus of the AFDC program to promote work
over welfare and self-sufficiency over welfare dependence.”
The workers in the local welfare offices were a vital link in the implementation of
Work Pays. The intake and redetermination interviews they conducted represented
virtually the only in-person contact that most clients had with the welfare system.
This fact prompted a team of evaluators to study how welfare workers were
communicating the Work Pays policies during their interactions with clients.
To assess the welfare workers’ implementation of the new policies, the evaluators
observed and analyzed the content of 66 intake or redetermination interviews
between workers and clients in four counties included in the Work Pays
demonstration. A structured observation form was used to record the frequency with
which various topics were discussed and to collect information about the
characteristics of the case. These observations were coded on the two dimensions
of interest: (1) information content and (2) positive discretion.
In over 80% of intake and redetermination interviews workers did not provide
and interpret information about welfare reforms. Most workers continued a
pattern of instrumental transactions that emphasized workers’ needs to collect
and verify eligibility information. Some workers coped with the new demand
by providing information about work-related policies, but routinizing the
information and adding it to their standardized, scripted recitations of welfare
rules. Others were coping by particularizing their interactions, giving some of
their clients some information some of the time, on an ad hoc basis.
These findings suggest that welfare reforms were not fully implemented at the street
level in these California counties. Worker-client transactions were consistent with
the processing of welfare claims, the enforcement of eligibility rules, and the
rationing of scarce resources such as JOBS services; they were poorly aligned with
new program objectives emphasizing transitional assistance, work, and self-
sufficiency outside the welfare system. (pp. 18-19)
SOURCE: Adapted by permission from Marcia K. Meyers, Bonnie Glaser, and Karin
MacDonald, “On the Front Lines of Welfare Delivery: Are Workers Implementing
The third issue is the one with which we began: the degree of conformity between a
program’s design and its implementation. Shortfalls may occur because the program is
not performing functions it is expected to or because it is not performing them as well as
expected. Such discrepancies may lead to efforts to move the implementation of a
project closer to the original design or to a respecification of the design itself. Such
analysis also provides an opportunity to judge the appropriateness of performing an
impact evaluation and, if necessary, to opt for more formative evaluation to develop the
desired convergence of design and implementation.
Summary
Program process evaluation is a form of evaluation designed to describe how a
program is operating and assess how well it performs its intended functions. It builds on
program process theory, which identifies the critical components, functions, and
relationships assumed necessary for the program to be effective. Where process
evaluation is an ongoing function involving repeated measurements over time, it is
referred to as program process monitoring.
The criteria for assessing program process performance may include stipulations
of the program theory, administrative standards, applicable legal, ethical, or
professional standards, and after-the-fact judgment calls.
Program process monitoring takes somewhat different forms and serves different
purposes when undertaken from the perspectives of evaluation, accountability, and
program management, but the types of data required and the data collection procedures
used generally are the same or overlap considerably. In particular, program process
monitoring generally involves one or both of two domains of program performance:
service utilization and organizational functions.
Service utilization issues typically break down into questions about coverage and
bias. The sources of data useful for assessing coverage are program records, surveys of
program participants, and community surveys. Bias in program coverage can be
revealed through comparisons of program users, eligible nonparticipants, and dropouts.
K EY CONCEPTS
Accessibility
The extent to which the structural and organizational arrangements facilitate
participation in the program.
Accountability
The responsibility of program staff to provide evidence to stakeholders and sponsors
******ebook converter DEMO Watermarks*******
that a program is effective and in conformity with its coverage, service, legal, and fiscal
requirements.
Administrative standards
Stipulated achievement levels set by program administrators or other responsible
parties, for example, intake for 90% of the referrals within one month. These levels may
be set on the basis of past experience, the performance of comparable programs, or
professional judgment.
Bias
As applied to program coverage, the extent to which subgroups of a target population
are reached unequally by a program.
Coverage
The extent to which a program reaches its intended target population.
Outcome monitoring
The continual measurement and reporting of indicators of the status of the social
conditions a program is accountable for improving.
Chapter Outline
Program Outcomes
Outcome Level, Outcome Change, and Net Effect
Identifying Relevant Outcomes
Stakeholder Perspectives
Program Impact Theory
Prior Research
Unintended Outcomes
Measuring Program Outcomes
Measurement Procedures and Properties
Reliability
Validity
Sensitivity Choice of Outcome Measures
Monitoring Program Outcomes
Indicators for Outcome Monitoring
Pitfalls in Outcome Monitoring
Interpreting Outcome Data
Assessing a program’s effects on the clients it serves and the social conditions it aims
to improve is the most critical evaluation task because it deals with the “bottom line”
issue for social programs. No matter how well a program addresses target needs,
embodies a good plan of attack, reaches its target population and delivers apparently
appropriate services, it cannot be judged successful unless it actually brings about some
measure of beneficial change in its given social arena. Measuring that beneficial change,
therefore, is not only a core evaluation function but also a high-stakes activity for the
program. For these reasons, it is a function that evaluators must accomplish with great
care to ensure that the findings are valid and properly interpreted. For these same
reasons, it is one of the most difficult and, often, politically charged tasks the evaluator
undertakes.
Beginning in this chapter and continuing through Chapter 10, we consider how best
to identify the changes a program should be expected to produce, how to devise
measures of these changes, and how to interpret such measures. Consideration of
program effects begins with the concept of a program outcome, so we first discuss that
pivotal concept.
Program Outcomes
******ebook converter DEMO Watermarks*******
An outcome is the state of the target population or the social conditions that a program
is expected to have changed. For example, the amount of smoking among teenagers after
exposure to an antismoking campaign in their high school is an outcome. The attitudes
toward smoking of those who had not yet started to smoke is also an outcome. Similarly,
the “school readiness” of children after attending a preschool program would be an
outcome, as would the body weight of people who completed a weight-loss program,
the management skills of business personnel after a management training program, and
the amount of pollutants in the local river after a crackdown by the local environmental
protection agency.
Notice two things about these examples. First, outcomes are observed
characteristics of the target population or social conditions, not of the program, and the
definition of an outcome makes no direct reference to program actions. Although the
services delivered to program participants are often described as program “outputs,”
outcomes, as defined here, must relate to the benefits those products or services might
have for the participants, not simply their receipt. Thus, “receiving supportive family
therapy” is not a program outcome in our terms but, rather, the delivery of a program
service. Similarly, providing meals to 100 housebound elderly persons is not a program
outcome; it is service delivery, an aspect of program process. The nutritional benefits of
those meals for the health of the elderly, on the other hand, are outcomes, as are any
improvements in their morale, perceived quality of life, and risk of injury from
attempting to cook for themselves. Put another way, outcomes always refer to
characteristics that, in principle, could be observed for individuals or situations that
have not received program services. For instance, we could assess the amount of
smoking, the school readiness, the body weight, the management skills, and the water
pollution in relevant situations where there was no program intervention. Indeed, as we
will discuss later, we might measure outcomes in these situations to compare with those
where the program was delivered.
Second, the concept of an outcome, as we define it, does not necessarily mean that
the program targets have actually changed or that the program has caused them to change
in any way. The amount of smoking by the high school teenagers may not have changed
since the antismoking campaign began, and nobody may have lost any weight during
their participation in the weight-loss program. Alternatively, there may be change but in
the opposite of the expected direction—the teenagers may have increased their smoking,
and program participants may have gained weight. Furthermore, whatever happened
may have resulted from something other than the influence of the program. Perhaps the
weight-loss program ran during a holiday season when people were prone to
overindulge in sweets. Or perhaps the teenagers decreased their smoking in reaction to
news of the smoking-related death of a popular rock music celebrity. The challenge for
evaluators, then, is to assess not only the outcomes that actually obtain but also the
degree to which any change in outcomes is attributable to the program itself.
******ebook converter DEMO Watermarks*******
Outcome Level, Outcome Change, and Net Effect
The foregoing considerations lead to important distinctions in the use of the term
outcome:
Outcome level is the status of an outcome at some point in time (e.g., the amount
of smoking among teenagers).
Outcome change is the difference between outcome levels at different points in
time.
Program effect is that portion of an outcome change that can be attributed
uniquely to a program as opposed to the influence of some other factor.
Consider the graph in 7-A, which plots the levels of an outcome measure over time.
The vertical axis represents an outcome variable relevant to a program we wish to
evaluate. An outcome variable is a measurable characteristic or condition of a
program’s target population that could be affected by the actions of the program. It might
be amount of smoking, body weight, school readiness, extent of water pollution, or any
other outcome falling under the definition above. The horizontal axis represents time,
specifically, a period ranging from before the program was delivered to its target
population until some time afterward. The solid line in the graph shows the average
outcome level of a group of individuals who received program services. Note that their
status over time is not depicted as a straight horizontal line but, rather, as a line that
wiggles around. This is to indicate that smoking, school readiness, management skills,
and other such outcome dimensions are not expected to stay constant—they change as a
result of many natural causes and circumstances quite extraneous to the program.
Smoking, for instance, tends to increase from the preteen to the teenage years. Water
pollution levels may fluctuate according to the industrial activity in the region and
weather conditions, for example, heavy rain that dilutes the concentrations.
If we measure the outcome variable (more on this shortly), we can determine how
high or low the target group is with respect to that variable, for example, how much
smoking or school readiness they display. This tells us the outcome level, often simply
called the outcome. When measured after the target population has received program
services, it tells us something about how that population is doing—how many teenagers
are smoking, the average level of school readiness among the preschool children, how
many pollutants there are in the water. If all the teenagers are smoking, we may be
disappointed, and, conversely, if none are smoking, we may be pleased. All by
themselves, however, these outcome levels do not tell us much about how effective the
program was, though they may constrain the possibilities. If all the teens are smoking,
for instance, we can be fairly sure that the antismoking program was not a great success
and possibly was even counterproductive. If none of the teenagers are smoking, that
******ebook converter DEMO Watermarks*******
finding is a strong hint that the program has worked because we would not expect them
all to spontaneously stop on their own. Of course, such extreme outcomes are rarely
found and, in most cases, outcome levels alone cannot be interpreted with any
confidence as indicators of a program’s success or failure.
EXHIBIT 7-A
Outcome Level, Outcome Change, and Program Effect
If we measure outcomes on our target population before and after they participate in
the program, we can describe more than the outcome level, we can also discern
outcome change. If the graph in Exhibit 7-A plotted the school readiness of children in a
preschool program, it would show that the children show less readiness before
participating in the program and greater readiness afterward, a positive change. Even if
their school readiness after the program was not as high as the preschool teachers hoped
it would be, the direction of before-after change shows that there was improvement. Of
course, from this information alone, we do not actually know that the preschool program
had anything to do with the children’s improvement in school readiness. Preschool-aged
children are in a developmental period when their cognitive and motor skills increase
rather rapidly through normal maturational processes. Other factors may also be at
work; for example, their parents may be reading to them and otherwise supporting their
intellectual development and preparation for entering school, and that may account for at
least part of their gain.
The dashed line in Exhibit 7-A shows the trajectory on the outcome variable that
would have been observed if the program participants had not received the program.
For the preschool children, for example, the dashed line shows how their school
Various program stakeholders have their own understanding of what the program is
supposed to accomplish and, correspondingly, what outcomes they expect it to affect.
The most direct sources of information about these expected outcomes usually are the
stated objectives, goals, and mission of the program. Funding proposals and grants or
contracts for services from outside sponsors also often identify outcomes that the
program is expected to influence.
A common difficulty with information from these sources is a lack of the specificity
and concreteness necessary to clearly identify specific outcome measures. It thus often
falls to the evaluator to translate input from stakeholders into workable form and
negotiate with the stakeholders to ensure that the resulting outcome measures capture
their expectations.
For the evaluator’s purposes, an outcome description must indicate the pertinent
characteristic, behavior, or condition that the program is expected to change. However,
as we discuss shortly, further specification and differentiation may be required as the
evaluator moves from this description to selecting or developing measures of this
outcome. Exhibit 7-B presents examples of outcome descriptions that would usually be
serviceable for evaluation purposes.
EXHIBIT 7-B
Examples of Outcomes Described Specifically Enough to Be Measured
Juvenile delinquency
Behavior of youths under the age of 18 that constitute chargeable offenses under
applicable laws irrespective of whether the offenses are detected by authorities or
the youth is apprehended for the offense.
Water quality
The absence of substances in the water that are harmful to people and other living
organisms that drink the water or have contact with it.
Cognitive ability
Performance on tasks that involve thinking, problem solving, information
processing, language, mental imagery, memory, and overall intelligence.
School readiness
Children’s ability to learn at the time they enter school; specifically, the health and
physical development, social and emotional development, language and
communication skills, and cognitive skills and general knowledge that enable a
child to benefit from participation in formal schooling.
Proximal outcomes are rarely the ultimate outcomes the program intends to generate,
as can be seen in the examples in 7-C.In this regard, they are not the most important
outcomes from a social or policy perspective. This does not mean, however, that they
should be overlooked in the evaluation. These outcomes are the ones the program has
the greatest capability to affect, so it can be very informative to know whether they are
attained. If the program fails to produce these most immediate and direct outcomes, and
the program theory is correct, then the more distal outcomes in the sequence are unlikely
to occur. In addition, the proximal outcomes are generally the easiest to measure and to
attribute to the program’s efforts. If the program is successful at generating these
outcomes, it is appropriate for it to receive credit for doing so. The more distal
outcomes, which are more difficult to measure and attribute, may yield ambiguous
results. Such results will be more balanced and interpretable if information is available
about whether the proximal outcomes were attained.
EXHIBIT 7-C
Examples of Program Impact Theories Showing Expected Program Effects on Proximal
and Distal Outcomes
Prior Research
In identifying and defining outcomes, the evaluator should thoroughly examine prior
research on issues related to the program being evaluated, especially evaluation
research on similar programs. Learning which outcomes have been examined in other
studies may call attention to relevant outcomes that might otherwise have been
overlooked. It will also be useful to determine how various outcomes have been defined
and measured in prior research. In some cases, there are relatively standard definitions
and measures that have an established policy significance. In other cases, there may be
known problems with certain definitions or measures that the evaluator will need to
know about.
Unintended Outcomes
So far, we have been considering how to identify and define those outcomes the
stakeholders expect the program to produce and those that are evident in the program’s
impact theory. There may be significant unintended outcomes of a program, however,
that will not be identified through these means. These outcomes may be positive or
negative, but their distinctive character is that they emerge through some process that is
not part of the program’s design and direct intent. That feature, of course, makes them
very difficult to anticipate. Accordingly, the evaluator must often make a special effort
to identify any potential unintended outcomes that could be significant for assessing the
program’s effects on the social conditions it addresses.
Prior research can often be especially useful on this topic. There may be outcomes
that other researchers have discovered in similar circumstances that can alert the
evaluator to possible unanticipated program effects. In this regard, it is not only other
evaluation research that is relevant but also any research on the dynamics of the social
conditions in which the program intervenes. Research about the development of drug
use and the lives of users, for instance, may provide clues about possible responses to a
program intervention that the program plan has not taken into consideration.
Often, good information about possible unintended outcomes can be found in the
firsthand accounts of persons in a position to observe those outcomes. For this reason,
as well as others we have mentioned elsewhere in this text, it is important for the
******ebook converter DEMO Watermarks*******
evaluator to have substantial contact with program personnel at all levels, program
participants, and other key informants with a perspective on the program and its effects.
If unintended outcomes are at all consequential, there should be someone in the system
who is aware of them and who, if asked, can alert the evaluator to them. These
individuals may not present this information in the language of unintended outcomes, but
their descriptions of what they see and experience in relation to the program will be
interpretable if the evaluator is alert to the possibility that there could be important
program effects not articulated in the program logic or intended by the core
stakeholders.
EXHIBIT 7-D
Examples of the Multiple Dimensions and Aspects That Constitute Outcomes
Juvenile delinquency
Diversifying measures can also safeguard against the possibility that poorly
performing measures will underrepresent outcomes and, by not measuring the aspects of
the outcome a program most affects, make the program look less effective than it
actually is. For outcomes that depend on observation, for instance, having more than one
observer may be useful to avoid the biases associated with any one of them. For
instance, an evaluator who was assessing children’s aggressive behavior with their
peers might want the parents’ observations, the teacher’s observations, and those of any
other person in a position to see a significant portion of the child’s behavior. An
example of multiple measures is presented in 7-E.
EXHIBIT 7-E
Multiple Measures of Outcomes
Multiple measurement of important outcomes thus can provide for broader coverage
of the concept and allow the strengths of one measure to compensate for the weaknesses
of another. It may also be possible to statistically combine multiple measures into a
single, more robust and valid composite measure that is better than any of the individual
measures taken alone. In a program to reduce family fertility, for instance, changes in
desired family size, adoption of contraceptive practices, and average desired number of
children might all be measured and used in combination to assess the program outcome.
Even when measures must be limited to a smaller number than comprehensive coverage
might require, it is useful for the evaluator to elaborate all the dimensions and variations
in order to make a thoughtful selection from the feasible alternatives.
Reliability
The reliability of a measure is the extent to which the measure produces the same
results when used repeatedly to measure the same thing. Variation in those results
constitutes measurement error. So, for example, a postal scale is reliable to the extent
that it reports the same “score” (weight) for the same envelope on different occasions.
No measuring instrument, classification scheme, or counting procedure is perfectly
reliable, but different types of measures have reliability problems to varying degrees.
Measurements of physical characteristics for which standard measurement devices are
available, such as height and weight, will generally be more consistent than
measurements of psychological characteristics, such as intelligence measured with an
IQ test. Performance measures, such as standardized IQ tests, in turn, have been found to
be more reliable than measures relying on recall, such as reports of household
expenditures for consumer goods. For evaluators, a major source of unreliability lies in
the nature of measurement instruments that are based on participants’ responses to
written or oral questions posed by researchers. Differences in the testing or measuring
situation, observer or interviewer differences in the administration of the measure, and
even respondents’ mood swings contribute to unreliability.
The effect of unreliability in measures is to dilute and obscure real differences. A
truly effective intervention, the outcome of which is measured unreliably, will appear to
be less effective than it actually is. The most straightforward way for the evaluator to
check the reliability of a candidate outcome measure is to administer it at least twice
under circumstances when the outcome being measured should not change between
administrations of the measure. Technically, the conventional index of this test-retest
reliability is a statistic known as the product moment correlation between the two sets
of scores, which varies between .00 and 1.00. For many outcomes, however, this check
******ebook converter DEMO Watermarks*******
is difficult to make because the outcome may change between measurement applications
that are not closely spaced. For example, questionnaire items asking students how well
they like school may be answered differently a month later, not because the measurement
is unreliable but because intervening events have made the students feel differently
about school. When the measure involves responses from people, on the other hand,
closely spaced measures are contaminated because respondents remember their prior
response rather than generating it anew. When the measurement cannot be repeated
before the outcome can change, reliability is usually checked by examining the
consistency among similar items in a multi-item measure administered at the same time
(referred to as internal consistency reliability).
For many of the ready-made measures that evaluators use, reliability information
will already be available from other research or from reports of the original
development of the measure. Reliability can vary according to the sample of
respondents and the circumstances of measurement, however, so it is not always safe to
assume that a measure that has been shown to be reliable in other applications will be
reliable when used in the evaluation.
There are no hard-and-fast rules about acceptable levels of reliability. The extent to
which measurement error can obscure a meaningful program outcome depends in large
part on the magnitude of that outcome. We will discuss this issue further in Chapter 10.
As a rule of thumb, however, researchers generally prefer that their measures have
reliability coefficients of .90 or above, a range that keeps measurement error small
relative to all but the smallest outcomes. For many outcome measures applied under the
circumstances characteristic of program evaluation, however, this is a relatively high
standard.
Validity
The issue of measurement validity is more difficult than the problem of reliability.
The validity of a measure is the extent to which it measures what it is intended to
measure. For example, juvenile arrest records provide a valid measure of delinquency
only to the extent that they accurately reflect how much the juveniles have engaged in
chargeable offenses. To the extent that they also reflect police arrest practices, they are
not valid measures of the delinquent behavior of the juveniles subject to arrest.
Although the concept of validity and its importance are easy to comprehend, it is
usually difficult to test whether a particular measure is valid for the characteristic of
interest. With outcome measures used for evaluation, validity turns out to depend very
much on whether a measure is accepted as valid by the appropriate stakeholders.
Confirming that it represents the outcome intended by the program when that outcome is
fully and carefully described (as discussed earlier) can provide some assurance of
******ebook converter DEMO Watermarks*******
validity for the purposes of the evaluation. Using multiple measures of the outcome in
combination can also provide some protection against the possibility that any one of
those measures does not tap into the actual outcome of interest.
Empirical demonstrations of the validity of a measure depend on some comparison
that shows that the measure yields the results that would be expected if it were, indeed,
valid. For instance, when the measure is applied along with alternative measures of the
same outcome, such as those used by other evaluators, the results should be roughly the
same. Similarly, when the measure is applied to situations recognized to differ on the
outcome at issue, the results should differ. Thus, a measure of environmental attitudes
should sharply differentiate members of the local Sierra Club from members of an off-
road dirt bike association. Validity is also demonstrated by showing that results on the
measure relate to or “predict” other characteristics expected to be related to the
outcome. For example, a measure of environmental attitudes should be related to how
favorably respondents feel toward political candidates with different positions on
environmental issues.
Sensitivity
As the discussion so far has implied, selecting the best measures for assessing
outcomes is a critical measurement problem in evaluations (Rossi, 1997). We
recommend that evaluators invest the necessary time and resources to develop and test
appropriate outcome measures (Exhibit 7-F provides an instructive example).A poorly
conceptualized outcome measure may not properly represent the goals and objectives of
the program being evaluated, leading to questions about the validity of the measure. An
unreliable or insufficiently sensitive outcome measure is likely to underestimate the
effectiveness of a program and could lead to incorrect inferences about the program’s
impact. In short, a measure that is poorly chosen or poorly conceived can completely
undermine the worth of an impact assessment by producing misleading estimates. Only
if outcome measures are valid, reliable, and appropriately sensitive can impact
estimates be regarded as credible.
EXHIBIT 7-F
Reliability and Validity of Self-Report Measures With Homeless Mentally Ill Persons
Psychiatric symptoms. Self-report on the Brief Symptom Inventory (BSI) was the
primary measure used in the evaluation to assess psychiatric symptoms. Internal
consistency reliability was examined for five waves of data collection and showed
generally high reliabilities (.76-.86) on the scales for anxiety, depression, hostility,
and somatization but lower reliability for psychoticism (.65-.67). To obtain
evidence for the validity of these scales, correlations were obtained between them
and comparable scales from the Brief Psychiatric Rating Schedule (BPRS), rated
for clients by master’s-level psychologists and social workers. Across the five
waves of data collection, these correlations showed modest agreement (.40-.60) for
anxiety, depression, hostility, and somatization. However, there was little agreement
regarding psychotic symptoms (–.01 to .22).
Substance abuse. The evaluation measure was clients’ estimation of how much they
needed treatment for alcohol and other substance abuse using scales from the
Addiction Severity Index (ASI). For validation, interviewers rated the clients’ need
for alcohol and other substance abuse treatment on the same ASI scales. The
correlations over the five waves of measurement showed moderate agreement,
ranging from .44 to .66 for alcohol and .47 to .63 for drugs. Clients generally
reported less need for service than the interviewers.
Program contact and service utilization. Clients reported how often they had
contact with their assigned program and whether they had received any of 14
specific services. The validity of these reports was tested by comparing them with
case managers’ reports at two of the waves of measurement. Agreement varied
substantially with content area. The highest correlations (.40-.70) were found for
contact with the program, supportive services, and specific resource areas (legal,
housing, financial, employment, health care, medication). Agreement was
considerably lower for mental health, substance abuse, and life skills training
services. The majority of the disagreements involved a case manager reporting
service and the client reporting none.
The evaluators concluded that the use of self-report measures with homeless
******ebook converter DEMO Watermarks*******
mentally ill persons was justified but with caveats: Evaluators should not rely
solely on self-report measures for assessing psychotic symptoms, nor for
information concerning the utilization of mental health and substance abuse services,
since clients provide significant underestimates in these areas.
SOURCE: Adapted from Robert J. Calsyn, Gary A. Morse, W. Dean Klinkenberg, and
Michael L. Trusty, “Reliability and Validity of Self-Report Data of Homeless Mentally
Ill Individuals,” Evaluation and Program Planning, 1997, 20(1): 47-54.
EXHIBIT 7-G
Client Satisfaction Survey Items That Relate to Specific Benefits
SOURCE: Adapted from Lawrence L. Martin and Peter M. Kettner, Measuring the
Performance of Human Service Programs (Thousand Oaks, CA: Sage, 1996), p. 97.
Because of the dynamic nature of the social conditions that typical programs attempt
to affect, the limitations of outcome indicators, and the pressures on program agencies,
there are many pitfalls associated with program outcome monitoring. Thus, while
outcome indicators can be a valuable source of information for program
decisionmakers, they must be developed and used carefully.
EXHIBIT 7-H
A Convincing Pre-Post Outcome Design for a Program to Reduce Residential Lead
Levels in Low-Income Housing
To reduce lead dust levels in low-income urban housing, the Community Lead
Education and Reduction Corps (CLEARCorps) was initiated in Baltimore as a
joint public-private effort. CLEARCorps members clean, repair, and make homes
lead safe, educate residents on lead-poisoning prevention techniques, and encourage
the residents to maintain low levels of lead dust through specialized cleaning
efforts. To determine the extent to which CLEARCorps was successful in reducing
the lead dust levels in treated urban housing units, CLEARCorps members collected
lead dust wipe samples immediately before, immediately after, and six months
following their lead hazard control efforts. In each of 43 treated houses, four
samples were collected from each of four locations—floors, window sills, window
wells, and carpets—and sent to laboratories for analysis.
Statistically significant differences were found between pre and post lead dust
levels for floors, window sills, and window wells. At the six-month follow-up,
further significant declines were found for floors and window wells, with a
marginally significant decrease for window sills.
Since no control group was used, it is possible that factors other than the
CLEARCorps program contributed to the decline in lead dust levels found in the
evaluation. Other than relevant, but modest, seasonal effects relating to the follow-
up period and the small possibility that another intervention program treated these
same households, for which no evidence was available, there are few plausible
alternative explanations for the decline. The evaluators concluded, therefore, that
the CLEARCorps program was effective in reducing residential lead levels.
Summary
Programs are designed to affect some problem or need in positive ways.
Evaluators assess the extent to which a program produces a particular improvement by
measuring the outcome, the state of the target population or social condition that the
program is expected to have changed.
Because outcomes are affected by events and experiences that are independent of
a program, changes in the levels of outcomes cannot be directly interpreted as program
effects.
K EY CONCEPTS
Impact
See program effect.
Outcome
The state of the target population or the social conditions that a program is expected to
have changed.
Outcome change
The difference between outcome levels at different points in time. See also outcome
level.
Outcome level
The status of an outcome at some point in time. See also outcome.
Program effect
That portion of an outcome change that can be attributed uniquely to a program, that is,
with the influence of other sources controlled or removed; also termed the program’s
impact. See also outcome change.
Reliability
The extent to which a measure produces the same results when used repeatedly to
measure the same thing.
Sensitivity
The extent to which the values on a measure change when there is a change or difference
in the thing being measured.
Chapter Outline
When Is an Impact Assessment Appropriate?
Key Concepts in Impact Assessment
Experimental Versus Quasi-Experimental Research Designs
“Perfect” Versus “Good Enough” Impact Assessments
Randomized Field Experiments
Using Randomization to Establish Equivalence
Units of Analysis
The Logic of Randomized Experiments
Examples of Randomized Experiments in Impact Assessment
Prerequisites for Conducting Randomized Field Experiments
Approximations to Random Assignment
Data Collection Strategies for Randomized Experiments
Complex Randomized Experiments
Analyzing Randomized Experiments
Limitations on the Use of Randomized Experiments
Programs in Early Stages of Implementation
******ebook converter DEMO Watermarks*******
Ethical Considerations
Differences Between Experimental and Actual Intervention Delivery
Time and Cost
Integrity of Experiments
Impact assessments are undertaken to find out whether programs actually produce
the intended effects. Such assessments cannot be made with certainty but only with
varying degrees of confidence. A general principle applies: The more rigorous the
research design, the more confident we can be about the validity of the resulting
estimate of intervention effects.
The design of impact evaluations needs to take into account two competing
pressures. On one hand, evaluations should be undertaken with sufficient rigor that
relatively firm conclusions can be reached. On the other hand, practical
considerations of time, money, cooperation, and protection of human subjects limit
the design options and methodological procedures that can be employed.
Evaluators assess the effects of social programs by comparing information about
outcomes for program participants with estimates of what their outcomes would have
been had they not participated. This chapter discusses the strongest research design
for accomplishing this objective—the randomized field experiment. Randomized
experiments compare groups of targets that have been randomly assigned to either
experience some intervention or not. Although practical considerations may limit the
use of randomized field experiments in some program situations, evaluators need to
be familiar with them. The logic of the randomized experiment is the basis for the
design of all types of impact assessments and the analysis of the data from them.
Impact assessments are designed to determine what effects programs have on their
intended outcomes and whether perhaps there are important unintended effects. As
described in Chapter 7, a program effect, or impact, refers to a change in the target
population or social conditions that has been brought about by the program, that is, a
change that would not have occurred had the program been absent. The problem of
establishing a program’s impact, therefore, is identical to the problem of establishing
that the program is a cause of some specified effect.
In the social sciences, causal relationships are ordinarily stated in terms of
probabilities. Thus, the statement “A causes B” usually means that if we introduce A,B is
******ebook converter DEMO Watermarks*******
more likely to result than if we do not introduce A. This statement does not imply that B
always results from A, nor does it mean that B occurs only if A happens first. To
illustrate, consider a job training program designed to reduce unemployment. If
successful, it will increase the probability that participants will subsequently be
employed. Even a very successful program, however, will not result in employment for
every participant. The likelihood of finding a job is related to many factors that have
nothing to do with the effectiveness of the training program, such as economic
conditions in the community. Correspondingly, some of the program participants would
have found jobs even without the assistance of the program.
The critical issue in impact evaluation, therefore, is whether a program produces
desired effects over and above what would have occurred without the intervention or, in
some cases, with an alternative intervention. In this chapter, we consider the strongest
research design available for addressing this issue, the randomized field experiment.
We begin with some general considerations about doing impact assessments.
Our discussion of the available options for impact assessment is rooted in the view
that the most valid way to establish the effects of an intervention is a randomized field
experiment, often called the “gold standard” research design for assessing causal
effects. The basic laboratory version of a randomized experiment is no doubt familiar.
Participants are randomly sorted into at least two groups. One group is designated the
control group and receives no intervention or an innocuous one; the other group, called
the intervention group, is given the intervention being tested. Outcomes are then
observed for both the intervention and the control groups, with any differences being
attributed to the intervention.
The control conditions for a randomized field experiment are established in similar
fashion. Targets are randomly assigned to an intervention group, to which the
intervention is administered, and a control group, from which the intervention is
withheld. There may be several intervention groups, each receiving a different
intervention or variation of an intervention, and sometimes several control groups, each
also receiving a different variant, for instance, no intervention, a placebo intervention,
and the treatment normally available to targets in the circumstances to which the
program intervention applies.
All the remaining impact assessment designs consist of nonrandomized quasi-
experiments in which targets who participate in a program (the “intervention” group)
are compared with nonparticipants (the “controls”) who are presumed to be similar to
participants in critical ways. These techniques are called quasi-experimental because
******ebook converter DEMO Watermarks*******
they lack the random assignment to conditions that is essential for true experiments. The
main approaches to establishing nonrandomized control groups in impact assessment
designs are discussed in the next chapter.
Designs using nonrandomized controls universally yield less convincing results than
well-executed randomized field experiments. From the standpoint of validity in the
estimation of program effects, therefore, the randomized field experiment is always the
optimal choice for impact assessment. Nevertheless, quasi-experiments are useful for
impact assessment when it is impractical or impossible to conduct a true randomized
experiment.
The strengths and weaknesses of different research designs for assessing program
effects, and the technical details of implementing them and analyzing the resulting data,
are major topics in evaluation. The classic texts are Campbell and Stanley (1966) and
Cook and Campbell (1979). More recent accounts that evaluators may find useful are
Shadish, Cook, and Campbell (2002) and Mohr (1995).
For several reasons, evaluators are confronted all too frequently with situations
where it is difficult to implement the “very best” impact evaluation design. First, the
designs that are best in technical terms sometimes cannot be applied because the
intervention or target coverage does not lend itself to that sort of design. For example,
the circumstances in which randomized experiments can be ethically and practicably
carried out with human subjects are limited, and evaluators must often use less rigorous
designs. Second, time and resource constraints always limit design options. Third, the
justification for using the best design, which often is the most costly one, varies with the
importance of the intervention being tested and the intended use of the results. Other
things being equal, an important program—one that is of interest because it attempts to
remedy a very serious condition or employs a controversial intervention—should be
evaluated more rigorously than other programs. At the other extreme, some trivial
programs probably should not have impact assessments at all.
Our position is that evaluators must review the range of design options in order to
determine the most appropriate one for a particular evaluation. The choice always
involves trade-offs; there is no single, always-best design that can be used universally
in all impact assessments. Rather, we advocate using what we call the “good enough”
rule in formulating research designs. Stated simply, the evaluator should choose the
strongest possible design from a methodological standpoint after having taken into
account the potential importance of the results, the practicality and feasibility of each
design, and the probability that the design chosen will produce useful and credible
results. For the remainder of this chapter, we will focus on randomized field
******ebook converter DEMO Watermarks*******
experiments as the most methodologically rigorous design and, therefore, the starting
point for considering the best possible design that can be applied for impact assessment.
Identical composition. Intervention and control groups contain the same mixes of
persons or other units in terms of their program-related and outcome-related
characteristics.
Identical predispositions. Intervention and control groups are equally disposed
toward the project and equally likely, without intervention, to attain any given
outcome status.
Identical experiences. Over the time of observation, intervention and control
groups experience the same time-related processes—maturation, secular drifts,
interfering events, and so forth.
The best way to achieve equivalence between intervention and control groups is to
use randomization to allocate members of a target population to the two groups.
Randomization is a procedure that allows chance to decide whether a person (or other
unit) receives the program or the control condition alternative. It is important to note
that “random” in this sense does not mean haphazard or capricious. On the contrary,
randomly allocating targets to intervention and control groups requires considerable
care to ensure that every unit in a target population has the same probability as any other
to be selected for either group.
To create a true random assignment, an evaluator must use an explicit chance-based
procedure such as a random number table, roulette wheel, roll of dice, or the ike. For
convenience, researchers typically use random number sequences. Tables of random
numbers are included in most elementary statistics or sampling textbooks, and many
computer statistical packages contain subroutines that generate random numbers. The
essential step is that the decision about the group assignment for each participant in the
impact evaluation is made solely on the basis of the next random result, for instance, the
next number in the random number table (e.g., odd or even). (See Boruch, 1997, and
Boruch and Wothke, 1985, for discussions of how to implement randomization.)
Because the resulting intervention and control groups differ from one another only
by chance, whatever influences may be competing with an intervention to produce
outcomes are present in both groups to the same extent, except for chance fluctuations.
This follows from the same chance processes that tend to produce equal numbers of
heads and tails when a handful of coins is tossed into the air. For example, with
randomization, persons whose characteristics make them more responsive to program
services are as likely to be in the intervention as the control group. Hence, both groups
should have the same proportion of persons favorably predisposed to benefit from the
intervention.
Of course, even though target units are assigned randomly, the intervention and
control groups will never be exactly equivalent. For example, more women may end up
in the control group than in the intervention group simply by chance. But if the random
assignment were made over and over, those fluctuations would average out to zero. The
expected proportion of times that a difference of any given size on any given
characteristic will be found in a series of randomizations can be calculated from
statistical probability models. Any given difference in outcome among randomized
intervention and control groups, therefore, can be compared to what is expected on the
basis of chance (i.e., the randomization process). Statistical significance testing can then
be used to guide a judgment about whether a specific difference is likely to have
occurred simply by chance or more likely represents the effect of the intervention. Since
the intervention in a well-run experiment is the only difference other than chance
******ebook converter DEMO Watermarks*******
between intervention and control groups, such judgments become the basis for
discerning the existence of a program effect. The statistical procedures for making such
calculations are quite straightforward and may be found in any text dealing with
statistical inference in experimental design.
One implication of the role of chance and statistical significance testing is that
impact assessments require more than just a few cases. The larger the number of units
randomly assigned to intervention and control groups, the more likely those groups are
to be statistically equivalent. This occurs for the same reason that tossing 1,000 coins is
less likely to deviate from a 50-50 split between heads and tails than tossing 2 coins.
Studies in which only one or a few units are in each group rarely, if ever, suffice for
impact assessments, since the odds are that any division of a small number of units will
result in differences between them. This and related matters are discussed more fully in
Chapter 10.
Units of Analysis
The units on which outcome measures are taken in an impact assessment are called
the units of analysis. The units of analysis in an experimental impact assessment are not
necessarily persons. Social programs may be designed to affect a wide variety of
targets, including individuals, families, neighborhoods and communities, organizations
such as schools and business firms, and political jurisdictions from counties to whole
nations. The logic of impact assessment remains constant as one moves from one kind of
unit to another, although the costs and difficulties of conducting a field experiment may
increase with the size and complexity of units. Implementing a field experiment and
gathering data on 200 students, for instance, will almost certainly be easier and less
costly than conducting a comparable evaluation with 200 classrooms or 200 schools.
The choice of the units of analysis should be based on the nature of the intervention
and the target units to which it is delivered. A program designed to affect communities
through block grants to local municipalities requires that the units studied be
municipalities. Notice that, in this case, each municipality would constitute one unit for
the purposes of the analysis. Thus, an impact assessment of block grants that is
conducted by contrasting two municipalities has a sample size of two—quite inadequate
for statistical analysis even though observations may be made on large numbers of
individuals within each of the two communities.
The evaluator attempting to design an impact assessment should begin by identifying
the units that are designated as the targets of the intervention in question and that,
therefore, should be specified as the units of analysis. In most cases, defining the units
of analysis presents no ambiguity; in other cases, the evaluator may need to carefully
appraise the intentions of the program’s designers. In still other cases, interventions may
******ebook converter DEMO Watermarks*******
be addressed to more than one type of target: A housing subsidy program, for example,
may be designed to upgrade both the dwellings of individual poor families and the
housing stocks of local communities. Here the evaluator may wish to design an impact
assessment that consists of samples of individual households within samples of local
communities. Such a design would incorporate two types of units of analysis in order to
estimate the impact of the program on individual households and also on the housing
stocks of local communities. Such multilevel designs follow the same logic as field
experiments with a single type of unit but involve more complex statistical analysis
(Murray, 1998; Raudenbush and Bryk, 2002).
EXHIBIT 8-A
Schematic Representation of a Randomized Experiment
Examples of Randomized
Experiments in Impact Assessment
Several examples can serve to illustrate the logic of randomized field experiments
as applied to actual impact assessments as well as some of the difficulties encountered
in real-life evaluations. Exhibit 8-B describes a randomized experiment to test the
effectiveness of an intervention to improve the nutritional composition of the food eaten
by schoolchildren. Several of the experiment’s features are relevant here. First, note that
the units of analysis were schools and not, for example, individual students.
Correspondingly, entire schools were assigned to either the intervention or control
condition. Second, note that a number of outcome measures were employed, covering
the multiple nutritional objectives of the intervention. It is also appropriate that
statistical tests were used to judge whether the effects (the intervention group’s lower
intake of overall calories and calories from fat) were simply chance differences.
CATCH was a randomized controlled field trial in which the basic units were 96
elementary schools in California, Louisiana, Minnesota, and Texas, with 56
randomly assigned to be intervention sites and 40 to be controls. The intervention
program included training sessions for the food service staffs informing them of the
rationale for nutritionally balanced school menus and providing recipes and menus
that would achieve that goal. Training sessions on nutrition and exercise were given
to teachers, and school administrations were persuaded to make changes in the
physical education curriculum for students. In addition, efforts were made to reach
the parents of participating students with nutritional information.
EXHIBIT 8-C
Assessing the Effects of a Service Innovation
In light of recent trends toward consumer-delivered mental health services, that is,
services provided by persons who have themselves been mentally ill and received
treatment, the community mental health center became interested in the possibility
that consumers might be more effective case managers than nonconsumers. Former
patients might have a deeper understanding of mental illness because of their own
experience and may establish a better empathic bond with patients, both of which
could result in more appropriate service plans.
Data were collected through interviews and standardized scales at baseline and one
month and then one year after assignment to case management. The measures
included social outcomes (housing, arrests, income, employment, social networks)
and clinical outcomes (symptoms, level of functioning, hospitalizations, emergency
room visits, medication attitudes and compliance, satisfaction with treatment,
quality of life). The sample size and statistical analysis were planned to have
sufficient statistical power to detect meaningful differences, with special attention to
the possibility that there would be no meaningful differences, which would be an
important finding for a comparison of this sort. Of the 96 participants, 94 continued
receiving services for the duration of study and 91 of them were located and
interviewed at the one-year follow-up.
SOURCE: Adapted from Phyllis Solomon and Jeffrey Draine, “One-Year Outcomes of a
Randomized Trial of Consumer Case Management.” Evaluation and Program
Planning, 1995, 18(2):117-127.
Exhibit 8-D describes one of the largest and best-known field experiments relating
to national policy ever conducted. It was designed to determine whether income support
payments to poor, intact (i.e., two-spouse) families would cause them to reduce the
amount of their paid employment, that is, create a work disincentive. The study was the
first of a series of five sponsored by government agencies, each varying slightly from
the others, to test different forms of guaranteed income and their effects on the work
efforts of poor and near-poor persons. All five experiments were run over relatively
******ebook converter DEMO Watermarks*******
long periods, the longest for more than five years, and all had difficulties maintaining
the cooperation of the initial groups of families involved. The results showed that
income payments created a slight work disincentive, especially for teenagers and
mothers with young children—those in the secondary labor force (Mathematica Policy
Research, 1983; Robins et al., 1980; Rossi and Lyall, 1976; SRI International, 1983).
EXHIBIT 8-D
The New Jersey-Pennsylvania Income Maintenance Experiment
In the late 1960s, when federal officials concerned with poverty began to consider
shifting welfare policy to provide some sort of guaranteed annual income for all
families, the Office of Economic Opportunity (OEO) launched a large-scale field
experiment to test one of the crucial issues in such a program: the prediction of
******ebook converter DEMO Watermarks*******
economic theory that such supplementary income payments to poor families would
be a work disincentive.
The experiment was started in 1968 and carried on for three years, administered by
Mathematica, Inc., a research firm in Princeton, New Jersey, and the Institute for
Research on Poverty of the University of Wisconsin. The target population was two-
parent families with income below 150% of the poverty level and male heads
whose age was between 18 and 58. The eight intervention conditions consisted of
various combinations of income guarantees and the rates at which payments were
taxed in relation to the earnings received by the families. For example, in one of the
conditions a family received a guaranteed income of 125% of the then-current
poverty level, if no one in the family had any earnings. Their plan then had a tax rate
of 50% so that if someone in the family earned income, their payments were reduced
50 cents for each dollar earned. Other conditions consisted of tax rates that ranged
from 30% to 70% and guarantee levels that varied from 50% to 125% of the poverty
line. A control group consisted of families who did not receive any payments.
The experiment was conducted in four communities in New Jersey and one in
Pennsylvania. A large household survey was first undertaken to identify eligible
families, then those families were invited to participate. If they agreed, the families
were randomly allocated to one of the intervention groups or to the control group.
The participating families were interviewed prior to enrollment in the program and
at the end of each quarter over the three years of the experiment. Among other things,
these interviews collected data on employment, earnings, consumption, health, and
various social-psychological indicators. The researchers then analyzed the data
along with the monthly earnings reports to determine whether those receiving
payments diminished their work efforts (as measured in hours of work) in relation to
the comparable families in the control groups.
Although about 1,300 families were initially recruited, by the end of the experiment
22% had discontinued their cooperation. Others had missed one or more interviews
or had dropped out of the experiment for varying periods. Fewer than 700 remained
for analysis of the continuous participants. The overall finding was that families in
the intervention groups decreased their work effort by about 5%.
SOURCE: Summary based on D. Kershaw and J. Fair, The New Jersey Income-
Maintenance Experiment, vol. 1. New York: Academic Press, 1976.
The desirable feature of randomization is that the allocation of eligible targets to the
intervention and control groups is unbiased; that is, the probabilities of ending up in the
intervention or control groups are identical for all participants in the study. There are
several alternatives to randomization as a way of obtaining intervention and control
groups that may also be relatively unbiased under favorable circumstances and thus
constitute acceptable approximations to randomization. In addition, in some cases it can
be argued that, although the groups differ, those differences do not produce bias in
relation to the outcomes of interest. For instance, a relatively common substitute for
randomization is systematic assignment from serialized lists, a procedure that can
accomplish the same end as randomization if the lists are not ordered in some way that
results in bias. To allocate high school students to intervention and control groups, it
might be convenient to place all those with odd ID numbers into the intervention group
and all those with even ID numbers into a control group. Under circumstances where the
odd and even numbers do not differentiate students on some relevant characteristic, such
as odd numbers being assigned to female students and even ones to males, the result
will be statistically the same as random assignment. Before using such procedures,
therefore, the evaluator must establish how the list was generated and whether the
numbering process could bias any allocation that uses it.
Sometimes ordered lists of targets have subtle biases that are difficult to detect. An
alphabetized list might tempt an evaluator to assign, say, all persons whose last names
begin with “D” to the intervention group and those whose last names begin with “H” to
the control group. In a New England city, this procedure would result in an ethnically
biased selection, because many names of French Canadian origin begin with “D” (e.g.,
DeFleur), while very few Hispanic names begin with “H.” Similarly, numbered lists
may contain age biases if numbers are assigned sequentially. The federal government
assigns Social Security numbers sequentially, for instance, so that individuals with
lower numbers are generally older than those with higher numbers.
There are also circumstances in which biased allocation may be judged as
“ignorable” (Rog, 1994; Rosenbaum and Rubin, 1983). For example, in a Minneapolis
******ebook converter DEMO Watermarks*******
test of the effectiveness of a family counseling program to keep children who might be
placed in foster care in their families, those children who could not be served by the
program because the agency was at full capacity at the time of referral were used as a
control group (AuClaire and Schwartz, 1986). The assumption made was that when a
child was referred had little or nothing to do with the outcome of interest, namely, a
child’s prospects for reconciliation with his or her family. Thus, if the circumstances
that allocate a target to service or denial of service (or, perhaps, a waiting list for
service) are unrelated to the characteristics of the target, the result may be an acceptable
approximation to randomization.
Whether events that divide targets into those receiving and not receiving program
services operate to make an unbiased allocation, or have biases that can be safely
ignored, must be judged through close scrutiny of the circumstances. If there is any
reason to suspect that the events in question affect targets with certain characteristics
more than others, then the results will not be an acceptable approximation to
randomization unless those characteristics can be confidently declared irrelevant to the
outcomes at issue. For example, communities that have fluoridated their water supplies
cannot be regarded as an intervention group to be contrasted with those who have not
for purposes of assessing the effects of fluoridation on dental health. Those communities
that adopt fluoridation are quite likely to have distinctive characteristics (e.g.,lower
average age and more service-oriented government) that cannot be regarded as
irrelevant to dental health and thus represent bias in the sense used here.
Two strategies for data collection can improve the estimates of program effects that
result from randomized experiments. The first is to make multiple measurements of the
outcome variable, preferably both before and after the intervention that is being
assessed. As mentioned earlier, sometimes the outcome variable can be measured only
after the intervention, so that no pretest is possible. Such cases aside, the general rule is
that the more measurements of the outcome variables made before and after the
intervention, the better the estimates of program effect. Measures taken before an
intervention indicate the preintervention states of the intervention and control groups
and are useful for making statistical adjustments for any preexisting differences that are
not fully balanced by the randomization. They are also helpful for determining just how
much gain an intervention produced. For example, in the assessment of a vocational
retraining project, preintervention measures of earnings for individuals in intervention
and control groups would enable the researchers to better estimate the amount by which
earnings improved as a result of the training.
The second strategy is to collect data periodically during the course of an
******ebook converter DEMO Watermarks*******
intervention. Such periodic measurements allow evaluators to construct useful accounts
of how an intervention works over time. For instance, if the vocational retraining effort
is found to produce most of its effects during the first four weeks of a six-week program,
shortening the training period might be a reasonable option for cutting costs without
seriously impairing the program’s effectiveness. Likewise, periodic measurements can
lead to a fuller understanding of how targets react to services. Some reactions may start
slowly and then accelerate; others may be strong initially but trail off as time goes on.
EXHIBIT 8-E
Making Welfare Work and Work Pay: The Minnesota Family Investment Program
******ebook converter DEMO Watermarks*******
A frequent criticism of the Aid to Families with Dependent Children (AFDC)
program is that it does not encourage recipients to leave the welfare rolls and seek
employment because AFDC payments were typically more than could be earned in
low-wage employment. The state of Minnesota received a waiver from the federal
Department of Health and Human Services to conduct an experiment that would
encourage AFDC clients to seek employment and allow them to receive greater
income than AFDC would allow if they succeeded. The main modification
embodied in the Minnesota Family Investment Program (MFIP) increased AFDC
benefits by 20% if participants became employed and reduced their benefits by only
$1 for every $3 earned through employment. A child care allowance was also
provided so that those employed could obtain child care while working. This meant
that AFDC recipients who became employed under this program had more income
than they would have received under AFDC.
Over the period 1991 to 1994, some 15,000 AFDC recipients in a number of
Minnesota counties were randomly assigned to one of three conditions: (1) an MFIP
intervention group receiving more generous benefits and mandatory participation in
employment and training activities; (2) an MFIP intervention group receiving only
the more generous benefits and not the mandatory employment and training; and (3) a
control group that continued to receive the old AFDC benefits and services. All
three groups were monitored through administrative data and repeated surveys. The
outcome measures included employment, earnings, and satisfaction with the
program.
An analysis covering 18 months and the first 9,000 participants in the experiment
found that the demonstration was successful. MFIP intervention families were more
likely to be employed and, when employed, had larger incomes than control
families. Furthermore, those in the intervention group receiving both MFIP benefits
and mandatory employment and training activities were more often employed and
earned more than the intervention group receiving only the MFIP benefits.
SOURCE: Adapted from Cynthia Miller, Virginia Knox, Patricia Auspos, Jo Anna
Hunter-Manns, and Alan Prenstein, Making Welfare Work and Work Pay:
Implementation and 18 Month Impacts of the Minnesota Family Investment Program.
New York: Manpower Demonstration Research Corporation, 1997.
EXHIBIT 8-F
Analysis of Randomized Experiments: The Baltimore LIFE Program
The Baltimore LIFE experiment was funded by the Department of Labor to test
whether small amounts of financial aid to persons released from prison would help
them make the transition to civilian life and reduce the probability of their being
arrested and returned to prison. The financial aid was configured to simulate
unemployment insurance payments, for which most prisoners are ineligible since
they cannot accumulate work credits while imprisoned.
Persons released from Maryland state prisons to return to Baltimore were randomly
assigned to either an intervention or control group. Those in the intervention group
were eligible for 13 weekly payments of $60 as long as they were unemployed.
Those in the control group were told that they were participating in a research
project but were not offered payment. Researchers periodically interviewed the
participants and monitored their arrest records for a year beyond each prisoner’s
release date. The arrest records yielded the results over the postrelease year shown
in Table 8-F1.
The findings shown in the table are known as main effects and constitute the
******ebook converter DEMO Watermarks*******
simplest representation of experimental results. Since randomization has made the
intervention and control groups statistically equivalent except for the intervention,
the arrest rate differences between them are assumed to be due only to the
intervention plus any chance variability.
The substantive import of the findings is summarized in the last column on the right
of the table, where the differences between the intervention and control groups in
arrest rates are shown for various types of crimes. For theft crimes in the
postrelease year the difference of –8.4 percentage points indicated a potential
intervention effect in the desired direction. The issue then became whether 8.4 was
within the range of expected chance differences, given the sample sizes (n). A
variety of statistical tests are applicable to this situation, including chi-square, t-
tests, and analysis of variance. The researcher used a one-tailed t-test, because the
direction of the differences between the groups was given by the expected effects of
the intervention. The results showed that a difference of –8.4 percentage points or
larger would occur by chance less than five times in every hundred experiments of
the same sample size (statistically significant at p ≤ .05). The researchers concluded
that the difference was large enough to be taken seriously as an indication that the
intervention had its desired effect, at least for theft crimes.
The remaining types of crimes did not show differences large enough to survive the
t-test criterion. In other words, the differences between the intervention and control
groups were within the range where chance fluctuations were sufficient to explain
them according to the conventional statistical standards (p > .05).
Given these results, the next question is a practical one: Are the differences large
enough in a policy sense? In other words, would a reduction of 8.4 percentage
points in theft crimes justify the costs of the program? To answer this last question,
the Department of Labor conducted a cost-benefit analysis (an approach discussed
in Chapter 11) that showed that the benefits far outweighed the costs.
A more complex and informative way of analyzing the theft crime data using
multiple regression is shown in Table 8-F2. The question posed is exactly the same
as in the previous analysis, but in addition, the multiple regression model takes into
account some of the factors other than the payments that might also affect arrests.
The multiple regression analysis statistically controls those other factors while
comparing the proportions arrested in the control and intervention groups.
In effect, comparisons are made between intervention and control groups within
each level of the other variables used in the analysis. For example, the
unemployment rate in Baltimore fluctuated over the two years of the experiment:
******ebook converter DEMO Watermarks*******
Some prisoners were released at times when it was easy to get jobs, whereas others
were released at less fortunate times. Adding the unemployment rate at time of
release to the analysis reduces the variation among individuals due to that factor and
thereby purifies estimates of the intervention effect.
Note that all the variables added to the multiple regression analysis of Table 8-F2
were ones that were known from previous research to affect recidivism or chances
of finding employment. The addition of these variables strengthened the findings
considerably. Each coefficient indicates the change in the probability of postrelease
arrest associated with each unit of the independent variable in question. Thus, the
–.083 associated with being in the intervention group means that the intervention
reduced the arrest rate for theft crimes by 8.3 percentage points. This corresponds
closely to what was shown in Table 8-F1, above. However, because of the
statistical control of the other variables in the analysis, the chance expectation of a
SOURCE: Adapted from P. H. Rossi, R. A. Berk, and K. J. Lenihan, Money, Work and
Crime: Some Experimental Evidence., New York: Academic Press, 1980.
EXHIBIT 8-G
Analyzing a Complex Randomized Experiment: The TARP Study
The main effects of the interventions are shown in the analyses of variance in Table
8-G. (For the sake of simplicity, only results from the Texas TARP experiment are
shown.) The interventions had no effect on property arrests: The intervention and
control groups differed by no more than would be expected by chance. However, the
interventions had a very strong effect on the number of weeks worked during the
postrelease year: Ex-felons receiving payments worked fewer weeks on the average
than those in the control groups, and the differences were statistically significant. In
short, it seems that the payments did not compete well with crime but competed
quite successfully with employment.
In short, these results seem to indicate that the experimental interventions did not
work in the ways expected and indeed produced undesirable effects. However, an
analysis of this sort is only the beginning. The results suggested to the evaluators that
a set of counterbalancing processes may have been at work. It is known from the
criminological literature that unemployment for ex-felons is related to an increased
probability of rearrest. Hence, the researchers postulated that the unemployment
benefits created a work disincentive represented in the fewer weeks worked by
participants receiving more weeks of benefits or a lower “tax rate” and that this
should have the effect of increasing criminal behavior. On the other hand, the
payments should have reduced the need to engage in criminal behavior to produce
income. Thus, a positive effect of payments in reducing criminal activity may have
been offset by the negative effects of less employment over the period of the
payments so that the total effect on arrests was virtually zero.
******ebook converter DEMO Watermarks*******
To examine the plausibility of this “counterbalancing effects” interpretation, a causal
model was constructed, as shown in Figure 8-G. In that model, negative coefficients
are expected for the effects of payments on employment (the work disincentive) and
for their effects on arrests (the expected intervention effect). The counterbalancing
effect of unemployment, in turn, should show up as a negative coefficient between
employment and arrest, indicating that fewer weeks of employment are associated
with more arrests. The coefficients shown in Figure 8-G were derived empirically
from the data using a statistical technique known as structural equation modeling. As
shown there, the hypothesized relationships appear in both the Texas and Georgia
data.
Ethical Considerations
A major obstacle to randomized field experiments is that they are usually costly and
time- consuming, especially large-scale multisite experiments. For this reason, they
should ordinarily not be undertaken to assess program concepts that are very unlikely to
be adopted by decisionmakers or to assess established programs when there is not
significant stakeholder interest in evidence about impact. Moreover, experiments should
not be undertaken when information is needed in a hurry. To underscore this last point, it
should be noted that the New Jersey-Pennsylvania Income Maintenance Experiment
******ebook converter DEMO Watermarks*******
(Exhibit 8-D) cost $34 million (in 1968 dollars) and took more than seven years from
design to published findings. The Seattle and Denver income maintenance experiments
took even longer, with their results appearing in final form long after income
maintenance as a policy had disappeared from the national agenda (Mathematica Policy
Research, 1983; Office of Income Security, 1983; SRI International, 1983).
Integrity of Experiments
Summary
The purpose of impact assessments is to determine the effects that programs have
on their intended outcomes. Randomized field experiments are the flagships of impact
assessment because, when well conducted, they provide the most credible conclusions
about program effects.
******ebook converter DEMO Watermarks*******
Impact assessments may be conducted at various stages in the life of a program.
But because rigorous impact assessments involve significant resources, evaluators
should consider whether a requested impact assessment is justified by the
circumstances.
The methodological concepts that underlie all research designs for impact
assessment are based on the logic of the randomized experiment. An essential feature of
this logic is the division of the targets under study into intervention and control groups
by random assignment. In quasi-experiments, assignment to groups is accomplished by
some means other than true randomization. Evaluators must judge in each set of
circumstances what constitutes a “good enough” research design.
The principal advantage of the randomized experiment is that it isolates the effect
of the intervention being evaluated by ensuring that intervention and control groups are
statistically equivalent except for the intervention received. Strictly equivalent groups
are identical in composition, experiences over the period of observation, and
predispositions toward the program under study. In practice, it is sufficient that the
groups, as aggregates, are comparable with respect to any characteristics that could be
relevant to the outcome.
Although chance fluctuations will create some differences between any two
groups formed through randomization, statistical significance tests allow researchers to
estimate the likelihood that observed outcome differences are due to chance rather than
the intervention being evaluated.
K EY CONCEPTS
Control group
A group of targets that do not receive the program intervention and that is compared on
outcome measures with one or more groups that do receive the intervention. Compare
intervention group.
Intervention group
A group of targets that receive an intervention and whose outcome measures are
compared with those of one or more control groups. Compare control group.
Quasi-experiment
An impact research design in which intervention and control groups are formed by a
procedure other than random assignment.
Randomization
Assignment of potential targets to intervention and control groups on the basis of chance
so that every unit in a target population has the same probability as any other to be
******ebook converter DEMO Watermarks*******
selected for either group.
Units of analysis
The units on which outcome measures are taken in an impact assessment and,
correspondingly, the units on which data are available for analysis. The units of analysis
may be individual persons but can also be families, neighborhoods, communities,
organizations, political jurisdictions, geographic areas, or any other such entities.
Alternative Designs
Chapter Outline
Bias in Estimation of Program Effects
Selection Bias
Other Sources of Bias
Secular Trends
Interfering Events
Maturation
Bias in Nonrandomized Designs
Quasi-Experimental Impact Assessment
Constructing Control Groups by Matching
Choosing Variables to Match
Matching Procedures
Equating Groups by Statistical Procedures
Multivariate Statistical Techniques
Modeling the Determinants of Outcome
Modeling the Determinants of Selection
Regression-Discontinuity Designs
Reflexive Controls
Simple Pre-Post Studies
Time-Series Designs
******ebook converter DEMO Watermarks*******
Some Cautions About Using Quasi-Experiments for Impact Assessment
Exhibit 9-A
Illustration of Bias in the Estimate of a Program Effect
The bias in this estimate of the program effect comes in because the vocabulary of
young children is not, in fact, static but rather tends to increase over time (and virtually
never decreases under ordinary circumstances). This means that, had the children not
been in the program, their vocabulary would have increased anyway, though perhaps not
as much as with the help of the program. The amount by which it would have increased
on its own is included in our estimate along with the actual program effect. This is the
source of the bias in our estimate (as shown in Exhibit 9-B). It is because of such
natural changes over time in so many aspects of human behavior that before-after
measures almost always produce biased estimates of program effects. Unfortunately for
the evaluator engaged in impact assessment, not all forms of bias that may compromise
******ebook converter DEMO Watermarks*******
impact assessment are as obvious as this one, as we are about to discuss.
Selection Bias
Exhibit 9-B
Bias in the Estimate of the Effect of a Reading Program on Children’s Vocabulary Based
on Before-After Change
If the groups have not been randomly assigned, however, this critical assumption
******ebook converter DEMO Watermarks*******
will be questionable. A group comparison design for which the groups have not been
formed through randomization is known as a nonequivalent comparison design
irrespective of how equivalent the groups may appear. This label emphasizes the fact
that equivalence on outcome, absent program exposure, cannot necessarily be assumed.
When the equivalence assumption does not hold, the difference in outcome between
the groups that would have occurred anyway produces a form of bias in the estimate of
program effects that is known as selection bias. This type of bias is an inherent threat to
the validity of the program effect estimate in any impact assessment using a
nonequivalent (i.e., nonrandomized) group comparison design.
Selection bias gets its name because it appears in situations when some process that
has influences that are not fully known selects which individuals will be in which
group, as opposed to the assignment to groups being determined by pure chance (which
has known influences). Imagine, for instance, that we administer the program to a group
of individuals who volunteer to participate and use those who do not volunteer as the
control group. By volunteering, individuals have self-selected which group they will be
in. The selection bias is any difference between volunteers and nonvolunteers that
would show up on the outcome measure if neither group got the program. Because we
are unlikely to know what all the relevant differences are between volunteers and
nonvolunteers, we have limited ability to determine the nature and extent of that bias.
Selection bias, however, does not refer only to bias that results from a deliberate
selection into the program or not, as with the volunteers and nonvolunteers. It often has
much more subtle forms. Suppose, for example, that an evaluator assessing the impact of
a schoolwide drug prevention program finds another nearby school that does not have
the program but is otherwise similar. The evaluator could use the children in that school
as a control group for the children in the school with the program by comparing the two
groups’ levels of drug use at the end of the school year. Even if drug use was the same
for the children in both schools at the beginning of the school year, however, how does
the evaluator know that it would be the same at the end of the year if neither received
the program? There are many personal, cultural, and economic factors that influence
where a child lives and what school he or she attends. These factors operate to “select”
some children to attend the one school and some to attend the other. Whatever
differences these children have that influence their school attendance may also influence
the likelihood that they will use drugs over the course of the school year at issue.To the
extent that happens, there will be selection bias in any estimate of the effects of the drug
prevention program made by comparing drug use at the two schools.
Selection bias can also occur through natural or deliberate processes that cause a
loss of outcome data for members of intervention and control groups that have already
been formed, a circumstance known as attrition. Attrition can occur in two ways: (1)
targets drop out of the intervention or control group and cannot be reached, or (2)
targets refuse to cooperate in outcome measurement. Because the critical issue is
******ebook converter DEMO Watermarks*******
whether the groups would be equivalent except for program effects at the time of the
postprogram outcome measurement, any missing outcome data from cases originally
assigned to intervention or control groups select individuals out of the research design.
Whenever attrition occurs as a result of something other than an explicit chance process
(e.g., using a random number table or coin flip), which is virtually always, differential
attrition has to be assumed. That is, those from the intervention group whose outcome
data are missing cannot be assumed to have the same outcome-relevant characteristics
as those from the control group whose outcome data are missing. It follows that the
comparability of those left in the groups after any attrition will have changed as well,
with all the implications for selection bias.
It should be apparent that random assignment designs are not immune from selection
bias induced by attrition. Random assignment produces groups that are statistically
equivalent at the time of the initial assignment, but it is equivalence at the time of
postprogram outcome measurement that protects against selection bias. Consequently,
selection bias is produced by any attrition from either the intervention or the control
group, or both, such that outcome data are not collected for every unit that was initially
randomized. To maintain the validity of a randomized field experiment, therefore, the
evaluator must prevent or, at least, minimize attrition on outcome measures, else the
design quickly degrades to a nonequivalent comparison group.
Note that the kind of attrition that degrades the research design is a loss of cases
from outcome measurement. Targets that drop out of the program do not create selection
bias if they still cooperate in outcome measurement. When these targets do not complete
the program, it degrades the program implementation but not the research design for
assessing the impact of the program at whatever degree of implementation results. The
evaluator should thus attempt to obtain outcome measures for everyone in the
intervention group whether or not they actually received the full program.Similarly,
outcome data should be obtained for everyone in the control group even if some ended
up receiving the program or some other relevant service. If full outcome data are
obtained, the validity of the design for comparing the two groups is retained. What
suffers when there is not full service to the intervention group and the absence of such
service to the control group is the sharpness of the comparison and the meaning of the
resulting estimates of program effect.Whatever program effects are found represent the
effects of the program as delivered to the intended group, which may be less than the
full program. If the control group receives services, the effect estimates represent what
is gained by providing whatever fuller implementation the intervention group receives
relative to the control group.
In sum, selection bias applies not only to the initial assignment of targets to
intervention and control groups but also to the data available for the groups at the time
of the outcome measurement. That is, selection bias includes all situations in which the
units that contribute outcome measures to a comparison between those receiving and not
******ebook converter DEMO Watermarks*******
receiving the program differ on some inherent characteristics that influence their status
on those outcome measures, aside from those related directly to program participation.
Apart from selection bias, the other factors that can bias the results of an impact
assessment generally have to do with events or experiences other than receiving the
program that occur during the period of the intervention. Estimation of program effects
by comparing outcomes for intervention and control groups requires not only that the
units in both groups be equivalent on outcome-related characteristics but that their
outcome-related experiences during the course of the study be equivalent except for the
difference in program exposure. To the extent that one group has experiences other than
program participation that the other group does not have and that also affect the
outcome, the difference between the outcomes for the two groups will reflect the
program effect plus the effects of that other experience. Those latter effects, of course,
will make the program effect appear larger or smaller than it actually is and thus
constitute bias.
Intervening events or experiences are potential problems even when the units that
receive the program are compared with those same units prior to receiving the program
in a before-and-after design. To the extent that the units have experiences or are
subjected to extraneous events during the program period that are unrelated to the
program but influence the outcomes, before-and-after comparisons yield biased
estimates of program effects. We have already seen an example in the case of the natural
growth in vocabulary that would occur for young children over the period of
participation in a reading program (Exhibit 9-B).
The difficulty for evaluators is that social programs operate in environments in
which ordinary or natural sequences of events inevitably influence the outcomes of
interest. For example, many persons who recover from acute illnesses do so naturally
because ordinary body defenses are typically sufficient to overcome such illnesses.
Thus, medical experiments testing a treatment for some pathological condition—
influenza, say—must distinguish the effects of the intervention from the changes that
would have occurred without the treatment or the estimates of treatment effects will be
quite biased. The situation is similar for social interventions. A program for training
young people in particular occupational skills must contend with the fact that some
people will obtain the same skills in ways that do not involve the program. Likewise,
assessments of a program to reduce poverty must consider that some families and
individuals will become better off economically without outside help.
The experiences and events that may produce bias in impact assessments generally
fall into three categories: secular trends, interfering events, and maturation.
******ebook converter DEMO Watermarks*******
Secular Trends
Relatively long-term trends in the community, region, or country, sometimes termed
secular drift, may produce changes that enhance or mask the apparent effects of a
program. In a period when a community’s birth rate is declining, a program to reduce
fertility may appear effective because of bias stemming from that downward trend.
Similarly, a program to upgrade the quality of housing occupied by poor families may
appear to be more effective than it actually is because of upward national trends in real
income that enable everyone to put more resources into their housing. Secular trends can
also produce bias that masks the real impact of programs. An effective project to
increase crop yields, for example, may appear to have no impact if the estimates of
program effects are biased by the influence of unfavorable weather during the program
period. Similarly, the effects of a program to provide employment opportunities to
released prisoners may be obscured if the program coincides with a depressed period
in the labor market.
Interfering Events
Like secular trends, short-term events can produce changes that may introduce bias
into estimates of program effect. A power outage that disrupts communications and
hampers the delivery of food supplements may interfere with a nutritional program.A
natural disaster may make it appear that a program to increase community cooperation
has been effective, when in reality it is the crisis situation that has brought community
members together.
Maturation
As noted earlier, impact evaluations must often cope with the fact that natural
maturational and developmental processes can produce considerable change
independently of the program. If those changes are included in estimates of program
effects, then those estimates will be biased. For example, the effectiveness of a
campaign to increase interest in sports among young adults may be masked by a general
decline in such interest that occurs when they enter the labor force. Maturational trends
can affect older adults as well: A program to improve preventive health practices
among adults may seem ineffective because health generally declines with age.
This discussion of bias in impact assessment has been motivated by the fact that it is
the pivotal issue in the design and analysis of all impact assessments that are not
conducted as well-implemented randomized field experiments. Indeed, in some
******ebook converter DEMO Watermarks*******
circumstances bias can be highly relevant to randomized experiments as well. In the
randomized experiment, a proper randomization with no attrition from outcome
measurement should prevent selection bias. Careful maintenance of comparable
circumstances for program and control groups between random assignment and outcome
measurement should prevent bias from the influence of other differential experiences or
events on the groups. If either of these conditions is absent from the design, however,
there is potential for bias in the estimates of program effect.
Minimizing bias in these estimates is the crucial research design issue with which
the evaluator must contend when using any nonrandomized impact assessment design,
and any randomized one with relatively high attrition or differential extraneous events
or experiences between intervention and control groups. For this reason, we will
organize our discussion of the alternatives to the randomized field experiment around
consideration of their potential for bias in the resulting estimates of program effect and
the ways the evaluator can attempt to reduce it.
Exhibit 9-C
Studying the Effects of Inclusive Education Using Individually Matched Controls
Each of the eight disabled students in regular junior high classrooms was matched
with a student in a special education classroom on age, gender, level of disability,
adaptive communication behavior, and adaptive social behavior. Statistical analyses
for matched pairs revealed no significant differences between students in the two
groups. These groups were then compared on outcome measures relating to the
students’ friendship networks and the character of their interaction with peers
without disabilities. The results showed that the students in general education
classrooms interacted more frequently with peers without disabilities across a
greater range of activities and settings, received and provided higher levels of
social support, and had larger friendship networks and more durable relationships
with peers without disabilities.
SOURCE: Adapted from Craig H. Kennedy, Smita Shikla, and Dale Fryxell,
“Comparing the Effects of Educational Placement on the Social Relationships of
Intermediate School Students With Severe Disabilities.” Exceptional Children, 1997,
64(1):31-47.
Exhibit 9-D
Evaluation of a Family Development Program Using Aggregate-Matched Controls
A program was started in Baltimore to serve poor families living in public housing
by providing integrated services with the hope of helping families escape from long-
term poverty. Services included access to special educational programs for children
and adults, job training programs, teenage programs, special health care access, and
child care facilities. To the extent possible, these services were delivered within the
LaFayette Courts public housing project. Case managers assigned to the housing
project helped families choose services appropriate to them. The special feature of
this program was its emphasis on serving families rather than individuals. In all,
125 families were enrolled.
To constitute a control group, 125 families were chosen from a comparable public
housing project, Murphy Homes. The impact of the Family Development program
was then assessed by contrasting the enrolled families with the Murphy Homes
sample. After a year of enrollment, the participating families were shown to be
higher in self-esteem and sense of control over their fates, but positive impacts on
employment and earnings had not yet occurred.
SOURCE: Adapted from Anne B. Shlay and C. Scott Holupka, Steps Toward
Independence: The Early Effects of the LaFayette Courts Family Development
Center. Baltimore, MD: Institute for Policy Studies, Johns Hopkins University, 1991.
Exhibit 9-E
Simple Statistical Controls in an Evaluation of the Impact of a Hypothetical
Employment Training Project
I. Outcome comparison between men 35-40 who completed the training program
and a sample of men 35-40 who did not attend the program
III. Comparison adjusting for educational attainment and employment at the start
of the training program (or equivalent data for nonparticipants)
Exhibit 9-F
Estimating the Effect of AA Attendance Using Regression Modeling
Regression-Discontinuity Designs
As the discussion of selection modeling above should make clear, complete and
valid data on the variables that are the basis for selection into nonequivalent
comparison groups provide the makings for an effective statistical control variable.
Suppose, now, that instead of trying to figure out what variables were related to
selection, the evaluator was given the selection variable up front and could apply it
case-by-case to allocate individuals into the intervention or control group according to
their scores on that variable. In this circumstance, selection modeling should be a sure
thing because there would be no uncertainty about how selection was done and the
evaluator would have in hand the measured values that determined it.
Exhibit 9-G
Estimating the Effect of AA Attendance Using Two-Stage Selection Modeling
*p ≤ .05.
This selection model was then used to produce a new variable, Lambda, which
estimates the probability that each individual will be in the intervention versus the
control group. Lambda is then entered as a control variable in a second-stage
regression analysis that attempts to predict the outcome variable, amount of drinking
measured on the Drinking Pattern scale. Two outcome-related control variables
were also included at this stage—baseline drinking scores and marital status.
Finally, inclusion of the intervention variable, AA attendance (0 = no, 1 = yes),
allowed assessment of its relation to the outcome when the other predictor
variables, including the selection variable, were statistically controlled.
*p ≤ .05.
SOURCE: Adapted with permission from Keith Humphreys, Ciaran S. Phibbs, and
Rudolf H. Moos, “Addressing Self-Selection Effects in Evaluations of Mutual Help
Groups and Professional Mental Health Services: An Introduction to Two-Stage Sample
Selection Models.” Evaluation and Program Planning, 1996, 19(4):301-308.
Reflexive Controls
In studies using reflexive controls, the estimation of program effects comes entirely
from information on the targets at two or more points in time, at least one of which is
before exposure to the program. When reflexive controls are used, the presumption must
be made that the targets have not changed on the outcome variable during the time
between observations except for any change induced by the intervention. Under this
assumption, any difference between preintervention and postintervention outcome status
is deemed a program effect. For example, suppose that pensioners from a large
corporation previously received their checks in the mail but now have them
automatically deposited in their bank accounts.Comparison of complaints about late or
missing payments before and after this procedure was implemented could be construed
as evidence of impact, provided that it was plausible that the rate of burglaries from
mailboxes, the level of postal service, and so on had not also changed. This is an
example of a simple pre-post study, the procedure we describe next. Then we turn to the
strongest type of reflexive control design, time-series designs.
Simple Pre-Post Studies
A simple pre-post design (or before-and-after study) is one in which outcomes are
measured on the same targets before program participation and again after sufficiently
long participation for effects to be expected. Comparing the two sets of measurements
produces an estimate of the program effect. As we have noted, the main drawback to
******ebook converter DEMO Watermarks*******
this design is that the estimate will be biased if it includes the effects of other influences
that occur during the period between the before and measurements. For example, it
might be tempting to assess the effects of Medicare by comparing the health status of
persons before they became eligible with the same measures taken a few years after
participation in Medicare. However, such comparisons would be quite misleading. The
effects of aging generally lead to poorer health on their own, which would bias the
program effect estimate downward. Other life changes that affect health status may also
occur around the time individuals become eligible for Medicare that may also create
bias, such as retirement and reduced income.
Sometimes time-related changes are subtle. For example, reflexive controls will be
questionable in studies of the effects of clinical treatment for depression. People tend to
seek treatment when they are at a low point, after which some remission of their
symptoms is likely to occur naturally so that they feel less depressed. Measures of their
depression before and after treatment, therefore, will almost automatically show
improvement even if the treatment has no positive effects.
In general, simple pre-post reflexive designs provide biased estimates of program
effects that have little value for purposes of impact assessment. This is particularly the
case when the time elapsed between the two measurements is appreciable—say, a year
or more—because over time it becomes more likely that other processes will obscure
any effects of the program. The simple pre-post design, therefore, is appropriate mainly
for short-term impact assessments of programs attempting to affect conditions that are
unlikely to change much on their own. As described in Chapter 7, they may also be
useful for purposes of routine outcome monitoring where the purpose is mainly to
provide feedback to program administrators, not to generate credible estimates of
program effects.
Simple pre-post designs can often be strengthened if it is possible to obtain multiple
measures of the outcome that span the preprogram to postprogram periods. The repeated
measures in such a series may make it possible to describe ongoing trends that would
bias a pre-post effect estimate and adjust them out of that estimate. This is the premise
of time-series designs, which we will discuss next.
Time-Series Designs
The strongest reflexive control design is a time-series design consisting of a
number of observations over a time period spanning the intervention. For example,
suppose that instead of just a pre- and postmeasure of pensioners’ complaints about late
or missing payments,we had monthly information for,say,two years before and one year
after the change in payment procedures. In this case, our degree of certainty about the
program effects would be higher because we would have more information upon which
to base our estimates about what would have happened had there been no change in the
mode of check delivery.A second procedure often used is to disaggregate the outcome
******ebook converter DEMO Watermarks*******
data by various characteristics of the targets. For example, examining time-series data
about pensioners’ complaints regarding receipt of checks in high and low crime areas
and in rural and urban areas would provide additional insight into the impact of the
change in procedure.
Time-series designs may or may not include the same respondents at each time of
measurement. Studies using these designs most often draw their data from existing
databases that compile periodic information related to the outcomes of interest (e.g.,
fertility, mortality, and crime). Available databases typically involve aggregated data
such as averages or rates computed for one or more political jurisdictions. For example,
the Department of Labor maintains an excellent time series that has tracked
unemployment rates monthly for the whole country and for major regions since 1948.
When a relatively long time series of preintervention observations exists, it is often
possible to model long-standing trends in the target group, projecting those trends
through and beyond the time of the intervention and observing whether or not the
postintervention period shows significant deviations from the projections. The use of
such general time-trend modeling procedures as ARIMA (auto regressive integrated
moving average; see Hamilton, 1994; McCleary and Hay, 1980) can identify the best-
fitting statistical models by taking into account long-term secular trends and seasonal
variations. They also allow for the degree to which any value or score obtained at one
point in time is necessarily related to previous ones (technically referred to as
autocorrelation). The procedures involved are technical and require a fairly high level
of statistical sophistication.
Exhibit 9-H illustrates the use of time-series data for assessing the effects of raising
the legal drinking age on alcohol-related traffic accidents.This evaluation was made
possible by the existence of relatively long series of measures on the outcome variable
(more than 200). The analysis used information collected over the eight to ten years
prior to the policy changes of interest to establish the expected trends for alcohol-
related accident rates for different age groups legally entitled to drink. Comparison of
the age- stratified rates experienced after the drinking ages were raised with the
expected rates based on the prior trends provided a measure of the program effect.
As noted earlier, the units of analysis in time-series data relevant to social programs
are usually highly aggregated. Exhibit 9-H deals essentially with one case, the state of
Wisconsin, where accident measures are constructed by aggregating the pertinent data
over the entire state and expressing them as accident rates per 1,000 licensed drivers.
The statistical models developed to fit such data are vulnerable to bias just like all the
other such models we have discussed. For example, if there were significant influences
on the alcohol-related accident rates in Wisconsin that were not represented in the trend
lines estimated by the model, then the results of the analysis would not be valid.
Simple graphic methods of examining time-series data before and after an
intervention can provide crude but useful clues to impact. Indeed, if the confounding
******ebook converter DEMO Watermarks*******
influences on an intervention are known and there is considerable certainty that their
effects are minimal, simple examination of a time-series plot may identify obvious
program effects. Exhibit 9-I presents the primary data for one of the classic applications
of time series in program evaluation—the British Breathalyzer crackdown (Ross,
Campbell, and Glass,1970).The graph in that exhibit shows the auto accident rates in
Great Britain before and after the enactment and enforcement of drastically changed
penalties for driving while under the influence of alcohol.The accompanying chart
indicates that the legislation had a discernible impact: Accidents declined after it went
into effect, and the decline was especially dramatic for accidents occurring over the
weekend, when we would expect higher levels of alcohol consumption. Though the
effects are rather evident in the graph, it is wise to confirm them with statistical
analysis; the reductions in accidents visible in Exhibit 9-I are, in fact, statistically
significant.
Time-series approaches are not necessarily restricted to single cases. When time-
series data exist for interventions at different times and in different places, more
complex analyses can be undertaken.Parker and Rebhun (1995), for instance, examined
the relationship of changes in state laws governing the minimum age of purchase of
alcohol with homicide rates using time series covering 1976-1983 for each of the 50
states plus the District of Columbia. They used a pooled cross-section time-series
analysis with a dummy code (0 or 1) to identify the years before and after the drinking
age was raised. Other variables in the model included alcohol consumption (beer sales
in barrels per capita), infant mortality (as a poverty index), an index of inequality, racial
composition, region, and total state population. This model was applied to homicide
rates for different age groups. Raising the minimum age-of-purchase law was found to
be significantly related to reductions in homicide for victims in the age 21-24 category.
Exhibit 9-H
Estimating the Effects of Raising the Drinking Age From Time-Series Data
During the early 1980s, many states raised the minimum drinking age from 18 to 21,
especially after passage of the federal Uniform Drinking Age Act of 1984, which
reduced highway construction funds to states that maintained a drinking age less than
21. The general reason for this was the widespread perception that lower drinking
ages had led to dramatic increases in the rate of alcohol-related traffic accidents
among teenagers. Assessing the impact of raising the drinking age, however, is
complicated by downward trends in accidents stemming from the introduction of
new automobile safety factors and increased public awareness of the dangers of
drinking and driving.
******ebook converter DEMO Watermarks*******
Wisconsin raised its drinking age to 19 in 1984 then to 21 in 1986. To assess the
impact of these changes, David Figlio examined an 18-year time series of monthly
observations on alcohol-related traffic accidents, stratified by age, that was
available from the Wisconsin Department of Transportation for the period from
1976 to 1993. Statistical time-series models were fit to the data for 18-year-olds
(who could legally drink prior to 1984), for 19- and 20-year-olds (who could
legally drink prior to 1986), and for over-21-year-olds (who could legally drink
over the whole time period). The outcome variable in these analyses was the rate of
alcohol-related crashes per thousand licensed drivers in the respective age group.
The results showed that, for 18-year-olds, raising the minimum drinking age to 19
reduced the alcohol-related crashes by an estimated 26% from the prior average of
2.2 per month per 1,000 drivers. For 19- and 20-year-olds, raising the minimum to
age 21 reduced the monthly crash rate by an estimated 19% from an average of 1.8
per month per 1,000 drivers. By comparison, the estimated effect of the legal
changes for the 21-and-older group was only 2.5% and statistically nonsignificant.
The evaluator’s conclusion was that the imposition of increased minimum drinking
ages in Wisconsin had immediate and conclusive effects on the number of teenagers
involved in alcohol-related crashes resulting in substantially fewer than the
prelegislation trends would have generated.
SOURCE: Adapted from David N. Figlio, “The Effect of Drinking Age Laws and
Alcohol-Related Crashes: Time-Series Evidence From Wisconsin.” Journal of Policy
Analysis and Management, 1995, 14(4):555-566.
Exhibit 9-I
An Analysis of the Impact of Compulsory Breathalyzer Tests on Traffic Accidents
In 1967, the British government enacted a new policy that allowed police to give
Breathalyzer tests at the scenes of accidents. The test measured the presence of
alcohol in the blood of suspects. At the same time, heavier penalties were instituted
for drunken driving convictions. Considerable publicity was given to the provisions
of the new law, which went into effect in October 1967.
The chart below plots vehicular accident rates by various periods of the week
before and after the new legislation went into effect. Visual inspection of the chart
clearly indicates that a decline in accidents occurred after the legislation, which
Although the time-series analyses we have discussed all use aggregated data, the
logic of time-series analyses is also applicable to disaggregated data.An example is the
analysis of interventions administered to small groups of persons whose behavior is
measured a number of times before, after, and perhaps during program participation.
Therapists, for example, have used time-series designs to assess the impact of
treatments on individual clients. Thus, a child’s performance on some achievement test
may be measured periodically before and after a new teaching method is used with the
child, or an adult’s drinking behavior may be measured before and after therapy for
alcohol abuse. The logic of time-series analyses remains the same when applied to a
single case, although the statistical methods applied are different because the issues of
long-term trends and seasonality usually are not as serious for individual cases (Kazdin,
******ebook converter DEMO Watermarks*******
1982).
Summary
Impact assessment aims to determine what changes in outcomes can be attributed
to the intervention being assessed.While the strongest research design for this purpose is
the randomized experiment, there are several potentially valid quasi-experimental
impact assessment strategies that can be used when it is not feasible to randomly assign
targets to intervention and control conditions.
When it is possible to assign targets to the intervention and control groups on the
basis of their scores on a quantitative measure of need, merit, or the like, estimates of
program effect from the regression-discontinuity design are generally less susceptible to
bias than those from other quasi-experimental designs.
K EY CONCEPTS
Attrition
The loss of outcome data measured on targets assigned to control or intervention groups,
usually because targets cannot be located or refuse to contribute data.
Matching
Constructing a control group by selecting targets (individually or as aggregates) that are
identical on specified characteristics to those in an intervention group except for receipt
of the intervention.
Pre-post design
Reflexive controls
Measures of an outcome variable taken on participating targets before intervention and
used as control observations. See also pre-post design; time-series design.
Regression-discontinuity design
A quasi-experimental design in which selection into the intervention or control group is
based on the observed value on an appropriate quantitative scale, with targets scoring
above a designated cutting point on that scale assigned to one group and those scoring
below assigned to the other. Also called a cutting-point design.
Selection bias
Systematic under- or overestimation of program effects that results from uncontrolled
differences between the intervention and control groups that would result in differences
on the outcome if neither group received the intervention.
Selection modeling
Creation of a multivariate statistical model to “predict” the probability of selection into
intervention or control groups in a nonequivalent comparison design. The results of this
analysis are used to configure a control variable for selection bias to be incorporated
into a second-stage statistical model that estimates the effect of intervention on an
outcome.
Statistical controls
The use of statistical techniques to adjust estimates of program effects for bias resulting
from differences between intervention and control groups that are related to the
outcome. The differences to be controlled by these techniques must be represented in
measured variables that can be included in the statistical analysis.
Time-series design
A reflexive control design that relies on a number of repeated measurements of the
outcome variable taken before and after an intervention.
******ebook converter DEMO Watermarks*******
******ebook converter DEMO Watermarks*******
10
Chapter Outline
The Magnitude of a Program Effect
Detecting Program Effects
Statistical Significance
Type I and Type II Errors
Statistical Power
What Statistical Power Is Appropriate for a Given Impact
Assessment?
Assessing the Practical Significance of Program Effects
Examining Variations in Program Effects
Moderator Variables
Mediator Variables
The Role of Meta-Analysis
Informing an Impact Assessment
Informing the Evaluation Field
The three previous chapters focused on outcome measurement and research design
for the purpose of obtaining valid estimates of program effects. Despite good
measurement and design, however, the actual effects produced by a program will not
necessarily appear in a form that allows the evaluator to be confident about their
The end product of an impact assessment is a set of estimates of the effects of the
program. Evaluators arrive at these estimates by contrasting outcomes for program
participants to estimates of the outcomes that would have resulted if the targets had not
participated in the program.As discussed in Chapters 8 and 9, research designs vary in
the credibility with which they estimate outcomes absent program participation.
However, all effect estimates, including those obtained through randomized
experiments, need to be examined carefully to ascertain their significance. How to make
such assessments is the major theme of this chapter.We will consider how evaluators
can characterize the magnitude of a program effect, how they can detect program effects
in a set of data,and how they can assess the practical significance of those effects. We
then discuss the more complex issue of analyzing variations in program effects for
different subgroups in the target population. At the end of the chapter, we briefly
consider how meta-analyses of the effects found in previous impact assessments can
help improve the design and analysis of specific evaluations and contribute to the body
of knowledge in the evaluation field.
Exhibit 10-A
Common Effect Size Statistics
The standardized mean difference effect size statistic is especially appropriate for
representing intervention effects found on continuous outcome measures, that is,
measures producing values that range over some continuum. Continuous measures
include age, income, days of hospitalization, blood pressure readings, scores on
achievement tests and other such standardized measurement instruments, and the
like. The outcomes on such measures are typically presented in the form of mean
values for the intervention and control groups, with the difference between those
means indicating the size of the intervention effect. Correspondingly, the
standardized mean difference effect size statistic is defined as:
sdp = the pooled standard deviations of the intervention (sdi) and control (sdc) group
scores, specifically: nc with nl and nc the sample
sizes of the intervention and control groups, respectively.
where
p = the proportion of the individuals in the intervention group with a positive
outcome,
1 – p = the proportion with a negative outcome,
q = the proportion of the individuals in the control group with a positive
outcome,
1 – q = the proportion with a negative outcome,
p/(1 – = the odds of a positive outcome for an individual in the intervention
p) group, and
q/(1 – = the odds of a positive outcome for an individual in the control group.
q)
The odds ratio thus represents an intervention effect in terms of how much greater
Thus, the odds of being free of clinical levels of depression for those in the
intervention group are 1.75 times greater than those for individuals in the control
group.
Statistical Significance
If we think of the actual program effect as a signal that we are trying to detect in an
impact assessment, the problem of apparent effects that result from statistical noise is
one of a low signal-to-noise ratio. Fortunately, statistics provide tools for assessing the
level of noise to be expected in the type of data we are working with. If the “signal”—
******ebook converter DEMO Watermarks*******
the estimate of program effect that we observe in the data—is large relative to the
expected level of statistical noise, we will be relatively confident that we have detected
a real effect and not a chance pattern of noise. On the other hand, if the program effect
estimate is small relative to the pseudo-effects likely to result from statistical noise, we
will have little confidence that we have observed a real program effect.
To assess the signal-to-noise ratio, we must estimate both the program effect signal
and the background statistical noise.The best estimate of the program effect is simply the
measured mean difference between the outcomes for an intervention and control group,
often expressed as an effect size of the sort described in Exhibit 10-A. An estimate of
the magnitude of the pseudo-effects likely to result from statistical noise is derived by
applying an appropriate statistical probability theory to the data. That estimate is mainly
a function of the size of the sample (the number of units in the intervention and control
groups being compared) and how widely those units vary on the outcome measure at
issue.
This signal-to-noise comparison is routinely accomplished through statistical
significance testing. If the difference between the mean outcomes for an intervention and
control group is statistically significant, the significance test is telling us that the signal-
to-noise ratio, under its assumptions, is such that statistical noise is unlikely to have
produced an effect as large as the one observed in the data when the real effect is zero.
Conventionally, statistical significance is set at the .05 alpha level. This means that the
chance of a pseudo-effect produced by noise being as large as the observed program
effect is 5% or less. Given that, we have a 95% confidence level that the observed
effect is not simply the result of statistical noise.
Although the .05 significance level has become conventional in the sense that it is
used most frequently, there may be good reasons to use a higher or lower level in
specific instances.When it is very important for substantive reasons to have very high
confidence in the judgment that a program is effective, the evaluator might set a higher
threshold for accepting that judgment, say, a significance level of .01, corresponding to
a 99% level of confidence that the effect estimate is not purely the result of chance. In
other circumstances,for instance,in exploratory work seeking leads to promising
interventions, the evaluator might use a lower threshold, such as .10 (corresponding to a
90% level of confidence).
Notice that statistical significance does not mean practical significance or
importance. A statistically significant finding may or may not be significant theoretically
or practically; is it simply a result that is unlikely to be due to chance. Statistical
significance is thus a minimum requirement for a meaningful result (we discuss later in
this chapter how to assess the practical meaning of a given effect estimate). If a
measured program effect is not statistically significant, this means that, by conventional
standards, the signal-to-noise ratio is too low for the effect to be accepted as an
indication of something that is likely to be a real program effect.
******ebook converter DEMO Watermarks*******
Statistical significance testing is thus the evaluator’s first assessment of the
magnitude of a measured program effect in an impact assessment. Moreover, it is
basically an all-or-nothing test. If the observed effect is statistically significant, it is
large enough to be discussed as a program effect.If it is not statistically significant,then
no claim that it is a program effect and not simply statistical noise will have credibility
in the court of scientific opinion.
Exhibit 10-B
Type I and Type II Statistical Inference Errors
Statistical Power
Evaluators, of course, should not design impact assessments that are likely to
******ebook converter DEMO Watermarks*******
produce erroneous conclusions about program effects, especially at the fundamental
level of conclusions about statistical significance. To avoid such mistakes, evaluators
must give careful attention to ensuring that the research design has low risks for Type I
and Type II errors.
The risk of Type I error (finding statistical significance when there is no program
effect) is relatively easy to control. The maximum acceptable chance of that error is set
by the researcher when an alpha level for statistical significance is selected for the
statistical test to be applied. The conventional alpha level of .05 means that the
probability of a Type I error is being held to 5% or less.
Controlling the risk of Type II error (not obtaining statistical significance when there
is a program effect) is more difficult. It requires configuring the research design so that
it has adequate statistical power. Statistical power is the probability that an estimate of
the program effect will be statistically significant when, in fact, it represents a real
effect of a given magnitude.The likelihood of Type II error is the complementary
probability of not obtaining statistical significance under these circumstances, or one
minus statistical power. So, for example, if statistical power is .80, then the likelihood
of Type II error is 1 – .80, or .20 (20%). An impact assessment design with high
statistical power is one that can be counted on to show statistical significance for
program effect estimates that are above some threshold the evaluator judges to be too
large to overlook.
Statistical power is a function of (1) the effect size to be detected, (2) the sample
size, (3) the type of statistical significance test used, and (4) the alpha level set to
control Type I error. The alpha probability level is conventionally set at .05 and thus is
usually treated as a given. The other three factors require more careful consideration. To
design for adequate statistical power, the evaluator must first determine the smallest
effect size the design should reliably detect. For this purpose, effect sizes will be
represented using an effect size statistic such as the standardized mean difference
described in Exhibit 10-A. For instance, the evaluator might select an effect size of .20
in standard deviation units as the threshold for important program effects the research
design should detect at a statistically significant level. Determining what numerical
effect size corresponds to the minimal meaningful program effect the evaluator wants to
detect is rarely straightforward.We will discuss this matter when we take up the topic of
the practical significance of program effects later in this chapter.
What Statistical Power Is Appropriate for a Given Impact
Assessment?
With a threshold effect size for detection selected, the evaluator must then decide
how much risk of Type II error to accept.For instance,the evaluator could decide that the
risk of failing to attain statistical significance when an actual effect at the threshold
level or higher was present should be held to 5%. This would hold Type II error to the
******ebook converter DEMO Watermarks*******
same .05 probability level that is customary for Type I error. Because statistical power
is one minus the probability of Type II error, this means that the evaluator wants a
research design that has a power of .95 for detecting an effect size at the selected
threshold level or larger. Similarly, setting the risk of Type II error at .20 would
correspond to a statistical power of .80.
What remains, then, is to design the impact evaluation with a sample size and type of
statistical test that will yield the desired level of statistical power. The sample size
factor is fairly straightforward—the larger the sample, the higher the power. Planning
for the best statistical testing approach is not so straightforward. The most important
consideration involves the use of control variables in the statistical model being applied
in the analysis. Control variables that are correlated with the outcome measure have the
effect of extracting the associated variability in that outcome measure from the analysis
of the program effect. Control variables representing nuisance factors can thus reduce
the statistical noise and increase the signal-noise ratio in ways that increase statistical
power.The most useful control variable for this purpose is generally the preintervention
measure of the outcome variable itself. A pretest of this sort taps into preexisting
individual differences on the outcome variable that create variation in scores unrelated
to the effects of the program. Because any source of irrelevant variation in the scores
contributes to the statistical noise, use of well-chosen control variables can greatly
enhance statistical power.
To achieve this favorable result, the control variable(s) must have a relatively large
correlation with the outcome variable and be integrated into the analysis that assesses
the statistical significance of the program effect estimate.The forms of statistical
analysis that involve control variables in this way include analysis of covariance,
multiple regression, structural equation modeling, and repeated measures analysis of
variance.
Deciding about the statistical power of an impact assessment is a substantive issue.
If the evaluator expects that the program’s effects will be small and that such small
effects are worthwhile, then a design powerful enough to detect such small effects will
be needed. For example, an intervention that would lower automobile accident deaths
by as little as 1% might be judged worthwhile because saving lives is so important. In
contrast, when the evaluator judges that an intervention is worthwhile only when its
effects are large, then it may be quite acceptable if the design lacks power for detecting
smaller effects.An expensive computer programming retraining program may be
considered worthwhile implementing, for instance, only if at least half of the trainees
subsequently obtain relevant employment,a relatively large effect that may be all the
design needs to be able to detect with high power.
It is beyond the scope of this text to discuss the technical details of statistical power
estimation, sample size, and statistical analysis with and without control variables.
Proficiency in these areas is critical for competent impact assessment, however, and
******ebook converter DEMO Watermarks*******
should be represented on any evaluation team undertaking such work. More detailed
information on these topics can be found in Cohen (1988), Kraemer and Thiemann
(1987), and Lipsey (1990, 1998).
Exhibit 10-C presents a representative example of the relationships among the
factors that have the greatest influence on statistical power. It shows statistical power
for various combinations of effect sizes and sample sizes for the most common
statistical test of the difference between the means of two groups (a t-test or,
equivalently, a oneway analysis of variance with no control variables and alpha = .05).
Close examination of the chart in Exhibit 10-C will reveal how difficult it can be to
achieve adequate statistical power in an impact evaluation. Relatively high power is
attained only when either the sample size or the threshold effect size is rather large.
Both of these conditions often are unrealistic for impact evaluation.
Suppose, for instance, that the evaluator wants to hold the risk of Type II error to the
same 5% level that is customary for Type I error, corresponding to a .95 power level.
This is a quite reasonable objective in light of the unjustified damage that might be done
to a program if it produces meaningful effects that the impact evaluation fails to detect at
a statistically significant level. Suppose, further, that the evaluator determines that a
statistical effect size of .20 on the outcome at issue would represent a positive program
accomplishment and should be detected.The chart in Exhibit 10-C shows that the usual
statistical significance test at the alpha = .05 standard and no control variables would
require a sample size of somewhat more than 650 in each group (intervention and
control), for a total of more than 1,300. While such numbers may be attainable in some
evaluation situations, they are far larger than the sample sizes usually reported in impact
evaluation studies.
Exhibit 10-C
Statistical Power as a Function of Sample Size and Effect Size for a t-Test With Alpha =
.05
Exhibit 10-D
Statistical Significance of the Effects of Delinquency Interventions
The shaded part of the graph in this exhibit indicates the proportion of the effect
estimates at each magnitude level that were found to be statistically significant in the
source evaluation study. Note that it is only for the very largest effect sizes that an
acceptably high proportion were detected at a statistically significant level.Many effects
of a magnitude that might well represent important program benefits were not found to
be statistically significant. This is a direct result of low statistical power. Even though
the effect estimates were relatively large, the amount of statistical noise in the studies
was also large, mainly because of small sample sizes and underutilization of control
variables in the statistical analysis.
Consider the portion of the graph in Exhibit 10-D that shows effect size estimates in
the range of .30. Many of the outcomes in these studies are the reoffense (recidivism)
Moderator Variables
Exhibit 10-E
Some Ways to Describe Statistical Effect Sizes in Practical Terms
When the original outcome measure has inherent practical meaning, the effect size
may be stated directly as the difference between the outcome for the intervention and
control groups on that measure. For example, the dollar value of health services
used after a prevention program or the number of days of hospitalization after a
******ebook converter DEMO Watermarks*******
program aimed at decreasing time to discharge would generally have inherent
practical meaning in their respective contexts.
For programs that aim to raise the outcomes for a target population to mainstream
levels, program effects may be stated in terms of the extent to which the program
effect reduced the gap between the preintervention outcomes and the mainstream
level. For example, the effects of a program for children who do not read well might
be described in terms of how much closer their reading skills at outcome are to the
norms for their grade level. Grade-level norms might come from the published test
norms, or they might be determined by the reading scores of the other children who
are in the same grade and school as the program participants.
When data on relevant outcome measures are available for groups of recognized
differences in the program context, program effects can be compared to their
differences on the respective outcome measures. Suppose, for instance, that a mental
health facility routinely uses a depression scale at intake to distinguish between
patients who can be treated on an outpatient basis and more severe cases that
require inpatient treatment. Program effects measured on that depression scale could
be compared with the difference between inpatient and outpatient intake scores to
reveal if they are small or large relative to that well-understood difference.
When a value on an outcome measure can be set as the threshold for success, the
proportion of the intervention group with successful outcomes can be compared to
the proportion of the control group with such outcomes. For example, the effects of
an employment program on income might be expressed in terms of the proportion of
the intervention group with household income above the federal poverty level in
contrast to the proportion of the control group with income above that level.
Expressing a program effect in terms of success rate may help depict its practical
significance even if the success rate threshold is relatively arbitrary. For example,
the mean outcome value for the control group could be used as a threshold value.
Generally, 50% of the control group will be above that mean. The proportion of the
intervention group above that same value will give some indication of the magnitude
******ebook converter DEMO Watermarks*******
of the program effect. If, for instance, 55% of the intervention group is above the
control group outcome mean, the program has not affected as many individuals as
when 75% are above that mean.
The evaluation literature may provide information about the statistical effects for
similar programs on similar outcomes that can be compiled to identify effects that
are small and large relative to what other programs achieve. Meta-analyses that
systematically compile and report such effects are especially useful for this purpose.
Thus, a standardized mean difference effect size of .22 on the number of consecutive
days without smoking after a smoking cessation program could be viewed as having
larger practical effects if the average effect size for other programs was around .10
on that outcome measures than if it was .50.
Conventional Guidelines
Cohen (1988) provided guidelines for what are generally “small,” “medium,” and
“large” effect sizes in social science research. Though these were put forward in the
context of conducting power analysis, they are widely used as rules of thumb for
judging the magnitude of intervention effects. For the standardized mean difference
effect size, for instance, Cohen suggested that .20 was a small effect, .50 a medium
one, and .80 a large one.
Evaluators can most confidently and clearly detect variations in program effects for
different subgroups when they define the subgroups at the start of the impact
assessment.In that case,there are no selection biases involved.For example,a target
obviously does not become a male or a female as a result of selection processes at work
during the period of the intervention. However, selection biases can come into play
when subgroups are defined that emerge during the course of the intervention. For
example, if some members of the control and intervention groups moved away after
being assigned to intervention or control conditions, then whatever forces influenced
that behavior may also be affecting outcomes. Consequently, the analysis needs to take
into account any selection biases in the formation of such emergent subgroups.
If the evaluator has measured relevant moderator variables, it can be particularly
informative to examine differential program effects for those targets most in need of the
benefits the program attempts to provide. It is not unusual to find that program effects
are smallest for those who were most in need when they were recruited into the impact
study.An employment training program, for instance, will typically show better job
placement outcomes for participants with recent employment experience and some job-
Mediator Variables
Exhibit 10-F
An Example of a Program Impact Theory Showing the Expected Proximal and Distal
Outcomes
Any meta-analyses conducted and reported for interventions of the same general
type as one for which an evaluator is planning an impact assessment will generally
provide useful information for the design of that study. Consequently, the evaluator
should pay particular attention to locating relevant meta-analysis work as part of the
general review of the relevant literature that should precede an impact assessment.
Exhibit 10-G summarizes results from a meta-analysis of school-based programs to
prevent aggressive behavior that illustrate the kind of information often available.
Meta-analysis focuses mainly on the statistical effect sizes generated by intervention
studies and thus can be particularly informative with regard to that aspect of an impact
assessment. To give proper consideration to statistical power, for instance, an evaluator
must have some idea of the magnitude of the effect size a program might produce and
what minimal effect size is worth trying to detect.Meta-analyses will typically provide
information about the overall mean effect size for a program area and, often,
breakdowns for different program variations.With information on the standard deviation
of the effect sizes, the evaluator will also have some idea of the breadth of the effect
size distribution and, hence, some estimate of the likely lower and upper range that
might be expected from the program to be evaluated.
Program effect sizes, of course, may well be different for different outcomes. Many
meta-analyses examine the different categories of outcome variables represented in the
available evaluation studies. This information can give an evaluator an idea of what
effects other studies have considered and what they found. Of course, the meta-analysis
will be of less use if the program to be evaluated is concerned about an outcome that
has not been examined in evaluations of other similar programs. Even then, however,
results for similar types of variables—attitudes, behavior, achievement, and so forth—
may help the evaluator anticipate both the likelihood of effects and the expected
magnitude of those effects.
Exhibit 10-G
An Example of Meta-Analysis Results: Effects of School-Based Intervention Programs
on Aggressive Behavior
******ebook converter DEMO Watermarks*******
Many schools have programs aimed at preventing or reducing aggressive and
disruptive behavior. To investigate the effects of these programs, a meta-analysis of
the findings of 221 impact evaluation studies of such programs was conducted.
A thorough search was made for published and unpublished study reports that
involved school-based programs implemented in one or more grades from
preschool through the last year of high school. To be eligible for inclusion in the
meta-analysis, the study had to report outcome measures of aggressive behavior
(e.g., fighting, bullying, person crimes, behavior problems, conduct disorder, and
acting out) and meet specified methodological standards.
Standardized mean difference effect sizes were computed for the aggressive
behavior outcomes of each study. The mean effect sizes for the most common types
of programs were as follows: In addition, a moderator analysis of the effect sizes
showed that program effects were larger when
SOURCE: Adapted from Sandra J. Wilson, Mark W. Lipsey, and James H. Derzon, “The
Effects of School-Based Intervention Programs on Aggressive Behavior: A Meta-
Analysis.” Journal of Consulting and Clinical Psychology, 2003, 71(1):136-149.
******ebook converter DEMO Watermarks*******
Reprinted with permission from the American Psychological Association.
Similarly, after completing an impact assessment, the evaluator may be able to use
relevant meta-analysis results in appraising the magnitude of the program effects that
have been found in the assessment. The effect size data presented by a thorough meta-
analysis of impact assessments in a program area constitute a set of norms that describe
both typical program effects and the range over which they vary.An evaluator can use
this information as a basis for judging whether the various effects discovered for the
program being evaluated are representative of what similar programs attain. Of course,
this judgment must take into consideration any differences in intervention
characteristics, clientele, and circumstances between the program at hand and those
represented in the meta-analysis results.
A meta-analysis that systematically explores the relationship between program
characteristics and effects on different outcomes not only will make it easier for the
evaluator to compare effects but may offer some clues about what features of the
program may be most critical to its effectiveness. The meta-analysis summarized in
Exhibit 10-G, for instance, found that programs were much less effective if they were
delivered by laypersons (parents, volunteers) than by teachers and that better results
were produced by a one-on-one than by a group format. An evaluator conducting an
impact assessment of a school-based aggression prevention program might, therefore,
want to pay particular attention to these characteristics of the program.
Aside from supporting the evaluation of specific programs, a major function of the
evaluation field is to summarize what evaluations have found generally about the
characteristics of effective programs. Though every program is unique in some ways,
this does not mean that we should not aspire to discover some patterns in our evaluation
findings that will broaden our understanding of what works,for whom,and under what
circumstances. Reliable knowledge of this sort not only will help evaluators to better
focus and design each program evaluation they conduct, but it will provide a basis for
informing decisionmakers about the best approaches to ameliorating social problems.
Meta-analysis has become one of the principal means for synthesizing what
evaluators and other researchers have found about the effects of social intervention in
general. To be sure, generalization is difficult because of the complexity of social
programs and the variability in the results they produce. Nonetheless, steady progress is
being made in many program areas to identify more and less effective intervention
models, the nature and magnitude of their effects on different outcomes, and the most
critical determinants of their success. As a side benefit, much is also being learned
******ebook converter DEMO Watermarks*******
about the role of the methods used for impact assessment in shaping the results obtained.
One important implication for evaluators of the ongoing efforts to synthesize impact
evaluation results is the necessity to fully report each impact evaluation so that it will
be available for inclusion in meta-analysis studies. In this regard, the evaluation field
itself becomes a stakeholder in every evaluation. Like all stakeholders, it has distinctive
information needs that the evaluator must take into consideration when designing and
reporting an evaluation.
Summary
The ability of an impact assessment to detect program effects, and the importance
of those effects, will depend in large part on their magnitude.The evaluator must,
therefore, be familiar with the considerations relevant to describing both the statistical
magnitude and the practical magnitude of program effects.
In attempting to statistically detect program effects, the evaluator may draw the
wrong conclusion from the outcome data. An apparent effect may be statistically
significant when there is no actual program effect (a Type I error), or statistical
significance may not be attained when there really is a program effect (a Type II error).
Whatever the overall mean program effect, there are usually variations in effects
for different subgroups of the target population. Investigating moderator variables,
which characterize distinct subgroups, is an important aspect of impact assessment. The
investigation may reveal that program effects are especially large or small for some
subgroups, and it allows the evaluator to probe the outcome data in ways that can
strengthen the overall conclusions about a program’s effectiveness.
In addition, meta-analysis has become one of the principal means for synthesizing
what evaluators and other researchers have found about the effects of social
intervention. In this role, it informs the evaluation field about what has been learned
collectively from the thousands of impact evaluations that have been conducted over the
years.
K EY CONCEPTS
Effect size statistic
******ebook converter DEMO Watermarks*******
A statistical formulation of an estimate of program effect that expresses its magnitude in
a standardized form that is comparable across outcome measures using different units or
scales. Two of the most commonly used effect size statistics are the standardized mean
difference and the odds ratio.
Mediator variable
In an impact assessment, a proximal outcome that changes as a result of exposure to the
program and then, in turn, influences a more distal outcome. The mediator is thus an
intervening variable that provides a link in the causal sequence through which the
program brings about change in the distal outcome.
Meta-analysis
An analysis of effect size statistics derived from the quantitative results of multiple
studies of the same or similar interventions for the purpose of summarizing and
comparing the findings of that set of studies.
Moderator variable
In an impact assessment, a variable, such as gender or age, that characterizes subgroups
for which program effects may differ.
Odds ratio
An effect size statistic that expresses the odds of a successful outcome for the
intervention group relative to that of the control group.
Statistical power
The probability that an observed program effect will be statistically significant when, in
fact, it represents a real effect. If a real effect is not found to be statistically significant,
a Type II error results. Thus, statistical power is one minus the probability of a Type II
error. See also Type II error.
Type II error
A statistical conclusion error in which a program effect estimate is not found to be
statistically significant when, in fact, the program does have an effect on the target
population.
Measuring Efficiency
Chapter Outline
Key Concepts in Efficiency Analysis
Ex Ante and Ex Post Efficiency Analyses
Cost-Benefit and Cost-Effectiveness Analyses
The Uses of Efficiency Analyses
Conducting Cost-Benefit Analyses
Assembling Cost Data
Accounting Perspectives
Measuring Costs and Benefits
Monetizing Outcomes
Shadow Prices
Opportunity Costs
Secondary Effects (Externalities)
Distributional Considerations
Discounting
Comparing Costs to Benefits
When to Do Ex Post Cost-Benefit Analysis
Conducting Cost-Effectiveness Analyses
Whether programs have been implemented successfully and the degree to which they
are effective are at the heart of evaluation. However, it is just as critical to be
informed about the cost of program outcomes and whether the benefits achieved
Efficiency issues arise frequently in decision making about social interventions, as the
following examples illustrate.
In January 1987, Union Bank opened a new profit center in Los Angeles. This one,
however, doesn’t lend money. It doesn’t manage money. It takes care of children.
The profit center is a day-care facility at the bank’s Monterey Park operations
center. Union Bank provided the facility with a $105,000 subsidy [in 1987]. In
return, it saved the bank as much as $232,000. There is, of course, nothing
extraordinary about a day-care center. What is extraordinary is the $232,000. That
number is part of a growing body of research that tries to tell companies what they
are getting—on the bottom line—for the dollars they invest in such benefits and
policies as day-care assistance, wellness plans, maternity leaves, and flexible work
schedules.
The Union Bank study, designed to cover many questions left out of other
evaluations, offers one of the more revealing glimpses of the savings from corporate
day-care centers. For one thing, the study was begun a year before the center
opened, giving researchers more control over the comparison statistics. Union Bank
approved spending $430,000 to build its day-care center only after seeing the
savings projections.
Using data provided by the bank’s human resource department, Sandra Burud, a
child-care consultant in Pasadena, California, compared absenteeism, turnover, and
maternity leave time the first year of operation and the year before. She looked at the
results for 87 users of the center, a control group of 105 employees with children of
similar ages who used other day-care options, and employees as a whole.
Her conclusion: The day-care center saves the bank $138,000 to $232,000 a year—
numbers she calls “very conservative.” Ms. Burud says savings on turnover total
$63,000 to $157,000, based mostly on the fact that turnover among center users was
2.2 percent compared with 9.5 percent in the control group and 18 percent
throughout the bank.
She also counted $35,000 in savings on lost days’ work. Users of the center were
absent an average of 1.7 fewer days than the control group, and their maternity
leaves were 1.2 weeks shorter than for other employees. Ms. Burud also added a
bonus of $40,000 in free publicity, based on estimates of media coverage of the
center.
******ebook converter DEMO Watermarks*******
Despite the complexities of measurement, she says, the study succeeds in
contradicting the “simplistic view of child care. This isn’t a touchy-feely kind of
program. It’s as much a management tool as it is an employee benefit.”
SOURCE: J. Solomon, “Companies Try Measuring Cost Savings From New Types of
Corporate Benefits,” Wall Street Journal, December 29, 1988, p. Bl. Reprinted by
permission of The Wall Street Journal, Dow Jones & Company, Inc. All rights reserved
worldwide.
In spite of their value, however, it bears emphasis that in many evaluations formal,
complete efficiency analyses are either impractical or unwise for several reasons. First,
efficiency analysis may be unnecessary if the efficacy of the program is either very
minimal or extremely high. Conducting an efficiency analysis makes sense primarily
when a program is effective but not perfectly so. Second, the required technical
procedures may call for methodological sophistication not available to the project’s
staff. Third, political or moral controversies may result from placing economic values
on particular input or outcome measures, controversies that could obscure the relevance
and minimize the potential utility of an otherwise useful and rigorous evaluation. Fourth,
expressing the results of evaluation studies in efficiency terms may require selectively
taking different costs and outcomes into account, depending on the perspectives and
values of sponsors, stakeholders, targets, and evaluators themselves (what are referred
to as accounting perspectives). The dependence of results on the accounting
perspective employed may be difficult for at least some of the stakeholders to
comprehend, again obscuring the relevance and utility of evaluations. (We discuss
accounting perspectives in more detail later in this chapter.)
Furthermore, efficiency analysis may be heavily dependent on untested assumptions,
or the requisite data for undertaking cost-benefit or cost-effectiveness calculations may
not be fully available. Even the strongest advocates of efficiency analyses acknowledge
that there often is no single “right” analysis. Moreover, in some applications, the results
may show unacceptable levels of sensitivity to reasonable variations in the analytic and
conceptual models used and their underlying assumptions.
Although we want to emphasize that the results of all cost-benefit and cost-
effectiveness analyses should be treated with caution, and sometimes with a fair degree
of skepticism, such analyses can provide a reproducible and rational way of estimating
the efficiency of programs.Even strong advocates of efficiency analyses rarely argue
that such studies should be the sole determinant of decisions about programs.
Nonetheless, they are a valuable input into the complex mosaic from which decisions
emerge.
Efficiency analyses are most commonly undertaken either (1) prospectively during
the planning and design phase of an initiative (ex ante efficiency analysis) or (2)
retrospectively, after a program has been in place for a time and has been demonstrated
to be effective by an impact evaluation, and there is interest in making the program
permanent or possibly expanding it (ex post efficiency analysis).
In the planning and design phases, ex ante efficiency analyses may be undertaken on
the basis of a program’s anticipated costs and outcomes. Such analyses, of course, must
assume a given magnitude of positive impact even if this value is only a
conjecture.Likewise,the costs of providing and delivering the intervention must be
estimated. In some cases, estimates of both the inputs and the magnitude of impact can
be made with considerable confidence, either because there has been a pilot program
(or a similar program in another location) or because the program is fairly simple in its
implementation. Nevertheless, because ex ante analyses cannot be based entirely on
empirical information, they run the risk of seriously under- or overestimating net
benefits (which may be understood for now as the total benefits minus the total costs).
Indeed, the issue of the accuracy of the estimates of both inputs and outputs is one of the
controversial areas in ex ante analyses.
Ex ante cost-benefit analyses are most important for those programs that will be
difficult to abandon once they have been put into place or that require extensive
commitments in funding and time to be realized. For example, the decision to increase
ocean beach recreational facilities by putting in new jetties along the New Jersey ocean
shore would be difficult to overturn once the jetties had been constructed; thus, there is
a need to estimate the costs and outcomes of such a program compared with other ways
of increasing recreational opportunities, or to judge the wisdom of increasing
recreational opportunities compared with the costs and outcomes of allocating the
resources to another social program area.
Thus, when a proposed program would require heavy expenditures, decisions
whether to proceed can be influenced by an ex ante cost-benefit analysis. Exhibit 11-B
illustrates such a situation with regard to the testing of health care workers for HIV.
Even though the possibility of, say, a surgeon or dentist transmitting HIV/AIDS to a
patient is a matter of serious consequences and concern, testing the vast number of
health care workers in this country for HIV would surely be quite expensive. Before
embarking on such a program, it is wise to develop some estimate, even if crude, of
how expensive it is likely to be in relation to the number of patient infections averted.
The analysis summarized in Exhibit 11-B showed that under most risk scenarios any
reasonable policy option would likely be quite expensive. Moreover, there was
considerable uncertainty in the estimates possible from available information. Given the
high, but uncertain, cost estimates, policymakers would be wise to move cautiously on
******ebook converter DEMO Watermarks*******
this issue until better information could be developed.
Most often, however, ex ante efficiency analyses for social programs are not
undertaken. As a consequence, many social programs are initiated or markedly modified
without attention to the practicality of the action in cost-benefit or cost-effectiveness
terms. For example, it might seem worthwhile to expand dental health services for
children in Head Start to include a particular dental treatment that has been shown to
prevent cavities. However, suppose that, while the treatment can be expected to reduce
cavities by an average of one-half cavity per child per year, its annual cost per child is
four times what dentists would charge on average for filling a single cavity.An
efficiency analysis in such a case might easily dissuade decisionmakers from
implementing the program.
Exhibit 11-B
Ex Ante Analysis of the Cost-Effectiveness of HIV Testing for Health Care Workers
The derivation of costs in this study was based on data obtained from reviewing the
pertinent literature and consulting with experts. The cost estimates included three
components: (a) counseling and testing costs, (b) additional treatment costs because
of early detection of HIV-positive cases, and (c) medical care costs averted per
patient infection averted. Costs were estimated by subtracting (c) from (a) + (b).
Analyzing all options under high, medium, and low HIV prevalence and
transmission risk scenarios, the study concluded that one-time mandatory testing
with mandatory restriction of practice for a health care worker found HIV positive
was more cost-effective than the other options. While showing the lowest cost of the
policies considered, that option nonetheless was estimated to cost $291,000 per
infection averted for surgeons and $500,000 for dentists. Given these high costs and
the political difficulties associated with adopting and implementing mandatory
restrictions on practice, this was not considered a viable policy option.
The analysts also found that the cost-effectiveness estimates were highly sensitive to
variations in prevalence and transmission risk and to the different patterns of
******ebook converter DEMO Watermarks*******
practice for physicians in contrast to dentists. The incremental cost per infection
averted ranged from $447 million for dentists under low prevalence/transmission
risk conditions to a savings of $81,000 for surgeons under high
prevalence/transmission risk conditions.
Given the high costs estimated for many of the options and the uncertainty of the
results, the authors concluded as follows: “Given the ethical, social, and public
health implications, mandatory testing policies should not be implemented without
greater certainty as to their cost-effectiveness.”
SOURCE: Adapted from Tevfik F. Nas, Cost-Benefit Analysis: Theory and Application
(Thousand Oaks, CA: Sage, 1996), pp. 191-192. Original study was K. A. Phillips, R.
A. Lowe, J. G. Kahn, P. Lurie, A. L. Avins, and D. Ciccarone, “The Cost Effectiveness
of HlV Testing of Physicians and Dentists in the United States,” Journal of the
American Medical Association, 1994, 271:851-858.
Most commonly, efficiency analyses in the social program field take place after the
completion of an impact evaluation, when the impact of a program is known. In such ex
post cost-benefit and cost-effectiveness assessments, the analysis is undertaken to
assess whether the costs of the intervention can be justified by the magnitude of the
program effects.
The focus of such assessments may be on examining the efficiency of a program in
absolute or comparative terms, or both. In absolute terms, the idea is to judge whether
the program is worth what it costs either by comparing costs to the monetary value of
benefits or by calculating the money expended to produce some unit of outcome. For
example, a cost-benefit analysis may reveal that for each dollar spent to reduce
shoplifting in a department store, $2 are saved in terms of stolen goods, an outcome that
clearly indicates that the shoplifting program would be economically beneficial.
Alternatively, a cost-effectiveness study might show that the program expends $50 to
avert each shoplifting.
In comparative terms, the issue is to determine the differential “payoff”of one
program versus another—for example, comparing the costs of elevating the reading
achievement scores of schoolchildren by one grade level produced by a computerized
instruction program with the costs of achieving the same increase through a peer tutorial
program. In ex post analyses, estimates of costs and outcomes are based on studies of
the types described in previous chapters on impact evaluations.
Exhibit 11-C
Cost-Effectiveness of Computer-Assisted Instruction
Cost data are obviously essential to the calculation of efficiency measures. In the
case of ex ante analyses, program costs must be estimated, based on costs incurred in
similar programs or on knowledge of the costs of program processes. For ex post
efficiency analyses, it is necessary to analyze program financial budgets, segregating out
the funds used to finance program processes as well as collecting costs incurred by
targets or other agencies.
Useful sources of cost data include the following:
Agency fiscal records: These include salaries of program personnel, space rental,
stipends paid to clients, supplies, maintenance costs, business services, and so on.
Target cost estimates: These include imputed costs of time spent by clients in
program activities, client transportation costs, and so on. (Typically these costs
have to be estimated.)
Cooperating agencies: If a program includes activities of a cooperating agency,
such as a school, health clinic, or another government agency, the costs borne can
be obtained from the cooperating agency.
Fiscal records, it should be noted, are not always easily comprehended. The evaluator
may have to seek help from an accounting professional.
It is often useful to draw up a list of the cost data needed for a program. Exhibit 11-
D shows a worksheet representing the various costs for a program that provided high
school students with exposure to working academic scientists to heighten students’
interest in pursuing scientific careers. Note that the worksheet identifies the several
parties to the program who bear program costs.
Exhibit 11-D
Worksheet for Estimating Annualized Costs for a Hypothetical Program
Accounting Perspectives
To carry out a cost-benefit analysis, one must first decide which perspective to take
in calculating costs and benefits.What point of view should be the basis for specifying,
measuring, and monetizing benefits and costs? In short, costs to and benefits for whom?
Benefits and costs must be defined from a single perspective because mixing points of
view results in confused specifications and overlapping or double counting.Of course,
several cost-benefit analyses for a single program may be undertaken, each from a
different perspective. Separate analyses based on different perspectives often provide
information on how benefits compare to costs as they affect relevant stake-holders.
Generally, three accounting perspectives may be used for the analysis of social projects,
those of (1) individual participants or targets, (2) program sponsors, and (3) the
communal social unit involved in the program (e.g., municipality, county, state, or
******ebook converter DEMO Watermarks*******
nation).
The individual-target accounting perspective takes the point of view of the units
that are the program targets, that is, the persons, groups, or organizations receiving the
intervention or services. Cost-benefit analyses using the individual-target perspective
often produce higher benefit-to-cost results (net benefits) than those using other
perspectives. In other words, if the sponsor or society bears the cost and subsidizes a
successful intervention, then the individual program participant benefits the most. For
example, an educational project may impose relatively few costs on participants.
Indeed, the cost to targets may primarily be the time spent in participating in the project,
since books and materials usually are furnished. Furthermore, if the time required is
primarily in the afternoons and evenings, there may be no loss of income involved. The
benefits to the participants, meanwhile, may include improvements in earnings as a
result of increased education, greater job satisfaction, and increased occupational
options, as well as transfer payments (stipends) received while participating in the
project.
The program sponsor accounting perspective takes the point of view of the funding
source in valuing benefits and specifying cost factors. The funding source may be a
private agency or foundation, a government agency, or a for-profit firm. From this
perspective, the cost-benefit analysis most closely resembles what frequently is termed
private profitability analysis. That is, analysis from this perspective is designed to
reveal what the sponsor pays to provide a program and what benefits (or “profits”)
should accrue to the sponsor.
The program sponsor accounting perspective is most appropriate when the sponsor
is confronted with a fixed budget (i.e., there is no possibility of generating additional
funds) and must make decisive choices between alternative programs. A county
government, for example, may favor a vocational education initiative that includes
student stipends over other programs because this type of program would reduce the
costs of public assistance and similar subsidies (since some of the persons in the
vocational education program would have been supported by income maintenance
funds).Also, if the future incomes of the participants were to increase because of the
training received, their direct and indirect tax payments would increase, and these also
could be included in calculating benefits from a program sponsor perspective. The costs
to the government sponsor include the costs of
operation,administration,instruction,supplies,facilities, and any additional subsidies or
transfers paid to the participants during the training. Another illustration, Exhibit 11-E,
shows a cost-benefit calculation involving the savings to the mental health system that
result from providing specialized services to patients with co-occurring mental
disorders and substance abuse problems.
The communal accounting perspective takes the point of view of the community or
society as a whole, usually in terms of total income. It is, therefore, the most
******ebook converter DEMO Watermarks*******
comprehensive perspective but also usually the most complex and thus the most difficult
to apply. Taking the point of view of society as a whole implies that special efforts are
being made to account for secondary effects, or externalities—indirect project effects,
whether beneficial or detrimental, on groups not directly involved with the
intervention.A secondary effect of a training program, for example, might be the
spillover of the training to relatives, neighbors, and friends of the participants. Among
the more commonly discussed negative external effects of industrial and technical
projects are pollution, noise, traffic, and destruction of plant and animal life. Moreover,
in the current literature, communal cost-benefit analysis has been expanded to include
equity considerations, that is, the distributional effects of programs among different
subgroups. Such effects result in a redistribution of resources in the general population.
From a communal standpoint,for example,every dollar earned by a minority member
who had been unemployed for six months or more may be seen as a “double benefit”
and so entered into the analyses.
Exhibit 11-F illustrates the benefits that need to be taken into account from a
communal perspective. In this exhibit, Gray and associates (1991) report on an effort to
integrate several quasi-experimental studies to come out with a reasonable cost-to-
benefit analysis of the efficiency of different correctional approaches. As shown in the
table in Exhibit 11-F, benefits are of several different types.Although,as the article
carefully notes, there are serious uncertainties about the precision of the estimates, the
results are important to judges and other criminal justice experts concerned with the
costs to society of different types of sentences.
The components of a cost-benefit analysis conducted from a communal perspective
include most of the costs and benefits that also appear in calculations made from the
individual and program sponsor perspectives, but the items are in a sense valued and
monetized differently. For example, communal costs for a project include opportunity
costs, that is, alternative investments forgone by the community to fund the project in
question. These are obviously not the same as opportunity costs incurred by an
individual as a consequence of participating in the project. Communal costs also
include outlays for facilities, equipment, and personnel, usually valued differently than
they would be from the program sponsor perspective.Finally, these costs do not include
transfer payments because they would also be entered as benefits to the community and
the two entries would simply cancel each other out.
Exhibit 11-E
Costs and Savings to the Mental Health System of Providing Specialized Dual
Diagnosis Programs
The behavioral skills model produced the largest positive effects on measures of
client functioning and symptoms but was also the most expensive program to
deliver. To further explore the cost considerations, the evaluators examined service
utilization and cost data for the clients in each of the three programs for four time
periods: the six months before the dual diagnosis programs began (baseline), the six
months after, the 12 months after, and the 18 months after.
Mental health service costs were divided into two categories: supportive services
and intensive services. Supportive services included case management, outpatient
visits, medication visits, day services, and other such routine services for mental
health patients. Intensive services included the more costly treatments for serious
episodes, for instance, inpatient services, skilled nursing care, residential treatment,
and emergency visits.
The costs of supportive services were expected to show an increase for all of the
******ebook converter DEMO Watermarks*******
specialized dual diagnosis programs, corresponding to the extra resources required
to provide them. Any significant savings to the mental health system were expected
to appear as a result of decreased use of expensive intensive services. Thus, the cost
analysis focused on the amount by which the costs of supportive services increased
from baseline in comparison to the amount by which the costs of intensive services
decreased. The table shows the results for the change in service utilization costs
between the six-month baseline period and the 18 months after the program began.
Also, as hoped, the costs for intensive services were reduced from baseline for all
of the specialized programs. The greater impacts of the behavioral skills program on
client functioning and symptoms, however, did not translate into corresponding
decreases in service utilization and associated cost savings. Indeed, the usual-care
condition of the 12-step program produced the greatest decreases in subsequent
costs for intensive services. However, while the case management program did not
yield such large decreases, its lower support costs resulted in a savings-to-costs
ratio that was comparable to that of the 12-step program. Additional analyses
showed that these programs also generally resulted in savings to the medical system,
the criminal justice system, and the families of the clients.
In terms of costs and savings directly to the mental health system, therefore, both the
12-step and the case management programs produced considerably more savings
than they cost. Indeed, the cost analysis estimated that for every $1 invested in
providing these programs there were about $9 in savings that would accrue over the
subsequent 18 months. Moreover, the case management program could actually be
implemented with a net reduction in support service costs, thus requiring no
additional investment. The behavioral skills program, on the other hand, produced a
net loss to the mental health system. For every $1 invested in it, there was only a
$0.53 savings to the mental health system.
Average per Client Change in Costs of Services Used From Baseline to 18 Months
Later, in Dollars
Obviously, the decision about which accounting perspective to use depends on the
stakeholders who constitute the audience for the analysis, or who have sponsored it. In
this sense, the selection of the accounting perspective is a political choice. An analyst
employed by a private foundation interested primarily in containing the costs of hospital
care, for example, likely will take the program sponsor’s accounting perspective,
emphasizing the perspectives of hospitals. The analyst might ignore the issue of whether
the cost-containment program that has the highest net benefits from a sponsor accounting
perspective might actually show a negative cost-to-benefit value when viewed from the
standpoint of the individual. This could be the case if the individual accounting
perspective included the opportunity costs involved in having family members stay
home from work because the early discharge of patients required them to provide the
bedside care ordinarily received in the hospital.
Generally, the communal accounting perspective is the most politically neutral. If
analyses using this perspective are done properly, the information gained from an
individual or a program sponsor perspective will be included as data about the
distribution of costs and benefits.Another approach is to undertake cost-benefit analyses
from more than one accounting perspective. The important point, however, is that cost-
benefit analyses, like other evaluation activities, have political features.
Exhibit 11-F
Costs to Benefits of Correctional Sentences
The control of crime by appropriate sentencing of convicted offenders must take into
account not only the costs of implementing each of the three choices typically
available to judges—prison, jail, or probation sentences—but also the benefits
derived. The major correctional approaches are incapacitation through removing
the offender from the community by incarceration in a prison or jail, deterrence by
******ebook converter DEMO Watermarks*******
making visible the consequences of criminal behavior to discourage potential
offenders, and rehabilitation by resocialization and redirection of criminals’
behavior. Each approach generates different types of “benefits” for society. Since
jail sentences are usually short, for instance, the incapacitation benefit is very small
compared with the benefit from prison sentences, although, since no one likes being
in jail, the deterrence benefit of jail is estimated to be about five-sixths that of
prison.
Gray and associates attempted to estimate ex ante the monetary value of these
different social benefits for each sentencing option (see table). On average,
probation sentences showed greater net benefits than jail, which, in turn, showed a
smaller negative benefit than prison. However, the relative weight given to each
benefit varied according to the type and circumstances of the offense. For example,
the costs of a burglary (loss to the victim plus costs of the police investigation,
arrest, and court proceedings) comes to about $5,000, suggesting that perhaps long
prison sentences are called for in the case of recidivist burglars to maximize the
incapacitation benefit. In contrast, the cost of apprehending and trying persons for
receiving stolen property is less than $2,000, and a short jail sentence or even
probation may be the most efficient response.
Estimated Annual Social Costs and Benefits per Offender, in Dollars, for Different
Correctional Sentences (average across all offenses)
Exhibit 11-G
Components of Cost-Benefit Analyses From Different Perspectives for a Hypothetical
SOURCE: Adapted from Jeanette M. Jerell and Teh-Wei Hu, “Estimating the Cost
Impact of Three Dual Diagnosis Treatment Programs,” Evaluation Review, 1996,
20(2):160-180.
Exhibit 11-H
Hypothetical Example of Cost-Benefit Calculation From Different Accounting
Perspectives for a Typical Employment Training Program
******ebook converter DEMO Watermarks*******
a. Note that net social (communal) benefit can be split into net benefit for trainees plus
net benefit for the government; in this case, the latter is negative: 83,000 + (– 39,000) =
44,000.
Monetizing Outcomes
Because of the advantages of expressing benefits in monetary terms, especially in
cost-benefit analysis, a number of approaches have been specified for monetizing
outcomes or benefits (Thompson, 1980). Five frequently used ones are as follows.
Exhibit 11-I
Discounting Costs and Benefits to Their Present Values
where r is the discount rate (e.g., .05) and t is the number of years. The total stream
of benefits (and costs) of a program expressed in present values is obtained by
adding up the discounted values for each year in the period chosen for study. An
example of such a computation follows.
A training program is known to produce increases of $1,000 per year in earnings for
each participant. The earnings improvements are discounted to their present values
at a 10% discount rate for five years.
Over the five years, total discounted benefits equal $909.09 + $826.45 + … +
$620.92, or $3,790.79. Thus, increases of $1,000 per year for the next five years are
not currently worth $5,000 but only $3,790.79. At a 5% discount rate, the total
present value would be $4,329.48. In general, all else being equal, benefits
calculated using low discount rates will appear greater than those calculated with
high rates.
The choice of time period on which to base the analysis depends on the nature of the
program and whether the analysis is ex ante or ex post. All else being equal, a program
******ebook converter DEMO Watermarks*******
will appear more beneficial the longer the time horizon chosen.
There is no authoritative approach for fixing the discount rate. One choice is to fix
the rate on the basis of the opportunity costs of capital, that is, the rate of return that
could be earned if the funds were invested elsewhere. But there are considerable
differences in opportunity costs depending on whether the funds are invested in the
private sector, as an individual might do, or in the public sector, as a quasi-government
body may decide it must. The length of time involved and the degree of risk associated
with the investment are additional considerations.
The results of a cost-benefit analysis are thus particularly sensitive to the choice of
discount rate. In practice, evaluators usually resolve this complex and controversial
issue by carrying out discounting calculations based on several different rates.
Furthermore, instead of applying what may seem to be an arbitrary discount rate or
rates, the evaluator may calculate the program’s internal rate of return, or the value
that the discount rate would have to be for program benefits to equal program costs.
A related technique, inflation adjustment, is used when changes over time in asset
prices should be taken into account in cost-benefit calculations. For example, the prices
of houses and equipment may change considerably because of the increased or
decreased value of the dollar at different times.
Earlier we referred to the net benefits of a program as the total benefits minus the total
costs. The necessity of discounting means that net benefits are more precisely defined
as the total discounted benefits minus the total discounted costs. This total is also
referred to as the net rate of return.
It is clear that with the many considerations involved there can be considerable
disagreement on the monetary values to be placed on benefits. The disputes that arise in
setting these values underlie much of the conflict over whether cost-benefit analysis is a
legitimate way of estimating the efficiency of programs.
The final step in cost-benefit analysis consists of comparing total costs to total
benefits. How this comparison is made depends to some extent on the purpose of the
analysis and the conventions in the particular program sector. The most direct
comparison can be made simply by subtracting costs from benefits, after appropriate
discounting. For example, a program may have costs of $185,000 and calculated
benefits of $300,000. In this case, the net benefit (or profit, to use the business analogy)
is $115,000. Although generally more problematic, sometimes the ratio of benefits to
costs is used rather than the net benefit. This measure is generally regarded as more
difficult to interpret and should be avoided (Mishan, 1988).
******ebook converter DEMO Watermarks*******
In discussing the comparison of benefits to costs, we have noted the similarity to
decision making in business. The analogy is real. In particular, in deciding which
programs to support, some large private foundations actually phrase their decisions in
investment terms. They may want to balance a high-risk venture (i.e., one that might
show a high rate of return but has a low probability of success) with a low-risk program
(one that probably has a much lower rate of return but a much higher probability of
success). Thus, foundations, community organizations, or government bodies might wish
to spread their “investment risks”by developing a portfolio of projects with different
likelihoods and prospective amounts of benefit.
Sometimes, of course, the costs of a program are greater than its benefits. In Exhibit
11-J, a cost-to-benefit analysis is presented that documents the negative results of a
federal initiative to control noise. In this analysis, the costs of regulatory efforts to
control the noise from motorcycles,trucks,and buses were estimated to be considerably
higher than the benefits of the program. In the exhibit’s table, the findings for truck and
bus regulations are reported; note the negative values when benefits are subtracted from
costs and the less than 1.0 values resulting when benefits are divided by costs. Of
course, one can quarrel over the measure of benefits, which was simply the increase in
property values resulting from a decline in decibels (dBAs) of noise. Nevertheless,
according to Broder (1988), the analysis was a major reason why the Reagan
administration abandoned the program.
It bears noting that sometimes programs that yield negative values are nevertheless
important and should be continued. For example, there is a communal responsibility to
provide for severely retarded persons, and it is unlikely that any procedure designed to
do so will have a positive value (subtracting costs from benefits). In such cases, one
may still want to use cost-benefit analysis to compare the efficiency of different
programs, such as institutional care and home care.
Exhibit 11-J
A Study of the Birth and Death of a Regulatory Agenda
It has long been the case that, once funded, government programs are almost
impossible to eliminate. Most organizations build up constituencies over the years
that can be called on to protect them if threatened. Thus, it was particularly
remarkable that the federal Office of Noise Abatement and Control (ONAC) at the
Environmental Protection Agency (EPA) was disbanded during the Reagan
administration, thus terminating a major social regulatory program without a public
outcry.
Although the halt in the spread of inefficient noise regulation is one of few examples
of lasting relief from social regulation provided by the Reagan administration, a
further irony is that much of the economic analysis that was at least partly
instrumental in motivating the change in policy was produced by the prior
administration. Specifically, President Carter’s Council of Economic Advisors and
the Council on Wage and Price Stability, an agency disbanded by the Reagan
administration, had produced several economic analyses for the public docket that
were highly critical of the regulations, although it was the Reagan administration
that acted on these analyses.
NOTE: dBAs = decibels. Costs and benefits are in millions of 1978 dollars except for
ratios.
SOURCE: Adapted from I. E. Broder, “A Study of the Birth and Death of a Regulatory
Agenda: The Case of the EPA Noise Program,” Evaluation Review, 1988, 12(3):291-
309.
******ebook converter DEMO Watermarks*******
Optimal prerequisites of an ex post cost-benefit analysis of a program include the
following:
The program has independent or separable funding. This means that its costs can be
separated from those incurred by other activities.
The program is beyond the development state, and it is certain that its effects are
significant.
The program’s impact and the magnitude of that impact are known or can be
validly estimated.
Benefits can be translated into monetary terms.
Decisionmakers are considering alternative programs, rather than simply whether
or not to continue the existing project.
Exhibit 11-K
Cotton Dust Regulation: An OSHA Success Story
In the late 1970s, the Occupational Safety and Health Administration (OSHA) took a
major step in attempting to promote the health of workers in the textile industry,
tightening its standard on cotton dust levels in textile plants. Because the OSHA
cotton dust standard was widely believed to be ineffective, it became the target of a
major political debate and a fundamental U.S. Supreme Court decision. However,
the evidence indicates that the standard has had a significant beneficial effect on
worker health, and at a cost much lower than originally anticipated. For instance,
data on the relationship between exposure to cotton dust and disease incidence, as
well as the disability data and the evidence based on worker turnover, suggest that
the risks of byssinosis (lung disease) have been reduced dramatically. The cost of
eliminating even cases classified as “totally disabled” is less than $1,500, and thus
there is a strong economic basis for the enforcement of OSHA standards.
Exhibit 11-L
Cost Analysis of Training and Employment Services in Methadone Treatment
Prior evaluation research has shown that vocational and employment counseling for
drug users has positive effects not only on employment but also on drug use and
criminality. Despite these encouraging signs, many drug treatment programs have
reduced or eliminated vocational services due to changes in program emphasis or
financial pressures. Against this background, a team of evaluators at Research
Triangle Institute conducted cost analysis on four methadone maintenance programs
with employment services components to help decisionmakers explore the
feasibility of a renewed emphasis on vocational services in substance abuse
******ebook converter DEMO Watermarks*******
treatment.
Given these positive findings, the critical practical question is how much the TEP
component added to the cost of the standard treatment program. To assess this, the
evaluators examined the total costs and cost per client of TEP in comparison to the
analogous costs of the standard program without TEP for each of the four program
sites. The main results are summarized in the table.
The results of this analysis indicated that the cost per client of the TEP component
ranged from $1,648 to $2,215, amounts corresponding to between 42% and 50% of
the cost of the standard methadone treatment without TEP.
Annual Total and per Client Costs of Adding Training and Employment Program
(TEP) Services
Although some sponsors and program staff are prejudiced against efficiency
analyses because they deal chiefly with “dollars” and not “people,” the approach that
underlies them is no different from that of any stakeholder who needs to assess the
utility of implementing or maintaining a program. Our world of limited resources,
though often decried, nevertheless requires setting one program against another and
deciding on resource allocation. Competent efficiency analysis can provide valuable
information about a program’s economic potential or actual payoff and thus is important
for program planning, implementation, and policy decisions, as well as for gaining and
maintaining the support of stakeholders.
Summary
Efficiency analyses can require considerable technical sophistication and the use
of consultants. As a way of thinking about program results, however, they direct
attention to costs as well as benefits and have great value for the evaluation field.
In estimating costs,the concept of opportunity costs allows for a truer estimate but
can be complex and controversial in application.
The true outcomes of projects include secondary and distributional effects, both
of which should be taken into account in full cost-benefit analyses.
In cost-benefit analysis, both costs and benefits must be projected into the future
to reflect the long-term effects of a program. In addition, future benefits and costs must
******ebook converter DEMO Watermarks*******
be discounted to reflect their present values.
K EY CONCEPTS
Accounting perspectives
Perspectives underlying decisions on which categories of goods and services to include
as costs or benefits in an efficiency analysis.
Benefits
Positive program outcomes, usually translated into monetary terms in cost-benefit
analysis or compared with costs in cost-effectiveness analysis. Benefits may include
both direct and indirect outcomes.
Costs
Inputs, both direct and indirect, required to produce an intervention.
Discounting
The treatment of time in valuing costs and benefits of a program in efficiency analyses,
that is, the adjustment of costs and benefits to their present values, requiring a choice of
discount rate and time frame.
Distributional effects
Effects of programs that result in a redistribution of resources in the general population.
Net benefits
The total discounted benefits minus the total discounted costs. Also called net rate of
return.
Opportunity costs
The value of opportunities forgone because of an intervention program.
Secondary effects
Effects of a program that impose costs on persons or groups who are not targets.
Shadow prices
Imputed or estimated costs of goods and services not valued accurately in the
marketplace. Shadow prices also are used when market prices are inappropriate due to
regulation or externalities. Also known as accounting prices.
Chapter Outline
The Social Ecology of Evaluations
Multiple Stakeholders
The Range of Stakeholders
Consequences of Multiple Stakeholders
Disseminating Evaluation Results
Evaluation as a Political Process
Political Time and Evaluation Time
Issues of Policy Significance
Evaluating Evaluations
The Profession of Evaluation
Intellectual Diversity and Its Consequences
The Education of Evaluators
Consequences of Diversity in Origins
Diversity in Working Arrangements
Inside Versus Outside Evaluations
Organizational Roles
The Leadership Role of Evaluation “Elite” Organizations
Evaluation Standards, Guidelines, and Ethics
Utilization of Evaluation Results
Do Evaluations Have Direct Utility?
This chapter is concerned with the social and political context of evaluation
activities. Evaluation involves more than simply using appropriate research
procedures. It is a purposeful activity, undertaken to affect the development of policy,
to shape the design and implementation of social interventions, and to improve the
management of social programs. In the broadest sense of politics, evaluation is a
political activity.
There are, of course, intrinsic rewards for evaluators, who may derive great
pleasure from satisfying themselves that they have done as good a technical job as
possible—like artists whose paintings hang in their attics and never see the light of
day, and poets whose penciled foolscap is hidden from sight in their desk drawers.
But that is not really what it is all about. Evaluations are a real-world activity. In the
end, what counts is not the critical acclaim with which an evaluation is judged by
peers in the field but the extent to which it leads to modified policies, programs, and
practices—ones that, in the short or long term, improve the conditions of human life.
Multiple Stakeholders
Exhibit 12-A
The Consequences of Contrary Results
In the middle 1980s, the Robert Wood Johnson Foundation and the Pew Memorial
Trust provided a grant to the Social and Demographic Institute at the University of
Massachusetts to develop practical methods of undertaking credible enumerations of
the homeless. The two foundations had just launched a program funding medical
clinics for homeless persons, and an accurate count of the homeless was needed to
assess how well the clinics were covering their clients.
Our findings concerning how many homeless were in Chicago quickly became the
center of a controversy. The interests of the Chicago homeless were defended and
advanced by the Chicago Coalition for the Homeless and by the Mayor’s Committee
on the Homeless, both composed of persons professionally and ideologically
devoted to these ends. These two groups were consistently called on by the media
and by public officials to make assessments of the status of the Chicago homeless.
Their views about homelessness in essence defined the conventional wisdom and
knowledge on this topic. In particular, a widely quoted estimate that between 20,000
and 25,000 persons were homeless in Chicago came from statements made by the
Coalition and the Mayor’s Committee.
At the outset, the Chicago Coalition for the Homeless maintained a neutral position
toward our study. The study, its purposes, and its funding sources were explained to
the coalition, and we asked for their cooperation, especially in connection with
obtaining consent from shelter operators to interview their clients. The coalition
neither endorsed our study nor condemned it, expressing some skepticism
concerning our approach and especially about the operational definition of
homelessness, arguing for a broader definition of homelessness that would
encompass persons in precarious housing situations, persons living double-upped
When the data from Phase I were processed, we were shocked by the findings. The
estimate of the size of the homeless population was many magnitudes smaller than
the numbers used by the coalition: 2,344, compared to 20,000-25,000. Because we
had anticipated a much larger homeless population, our sample of streets was too
small to achieve much precision for such small numbers. We began to question
whether we had made some egregious error in sample design or execution. Adding
to our sense of self-doubt, the two foundations that had supported most of the project
also began to have doubts, their queries fueled in part by direct complaints from the
advocates for the homeless. To add to our troubles, the Phase I survey had consumed
all the funds that our sponsors had provided, which were originally intended to
support three surveys spread over a year. After checking over our Phase I findings,
we were convinced that they were derived correctly but that they would be more
convincing to outsiders if the study were replicated. We managed to convince our
funding sponsors to provide more funds for a second survey that was designed with
a larger sample of Chicago blocks than Phase I. The street sample was also
supplemented by special purposive samples in places known to contain large
numbers of homeless persons (bus, elevated, and subway stations; hospital waiting
rooms; etc.) to test whether our dead-of-the-night survey time missed significant
numbers of homeless persons who were on the streets during the early evening hours
but had found sleeping accommodations by the time our interviewing teams searched
sample blocks.
When the data were in from Phase II, our calculated estimate of the average size of
the nightly homeless in Chicago was 2,020 with a standard error of 275. Phase II
certainly had increased the precision of our estimates but had not resulted in
substantially different ones. Using data from our interviews, we also attempted to
estimate the numbers of homeless persons we may have missed because they were
temporarily housed, in jail, in a hospital, or in prison. In addition, we estimated the
number of homeless children accompanying parents (we found no homeless children
in our street searches). Adding these additional numbers of homeless persons to the
average number who were nightly homeless as estimated from our Phase I and Phase
II surveys, we arrived at a total of 2,722. This last estimate was still very far from
the 20,000- to 25,000-person estimates of the Chicago Coalition.
Although the final report was distributed to the Chicago newspapers, television
stations, and interested parties on the same date, somehow copies of the report had
managed to get into the hands of the coalition. Both major Chicago newspapers ran
stories on the report, followed the next day by denunciatory comments from
Almost overnight, I had become persona non grata in circles of homeless advocates.
When I was invited by the Johnson Foundation to give a talk at a Los Angeles
meeting of staff members from the medical clinics the foundation financed, no one
present would talk to me except for a few outsiders. I became a nonperson
wandering through the conference, literally shunned by all.
SOURCE: Adapted from Peter H. Rossi, “No Good Applied Research Goes
Unpunished!” Social Science and Modern Society, 1987, 25(1):74-79.
Obviously, results must be communicated in ways that make them intelligible to the
various stakeholder groups. External evaluators generally provide sponsors with
technical reports that include detailed and complete (not to mention honest) descriptions
of the evaluation’s design, data collection methods, analysis procedures, results,
suggestions for further research, and recommendations regarding the program (in the
case of monitoring or impact evaluations),as well as a discussion of the limitations of
******ebook converter DEMO Watermarks*******
the data and analysis.Technical reports usually are read only by peers,rarely by the
stakeholders who count. Many of these stakeholders simply are not accustomed to
reading voluminous documents, do not have the time to do so, and might not be able to
understand them.
For this reason, every evaluator must learn to be a “secondary disseminator.”
Secondary dissemination refers to the communication of results and recommendations
that emerge from evaluations in ways that meet the needs of stakeholders (as opposed to
primary dissemination to sponsors and technical audiences, which in most cases is the
technical report). Secondary dissemination may take many different forms, including
abbreviated versions of technical reports (often called executive summaries),special
reports in more attractive and accessible formats,oral reports complete with slides, and
sometimes even movies and videotapes.
The objective of secondary dissemination is simple: to provide results in ways that
can be comprehended by the legendary “intelligent layperson,” admittedly a figure
sometimes as elusive as Bigfoot. Proper preparation of secondary dissemination
documents is an art form unknown to most in the field, because few opportunities for
learning are available during one’s academic training. The important tactic in secondary
communication is to find the appropriate style for presenting research findings, using
language and form understandable to audiences who are intelligent but unschooled in the
vocabulary and conventions of the field. Language implies a reasonable vocabulary
level that is as free as possible from esoteric jargon; form means that secondary
dissemination documents should be succinct and short enough not to be formidable.
Useful advice for this process can be found in Torres, Preskill, and Piontek (1996). If
the evaluator does not have the talents to disseminate his or her findings in ways that
maximize utilization—and few of us do—an investment in expert help is justified. After
all, as we have stressed, evaluations are undertaken as purposeful activities; they are
useless unless they can get attention from stakeholders.
Throughout this book, we have stressed that evaluation results can be useful in the
decision-making process at every point during a program’s evolution and operations. In
the earliest phases of program design, evaluations can provide basic data about social
problems so that sensitive and appropriate services can be designed. While prototype
programs are being tested,evaluations of pilot demonstrations may provide estimates of
the effects to be expected when the program is fully implemented. After programs have
been in operation, evaluations can provide considerable knowledge about
accountability issues. But this is not to say that what is useful in principle will
automatically be understood, accepted, and used. At every stage, evaluation is only one
******ebook converter DEMO Watermarks*******
ingredient in an inherently political process. And this is as it should be: Decisions with
important social consequences should be determined in a democratic society by
political processes.
In some cases, project sponsors may contract for an evaluation with the strong
anticipation that it will critically influence the decision to continue, modify, or terminate
a project. In those cases, the evaluator may be under pressure to produce information
quickly, so that decisions can be made expeditiously. In short, evaluators may have a
receptive audience. In other situations, evaluators may complete their assessments of an
intervention only to discover that decisionmakers react slowly to their findings.Even
more disconcerting are the occasions when a program is continued, modified, or
terminated without regard to an evaluation’s valuable and often expensively obtained
information.
Although in such circumstances evaluators may feel that their labors have been in
vain, they should remember that the results of an evaluation are only one of the elements
in a complex decision-making process. This point was clearly illustrated as long ago as
1915 in the controversy over the evaluation of the Gary plan in New York City public
schools, described in Exhibit 12-B. The many parties involved in a human service
program, including sponsors, managers, operators, and targets, often have very high
stakes in the program’s continuation, and their frequently unsupportable but enthusiastic
claims may count more heavily than the coolly objective results of an evaluation.
Moreover, whereas the outcome of an evaluation is simply a single argument on one
side or another, the outcome of typical American political processes may be viewed as
a balancing of a variety of interests.
In any political system that is sensitive to weighing,assessing,and balancing the
conflicting claims and interests of a number of constituencies, the evaluator’s role is that
of an expert witness,testifying to the degree of a program’s effectiveness and bolstering
that testimony with empirically based information.A jury of decisionmakers and other
stake-holders may give such testimony more weight than uninformed opinion or shrewd
guessing,but they,not the expert witness,are the ones who must reach a verdict.There are
other considerations to be taken into account.
To imagine otherwise would be to see evaluators as having the power of veto in the
political decision-making process, a power that would strip decisionmakers of their
prerogatives. Under such circumstances, evaluators would become philosopher-kings
whose pronouncements on particular programs would override those of all the other
parties involved.
In short, the proper role of evaluation is to contribute the best possible knowledge
on evaluation issues to the political process and not to attempt to supplant that process.
Exhibit 12-C contains an excerpt from an article by one of the founders of modern
evaluation theory, Donald T. Campbell, expounding a view of evaluators as servants of
“the Experimenting Society.”
******ebook converter DEMO Watermarks*******
Exhibit 12-B
Politics and Evaluation
This exhibit concerns the introduction of a new plan of school organization into the
New York City schools in the period around World War I. The so-called Gary plan
modeled schools after the new mass production factories, with children being
placed on shifts and moved in platoons from subject matter to subject matter. The
following account is a description of how evaluation results entered into the
political struggle between the new school board and the existing school system
administration.
The Gary plan was introduced into the schools by a new school board appointed by
a reform mayor, initially on a pilot basis. School Superintendent Maxwell, resentful
of interference in his professional domain and suspicious of the intent of the mayor’s
administration, had already expressed his feelings about the Gary plan as it was
operating in one of the pilot schools: “Well, I visited that school the other day, and
the only thing I saw was a lot of children digging in a lot.” Despite the
superintendent’s views, the Gary system had been extended to 12 schools in the
Bronx, and there were plans to extend it further. The cry for more research before
extending the plan was raised by a school board member.
Buckingham’s report was highly critical of the eager proponents of the Gary system
for making premature statements concerning its superiority. No sooner had the
Buckingham report appeared than a veritable storm of rebuttal followed, both in the
press and in professional journals. Howard W. Nudd, executive director of the
Public Education Association, wrote a detailed critique of the Buckingham report,
which was published in the New York Globe, the New York Times, School and
Society, and the Journal of Education. Nudd argued that at the time Buckingham
conducted his tests, the Gary plan had been in operation in one school for only four
SOURCE: Adapted from A. Levine and M. Levine, “The Social Context of Evaluation
Research: A Case Study,” Evaluation Quarterly, 1977, 1(4):515-542.
Political Time and Evaluation Time
There are two additional strains involved in doing evaluations, compared with
academic social research, that are consequences of the fact that the evaluator is engaged
in a political process involving multiple stakeholders.One is the need for evaluations to
be relevant and significant in a policy sense, a topic we will take up momentarily; the
other is the difference between political time and evaluation time.
Evaluations, especially those directed at assessing program impact, take time.
Usually, the tighter and more elegant the study design, the longer the time period
required to perform the evaluation. Large-scale social experiments that estimate the net
effects of major innovative programs may require anywhere from four to eight years to
complete and document. The political and program worlds often move at a much faster
pace. Policymakers and project sponsors usually are impatient to know whether or not a
******ebook converter DEMO Watermarks*******
program is achieving its goals, and often their time frame is a matter of months, not
years.
Exhibit 12-C
Social Scientists as Servants of the Experimenting Society
SOURCE: Quoted from Donald T. Campbell, “Methods for the Experimenting Society,”
Evaluation Practice, 1991, 12(3):228-229.
Exhibit 12-D
Using Evaluative Activities in the Analysis of Proposed New Programs
Many of us spend much of our time doing retrospective studies; these are and will
continue to be the meat and potatoes of evaluation research. Congress asks us for
them and asks the executive branch to do them, and they are needed, but these
studies are not the easiest ones to insert into the political process, and they may well
be the least propitious from the viewpoint of use… . By contrast, before a program
has started, evaluators can have an enormous effect in improving the reasoning
behind program purposes or goals, in identifying the problems to be addressed, and
in selecting the best point of intervention and the type of intervention most likely to
succeed. The tempo at which new programs are sometimes introduced presents
some difficulty… . The pace often becomes so frantic that the lead time necessary to
gear up for evaluative work is simply impossible to obtain if results are to be ready
soon enough to be useful.
At the General Accounting Office (GAO) we are developing a method I call the
Evaluation Planning Review which is specifically intended to be useful in the
formulation of new programs. We have just given it a first try by looking at a
proposed program focusing on teenage pregnancy. Essentially, the method seeks to
gather information on what is known about past, similar programs and apply the
experience to the architecture of the new one. Senator Chaffee asked us to look at
the bill he was introducing; we managed to secure four good months to do the work,
and it has been a major success from both the legislative point of view and our own.
From a more general, political perspective, providing understanding ahead of time
of how a program might work can render a valuable public service—either by
helping to shore up a poorly thought-out program or by validating the basic
soundness of what is to be undertaken. True, there are questions that decisionmakers
do not pose to evaluators that could usefully be posed, which seems a priori to be a
problem for the framework; however, even when evaluators have been free to
******ebook converter DEMO Watermarks*******
choose the questions, this particular type of question has not often been asked. Also,
evaluators can always influence the next round of policy questions through their
products.
Policy significance. The fact that evaluations are conducted according to the canons of
social research may make them superior to other modes of judging social programs. But
evaluations provide only superfluous information unless they directly address the value
issues of persons engaged in policy making, program planning, and management, that is,
unless there is policy significance. The weaknesses of evaluations,in this regard, tend to
center on how research questions are stated and how findings are interpreted (Datta,
1980). The issues here involve considerations that go beyond methodology. To
maximize the utility of evaluation findings, evaluators must be sensitive to two levels of
policy considerations.
First, programs that address problems perceived as critical require better (i.e., more
Basic science models versus policy-oriented models. Social scientists often do not
grasp the difference in emphasis required in formulating a model purposefully to alter a
phenomenon as opposed to developing a causal model to explain the phenomenon. For
example, much of the criminal behavior of young men can be explained by the extent of
such behavior among males in their social network—fathers, brothers, other male
relatives, friends, neighbors, schoolmates, and so on. This is a fascinating finding that
affords many insights into the geographic and ethnic distributions of crime rates.
However, it is not a useful finding in terms of altering the crime rate because it is
difficult to envisage an acceptable public policy that would alter the social networks of
young men. Short of yanking young males out of their settings and putting them into other
environments, it is not at all clear that anything can be done to affect their social
networks. Policy space will likely never (we hope) include population redistribution
for these purposes.
In contrast, it is easier to envisage a public policy that would attempt to alter the
perceived costs of engaging in criminal activities, even though they are a weaker
determinant of crime. The willingness to engage in crime is sluggishly and weakly
related to subjective probabilities: The more that individuals believe they likely will be
caught if they commit a crime, convicted if caught, and imprisoned if convicted, the
lower the probability of criminal behavior. Thus, to some extent the incidence of
criminal acts will be reduced if the police are effective in arresting criminals, if the
prosecution is diligent in obtaining convictions, and if the courts have a harsh sentencing
policy. None of these relationships is especially strong, yet these findings are much
more significant for public policy that attempts to control crime than the social network
explanation of criminal behavior. Mayors and police chiefs can implement programs
Evaluating Evaluations
Evaluation has a richly diverse intellectual heritage. All the social science
disciplines—economics, psychology, sociology, political science, and anthropology—
have contributed to the development of the field.Individuals trained in each of these
disciplines have made contributions to the conceptual base of evaluation research and to
its methodological repertoire. Persons trained in the various human service professions
******ebook converter DEMO Watermarks*******
with close ties to the social sciences, medicine, public health, social welfare, urban
planning, public administration, education, and so on have made important
methodological contributions and have undertaken landmark evaluations.In addition,the
applied mathematics fields of statistics, biometrics, econometrics, and psychometrics
have contributed important ideas on measurement and analysis.
Cross-disciplinary borrowing has been extensive. Take the following examples:
Although economics traditionally has not been an experimentally based social science,
economists have designed and implemented a significant proportion of the federally
sponsored large-scale, randomized field experiments of the past several decades,
including the highly visible experiments in public welfare, employment training, income
maintenance, housing allowance, and national health insurance. Sociologists and
psychologists have borrowed heavily from the econometricians,notably in their use of
time-series analysis methods and simultaneous equation modeling. Sociologists have
contributed many of the conceptual and data collection procedures used in monitoring
organizational performance, and psychologists have contributed the idea of regression-
discontinuity designs to time-series analyses. Psychometricians have provided some of
the basic ideas underlying theories of measurement applicable to all fields, and
anthropologists have provided some of the basic approaches used in qualitative
fieldwork. Indeed, the vocabulary of evaluation is a mix from all of these disciplines.
The list of references at the back of this book is testimony to the multidisciplinary
character of the evaluation field.
In the abstract, the diverse roots of the field are one of its attractions. In practice,
however, they confront evaluators with the need to be general social scientists and
lifelong students if they are even to keep up, let alone broaden their knowledge base.
Furthermore, the diversity in the field accounts to a considerable extent for the
“improper” selection of research approaches for which evaluators are sometimes
criticized. Clearly, it is impossible for every evaluator to be a scholar in all of the
social sciences and to be an expert in every methodological procedure.
There is no ready solution to the need to have the broad knowledge base and range
of competencies ideally required by the “universal”evaluator. This situation means that
evaluators must at times forsake opportunities to undertake work because their
knowledge bases may be too narrow, that they may have to use an “almost good enough”
method rather than the appropriate one they are unfamiliar with, and that sponsors of
evaluations and managers of evaluation staffs must be highly selective in deciding on
contractors and in making work assignments.It also means that at times evaluators will
need to make heavy use of consultants and solicit advice from peers.
In a profession, a range of opportunities is provided for keeping up with the state of
the art and expanding one’s repertoire of competencies—for example, the peer learning
that occurs at regional and national meetings and the didactic courses provided by these
professional associations. At present, only a fraction of the many thousands of
******ebook converter DEMO Watermarks*******
evaluation practitioners participate in professional evaluation organizations and can
take advantage of the opportunities they provide.
The Education of Evaluators
The diffuse character of the evaluation field is exacerbated by the different ways in
which evaluators are educated.Few people in evaluation have achieved responsible
posts and rewards by working their way up from lowly jobs within evaluation
units.Most evaluators have some sort of formal graduate training either in social science
departments or in professional schools. One of the important consequences of the
multidisciplinary character of evaluation is that appropriate training for full
participation in it cannot be adequately undertaken within any single discipline. In a few
universities, interdisciplinary programs have been set up that include graduate
instruction across a number of departments. In these programs, a graduate student might
be directed to take courses in test construction and measurement in a department of
psychology, econometrics in a department of economics, survey design and analysis in a
department of sociology, policy analysis in a political science department, and so on.
Interdisciplinary training programs, however, are neither common nor very stable. In
the typical research-oriented university where graduate training is usually obtained, the
powerful units are the traditional departments. The interdepartmental coalitions of
faculty that form interdisciplinary programs tend to have short lives, because
departments typically do not reward participation in such ventures very highly and
faculty drift back into their departments as a consequence. The result is that too often
graduate training of evaluators primarily is unidisciplinary despite the clear need for it
to be multidisciplinary.
Moreover, within academic departments, applied work is often regarded less highly
than “pure” or “basic” research. As a consequence, training in evaluation-related
competencies is often limited. Psychology departments may provide fine courses on
experimental design but fail to consider very much the special problems of
implementing field experiments in comparison with laboratory studies; sociology
departments may teach survey research courses but not deal at all with the special data
collection problems involved in interviewing the unique populations that are typically
the targets of social programs. Then, too, the low status accorded applied work in
graduate departments often is a barrier to undertaking evaluations as dissertations and
theses.
If there is any advice to be given in this regard,it is that students who are interested
in an evaluation career must be assertive. Often the student must take the lead in hand-
tailoring an individual study program that includes course offerings in a range of
departments,be insistent about undertaking an applied dissertation or thesis,and seize on
any opportunities within university research institutes and in the community to
supplement formal instruction with relevant apprenticeship learning.
******ebook converter DEMO Watermarks*******
The other training route is the professional school. Schools of education train
evaluators for positions in that field, programs in schools of public health and medical
care produce persons who engage in health service evaluations, and so on. In fact, over
time these professional schools, as well as MBA programs, have become the training
sites for many evaluators.
These programs have their limitations as well. One criticism raised about them is
that they are too “trade school” oriented in outlook. Consequently, some of them fail to
provide the conceptual breadth and depth that allows graduates to move back and forth
across social program areas, and to grasp technical innovations when they occur.
Moreover, particularly at a master’s level, many professional schools are required to
have a number of mandatory courses, because their standing and sometimes their funding
depend on accreditation by professional bodies who see the need for common training if
graduates are going to leave as MSWs, MPHs, MBAs, and the like. Because many
programs therefore leave little time for electives, the amount of technical training that
can be obtained in courses is limited. Increasingly, the training of evaluators in
professional schools therefore has moved from the master’s to the doctoral level.
Also, in many universities both faculty and students in professional schools are
viewed as second-class citizens by those located in social science departments. This
elitism often isolates students so that they cannot take advantage of course offerings in
several social science departments or apprenticeship training in their affiliated social
science research institutes. Students trained in professional schools, particularly at the
master’s level, often trade off opportunities for intensive technical training for
substantive knowledge in a particular program area and the benefits of professional
certification. The obvious remedy is either undertaking further graduate work or seizing
opportunities for additional learning of technical skills while pursuing an evaluation
career.
We hold no brief for one route over the other; each has its advantages and liabilities.
Increasingly, it appears that professional schools are becoming the major suppliers of
evaluators, at least in part because of the reluctance of graduate social science
departments to develop appropriate applied research programs. But these professional
schools are far from homogeneous in what they teach, particularly in the methods of
evaluation they emphasize—thus the continued diversity of the field.
Consequences of Diversity in Origins
The existence of many educational pathways to becoming an evaluator contributes to
the lack of coherence in the field.It accounts,at least in part,for the differences in the
very definition of evaluation, and the different outlooks regarding the appropriate way
to evaluate a particular social program. Of course, other factors contribute to this
diversity, including social and political ideologies of evaluators.
Some of the differences are related to whether the evaluator is educated in a
******ebook converter DEMO Watermarks*******
professional school or a social science department. For example, evaluators who come
out of professional schools such as social work or education are much more likely than
those trained in,say,sociology to see themselves as part of the program staff and to give
priority to tasks that help program managers. Thus, they are likely to stress formative
evaluations that are designed to improve the day-to-day operations of programs,
whereas the more social-science minded are more likely to be primarily concerned with
effectiveness and efficiency issues.
The diversity is also related to differences among social science departments and
among professional schools. Evaluators trained as political scientists frequently are
oriented to policy analysis, an activity designed to aid legislators and high-level
executives, particularly government administrators. Anthropologists, as one might
expect, are predisposed to qualitative approaches and are unusually attentive to target
populations’ interests in evaluation outcomes. Psychologists, in keeping with their
discipline’s emphasis on small-scale experiments, often are concerned more with the
validity of the causal inference in their evaluations than the generalizability to program
practice. In contrast, sociologists are often more concerned with the potential for
generalization and are more willing to forsake some degree of rigor in the causal
conclusions to achieve it.Economists are likely to work in still different
ways,depending on the body of microeconomic theory to guide their evaluation designs.
Similar diversity can be found among those educated in different professional
schools. Evaluators trained in schools of education may focus on educational
competency tests in measuring the outcome of early-childhood education programs,
whereas social work graduates may focus on caseworker ratings of children’s emotional
status and parental reports of their behavior. Persons coming from schools of public
health may be most interested in preventive practices, those from medical care
administration programs in frequency of physician encounters and duration of
hospitalization, and so on.
It is easy to exaggerate the distinctive outlook that each discipline and profession
manifests in approaching the design and conduct of evaluations, and there are many
exceptions to the preferences and tendencies just described. Indeed, a favorite game
among evaluation buffs is to guess an author’s disciplinary background from the content
of an article he or she has written. Nevertheless, disciplinary and professional diversity
has produced a fair degree of conflict within the field of evaluation. Evaluators hold
divided views on topics ranging from epistemology to the choice of methods and the
major goals of evaluation. Some of the major divisions are described briefly below.
Epistemological differences. The “cultural wars” that are being waged in the
humanities and some of the social sciences have touched evaluation as well.
Postmodern theories of knowledge are reflected in evaluation with claims that social
problems are social constructions and that knowledge is not absolute but, rather, that
there are different “truths,” each valid for the perspective from which it derives.
Postmodernists tend to favor qualitative research methods that produce rich
“naturalistic” data and evaluation perspectives favoring those of the program personnel
and target populations. (See Guba and Lincoln, 1989, for a foremost exponent of
postmodern evaluation.)
Those who oppose the postmodern position are not homogeneous in their beliefs on
the nature of knowledge. Nevertheless, among the opponents to postmodernism there is
some strong consensus that truth is not entirely relativistic. For example, while most
believe that the definition of poverty is a social construction, they are also convinced
that the distribution of annual incomes can be described through research operations on
which most social scientists can agree. That is, whether a given income level is
regarded as poverty is a matter of social judgment, but the number of households at that
******ebook converter DEMO Watermarks*******
income level can be estimated with a known sampling error. This position implies that
disagreements among researchers on empirical findings are mainly matters of method or
measurement error rather than matters involving different truths.
Our own position, as exemplified throughout this book, is clearly not postmodern.
We believe that there are close matches between methods and evaluation problems. For
given research questions, there are better methods and poorer methods. Indeed, the
major concern in this book is how to choose the method for a given research question
that is likely to produce the most credible findings.
The qualitative-quantitative division. Coinciding with some of the divisions within the
evaluation community is the division between those who advocate qualitative methods
and those who argue for quantitative ones. A sometimes pointless literature has
developed around this “issue.”On one side, the advocates of qualitative approaches
stress the need for intimate knowledge and acquaintance with a program’s concrete
manifestations in attaining valid knowledge about the program’s effects. Qualitative
evaluators tend to be oriented toward formative evaluation, that is, making a program
work better by feeding information on the program to its managers. In contrast,
quantitatively oriented evaluators often view the field as being primarily concerned
with impact assessments or summative evaluation. They focus on developing measures
of program characteristics, processes, and impact that allow program effectiveness to
be assessed with relatively high credibility.
Often the polemics obscure the critical point—namely, that each approach has
utility, and the choice of approaches depends on the evaluation question at hand. We
have tried in this book to identify the appropriate applications of each viewpoint.As we
have stressed, qualitative approaches can play critical roles in program design and are
important means of monitoring programs. In contrast, quantitative approaches are much
more appropriate in estimates of impact as well as in assessments of the efficiency of
social program efforts. (For a balanced discussion of the qualitative-quantitative
discussion, see Reichardt and Rallis, 1994.)
Thus, it is fruitless to raise the issue of which is the better approach without
specifying the evaluation questions to be studied.Fitting the approach to the research
purposes is the critical issue; to pit one approach against the other in the abstract results
in a pointless dichotomization of the field. Even the most avid proponents of one
approach or the other recognize the contribution each makes to social program
evaluations (Cronbach, 1982; Patton, 1997). Indeed, the use of multiple methods, often
referred to as triangulation, can strengthen the validity of findings if results produced
by different methods are congruent. Using multiple methods is a means of offsetting
different kinds of bias and measurement error (for an extended discussion of this point,
see Greene and Caracelli, 1997).
The problem, as we see it, is both philosophical and strategic.Evaluations are
******ebook converter DEMO Watermarks*******
undertaken primarily as contributions to policy and program formulation and
modification activities,as we have stressed,that have a strong political dimension.As
Chelimsky (1987) has observed,“It is rarely prudent to enter a burning political debate
armed only with a case study”(p. 27).
The diversity of the evaluation field is also manifest in the variety of settings and
bureaucratic structures in which evaluators work. First, there are two contradictory
theses about working arrangements, or what might be called the insider-outsider debate.
One position is that evaluators are best off when their positions are as secure and
independent as possible from the influence of project management and staff. The other is
that sustained contact with the policy and program staff enhances evaluators’ work by
providing a better understanding of the organization’s objectives and activities while
inspiring trust and thus increasing the evaluator’s influence.
Second, there are ambiguities surrounding the role of the evaluator vis-à-vis
program staff and groups of stakeholders regardless of whether the evaluator is an
insider or outsider.The extent to which relations between staff members should
resemble other structures in corporations or the collegial model that supposedly
characterizes academia is an issue. But it is only one dimension to the challenge of
structuring appropriate working relationships that confronts the evaluator.
Third, there is the concern on the part of evaluators with the “standing” of the
organizations with which they are affiliated. Like universities, the settings in which
evaluators work can be ranked and rated along a number of dimensions and a relatively
few large evaluation organizations constitute a recognized elite subset of work places.
Whether it is better to be a small fish in a big pond or vice versa is an issue in the
evaluation field.
The discussion that follows, it bears emphasis, is based more on impressions of the
authors of this text than on empirical research findings. Our impressions may be faulty,
but it is a fact that debates surrounding these issues are commonplace whenever a
critical mass of evaluators congregates.
Inside Versus Outside Evaluations
In the past, some experienced evaluators went so far as to state categorically that
evaluations should never be undertaken within the organization responsible for the
administration of a project, but should always be conducted by an outside group. One
reason “outsider”evaluations may have seemed the desired option is that there were
differences in the levels of training and presumed competence of insider and outsider
evaluation staffs. These differences have narrowed. The career of an evaluation
******ebook converter DEMO Watermarks*******
researcher has typically taken one of three forms. Until the 1960s, a large proportion of
evaluation research was done by either university-affiliated researchers or research
firms. Since the late 1960s, public service agencies in various program areas have been
hiring researchers for staff positions to conduct more in-house evaluations. Also, the
proportion of evaluations done by private, for-profit research groups has increased
markedly. As research positions in both types of organizations have increased and the
academic job market has declined, more persons who are well trained in the social and
behavioral sciences have gravitated toward research jobs in public agencies and for-
profit firms.
The current evidence is far from clear regarding whether inside or outside
evaluations are more likely to be of higher technical quality. But technical quality is not
the only criterion; utility may be just as important. A study in the Netherlands of external
and internal evaluations suggests that internal evaluations may have a higher rate of
impact on organizational decisions.According to van de Vall and Bolas (1981), of more
importance than which category of researchers excels at influencing social policy are
those variables responsible for the higher rate of utilization of internal researchers’
findings. The answer, they suggest, lies partly in a higher rate of communication
between inside researchers and policymakers, accompanied by greater consensus, and
partly in a balance between standards of epistemological and implemental validity: “In
operational terms, this means that social policy researchers should seek equilibrium
between time devoted to methodological perfection and translating results into policy
measures” (p. 479). Their data suggest that currently in-house social researchers are in
a more favorable position than external researchers for achieving these instrumental
goals.
Given the increased competence of staff and the visibility and scrutiny of the
evaluation enterprise, there is no reason now to favor one organizational arrangement
over another.Nevertheless, there remain many critical points during an evaluation when
there are opportunities for work to be misdirected and consequently misused
irrespective of the locus of the evaluators. The important issue, therefore, is that any
evaluation strikes an appropriate balance between technical quality and utility for its
purposes, recognizing that those purposes may often be different for internal evaluations
than for external ones.
Organizational Roles
Whether evaluators are insiders or outsiders, they need to cultivate clear
understandings of their roles with sponsors and program staff. Evaluators’ full
comprehension of their roles and responsibilities is one major element in the successful
conduct of an evaluation effort.
Again,the heterogeneity of the field makes it difficult to generalize on the best ways
to develop and maintain the appropriate working relations.One common mechanism is
******ebook converter DEMO Watermarks*******
to have in place advisory groups or one or more consultants to oversee evaluations and
provide some aura of authenticity to the findings. The ways such advisory groups or
consultants work depend on whether an inside or an outside evaluation is involved and
on the sophistication of both the evaluator and the program staff. For example, large-
scale evaluations undertaken by federal agencies and major foundations often have
advisory groups that meet regularly and assess the quality, quantity, and direction of the
work. Some public and private health and welfare organizations with small evaluation
units have consultants who provide technical advice to the evaluators or advise agency
directors on the appropriateness of the evaluation units’ activities, or both.
Sometimes advisory groups and consultants are mere window dressing; we do not
recommend their use if that is their only function.When members are actively engaged,
however, advisory groups can be particularly useful in fostering interdisciplinary
evaluation approaches, in adjudicating disputes between program and evaluation staffs,
and in defending evaluation findings in the face of concerted attacks by those whose
interests are threatened.
C. Integrity/honesty: Evaluators ensure the honesty and integrity of the entire evaluation
process.
D. Respect for people: Evaluators respect the security, dignity, and self-worth of the
respondents, program participants, clients, and other stakeholders with whom they
interact.
E. Responsibilities for general and public welfare: Evaluators articulate and take into
account the diversity of interests and values that may be related to the general and
public welfare.
These five principles are elaborated and discussed in the Guiding Principles,
although not to the detailed extent found in the Joint Committee’s work. Just how useful
such general principles may be is problematic. An evaluator who has a specific ethical
problem will likely find very little guidance in any one of them. (See Shadish, Newman,
et al., 1995, for critical appraisals of the Guiding Principles.)
We expect that developing a set of practice standards and ethical principles that can
provide pointed advice to evaluators will take some time.The diversity of evaluation
******ebook converter DEMO Watermarks*******
styles will make it difficult to adopt standards because any practice so designated may
contradict what some group may consider good practice. The development of standards
would be considerably advanced by the existence of case law, the accumulation of
adjudicated specific instances in which the principles have been applied. However,
neither the Joint Committee’s Standards nor the American Evaluation Association’s
Guiding Principles have any mode of enforcement, the usual institutional mechanism for
the development of case law.
Exhibit 12-E
The American Evaluation Association’s Guiding Principles for Evaluators
2. Evaluators should explore with the client the shortcomings and strengths both of
the various evaluation questions it might be productive to ask and the various
approaches that might be used for answering those questions.
3. When presenting their work, evaluators should communicate their methods and
approaches accurately and in sufficient detail to allow others to understand,
interpret, and critique their work. They should make clear the limitations of an
evaluation and its results. Evaluators should discuss in a contextually appropriate
way those values, assumptions, theories, methods, results, and analyses that
significantly affect the interpretation of the evaluative findings. These statements
apply to all aspects of the evaluation, from its initial conceptualization to the
eventual use of findings.
1. Evaluators should possess (or, here and elsewhere as appropriate, ensure that the
evaluation team possesses) the education, abilities, skills, and experience
appropriate to undertake the tasks proposed in the evaluation.
2. Evaluators should practice within the limits of their professional training and
******ebook converter DEMO Watermarks*******
competence and should decline to conduct evaluations that fall substantially outside
those limits. When declining the commission or request is not feasible or
appropriate, evaluators should make clear any significant limitations on the
evaluation that might result. Evaluators should make every effort to gain the
competence directly or through the assistance of others who possess the required
expertise.
2. Evaluators should record all changes made in the originally negotiated project
plans, and the reasons why the changes were made. If those changes would
significantly affect the scope and likely results of the evaluation, the evaluator
should inform the client and other important stakeholders in a timely fashion
(barring good reason to the contrary, before proceeding with further work) of the
changes and their likely impact.
7. Barring compelling reason to the contrary, evaluators should disclose all sources
of financial support for an evaluation, and the source of the request for the
evaluation.
D. Respect for people: Evaluators respect the security, dignity, and self-worth of
the respondents, program participants, clients, and other stakeholders with whom
they interact.
3. Knowing that evaluations often will negatively affect the interests of some
******ebook converter DEMO Watermarks*******
stakeholders, evaluators should conduct the evaluation and communicate its results
in a way that clearly respects the stakeholders’ dignity and self-worth.
4. Where feasible, evaluators should attempt to foster the social equity of the
evaluation, so that those who give to the evaluation can receive some benefits in
return. For example, evaluators should seek to ensure that those who bear the
burdens of contributing data and incurring any risks are doing so willingly and that
they have full knowledge of, and maximum feasible opportunity to obtain, any
benefits that may be produced from the evaluation. When it would not endanger the
integrity of the evaluation, respondents or program participants should be informed
if and how they can receive services to which they are otherwise entitled without
participating in the evaluation.
E. Responsibilities for general and public welfare: Evaluators articulate and take
into account the diversity of interests and values that may be related to the
general and public welfare.
2. Evaluators should consider not only the immediate operations and outcomes of
whatever is being evaluated but also the broad assumptions, implications, and
potential side effects of it.
4. Evaluators should maintain a balance between client needs and other needs.
Evaluators necessarily have a special relationship with the client who funds or
requests the evaluation. By virtue of that relationship, evaluators must strive to meet
legitimate client needs whenever it is feasible and appropriate to do so. However,
that relationship can also place evaluators in difficult dilemmas when client
interests conflict with other interests, or when client interests conflict with the
obligation of evaluators for systematic inquiry, competence, integrity, and respect
for people. In these cases, evaluators should explicitly identify and discuss the
conflicts with the client and relevant stakeholders, resolve them when possible,
determine whether continued work on the evaluation is advisable if the conflicts
cannot be resolved, and make clear any significant limitations on the evaluation that
might result if the conflict is not resolved.
5. Evaluators have obligations that encompass the public interest and good. These
obligations are especially important when evaluators are supported by publicly
generated funds, but clear threats to the public good should never be ignored in any
evaluation. Because the public interest and good are rarely the same as the interests
of any particular group (including those of the client or funding agency), evaluators
will usually have to go beyond an analysis of particular stakeholder interests when
considering the welfare of society as a whole.
Until such evaluation standards and ethical rules are established, evaluators will
have to rely on such general principles as the profession appears to be currently willing
to endorse.A useful discussion of the many issues of applied ethics for program
evaluation can be found in Newman and Brown (1996).
Evaluators should understand that the Guiding Principles do not supersede ethical
standards imposed by most human services agencies and universities. Most social
research centers and almost all universities have standing committees that deal with
research involving humans, and most require that research plans be submitted in
******ebook converter DEMO Watermarks*******
advance for approval.Almost all such reviews focus on informed consent, upholding
the principle that research subjects in most cases should be informed about a research in
which they are asked to participate, the risks to which they may be exposed, and should
consent to becoming a research subject. In addition, most professional associations
(e.g., the American Sociological Association, the American Psychological Association)
also have ethics codes that are applicable as well and may provide useful guides to
professional issues such as proper acknowledgment to collaborators, avoiding
exploitation of research assistants, and so on.
How to apply such guidelines in pursuing evaluations is both easy and difficult. It is
easy in the sense that the guidelines uphold general ethical standards that anyone would
follow in all situations but difficult in cases when the demands of the research might
appear to conflict with a standard. For example, an evaluator in need of business might
be tempted to bid on an evaluation that called for using methods with which he is not
familiar, an action that might be in conflict with one of the Guiding Principles.In
another case, an evaluator might worry whether the procedures she intends to use
provide sufficient information for participants to understand that there are risks to
participation. In such cases, our advice to the evaluator is to consult other experienced
evaluators and in any case avoid taking actions that might appear to conflict with the
guidelines.
No doubt every evaluator has had moments of glorious dreams in which a grateful
world receives with adulation the findings of his or her evaluation and puts the results
immediately and directly to use.Most of our dreams must remain dreams.We would
argue, however,that the conceptual utilization of evaluations often provides important
inputs into policy and program development and should not be compared with finishing
the race in second place. Conceptual utilization may not be as visible to peers or
sponsors, yet this use of evaluations deeply affects the community as a whole or critical
segments of it.
“Conceptual use” includes the variety of ways in which evaluations indirectly have
an impact on policies, programs, and procedures. This impact ranges from sensitizing
persons and groups to current and emerging social problems to influencing future
program and policy development by contributing to the cumulative results of a series of
evaluations.
Evaluations perform a sensitizing role by documenting the incidence, prevalence,
and distinguishing features of social problems. Diagnostic evaluation activities,
described in Chapter 4, have provided clearer and more precise understanding of
changes occurring in the family system, critical information on the location and
distribution of unemployed persons, and other meaningful descriptions of the social
world.
Impact assessments, too, have conceptual utility. A specific example is the current
concern with “notch” groups in the development of medical care policy. Evaluations of
programs to provide medical care to the poor have found that the very poor, those who
are eligible for public programs such as Medicaid, often are adequately provided with
health services. Those just above them—in the “notch” group that is not eligible for
public programs—tend to fall in the cracks between public assistance and being able to
provide for their own care. They have decidedly more difficulty receiving services,
and, when seriously ill, represent a major burden on community hospitals, which cannot
turn them away yet receive reimbursement neither from the patients nor from the
government. Concern with the near-poor, or notch group, is increasing because of their
exclusion from a wide range of health, mental health, and social service programs.
An interesting example of a study that had considerable long-term impact is the now
classic Coleman report on educational opportunity (Coleman et al., 1966). The initial
impetus for this study was a 1964 congressional mandate to the (then) Office of
Education to provide information on the quality of educational opportunities provided to
minority students in the United States. Its actual effect was much more far-reaching: The
report changed the conventional wisdom about the characteristics of good and bad
educational settings, turning policy and program interest away from problems of fiscal
support to ways of improving teaching methods (Moynihan, 1991).
******ebook converter DEMO Watermarks*******
The conceptual use of evaluation results creeps into the policy and program worlds
by a variety of routes,usually circuitous,that are difficult to trace.For
example,Coleman’s report did not become a Government Printing Office best-seller; it
is unlikely that more than a few hundred people actually read it cover to cover. In 1967,
a year after his report had been published by the Government Printing Office, Coleman
was convinced that it had been buried in the National Archives and would never emerge
again. But journalists wrote about it, essayists summarized its arguments, and major
editorial writers mentioned it.Through these communication brokers,the findings
became known to policymakers in the education field and to politicians at all levels of
government.
Eventually, the findings in one form or another reached a wide and influential
audience. Indeed, by the time Caplan and his associates (Caplan and Nelson, 1973)
questioned influential political figures in Washington about which social scientists had
influenced them, Coleman’s name was among the most prominently and consistently
mentioned.
Some of the conceptual utilizations of evaluations may be described simply as
consciousness-raising.For example,the development of early-childhood education
programs was stimulated by the evaluation findings resulting from an impact assessment
of Sesame Street. The evaluation found that although the program did have an effect on
young children’s educational skills, the magnitude of the effect was not as large as the
program staff and sponsors imagined it would be. Prior to the evaluation, some
educators were convinced that the program represented the “ultimate”solution and that
they could turn their attention to other educational problems.The evaluation findings led
to the conviction that early-childhood education was in need of further research and
development.
As in the case of direct utilization, evaluators have an obligation to do their work in
ways that maximize conceptual utilization. In a sense, however, efforts at maximizing
conceptual utilization are more difficult to devise than ones to optimize direct use. To
the extent that evaluators are hired guns and turn to new ventures after completing an
evaluation, they may not be around or have the resources to follow through on promoting
conceptual utilization. Sponsors of evaluations and other stakeholders who more
consistently maintain a commitment to particular social policy and social problem areas
must assume at least some of the responsibility, if not the major portion, for maximizing
the conceptual use of evaluations. Often these parties are in a position to perform the
broker function alluded to earlier.
Relevance
Communication between researchers and users
Information processing by users
Plausibility of research results
User involvement or advocacy
Exhibit 12-F
Truth Tests and Utility Tests
Utility testsŠDoes the research provide direction? Does it yield guidance either
for immediate action or for considering alternative approaches to problems? The
two specific components are
1. Action orientation: Does the research show how to make feasible changes in
things that can feasibly be changed?
2. Challenge to the status quo: Does the research challenge current philosophy,
program, or practice? Does it offer new perspectives?
Together with relevance (i.e., the match between the topic of the research and
the person™s job responsibilities), the four components listed above constitute the
frames of reference by which decisionmakers assess social science research.
Research quality and conformity to user expectations form a single truth test in that
their effects are contingent on each other: Research quality is less important for the
usefulness of a study when results are congruent with officials™ prior knowledge
than when results are unexpected or counterintuitive. Action orientation and
challenge to the status quo represent alternative functions that a study can serve.
They constitute a utility test, since the kind of explicit and practical direction
captured by the action orientation frame is more important for a study™s usefulness
when the study provides little criticism or reorientation (challenge to the status quo)
than it is when challenge is high. Conversely, the criticisms of programs and the new
perspectives embedded in challenge to the status quo add more to usefulness when a
study lacks prescriptions for implementation.
2. Evaluation results must be timely and available when needed. Evaluators must,
therefore, balance thoroughness and completeness of analysis with timing and
accessibility of findings. In doing so, they may have to risk criticism from some of their
academic colleagues, whose concepts of scholarship cannot always be met because of
******ebook converter DEMO Watermarks*******
the need for rapid results and crisp reporting.
Although these guidelines are relevant to the utilization of all program evaluations,
the roles of evaluation consumers do differ. Clearly, these differing roles affect the uses
to which information is put and, consequently, the choice of mechanisms for maximizing
utility. For example, if evaluations are to influence federal legislation and policies, they
must be conducted and “packaged” in ways that meet the needs of congressional
staff.For the case of educational evaluation and legislation,Florio,Behrmann,and Goltz
(1979) furnished a useful summary of requirements that rings as true today as when it
was compiled (see Exhibit 12-G).
Exhibit 12-G
Educational Evaluation: The Unmet Potential
The timing of study reports and their relevance to questions before the Congress
were major barriers repeatedly mentioned by congressional staff. A senior policy
analyst for the Assistant Secretary of Education compared the policy process to a
moving train. She suggested that information providers have the obligation to know
the policy cycle and meet it on its own terms. The credibility problem is also one
that plagues social inquiry. The Deputy Director of the White House Domestic
Policy staff said that all social science suffers from the perception that it is
unreliable and not policy-relevant. His comments were reflected by several of the
staffers interviewed; for example, “Research rarely provides definitive
conclusions,” or “For every finding, others negate it,” or “Educational research can
rarely be replicated and there are few standards that can be applied to assess the
research products.” One went so far as to call project evaluations lies, then
reconsidered and called them embellishments.
It must be pointed out that the distinctions among different types of inquiry research,
evaluation, data collection, and so on are rarely made by the recipients of
knowledge and information. If project evaluations are viewed as fabrications, it
reflects negatively on the entire educational inquiry community. Even when policy-
relevant research is presented in time to meet the moving train, staffers complain of
having too much information that cannot be easily assimilated, or that studies are
poorly packaged, contain too much technical jargon, and are too self-serving.
Several said that researchers write for other researchers and rarely, except in
congressionally mandated studies, tailor their language to the decision-making
audiences in the legislative process.
Summary
Evaluation is purposeful, applied social research. In contrast to basic research,
evaluation is undertaken to solve practical problems. Its practitioners must be
conversant with methods from several disciplines and able to apply them to many types
of problems. Furthermore, the criteria for judging the work include its utilization and
hence its impact on programs and the human condition.
Because the value of their work depends on its utilization by others, evaluators
must understand the social ecology of the arena in which they work.
Evaluators must put a high priority on deliberately planning for the dissemination
of the results of their work. In particular, they need to become “secondary
disseminators” who package their findings in ways that are geared to the needs and
competencies of a broad range of relevant stakeholders.
Two significant strains that result from the political nature of evaluation are (1)
the different requirements of political time and evaluation time and (2) the need for
evaluations to have policy-making relevance and significance. With respect to both of
these sets of issues, evaluators must look beyond considerations of technical excellence
and pure science, mindful of the larger context in which they are working and the
purposes being served by the evaluation.
A small group of elite evaluation organizations and their staffs occupy a strategic
position in the field and account for most large-scale evaluations.As their own methods
and standards improve, these organizations are contributing to the movement toward
professionalization of the field.
Evaluative studies are worthwhile only if they are used. Three types of utilization
are direct (instrumental), conceptual, and persuasive. Although in the past, considerable
doubt has been shed on the direct utility of evaluations, there is reason to believe they
do have an impact on program development and modification. At least as important, the
conceptual utilization of evaluations appears to have a definite effect on policy and
program development, as well as social priorities, albeit one that is not always easy to
trace.
K EY CONCEPTS
Conceptual utilization
Long-term, indirect utilization of the ideas and findings of an evaluation.
******ebook converter DEMO Watermarks*******
Direct (instrumental) utilization
Explicit utilization of specific ideas and findings of an evaluation by decisionmakers
and other stakeholders.
Policy significance
The significance of an evaluation’s findings for policy and program development (as
opposed to their statistical significance).
Policy space
The set of policy alternatives that are within the bounds of acceptability to policymakers
at a given point in time.
Primary dissemination
Dissemination of the detailed findings of an evaluation to sponsors and technical
audiences.
Secondary dissemination
Dissemination of summarized often simplified findings of evaluations to audiences
composed of stakeholders.
Accessibility
The extent to which the structural and organizational arrangements facilitate
participation in the program.
Accountability
The responsibility of program staff to provide evidence to stakeholders and sponsors
that a program is effective and in conformity with its coverage, service, legal, and fiscal
requirements.
Accounting perspectives
Perspectives underlying decisions on which categories of goods and services to include
as costs or benefits in an efficiency analysis.
Administrative standards
Stipulated achievement levels set by program administrators or other responsible
parties, for example, intake for 90% of the referrals within one month. These levels may
be set on the basis of past experience, the performance of comparable programs, or
professional judgment.
Attrition
The loss of outcome data measured on targets assigned to control or intervention groups,
usually because targets cannot be located or refuse to contribute data.
Benefits
Positive program outcomes, usually translated into monetary terms in cost-benefit
analysis or compared with costs in cost-effectiveness analysis. Benefits may include
both direct and indirect outcomes.
Bias
As applied to program coverage, the extent to which subgroups of a target population
are reached unequally by a program.
Catchment area
The geographic area served by a program.
Conceptual utilization
Long-term, indirect utilization of the ideas and findings of an evaluation.
Control group
A group of targets that do not receive the program intervention and that is compared on
******ebook converter DEMO Watermarks*******
outcome measures with one or more groups that do receive the intervention. Compare
intervention group.
Cost-benefit analysis
Analytical procedure for determining the economic efficiency of a program, expressed
as the relationship between costs and outcomes, usually measured in monetary terms.
Cost-effectiveness analysis
Analytical procedure for determining the efficacy of a program in achieving given
intervention outcomes in relation to the program costs.
Costs
Inputs, both direct and indirect, required to produce an intervention.
Coverage
The extent to which a program reaches its intended target population.
Discounting
The treatment of time in valuing costs and benefits of a program in efficiency analyses,
that is, the adjustment of costs and benefits to their present values, requiring a choice of
discount rate and time frame.
Distributional effects
Effects of programs that result in a redistribution of resources in the general population.
Efficiency assessment
An evaluative study that answers questions about program costs in comparison to either
the monetary value of its benefits or its effectiveness in terms of the changes brought
about in the social conditions it addresses.
Empowerment evaluation
A participatory or collaborative evaluation in which the evaluator’s role includes
consultation and facilitation directed toward the development of the capabilities of the
participating stakeholders to conduct evaluation on their own, to use it effectively for
advocacy and change, and to have some influence on a program that affects their lives.
Evaluability assessment
Negotiation and investigation undertaken jointly by the evaluator, the evaluation
sponsor, and possibly other stakeholders to determine whether a program meets the
preconditions for evaluation and, if so, how the evaluation should be designed to ensure
maximum utility.
Evaluation questions
A set of questions developed by the evaluator, evaluation sponsor, and other
stakeholders; the questions define the issues the evaluation will investigate and are
stated in terms such that they can be answered using methods available to the evaluator
in a way useful to stakeholders.
Evaluation sponsor
The person, group, or organization that requests or requires the evaluation and provides
the resources to conduct it.
Focus group
A small panel of persons selected for their knowledge or perspective on a topic of
interest that is convened to discuss the topic with the assistance of a facilitator. The
discussion is used to identify important themes or to construct descriptive summaries of
views and experiences on the focal topic.
Formative evaluation
Evaluative activities undertaken to furnish information that will guide program
improvement.
Impact
See program effect.
Impact assessment
An evaluative study that answers questions about program outcomes and impact on the
social conditions it is intended to ameliorate. Also known as an impact evaluation or an
outcome evaluation.
Impact theory
A causal theory describing cause-and-effect sequences in which certain program
activities are the instigating causes and certain social benefits are the effects they
eventually produce.
Implementation failure
The program does not adequately perform the activities specified in the program design
that are assumed to be necessary for bringing about the intended social improvements. It
includes situations in which no service, not enough service, or the wrong service is
delivered, or the service varies excessively across the target population.
Incidence
The number of new cases of a particular problem or condition that arise in a specified
area during a specified period of time. Compare prevalence.
Independent evaluation
An evaluation in which the evaluator has the primary responsibility for developing the
evaluation plan, conducting the evaluation, and disseminating the results.
Intervention group
A group of targets that receive an intervention and whose outcome measures are
compared with those of one or more control groups. Compare control group.
Key informants
Persons whose personal or professional position gives them a knowledgeable
perspective on the nature and scope of a social problem or a target population and
whose views are obtained during a needs assessment.
Matching
Constructing a control group by selecting targets (individually or as aggregates) that are
******ebook converter DEMO Watermarks*******
identical on specified characteristics to those in an intervention group except for receipt
of the intervention.
Mediator variable
In an impact assessment, a proximal outcome that changes as a result of exposure to the
program and then, in turn, influences a more distal outcome. The mediator is thus an
intervening variable that provides a link in the causal sequence through which the
program brings about change in the distal outcome.
Meta-analysis
An analysis of effect size statistics derived from the quantitative results of multiple
studies of the same or similar interventions for the purpose of summarizing and
comparing the findings of that set of studies.
Moderator variable
In an impact assessment, a variable, such as gender or age, that characterizes subgroups
for which program effects may differ.
Needs assessment
An evaluative study that answers questions about the social conditions a program is
intended to address and the need for the program.
Net benefits
The total discounted benefits minus the total discounted costs. Also called net rate of
return.
Odds ratio
An effect size statistic that expresses the odds of a successful outcome for the
intervention group relative to that of the control group.
******ebook converter DEMO Watermarks*******
Opportunity costs
The value of opportunities forgone because of an intervention program.
Organizational plan
Assumptions and expectations about what the program must do to bring about the
transactions between the target population and the program that will produce the
intended changes in social conditions. The program’s organizational plan is articulated
from the perspective of program management and encompasses both the functions and
activities the program is expected to perform and the human, financial, and physical
resources required for that performance.
Outcome
The state of the target population or the social conditions that a program is expected to
have changed.
Outcome change
The difference between outcome levels at different points in time. See also outcome
level.
Outcome level
The status of an outcome at some point in time. See also outcome.
Outcome monitoring
The continual measurement and reporting of indicators of the status of the social
conditions a program is accountable for improving.
Performance criterion
Policy significance
The significance of an evaluation’s findings for policy and program development (as
opposed to their statistical significance).
Policy space
The set of policy alternatives that are within the bounds of acceptability to policymakers
at a given point in time.
Population at risk
The individuals or units in a specified area with characteristics indicating that they have
a significant probability of having or developing a particular condition.
Population in need
The individuals or units in a specified area that currently manifest a particular
problematic condition.
Pre-post design
A reflexive control design in which only one measure is taken before and after the
intervention.
Prevalence
The total number of existing cases with a particular condition in a specified area at a
specified time. Compare incidence.
Primary dissemination
Dissemination of the detailed findings of an evaluation to sponsors and technical
audiences.
Process evaluation
Process theory
The combination of the program’s organizational plan and its service utilization plan
into an overall description of the assumptions and expectations about how the program
is supposed to operate.
Program effect
That portion of an outcome change that can be attributed uniquely to a program, that is,
with the influence of other sources controlled or removed; also termed the program’s
impact. See also outcome change.
Program evaluation
The use of social research methods to systematically investigate the effectiveness of
social intervention programs in ways that are adapted to their political and
organizational environments and are designed to inform social action in ways that
improve social conditions.
Program goal
A statement, usually general and abstract, of a desired state toward which a program is
directed. Compare with program objectives.
Program monitoring
The systematic documentation of aspects of program performance that are indicative of
whether the program is functioning as intended or according to some appropriate
standard. Monitoring generally involves program performance related to program
process, program outcomes, or both.
Program objectives
Specific statements detailing the desired accomplishments of a program together with
one or more measurable criteria of success.
Program theory
The set of assumptions about the manner in which a program relates to the social
benefits it is expected to produce and the strategy and tactics the program has adopted to
achieve its goals and objectives. Within program theory we can distinguish impact
theory, relating to the nature of the change in social conditions brought about by
program action, and process theory, which depicts the program’s organizational plan
and service utilization plan.
Quasi-experiment
An impact research design in which intervention and control groups are formed by a
procedure other than random assignment.
Randomization
Assignment of potential targets to intervention and control groups on the basis of chance
so that every unit in a target population has the same probability as any other to be
selected for either group.
Rate
The occurrence or existence of a particular condition expressed as a proportion of units
in the relevant population (e.g., deaths per 1,000 adults).
Reflexive controls
Measures of an outcome variable taken on participating targets before intervention and
used as control observations. See also pre-post design; time-series design.
Regression-discontinuity design
Reliability
The extent to which a measure produces the same results when used repeatedly to
measure the same thing.
Sample survey
A survey administered to a sample of units in the population. The results are
extrapolated to the entire population of interest by statistical projections.
Secondary dissemination
Dissemination of summarized often simplified findings of evaluations to audiences
composed of stakeholders.
Secondary effects
Effects of a program that impose costs on persons or groups who are not targets.
Selection bias
Systematic under- or overestimation of program effects that results from uncontrolled
differences between the intervention and control groups that would result in differences
on the outcome if neither group received the intervention.
Selection modeling
Creation of a multivariate statistical model to “predict” the probability of selection into
intervention or control groups in a nonequivalent comparison design. The results of this
analysis are used to configure a control variable for selection bias to be incorporated
into a second-stage statistical model that estimates the effect of intervention on an
outcome.
Sensitivity
Shadow prices
Imputed or estimated costs of goods and services not valued accurately in the
marketplace. Shadow prices also are used when market prices are inappropriate due to
regulation or externalities. Also known as accounting prices.
Snowball sampling
A nonprobability sampling method in which each person interviewed is asked to suggest
additional knowledgeable people for interviewing. The process continues until no new
names are suggested.
Social indicator
Periodic measurements designed to track the course of a social condition over time.
Stakeholders
Individuals, groups, or organizations having a significant interest in how well a program
******ebook converter DEMO Watermarks*******
functions, for instance, those with decision-making authority over the program, funders
and sponsors, administrators and personnel, and clients or intended beneficiaries.
Statistical controls
The use of statistical techniques to adjust estimates of program effects for bias resulting
from differences between intervention and control groups that are related to the
outcome. The differences to be controlled by these techniques must be represented in
measured variables that can be included in the statistical analysis.
Statistical power
The probability that an observed program effect will be statistically significant when, in
fact, it represents a real effect. If a real effect is not found to be statistically significant,
a Type II error results. Thus, statistical power is one minus the probability of a Type II
error. See also Type II error.
Summative evaluation
Evaluative activities undertaken to render a summary judgment on certain critical
aspects of the program’s performance, for instance, to determine if specific goals and
objectives were met.
Target
The unit (individual, family, community, etc.) to which a program intervention is
directed. All such units within the area served by a program comprise its target
population.
Theory failure
The program is implemented as planned but its services do not produce the immediate
effects on the participants that are expected or the ultimate social benefits that are
intended, or both.
Type I error
A statistical conclusion error in which a program effect estimate is found to be
statistically significant when, in fact, the program has no effect on the target population.
Type II error
A statistical conclusion error in which a program effect estimate is not found to be
statistically significant when, in fact, the program does have an effect on the target
population.
Units of analysis
The units on which outcome measures are taken in an impact assessment and,
correspondingly, the units on which data are available for analysis. The units of analysis
may be individual persons but can also be families, neighborhoods, communities,
organizations, political jurisdictions, geographic areas, or any other such entities.
Utilization of evaluation
The use of the concepts and findings of an evaluation by decisionmakers and other
stakeholders whether at the day-to-day management level or at broader funding or
policy levels.
Validity
The extent to which a measure actually measures what it is intended to measure.
Affholter, D. P.
1994 “Outcome Monitoring.”In J. S.Wholey, H. P. Hatry, and K. E. Newcomer (eds.),
Handbook of Practical Program Evaluation (pp. 96-118). San Francisco:
Jossey-Bass.
Ards, S.
1989 “Estimating Local Child Abuse.” Evaluation Review 13(5):484-515.
Averch, H. A.
1994 “The Systematic Use of Expert Judgment.” In J. S. Wholey, H. P. Hatry, and K.
E. New- comer (eds.), Handbook of Practical Program Evaluation (pp. 293-
309). San Francisco: Jossey-Bass.
Berkowitz, S.
1996 “Using Qualitative and Mixed-Method Approaches.” In R. Reviere, S.
Berkowitz, C. C. Carter, and C. G. Ferguson (eds.), Needs Assessment: A
Creative and Practical Guide for Social Scientists (pp. 121-146). Washington,
DC: Taylor & Francis.
Besharov, D. (ed.)
2003 Child Well-Being After Welfare Reform. New Brunswick, NJ: Transaction
Books.
Bickman, L. (ed.)
1987 “Using Program Theory in Evaluation.” New Directions for Program
Evaluation, no. 33. San Francisco: Jossey-Bass.
1990 “Advances in Program Theory.”New Directions for Program Evaluation,
no.47.San Francisco: Jossey-Bass.
Boruch, R. F.
1997 Randomized Experiments for Planning and Evaluation: A Practical Guide.
Thousand Oaks, CA: Sage.
******ebook converter DEMO Watermarks*******
Boruch, R. F., M. Dennis, and K. Carter-Greer
1988 “Lessons From the Rockefeller Foundation’s Experiments on the Minority
Female Single Parent Program.” Evaluation Review 12(4):396-426.
Bremner, R.
1956 From the Depths: The Discovery of Poverty in America. New York: New York
University Press.
Broder, I. E.
1988 “A Study of the Birth and Death of a Regulatory Agenda: The Case of the EPA
Noise Program.” Evaluation Review 12(3):291-309.
Bulmer, M.
1982 The Uses of Social Research. London: Allen & Unwin.
Campbell, D. T.
1969 “Reforms as Experiments.” American Psychologist 24 (April): 409-429.
1991 “Methods for the Experimenting Society,” Evaluation Practice 12(3):223-260.
1996 “Regression Artifacts in Time-Series and Longitudinal Data.” Evaluation and
******ebook converter DEMO Watermarks*******
Program Planning 19(4):377-389.
Chelimsky, E.
1987 “The Politics of Program Evaluation.” Society 25(1):24-32.
1991 “On the Social Science Contribution to Governmental Decision-Making.”
Science 254 (October): 226-230.
1997 “The Coming Transformations in Evaluation.” In E. Chelimsky and W. R.
Shadish (eds.), Evaluation for the 21st Century: A Handbook (pp. 1-26).
Thousand Oaks, CA: Sage.
Chen, H.-T.
1990 Theory-Driven Evaluations. Newbury Park, CA: Sage.
Cohen, J.
1988 Statistical Power Analysis for the Behavioral Sciences, 2nd ed. Hillsdale, NJ:
Lawrence Erlbaum.
Cordray, D. S.
1993 “Prospective Evaluation Syntheses: A Multi-Method Approach to Assisting
Policy-Makers.”In M. Donker and J. Derks (eds.), Rekenschap: Evaluatie-
onderzoek in Nederland, de stand van zaken (pp. 95-110). Utrecht, the
Netherlands: Centrum Geestelijke Volksge-zondheid.
Cronbach, L. J.
1982 Designing Evaluations of Educational and Social Programs. San Francisco:
Jossey-Bass.
Datta, L.
1977 “Does It Work When It Has Been Tried? And Half Full or Half Empty?”In M.
Guttentag and S.Saar (eds.),Evaluation Studies Review Annual, vol.2 (pp.301-
319).Beverly Hills,CA: Sage.
1980 “Interpreting Data: A Case Study From the Career Intern Program Evaluation.”
Evaluation Review 4 (August): 481-506.
Dean, D. L.
1994 “How to Use Focus Groups.”In J.S.Wholey,H.P.Hatry,and K.E.Newcomer
(eds.),Handbook of Practical Program Evaluation (pp. 338-349). San
Francisco: Jossey-Bass.
Dennis, M. L.
1990 “Assessing the Validity of Randomized Field Experiments: An Example From
Drug Abuse Research.” Evaluation Review 14(4):347-373.
DeVellis, R. F.
2003 Scale Development: Theory and Applications, 2nd ed. Thousand Oaks, CA:
Sage.
Dibella, A.
1990 “The Research Manager’s Role in Encouraging Evaluation Use.” Evaluation
Practice 11(2):115-119.
Duckart, J. P.
1998 “An Evaluation of the Baltimore Community Lead Education and Reduction
Corps (CLEARCorps) Program.” Evaluation Review 22(3):373-402.
Dunford, F. W.
1990 “Random Assignment: Practical Considerations From Field Experiments.”
Evaluation and Program Planning 13(2):125-132.
Eddy, D. M.
1992 “Cost-Effectiveness Analysis: Is It Up to the Task?” Journal of the American
Medical Association 267:3342-3348.
Elmore, R. F.
1980 “Backward Mapping: Implementation Research and Policy Decisions.”
Political Science Quarterly 94(4):601-616.
Figlio, D. N.
1995 “The Effect of Drinking Age Laws and Alcohol-Related Crashes: Time-Series
Evidence From Wisconsin.” Journal of Policy Analysis and Management
14(4):555-566.
Fink, A.
1995 Evaluation for Education and Psychology. Thousand Oaks, CA: Sage.
Fournier, D. M.
1995 “Establishing Evaluative Conclusions: A Distinction Between General and
Working Logic.”New Directions for Evaluation, no. 68 (pp. 15-32). San
Francisco: Jossey-Bass.
Fowler, F. L.
1993 Survey Research Methods, 2nd ed. Newbury Park, CA: Sage.
******ebook converter DEMO Watermarks*******
Fraker, T. F., A. P. Martini, and J. C. Ohls
1995 “The Effect of Food Stamp Cashout on Food Expenditures: An Assessment of
the Findings From Four Demonstrations.” Journal of Human Resources
30(4):633-649.
Freeman, H. E.
1977 “The Present Status of Evaluation Research.” In M.A. Guttentag and S. Saar
(eds.), Evaluation Studies Review Annual, vol. 2 (pp. 17-51). Beverly Hills,
CA: Sage.
Gramblin, E. M.
1990 A Guide to Benefit-Cost Analysis. Englewood Cliffs, NJ: Prentice Hall.
Greene, J. C.
1988 “Stakeholder Participation and Utilization in Program Evaluation.” Evaluation
Review 12(2):91-116.
Greene, W. H.
1993 “Selection-Incidental Truncation.” In W. H. Greene, Econometric Analysis (pp.
706-715). New York: Macmillan.
Hamilton, J.
1994 Time Series Analysis. Princeton, NJ: Princeton University Press.
Hatry, H. P.
1994 “Collecting Data From Agency Records.” In J. S.Wholey, H. P. Hatry, and K. E.
Newcomer (eds.), Handbook of Practical Program Evaluation. San Francisco:
Jossey-Bass.
1999 Performance Measurement: Getting Results. Washington, DC: Urban Institute
Press.
Haveman, R. H.
1987 “Policy Analysis and Evaluation Research After Twenty Years.” Policy Studies
Journal 16(2):191-218.
Henry, G. T.
1990 Practical Sampling. Newbury Park, CA: Sage.
Hoch, C.
1990 “The Rhetoric of Applied Sociology: Studying Homelessness in Chicago.”
Journal of Applied Sociology 7:11-24.
Hsu, L. M.
1995 “Regression Toward the Mean Associated With Measurement Error and the
Identifica- tion of Improvement and Deterioration in Psychotherapy.”Journal of
Consulting & Clinical Psychology 63(1):141-144.
Jones-Lee, M. W.
1994 “Safety and the Saving of Life: The Economics of Safety and Physical Risk.” In
R. Layard and S.Glaister (eds.),Cost-Benefit Analysis, 2nd ed.(pp.290-
318).Cambridge,UK: Cambridge University Press.
Kazdin, A. E.
1982 Single-Case Research Designs. New York: Oxford University Press.
Krueger, R. A.
1988 Focus Groups: A Practical Guide for Applied Research. Newbury Park, CA:
Sage.
LaLonde, R.
1986 “Evaluating the Econometric Evaluations of Training Programs.” American
Economic Review 76:604-620.
Landsberg, G.
1983 “Program Utilization and Service Utilization Studies: A Key Tool for
Evaluation.” New Directions for Program Evaluation, no. 20 (pp. 93-103). San
******ebook converter DEMO Watermarks*******
Francisco: Jossey-Bass.
Lipsey, M. W.
1990 Design Sensitivity: Statistical Power for Experimental Research. Newbury
Park, CA: Sage.
1993 ”Theory as Method: Small Theories of Treatments.” New Directions for
Program Evalua- tion, no. 57 (pp. 5-38). San Francisco: Jossey-Bass.
1997 “What Can You Build With Thousands of Bricks? Musings on the Cumulation of
Knowl- edge in Program Evaluation.” New Directions for Evaluation, no. 76
(pp. 7-24). San Fran- cisco: Jossey-Bass.
1998 “Design Sensitivity: Statistical Power for Applied Experimental Research.” In
L. Bick- man and D. J. Rog (eds.), Handbook of Applied Social Research
Methods (pp. 39-68). Thousand Oaks, CA: Sage.
Loehlin, J. C.
1992 Latent Variable Models: An Introduction to Factor, Path, and Structural
Analysis. Hills- dale, NJ: Lawrence Erlbaum. Luepker,R.V., C. L. Perry, S. M.
McKinlay, P. R. Nader, G. S. Parcel, E. J. Stone, L. S. Webber, J. P. Elder, H. A.
Feldman, C. C. Johnson, S. H. Kelder, and M. Wu
1996 “Outcomes of a Field Trial to Improve Children’s Dietary Patterns and Physical
Activity: The Child and Adolescent Trial for Cardiovascular Health (CATCH).”
Journal of the American Medical Association 275 (March): 768-776.
McFarlane, J.
******ebook converter DEMO Watermarks*******
1989 “Battering During Pregnancy: Tip of an Iceberg Revealed.” Women and Health
15(3):69-84.
McKillip, J.
1987 Need Analysis: Tools for the Human Services and Education. Newbury Park,
CA: Sage.
1998 “Need Analysis: Process and Techniques.” In L. Bickman and D. J. Rog (eds.),
Handbook of Applied Social Research Methods (pp. 261-284). Thousand Oaks,
CA: Sage.
McLaughlin, M. W.
1975 Evaluation and Reform: The Elementary and Secondary Education Act of
1965/Title I. Cambridge, MA: Ballinger.
Mercier, C.
1997 “Participation in Stakeholder-Based Evaluation: A Case Study.” Evaluation and
Program Planning 20(4):467-475.
Mitra, A.
1994 “Use of Focus Groups in the Design of Recreation Needs Assessment
Questionnaires.” Evaluation and Program Planning 17(2):133-140.
Mohr, L. B.
******ebook converter DEMO Watermarks*******
1995 Impact Analysis for Program Evaluation, 2nd ed. Thousand Oaks, CA: Sage.
Moynihan, D. P.
1991 “Educational Goals and Political Plans.” The Public Interest 102 (winter): 32-
48.
1996 Miles to Go: A Personal History of Social Policy.Cambridge,MA: Harvard
University Press.
Murray, D.
1998 Design and Analysis of Group-Randomized Trials. New York: Oxford
University Press.
Murray, S.
1980 The National Evaluation of the PUSH for Excellence Project. Washington, DC:
American Institutes for Research.
Nas, T. F.
1996 Cost-Benefit Analysis: Theory and Application. Thousand Oaks, CA: Sage.
Nelson, R. H.
1987 “The Economics Profession and the Making of Public Policy.”Journal of
Economic Literature 35(1):49-91.
Patton, M. Q.
1986 Utilization-Focused Evaluation, 2nd ed. Beverly Hills, CA: Sage.
1997 Utilization-Focused Evaluation: The New Century Text,3rd ed.Thousand
Oaks,CA: Sage.
Quinn, D. C.
1996 Formative Evaluation of Adapted Work Services for Alzheimer’s Disease
Victims: A Frame- work for Practical Evaluation in Health Care. Doctoral
dissertation,Vanderbilt University.
Reineke, R. A.
1991 “Stakeholder Involvement in Evaluation: Suggestions for Practice.” Evaluation
Practice 12(1):39-44.
Rich, R. F.
1977 “Uses of Social Science Information by Federal Bureaucrats.” In C. H. Weiss
(ed.), Using Social Research for Public Policy Making (pp. 199-211).
Lexington, MA: D.C. Heath.
Robertson, D. B.
1984 “Program Implementation versus Program Design.” Policies Study Review
3:391-405.
Rog, D. J.
1994 “Constructing Natural ‘Experiments.’” In J. S. Wholey, H. P. Hatry, and K. E.
Newcomer (eds.), Handbook of Practical Program Evaluation (pp. 119-132).
San Francisco: Jossey- Bass.
Rossi, P. H.
1978 “Issues in the Evaluation of Human Services Delivery.” Evaluation Quarterly
2(4):573-599.
1987 “No Good Applied Research Goes Unpunished!” Social Science and Modern
Society 25(1):74-79.
1989 Down and Out in America: The Origins of Homelessness. Chicago: University
of Chicago Press.
1997 “Program Outcomes: Conceptual and Measurement Issues.” In E. J. Mullen and
J. Magnabosco (eds.), Outcome and Measurement in the Human Services:
Cross-Cutting Issues and Methods. Washington, DC: National Association of
Social Workers.
2001 Four Evaluations of Welfare Reform: What Will Be Learned? The Welfare
Reform Academy. College Park: University of Maryland, School of Public
Affairs.
Savaya, R.
1998 “The Potential and Utilization of an Integrated Information System at a Family
and Marriage Counselling Agency in Israel.” Evaluation and Program
Planning 21(1): 11-20.
Scheirer, M. A.
1994 “Designing and Using Process Evaluation.” In J. S. Wholey, H. P. Hatry, and K.
E. New- comer (eds.), Handbook of Practical Program Evaluation (pp. 40-
68). San Francisco: Jossey-Bass.
Schorr, L. B.
1997 Common Purpose: Strengthening Families and Neighborhoods to Rebuild
America. New York: Doubleday Anchor Books.
Scriven, M.
1991 Evaluation Thesaurus, 4th ed. Newbury Park, CA: Sage.
Smith, M. F.
1989 Evaluability Assessment: A Practical Approach. Norwell, MA: Kluwer
Academic Publishers.
Solomon, J.
1988 “Companies Try Measuring Cost Savings From New Types of Corporate
Benefits.” Wall Street Journal, December 29.
Soriano, F. I.
1995 Conducting Needs Assessments: A Multidisciplinary Approach. Thousand
Oaks, CA: Sage.
Suchman, E.
1967 Evaluative Research. New York: Russell Sage Foundation.
Terrie, E. W.
1996 “Assessing Child and Maternal Health: The First Step in the Design of
Community- Based Interventions.” In R. Reviere, S. Berkowitz, C. C. Carter,
and C. G. Ferguson (eds.), Needs Assessment: A Creative and Practical Guide
for Social Scientists (pp. 121-146). Washington, DC: Taylor & Francis.
Thompson, M.
1980 Benefit-Cost Analysis for Program Evaluation. Beverly Hills, CA: Sage.
Trippe, C.
1995 “Rates Up: Trends in FSP Participation Rates: 1985-1992.” In D. Hall and M.
Stavrianos (eds.), Nutrition and Food Security in the Food Stamp Program.
Alexandria, VA: U.S. Department of Agriculture. Food and Consumer Service.
Trochim, W. M. K.
1984 Research Design for Program Evaluation: The Regression Discontinuity
Approach. Beverly Hills, CA: Sage.
Viscusi, W. K.
1985 “Cotton Dust Regulation: An OSHA Success Story?” Journal of Policy Analysis
and Management 4(3):325-343.
Weiss, C. H.
1972 Evaluation Research: Methods of Assessing Program Effectiveness.
Englewood Cliffs, NJ: Prentice Hall.
1988 “Evaluation for Decisions: Is Anybody There? Does Anybody Care?”
Evaluation Practice 9(1):5-19.
1993 “Where Politics and Evaluation Research Meet,” Evaluation Practice
14(1):93-106.
******ebook converter DEMO Watermarks*******
1997 “How Can Theory-Based Evaluation Make Greater Headway?” Evaluation
Review 21(4):501-524.
Wholey, J. S.
1979 Evaluation: Promise and Performance. Washington, DC: Urban Institute.
1981 “Using Evaluation to Improve Program Performance.” In R. A. Levine, M. A.
Solomon, and G. M. Hellstern (eds.), Evaluation Research and Practice:
Comparative and International Perspectives (pp. 92-106). Beverly Hills, CA:
Sage.
1987 “Evaluability Assessment: Developing Program Theory.” New Directions for
Program Evaluation, no. 33 (pp. 77-92). San Francisco: Jossey-Bass.
1994 “Assessing the Feasibility and Likely Usefulness of Evaluation.” In J. S.
Wholey, H. P. Hatry, and K. E. Newcomer (eds.), Handbook of Practical
Program Evaluation (pp. 15- 39). San Francisco: Jossey-Bass.
Zerbe, R. O.
1998 “Is Cost-Benefit Analysis Legal? Three Rules.” Journal of Policy Analysis and
Management 17(3):419-456.
La Chance, P. A., 35
Ladouceur, R., 38
LaLonde, R., 296
Landow, H., 35
******ebook converter DEMO Watermarks*******
Landsberg, G., 180
Larsen, C. R., 346, 350
Lenihan, K. J., 255, 258
Levin, H. M., 342, 344
Levine, A., 384
Levine, M., 384
Levine, R. A., 9, 138
Levings, D., 216
Leviton, L. C., 26, 27, 96, 370, 410, 411, 413
Lin, L. -H., 59
Lincoln,Y. S., 43, 372, 399
Lipsey, M. W., 26, 141, 296, 311, 325, 326, 390
Loehlin, J. C., 283
Longmire, L., 231
Lowe, R. A., 338
Luepker, R.V., 244
Lurie, P., 338
Lurigio, A. J., 45
Lyall, K., 246, 389
Lynn, L. E., Jr., 13
Ohls, J. C., 41
Olson, K. W., 346, 350
O’Malley, P. M., 164
Oman, R. C., 373
Osgood, D. W., 160
Government, 11
evaluation units in, 12
fiscal conservatism of, 14-15
policy analysis and, 12-15
See also Political process
GREAT (Gang Resistance Education and Training) program, 160
Great Depression, 11
Guiding Principles for Evaluators, 405-410, 406-410 (exhibit)
Hawthorne effect, 8
Head Start, 195
Health management programs, 111-112, 164, 176, 230, 244, 338
Healthy Start initiative, 111-112
Homeless programs, 85, 114, 115, 155, 186, 188, 223, 275
Targets, 33, 65
boundaries of, 120
definition/identification of, 118-121
direct/indirect targets, 119
diversity of perspective and, 120-121
populations, descriptions of, 121-124
specification of, 119-121
Task Force on Guiding Principles for Evaluators, 405-412, 406-410 (exhibit)
Teen mother parenting program, 95, 147
Theory failure, 79, 99
Time-series designs, 291-295, 293-294 (exhibits), 300
Title I, 191-192
Transitional Aid to Released Prisoners (TARP) study, 256-258, 388
Type I/type II errors, 308-312, 309 (exhibit), 312 (exhibit), 330
Mark W. Lipsey is the Director of the Center for Evaluation Research and
Methodology and a Senior Research Associate at the Vanderbilt Institute for Public
Policy Studies at Vanderbilt University. He received a Ph.D. in psychology from Johns
Hopkins University in 1972 following a B.S. in applied psychology from the Georgia
Institute of Technology in 1968. His professional interests are in the areas of public
policy, program evaluation research, social intervention, field research methodology,
and research synthesis (meta-analysis). The topics of his recent research have been risk
and intervention for juvenile delinquency and issues of methodological quality in
program evaluation research. Professor Lipsey serves on the editorial boards of the
American Journal of Evaluation, Evaluation and Program Planning, Psychological
Bulletin, and the American Journal of Community Psychology, and boards or
committees of, among others, the National Research Council, National Institutes of
******ebook converter DEMO Watermarks*******
Health, Campbell Collaboration, and Blueprints for Violence Prevention. He is a
recipient of the American Evaluation Association’s Paul Lazarsfeld Award and a
Fellow of the American Psychological Society.