How To Conduct Evaluation of
How To Conduct Evaluation of
How To Conduct Evaluation of
Extension Programs
Murari Suvedi
Kirk Heinze
Diane Ruonavaara
Introduction
Evaluation in extension used to focus primarily on judging a program’s merit or
worth. Additionally, the methodology associated with earlier forms of
evaluation was portrayed as basically a quantitative activity. In today’s
increasingly complex and demanding world, evaluation must deal with issues
of accountability, good management, knowledge building and sharing,
organizational learning and development, problem identification and policy
formation. As the scope of evaluation expands, qualitative approaches and
multiple methods are becoming increasingly necessary. Concurrently, today’s
evaluator in extension finds that he or she needs to fulfill multiple roles and be
familiar with numerous methods. This manual is designed to cover the
expanding field of evaluation as it applies to extension and to provide you, the
evaluator, with a methodological toolbox containing a broad array of methods
and suggestions as to their appropriate use.
What is Evaluation?
Program evaluation is a continual and systematic process of assessing the
value or potential value of Extension programs to guide decision-making for
the program’s future.
When we evaluate...
o We examine the assumptions upon which an existing or
proposed program is based.
Why Evaluate?
Demands on Extension for program efficiency, program effectiveness and
for public accountability are increasing. Evaluation can help meet these
demands in various ways.
o Planning
To assess needs.
To set priorities.
To guide policy.
o Direct decision-making
o Maintain accountability
To stakeholders.
To funding sources.
o Advocate
When to Evaluate
There are several basic questions to ask when deciding whether to carry out
an evaluation. If the answers to these questions are "No", this may not be
the time for an evaluation.
An Evaluator’s Credibility
An evaluator is judged by his or her competence and personal style.
Competence is developed through training and experience. Personal style
develops over time through a combination of training, experience and
personal characteristics.
Competence
Personal Style
o Communication skills.
o Confidence.
o Strong interpersonal skills.
o Ability to nurture trust and rapport.
o Sensitivity in reporting.
Steps to Evaluation
Program evaluation can be an overwhelming process. To make program
evaluation less intimidating and more manageable it can be broken down
into several manageable steps. The specifics of each step may vary,
depending on the nature, scope and complexity of the programs and the
resources available for conducting the evaluations. These steps will be
expanded upon in later sessions.
Step 1
à Step 2 à ààà
á â
Step 10 Step 3
á â
Step 9 Step 4
Communicate Identify & Consult Key
Findings Stakeholders
á â
Step 8 Step 5
á â
What is the current level of knowledge, skills, attitudes and beliefs of our
audience?
• Are there sufficient human and monetary resources available to carry out
an evaluation?
• Is there enough time to complete the evaluation?
• The indicators and criteria that will be used to judge value or worth
of the program. When program objectives are clearly stated, the
indicators and criteria to judge merit or worth will be explicitly
stated.
Characteristics of indicators:
1. Is it measurable?
2. Is it relevant and easy to use?
3. Does it provide a representative picture?
4. Is it easy to interpret and does it show trends over time?
5. Is it responsive to changes?
6. Does it have a reference to compare it against so that users are able to
assess the significance of its values?
7. Can it be measured at a reasonable cost, and can it be updated?
kill and Changes in participants’ knowledge, attitudes, skills and aspirations as a result of
(KASA) program participation.
Group Interview
• mapping
• qualitative analysis
Quality of Evidence
The validity and reliability of the data collection instrument determine the
quality of evidence for quantitative methods.
A large array of methods exist which can be used in evaluation. We will cover
the following:
Surveys
Surveys are a very popular method of collecting evaluation data and require a
carefully designed questionnaire administered by mail, telephone or personal
interviews. Surveys can be used to collect data on a participant’s knowledge,
attitudes, skills and aspirations, adoption of practices, and program benefits
and impacts. It is the responsibility of the evaluator to ensure that ethical
standards are maintained. This means that participation is voluntary and
survey results are made public in a way that maintains confidentiality.
Telephone Survey
A telephone survey consists of a written questionnaire that is read to a
selected group of people over the telephone. The survey sample is often
selected from a telephone directory or other lists. People on the list are
interviewed one at a time over the phone.
Survey title
Additional comments:
Result Codes
Code Explanation
05 Answered by nonresident
06 Household refusal
08 Temporarily disconnected
11 Contact only
17 Partial interview
18 Respondent contacted - completed interview
19 Other
Purpose of study:
Size of survey:
Identity of interviewer:
Issues of confidentiality:
How to get a copy of results:
Mail Survey
A mail survey is the most frequently used type of survey in evaluation of
Extension programs and requires the least resources.
Personal Surveys
Personal or face-to-face surveys are conducted by talking individually to
respondents and systematically recording their answers to each question.
Initiating contact:
Coverage error The sampling frame does not Redraw list from which the sample is
include all elements of the drawn to include all elements of the
population. population.
Sampling error A subset or sample of all people Increase the size of the sample; Use
in the population is studied random sampling; Purge list of
instead of conducting a census. duplication.
Group-administered Questionnaire
A group-administered questionnaire is handed directly to each participant in a
group at the end of a workshop, seminar or program. Respondents answer the
questions individually and hand them back to the person conducting the
evaluation.
Questionnaire Design
The overall aim questionnaire design is to solicit quality participation.
Response quality depends on the trust the respondent feels for the survey, the
topic, the interviewer and the manner in which the questions are worded and
arranged. Consider whether the questionnaire is going to be mailed, given
directly to respondents, used in a telephone survey or used in personal
interviews. Before you begin it is essential to know what kind of evidence you
need for the evaluation and how the information will be used.
o Make a list of what you want to know and how the information
will be used.
o Check to make sure the information is not already available
somewhere else.
o Eliminate all but essential questions.
o As you write questions try to view them through the eyes of the
respondents.
• Make questions fit the page so that the respondent does not need to turn
the page to answer a question.
• Provide easy-to-follow directions on how to answer the questions
• Arrange questions and answers in a vertical flow. Put answer choices
under rather than beside the questions so that respondents move down
the page rather than from side to side.
1st paragraph:
Explains the purpose of the study.
Describes who will be answering the
questionnaire.
2nd paragraph:
Assures the respondent the study is useful.
3rd paragraph:
Provides directions on how and when to
return the questionnaire
4th paragraph:
Reemphasizes the study’s social usefulness.
Writing Questions
The questions used in a questionnaire are the basic components that
determine the effectiveness of your survey. Writing good questions is not easy
and usually takes more than one try. Consider what information to include,
how to structure the questions and whether people can answer the questions
accurately. Good survey questions are focused, clear, and to the point.
Every question should focus on a single, specific issue or topic.
Poor: When was the last time you went to the doctor
for a physical examination on your own or because
you had to?
A respondent may answer the first question ambiguously. For example, "I have
two boys and a girl. They are 5, 7, and 10 years old." It is not possible to
determine the ages of each child from this response.
Types of Information
Questions can be formulated to elicit four types of information: 1) knowledge,
2) beliefs, attitudes and opinions, 3) behavior and 4) attributes. Any one or a
combination of these types can be included in a questionnaire.
Knowledge questions include what people know and how well they
understand something
What is the major cause of accidental deaths among children inside the home?
Behavioral questions ask people about what they have done in the past, what
they do now or what they plan to do in the future.
1. Have you or members of your family ever taken classes at the Regional
Education Center in this county? _____Yes _____No
2. To what extent do you agree or disagree with the new zoning code?
1. Strongly disagree
2. Mildly disagree
3. Neither agree or disagree
4. Strongly agree
Open-ended Questions
Open-ended questions allow respondents to answer in their own words rather
than select from predetermined answers.
o Difficult to analyze.
o Require more time to answer.
o Depend on respondent recall.
o Require greater interviewing skill.
o Lack response categories to help clarify questions.
o Handwritten responses may be illegible.
1. How do you plan to use the information acquired during this training?
2. What do you think should be done to improve the 4-H program in this
county?
In pre-testing, we ask:
o It is easy to setup.
o It is fast and relatively inexpensive.
o It is can reduce the distance between project personnel and
intended beneficiaries.
o It stimulates dialogue.
o It can generate ideas for evaluation questions to be included in
other survey methods.
o It is easily misused.
o It is requires special moderator skills.
o Data interpretation is tedious.
o Avoiding bias can be difficult.
o Capturing major issues that emerge can be difficult.
o Results may not be generalizable to the target population.
Arrange for a meeting room. Check the seating and table arrangements.
æ6
Identify a trained moderator and an assistant to conduct the focus group interview
æ5 friendly atmosphere, directs and keeps the flow of the conversation flowing and ta
Identify and contact potential participants by sending a personalized invitation.
æ4
Explain the purpose of the meeting to them and how their participation will contribute.
Arrange a meeting place that is neutral and non-threatening, convenient and easy to find.
æ 3
Select means to record discussion. (tape recorder, note taker etc.).
Good evening and welcome to our session tonight. Thank you for taking the
time to join our discussion of county educational services. My name is _______
and I represent ____________. Assisting me is _________ from _________.
We are attempting to gain information about educational opportunities in the
community. We have invited people who live in several parts of the county to
share their ideas.
You were selected because you have certain things in common that are of
particular interest to us. You are all employed outside the home and you live in
the suburban areas of the county. We are particularly interested in your views
because you are representative of others in the county.
Tonight we will be discussing non-formal educational issues in the community.
These include all the ways you gain new information about areas of interest to
you. There are no right or wrong answers but rather differing points of view.
Please feel free to share your point of view even if it differs from what others
have said.
Before we begin, let me remind you of some ground rules. Please speak up, but
only one person should talk at a time. We’re tape-recording the session because
we don’t want to miss any of your comments. If several are talking at the same
time, the tape will get garbled and we’ll miss your comments. We will be on a
first- name basis tonight, and in our later reports, there will not be any names
attached to comments. You may be assured of complete anonymity of
responses. Keep in mind that we’re just as interested in negative comments as
positive comments, and at times the negative comments are the most helpful.
Our session will last about an hour and we will not be taking a formal break.
Well, let’s begin.
Let’s find out some more about one another by going around the room one at a
time. Tell us your name and where you live.
"Why" questions can make people defensive and feel the need to
provide an answer.
When you ask "why," people usually respond with attributes or
influences.
b. Introductory questions.
c. Transition questions.
d. Key questions.
e. Ending questions.
(Cues are the hints or prompts that help participants recall specific
features or details.)
1. Let’s find out some more about one another by going around the room.
Tell us your name, where you live and what first comes to mind when
you hear the words "Michigan State University."
2. What are you hearing people say about Extension agriculture and natural
resources programs in your community? How has Extension’s work
changed in the recent past?
3. Think back to an experience you have had with MSU Extension that was
outstanding. Describe it.
4. Think back to an experience you have had with MSU Extension that was
disappointing. Describe it. How could Extension change its
programming?
5. Michigan State University Extension has adopted an Area of Expertise
(AOE) team approach to Extension work. Have you taken advantage of
the Area of Expertise (AOE) teams? What have been your impressions
of the AOE team performance during the past year?
6. How can MSU Extension’s field crop AOE team improve its future
program offerings? Could you suggest ways to improve Extension field
crop programs?
o Is low-cost.
o Requires little time.
o Can encourage local participation.
o Can decrease outsider bias.
o Can encourage participation of frequently overlooked groups.
o Offers flexibility in method selection.
o Seasonal bias.
o Accessibility bias.
o Elite bias.
o Hypothesis confirming - selective attentiveness.
o Concreteness bias - confusing specificity with generality.
o Consistency bias - premature formation of coherence in data.
o May not be generalizable.
• Wealth ranking
• Preference ranking
Matrices
Case Study
A case study is an in-depth analysis of a particular case – a program, a group
of participants, a single individual, or a specific site or location. Case studies
can be explanatory, descriptive or exploratory. An explanatory case study can
measure causal relationships; a descriptive case study can be used to describe
the context in which a program takes place and the program itself, and an
exploratory case study can help identify performance measures or pose
hypotheses for further evaluation. Case studies rely on multiple sources of
information and methods to provide as complete a picture as possible of the
particular case.
Participant Observation
Participant observation entails gathering information about behavioral actions
and reactions through direct observation, interviews with key informants, and
participation in the activities being evaluated. As used in evaluation, the PO
evaluator immerses him- or herself in the setting being studied with the intent
of understanding the world through the eyes of stakeholders. Participant
observation is useful in determining community conflicts or misunderstandings,
assessing community needs and problems, and/or identifying means to involve
local people in problem solving.
Benefit/Cost Analysis
Benefit/cost analysis is typically viewed as an alternative to program
evaluation. However, it can also be seen as an extension of the evaluation
process. As such, benefit/cost analysis provides a means to systematically
quantify and compare program inputs to program outcomes in monetary terms.
Valuing both benefits and costs in monetary terms allows them to be directly
compared to determine the net impact of the program, make comparisons
between alternative programs or projects, assist in program planning, advance
organizational accountability and /or expedite program support.
L = labor: The cost per hour for labor including salary and fringe
benefits. Fringe benefits vary but normally fall within 22 to 35 percent of
full salary. The full labor hourly formula (L) is: (S+S.35)/260/8 where S
= salary and S.35 = fringes, 260 = workdays per year and 8 = hours per
workday.
1. Labor Hours:
2. 1.
3. 2.
4. 3.
5. Direct costs
6. 1. Rent
2. Utilities
1. 1. Printed Pieces:
materials
2. 2. Furnishings
3. 3.
Instructional
Materials
4. 4. Travel Miles:
5. Opportunity
costs
6. 1. Child care
2. Food
3. Travel
Indirect costs
Total Total
program program costs
benefits
Benefit/cost
ratio
Sample Size
How large should a sample be? A sample size of 100 respondents is often cited
as a minimal number for a large population. The practical maximum size is
about 1000 respondents. Generally, a sample of fewer than 30 respondents will
not provide enough certainty to prove useful. However, several factors need to
be considered when determining actual sample size.
n* s* n s n s
Sampling Techniques
Random or probability sampling is based on random selection of
units from the identified population. Random sampling techniques include:
Simple random sample - all the individuals in the population have an equal and
independent chance of being selected as a member of the sample. A random
numbers table is sometimes used with a randomly selected starting point to
identify numbered subjects (see appendix).
Systematic sampling - all members in the population are placed on a list for
random selection and every nth person is chosen after a random starting place
is selected.
Cluster sample - the unit of sampling is not the individual but rather a naturally
occurring group of individuals such as a classroom, organization or community.
Content analysis
Reliability
To make valid inferences from the text, it is important that the classification
procedure be reliable in the sense of being consistent: Different people should
code the same text in the same way. Reliability problems in content analysis
usually grow out of the ambiguity of word meanings, category definitions, or
coding rules. Three types of reliability are pertinent to content analytic analysis:
stability, reproducibility, and accuracy.
Stability refers to the extent to which the results of content classification are
invariant over time, i.e., whether content will be coded in the same way if it is
coded more than once by the same coder.
Validity
The classification procedure must also generate valid variables, that is, it must
measure or represents what the investigator intends it to measure. As happens
with reliability, validity problems also grow out of the ambiguity of word
meaning and category or variable definitions.
A measure has construct validity to the extent that it is correlated with another
measure of the same construct. Thus, construct validity entails the
generalizability of the construct across measures or methods. There is no
simple right way to do content analysis, investigators must judge what methods
are most appropriate for their purpose. Large portions of text, such as
paragraphs and complete texts, usually are more difficult to code as a unit than
smaller portions, such as words or phrases, because large units typically contain
more information and a greater diversity of topics. Hence, they are more likely
to present coders with conflicting cues.
1. Define the coding units: Words, word sense (code different senses of
words with multiple meanings or code phrases that constitute a semantic
unit), sentences (when interested in words or phrases that occur closely
together), themes, paragraphs, whole text.
2. Define the categories, which involves two decisions: First, whether the
categories are mutually exclusive, and second, how narrow or broad the
categories are to be.
3. Test coding on a sample of text.
4. Assess reliability.
5. Revise the coding rules.
6. Return to step three until coders achieve sufficient reliability.
7. Code all the text.
8. Assess achieved reliability. Coders are subject to fatigue and are likely to
make more mistakes as the coding proceeds. Also, as the text is coded,
their understanding of the coding rules may change in subtle ways that
lead to greater unreliability.
Scales of measurement refers to the type of variable being measured and the
way it is measured. Different statistics are appropriate for different scales of
measurement. Scales of measurement include:
The mean is used for interval variables. It is the arithmetic average of all
observations. You calculate mean by totaling all observations (scores or
responses) and dividing by the number of observations. The mean is sensitive
to "outliers" or extreme values in the observations. When your data has a few
extremely small or large observations, the data are "skewed."
The median is most appropriate for ordinal variables. The median is the
middle observation. Half of the observations are larger and half are smaller.
The median is not as sensitive to the outliers as the mean.
The mode is used for nominal variables. It is the observation or category that
occurs most frequently. The mode can be used to show the most "popular"
observation or value. A distribution can be either unimodal or bimodal.
Distribution A Distribution B
23 2 33 1
45 6 21 7
34 8 61 21
25 11 75 4
73 15 66 3
83 18 24 7
54 10 74 10
66 12 88 21
Distribution B is bimodal or has two modes, 61 and 88, with 21 responses each.
t-Test is used to test the difference between two means even when the sample
sizes are small. The significance of the t statistics depends upon the hypothesis
the researcher plans to test. If you are interested in determining whether there is
a significant difference between two means, but you do not know which of the
means is greater, use the two-tailed test. If you are interested in testing the
specific hypothesis that one mean is greater than the other, use the one-tailed
test. Data should satisfy parametric assumptions: 1) the sample is selected from
populations that are nominally distributed; 2) there is homogeneity of variance
-- i.e., the spread of the dependent variable within the group tested must be
statistically equal; and 3) data are of continuous form with equal intervals of
quantity measurement. Dependent variables must be interval or ratio-type data.
T-test for matched pairs: if both groups of data are contained in each data
record, the appropriate t-test is for matched pairs. An example of an appropriate
use of the t-test for matched pairs might be to compare pretest and posttest
scores where each person took a pretest (variable 1) and a posttest (variable 2).
Both values are contained in each data record.
T-test for independent groups: If each case in the data file is to be assigned to
one group or the other based on another variable, use the t-test for independent
groups. For example, to compare reading scores between males and females,
split the reading scores into two groups, depending on whether the person is
male or female (each record in the date file is assigned to one group or the
other).
Degrees of freedom: (this is not a complete description) The degrees of
freedom (d.f.) reflect sample size. When two independent samples are being
considered, d.f. are equal to the sum of two sample sizes minus 2; i.e., d.f. =(n1
+ n2 –2).
Range is the difference between the largest and the smallest scores in a
distribution.
Example: Scores of 3, 6, 8, 10, 14, 17. The range is 14 points. The scores range
from 3 to 17.
Variance is the mean of the squares of the deviation scores. Calculate the
difference (deviation) between each score and the mean of the scores, square
the deviations, sum the squares and divide the sum by the number of scores
minus 1.
Standard deviation measures the spread of data about their mean and is an
essential part of any statistical test. It is calculated by taking the square root of
the variance. This transforms variance into the same unit of measurement as the
raw scores. Standard deviation is expressed in terms of "one standard deviation
above the mean" or the like. If the standard deviation is 11 and the score is 63,
then one standard deviation is above the mean is 74, two standard deviations is
85 and so forth. The value of this figure becomes apparent when we understand
the relationship between standard deviations and percentiles in a normal curve.
The area contained within +1 and - 1 standard deviations of the mean includes
approximately 69 percent of all scores on the distribution. Therefore, in our
earlier example 68 percent of all scores were between 52 and 74.
Nominal
Û Non-parametric Û Û Chi-square
Ordinal
Interval
Û ANOVA (3groups)
Reporting plan: Developing a reporting plan with stakeholders can help clarify
how, when and to whom findings should be disseminated.
Reporting Tips
• Reports that are short, concise and to the point are the ones that get
attention.
• Craft the style and content of the evaluation report to fit the intended
audience.
• Avoid technical terms that your audience may not know.
• Use a conversational tone.
• Use a combination of long and short sentences.
• Read report aloud to check for confusing ideas and sentences.
• Write in an active voice.
• Use a logical structure for your documents.
• Allow sufficient time for writing drafts and getting feedback and
proofreading.
• Conceptual clarity - Is the evaluation well focused and the purpose, role, and
general approach clearly stated?
• Appropriate methods and analysis - Were the appropriate methods chosen for
the evaluation? Were they used correctly? Were data analyzed and interpreted
carefully?
• Explicit standards & criteria for judging the evaluation - Did evaluation
contain an explicit listing and/or discussion of the criteria and standards used to
make judgments about the evaluation object?
impacts To what extent and in what ways could the program improved? To
what extent were informed, high-quality decisions made?
y users What do intended users think about the evaluation? What’s the evaluation’s
credibility? believability? relevance? accuracy? potential utility?
ation Who was involved? To what extent were key stakeholders and primary decision
makers involved throughout?
What data were gathered? What were the focus, the design and the analysis?
What happened in the evaluation?
To what extent were resources for the evaluation sufficient and well
managed? Was there sufficient time to carry out evaluation?
Identify the Assess the feasibility of Consult Stakeholders to clarify Identify approaches to Select d
program to be implementing an indicators of program merit data collection techniq
evaluation
evaluated, its
objectives and
stakeholders
Identify target population and Who will collect data? How will data be analyzed and How will evalu
select sample interpreted? shared with sta
References
Archer, T. and Layman, J. (1991). "Focus Group Interview" Evaluation
Guide Sheet, Ohio Cooperative Extension Service.
Salant, P. and Dillman, D.A. (1994). How to conduct your own survey.
New York, NY: John Willey & Sons, Inc.
Wholey, J. S.; Harty, H. P. and Newcomer, K. E. (eds.). (1994). Handbook
of practical program evaluation. San Francisco: Jossey-Bass Publishers.