Assessment and Evaluation
Assessment and Evaluation
According to Kizlik (2011) evaluation is most complex and the least understood term.
Hopkins and Antes (1990) defined evaluation as a continuous inspection of all available
information in order to form a valid judgment of students’ learning and/or the
effectiveness of education program.
Evaluation is based two philosophies. One, traditional philosophy is that
ability to learn is randomly distributed in the general population. This gave birth to
norm-referenced measurement of intellectual abilities. In norm-referenced
measurement, an individual's score is interpreted by comparing the score to those
of a defined group, often called the normative group. The comparison is relative
rather than absolute.
Concept of Measurement
Classroom Assessment
. Hamidi (2010) developed a framework to answer the Why; What, How and
When to assess. This is helpful in understanding the true nature of this concept.
Why to Assess: Teachers have clear goals for instruction and they assess
to ensure that these goals have been or are being met. If objectives are the
destination, instruction is the path to it then assessment is a tool to keep the
efforts on track and to ensure that the path is right.
Who to Assess: Teachers should treat students as 'real learners', not as course
or unit coverers. They should also predict that some are quick at learning and
some are slow at it. Therefore, classroom assessment calls for a prior realistic
appraisal of the performance of individuals.
Types of Assessment
Based upon the functions that it performs, assessment is generally divided into
three types: assessment for learning, assessment of learning and assessment as
learning.
a. Assessment for Learning (Formative Assessment)
Assessment for learning is a continuous and an ongoing assessment that allows teachers to
monitor students on a day-to-day basis and modify their teaching based on what the students need to
be successful. This assessment provides students with the timely, specific feedback that they need to
enhance their learning..
Role of assessment for learning in instructional process can be best understood with the
help of following diagram.
c. Assessment as Learning
Assessment as learning means to use assessment to develop and support students'
metacognitive skills. This form of assessment is crucial in helping students become lifelong
learners. Students develop a sense of efficacy and critical thinking when they use teacher,
peer and self- assessment feedback to make adjustments, improvements and changes to what
they understand.
The term ‘assessment’ is derived from the Latin word ‘assidere’ which means
‘to sit beside’. In contrast to testing, the tone of the term assessment is non-
threatening indicating a partnership based on mutual trust and
understanding. This emphasizes that there should be a positive rather than a
negative association between assessment and the process of teaching and
learning in schools. In the broadest sense assessment is concerned with
children’s progress and achievement.
Approaches to Evaluation
1 Formative Evaluation
I) to identify the content (i.e. knowledge or skills) which have not been
mastered by the student;
iii) to spcify the relationships between content and levels of cognitive abilities.
instructional context.
2. Summative Evaluation
Summative evaluation is primarily concerned with purposes, progress and
outcomes of the teaching-learning process. It attempts as far as possible to
determine to what extent the broad objectives of a programme have been
achieved. It is based on the following assumptions.
Incomplete Statement
The capital city of Pakistan is
A. Paris.
B. Lisbon.
C. Islamabad.
D. Rome.
Students can generally respond to these types of questions quite quickly. As a result,
they are often used to test student’s knowledge of a broad range of content. Creating
these questions can be time consuming because it is often difficult to generate several
plausible distracters. However, they can be marked very quickly.
Advantages:
Multiple-choice test items are not a panacea. They have advantages and advantages
just as any other type of test item. Teachers need to be aware of these
characteristics in order to use multiple-choice items effectively.
Versatility
Multiple-choice test items are appropriate for use in many different subject-matter areas,
and can be used to measure a great variety of educational objectives. They are
adaptable to various levels of learning outcomes, from simple recall of knowledge to
more complex levels, such as the student’s ability to:
• Analyze phenomena
• Apply principles to new situations
• Comprehend concepts and principles
• Discriminate between fact and opinion
• Interpret cause-and-effect relationships
• Interpret charts and graphs
• Judge the relevance of information
• Make inferences from given data
• Solve problems
Validity
A student is able to answer many multiple- choice items in time. This feature enables
the teacher using multiple-choice items to test a broader sample of course contents in a
given amount of testing time. Consequently, the test scores will likely be more
representative of the students’ overall achievement in the course.
Reliability
Well-written multiple-choice test items compare favourably with other test item types on
the issue of reliability. They are less susceptible to guessing than are true-false test
items, and therefore capable of producing more reliable scores.
Efficiency
Multiple-choice items are amenable to rapid scoring, which is often done by scoring
machines. This expedites the reporting of test results to the student so that any follow-
up clarification of instruction may be done before the course has proceeded much
further. Essay questions, on the other hand, must be graded manually, one at a time.
Overall multiple choice tests are:
Very effective
Versatile at all levels
Minimum of writing for student
Guessing reduced
Can cover broad range of content
Disadvantages
Versatility
Since the student selects a response from a list of alternatives rather than supplying or
constructing a response, multiple-choice test items are not adaptable to measuring
certain learning outcomes, such as the student’s ability to:
• Articulate explanations
• Display thought processes
• Furnish information
• Organize personal thoughts.
• Produce original ideas
• Provide examples
Such learning outcomes are better measured by short answer or essay questions, or by
performance tests.
Reliability
Although they are less susceptible to guessing than are true false-test items, multiple-
choice items are still affected to a certain extent. This guessing factor reduces the
reliability of multiple-choice item scores somewhat, but increasing the number of items
on the test offsets this reduction in reliability.
Difficulty of Construction
Good multiple-choice test items are generally more difficult and time-consuming to write
than other types of test items.
2. True/False Questions
A True-False test item requires the student to determine whether a statement is true or
false. The chief disadvantage of this type is the opportunity for successful guessing.
Also known as a “binary-choice” item because there are only two options to select from.
These types of items are more effective for assessing knowledge, comprehension, and
application outcomes as defined in the cognitive domain of Blooms’ Taxonomy of
educational objectives.
Example
Directions: Circle the correct response to the following statements.
1. Allama Iqbal is the founder of Pakistan. T/F
Good for:
Knowledge level content
Evaluating student understanding of popular misconceptions
Concepts with two logical responses
Advantages:
Easily assess verbal knowledge
Easy to construct for the teacher
Easy to score for the examiner
Helpful for poor students
Can test large amounts of content
Disadvantages:
It is difficult to discriminate between students that know the material
and students who don't know.
Need a large number of items for high reliability.
Fifty percent guessing factor.
Assess lower order thinking skills.
Poor representative of students learning achievement.
Tips for Writing Good True/False items:
Avoid double negatives.
Avoid long/complex sentences.
Use only one central idea in each item.
Don't emphasize the trivial.
Use exact quantitative language
Don't lift items straight from the book.
Make more false than true (60/40). (Students are more likely to answer
true.)
The desired method of marking true or false should be clearly explained
before students begin the test.
Construct statements that are definitely true or definitely false, without
additional qualifications. If opinion is used, attribute it to some source.
Avoid the following:
a. verbal clauses, absolutes, and complex sentences;
b. broad general statements that are usually not true or false without
further qualifications;
c. terms denoting indefinite degree (e.g., large, long time, or regularly)
or absolutes (e.g., never, only, or always).
d. placing items in a systematic order (e.g., TTFF, TFTF, and so on);
e. taking statements directly from the text and presenting them out of
context.
3. Matching items
The matching items consist of two parallel columns. The column on the left contains the
questions to be answered, termed premises; the column on the right, the answers,
termed responses. The student is asked to associate each premise with a response to
form a matching pair.
For example;
Islamabad Iran
Tehran Spain
Istanbul Portugal
Madrid Pakistan
Jaddah Turkey
Matching test items are used to test a student's ability to recognize relationships and to
make associations between terms, parts, words, phrases, clauses, or symbols in one
column with related alternatives in another column.
Good for:
Knowledge level
Some comprehension level, if appropriately constructed
Advantages:
The chief advantage of matching exercises is that a good deal of factual information can
be tested in minimal time, making the tests compact and efficient. They are especially
well suited to who, what, when and where types of subject matter. Further students
frequently find the tests fun to take because they have puzzle qualities to them.
Maximum coverage at knowledge level in a minimum amount of
space/prep time
Valuable in content areas that have a lot of facts
Disadvantages:
The principal difficulty with matching exercises is that teachers often find that the
subject matter is insufficient in quantity or not well suited for matching terms. An
exercise should be confined to homogeneous items containing one type of subject
matter (for instance, authors-novels; inventions inventors; major events-dates terms –
definitions; rules examples and the like).
Time consuming for students
Not good for higher levels of learning
Tips for Writing Good Matching items:
Here are some suggestions for writing matching items:
Keep both the list of descriptions and the list of options fairly short and
homogeneous.
The list of descriptions on the left side should contain the longer
phrases or statements, whereas the options on the right side should
consist of short phrases, words or symbols.
Each description in the list should be numbered (each is an item), and
the list of options should be identified by letter.
Include more options than descriptions. If the option list is longer than
the description list, it is harder for students to eliminate options. If the
option list is shorter, some options must be used more than once.
Always include some options that do not match any of the descriptions,
or some that match more than one, or both.
Need 15 items or less.
Use items in response column more than once (reduces the effects of
guessing).
Put all items on a single page.
4. Completion Items
Like true-false items, completion items are relatively easy to write. These are also
known as “Gap-Fillers.” Most effective for assessing knowledge and comprehension
learning outcomes but can be written for higher level outcomes. e.g.
The capital city of Pakistan is
Suggestions for Writing Completion or Supply Items
Here are our suggestions for writing completion or supply items:
I. If at all possible, items should require a single-word answer or a
brief and definite statement. Avoid statements that are so indefinite
that they may be logically answered by several terms.
a. Poor item:
World War II ended in .
b. Better item:
World War II ended in the year .
II. Be sure the question or statement poses a problem to the
examinee. A direct question is often more desirable than an
incomplete statement because it provides more structure.
III. Be sure the answer that the student is required to produce is
factually correct..
IV. Omit only key words; don’t eliminate so many elements that the
sense of the content is impaired.
a. Poor item:
The type of test item is usually more than the type.
b. Better item:
The supply type of test item is usually graded less objectively than the type.
Short Answer
Student supplies a response to a question that might consistent of a single word or
phrase. Most effective for assessing knowledge and comprehension learning outcomes
but can be written for higher level outcomes. Short answer items are of two types.
Simple direct questions
Who was the first president of the Pakistan?
Completion items
The name of the first president of Pakistan is .
Good for:
Application, synthesis, analysis, and evaluation levels
Advantages:
Gronlund (1995) writes that short-answer items have a number of advantages.
They reduce the likelihood that a student will guess the correct answer
They are relatively easy for a teacher to construct.
They are adapted to mathematics, the sciences, and foreign languages
where specific types of knowledge are to be tested (The formula for
ordinary table salt is--------).
They are consistent with the Socratic question and answer format
frequently employed in the elementary grades in teaching basic skills.
Disadvantages:
May overemphasize memorization of facts
Take care - questions may have more than one correct answer
Scoring is laborious
Tips for Writing Good Short Answer Items:
When using with definitions: supply term, not the definition-for a better
judge of student knowledge.
For numbers, indicate the degree of precision/units expected.
Use direct questions, not an incomplete statement.
If you do use incomplete statements, don't use more than 2 blanks
within an item.
Arrange blanks to make scoring easy.
Try to phrase question so there is only one answer possible.
Essay
Essay questions are supply or constructed response type questions and can be the best
way to measure the students' higher order thinking skills, such as applying, organizing,
synthesizing, integrating, evaluating, or projecting while at the same time providing a
measure of writing skills. The student has to formulate and write a response, which may
be detailed and lengthy. The accuracy and quality of the response are judged by the
teacher.
Essay questions provide a complex prompt that requires written responses, which can
vary in length from a couple of paragraphs to many pages. Like short answer questions,
they provide students with an opportunity to explain their understanding and
demonstrate creativity, but make it hard for students to arrive at an acceptable answer
by bluffing. They can be constructed reasonably quickly and easily but marking these
questions can be time-consuming and grade agreement can be difficult.
Essay questions differ from short answer questions in that the essay questions are less
structured. This openness allows students to demonstrate that they can integrate the
course material in creative ways. As a result, essays are a favoured approach to test
higher levels of cognition including analysis, synthesis and evaluation. However, the
requirement that the students provide most of the structure increases the amount of
work required to respond effectively. Students often take longer time to compose a five
paragraph essay than they would take to compose paragraph answer to short answer
questions.
There are 2 major categories of essay questions -- short response (also referred to as
restricted or brief) and extended response.
A. Restricted Response Essay Items
An essay item that poses a specific problem for which a student must recall proper
information, organize it in a suitable manner, derive a defensible conclusion, and
express it within the limits of posed problem, or within a page or time limit, is called a
restricted response essay type item. The statement of the problem specifies response
limitations that guide the student in responding and provide evaluation criteria for
scoring.
Example 1:
List the major similarities and differences in the lives of people living in Islamabad and
Faisalabad.
When Should Restricted Response Essay Items be used?
Restricted Response Essay Items are usually used to:-
Analyze relationship
Compare and contrast positions
State necessary assumptions
Identify appropriate conclusions
Explain cause-effect relationship
Organize data to support a viewpoint
B. Extended Response Essay Type Items
An essay type item that allows the student to determine the length and complexity of
response is called an extended-response essay item. This type of essay is most useful
at the synthesis or evaluation levels of cognitive domain. We are interested in
determining whether students can organize, integrate, express, and evaluate
information, ideas, or pieces of knowledge the extended response items are used.
Example:
Identify as many different ways to generate electricity in Pakistan as you can? Give
advantages and disadvantages of each. Your response will be graded on its accuracy,
comprehension and practical ability. Your response should be 8-10 pages in length and
it will be evaluated according to the RUBRIC (scoring criteria) already provided.
Over all Essay type items (both types restricted response and extended response) are
Good for:
Application, synthesis and evaluation levels
Advantages:
Students less likely to guess
Easy to construct
Stimulates more study
Allows students to demonstrate ability to organize knowledge, express
opinions, show originality.
Disadvantages:
Can limit amount of material tested, therefore has decreased validity.
Subjective, potentially unreliable scoring.
Time consuming to score.
Tips for Writing Good Essay Items:
Provide reasonable time limits for thinking and writing.
Avoid letting them to answer a choice of questions (You won't get a
good idea of the broadness of student achievement when they only
answer a set of questions.)
Give definitive task to student-compare, analyze, evaluate, etc.
Use checklist point system to score with a model answer: write outline,
determine how many points to assign to each part
Score one question at a time-all at the same time.
Types of Reliability
There are six general classes of reliability estimates, each of which estimates reliability
in a different way. They are:
i) Inter-Rater or Inter-Observer Reliability
To assess the degree to which different raters/observers give consistent estimates of
the same phenomenon. That is if two teachers mark same test and the results are
similar, so it indicates the inter-rater or inter-observer reliability.
ii) Test-Retest Reliability:
When a same test is administered twice and the results of both administrations are
similar, this constitutes the test-retest reliability.
iii) Parallel-Form Reliability:
To assess the consistency of the results of two tests constructed in the same way from
the same content domain. Here the test designer tries to develop two tests of the similar
kinds and after administration the results are similar then it will indicate the parallel form
reliability.
1. Test Length:
As a rule, adding more homogeneous questions to a test will increase the test's
reliability.
2. Method Used to Estimate Reliability:
The reliability coefficient is an estimate that can change depending on the method used
to calculate it. The method chosen to estimate the reliability should fit the way in which
the test will be used.
3. Heterogeneity of Scores
Heterogeneity is referred as the differences among the scores obtained from class.
Increasing the heterogeneity of the examinee sample increases variability (individual
differences) thus reliability increases.
4. Difficulty
A test that is too difficult or too easy reduces the reliability (e.g., fewer test-takers get
the answers correctly or vice-versa). A moderate level of difficulty increases test
reliability.
Validity
The validity of an assessment tool is the degree to which it measures for what it is
designed to measure. For example if a test is designed to measure the skill of addition
of three digit in mathematics but the problems are presented in difficult language that is
not according to the ability level of the students then it may not measure the addition
skill of three digits, consequently will not be a valid test. Many experts of measurement
had defined this term, some of the definitions are given as under.
According to Business Dictionary the “Validity is the degree to which an instrument,
selection process, statistical technique, or test measures what it is supposed to
measure.”
Overall we can say that in terms of assessment, validity is the extent to which a test
measures what it claims to measure. It is vital for a test to be valid in order for the
results to be accurately applied and interpreted.
2. Construct Validity
Construct validity is a test’s ability to measure factors which are relevant to the field of
study. Construct validity is thus an assessment of the quality of an instrument or
experimental design.
For Example - Integrity is a construct; it cannot be directly observed, yet it is useful for
understanding, describing, and predicting human behaviour.
3. Criterion Validity
It compares the test with other measures or outcomes (the criteria) already held to be
valid. For example, employee selection tests are often validated against measures of
job performance (the criterion), and IQ tests are often validated against measures of
academic performance (the criterion).
4. Concurrent Validity
Concurrent validity refers to the degree to which the scores taken at one point
correlates with other measures (test, observation or interview) of the same construct
that is measured at the same time.
For example:
To assess the validity of a diagnostic screening test. In this case the predictor (X) is the
test and the criterion (Y) is the clinical diagnosis. When the correlation is large this
means that the predictor is useful as a diagnostic tool.
5. Predictive Validity
Predictive validity assures how well the test predicts some future behaviour of the
examinee.
For example, a political poll intends to measure future voting intent. College entry tests
should have a high predictive validity with regard to final exam results. When the two
sets of scores are correlated, the coefficient that results is called the predictive validity
coefficient.
Transparency
In simple words transparency is a process which requires from teachers to maintain
objectivity and the honesty for developing, administering, marking and reporting the test
results. Transparency refers to the availability of clear, accurate information to students
about testing. Transparency makes students part of the testing process.
Security
Most teachers feel that security is an issue only in large-scale, high-stakes testing.
However, security is part of both reliability and validity. If a teacher invests time and
energy in developing good tests that accurately reflect the course outcomes, then it is
desirable to be able to recycle the tests or similar materials. This is especially important
if analyses show that the items, distracters and test sections are valid and
discriminating. In some parts of the world, cultural attitudes towards “collaborative test-
taking” are a threat to test security and thus to reliability and validity. As a result, there is
a trade-off between letting tests into the public domain and giving students adequate
information about tests.
Objectivity
The objectivity of a test refers to the degree to which equally competent scorers
obtain the same results. Most standardized tests of aptitude and achievement are high
in objectivity. The test items are of the objective type (e.g., multiple choices),
and the resulting scores are not influenced by the scorer's judgment or opinion. In fact,
such test are usually constructed so that they can be accurately scored by trained clerks
and scoring machines. When such highly objective procedures are used the reliability of
the test results is not affected by the scoring procedures.
For classroom test constructed by teachers, objectivity may play an important role
in obtaining reliable measures of achievement. In essay testing and various
observational procedures the results depend to a large extent on the person doing the
scoring. Different persons get different results, and even the same person may get
different results at different times. Such inconsistency in scoring has an adverse effect
on the reliability of the measures obtained, for the test scores now reflect the opinions
and biases of the scorer as well as the differences among pupils in the characteristic
being measured.
The solution is not to use only objective test and to abandon all subjective methods of
evaluations, as this would have an adverse effect on validity, and as we noted earlier,
validity is the most important quality of evaluation results. A better solution is to select
the evaluation procedure most appropriate for the behavior being evaluated and
then to make the evaluation procedure as objective as possible. In the use of essay
tests, for example objectivity can be increased by careful. phrasing of the questions.