100% found this document useful (2 votes)
119 views88 pages

Test Construction and Validation

The document discusses various concepts related to testing, measurement, and evaluation in education. It defines key terms like tests, measurement, assessment, and evaluation. It also covers purposes of evaluation like determining achievement and monitoring effectiveness. Different types of tests, measurements, and assessments are outlined like traditional, performance, portfolio, and authentic assessments. Principles of evaluation like validity, reliability, objectives, and diagnostic characteristics are also summarized.

Uploaded by

Nikko Atabelo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (2 votes)
119 views88 pages

Test Construction and Validation

The document discusses various concepts related to testing, measurement, and evaluation in education. It defines key terms like tests, measurement, assessment, and evaluation. It also covers purposes of evaluation like determining achievement and monitoring effectiveness. Different types of tests, measurements, and assessments are outlined like traditional, performance, portfolio, and authentic assessments. Principles of evaluation like validity, reliability, objectives, and diagnostic characteristics are also summarized.

Uploaded by

Nikko Atabelo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 88

TEST

CONSTRUCTION AND
VALIDATION
TEST
 TEST is an instrument or tool to
measure any quality, ability, skill or
knowledge.
MEASUREMENT is
 the process of quantifying an
individual's achievement,
personality, attitudes among others
 the process of quantifying the
degree to which someone/
something possesses a given trait
EDUCATIONAL MEASUREMENT

 A process of gathering data that


provides for a more precise and
objective appraisal of learning
outcomes that could be
accomplished by less formal and
systematic procedures.
ASSESSMENT
 Concerns itself with the totality of the
educational setting, and is the more
inclusive term, that is, it subsumes
measurement and evaluation.
 It focuses not only on the nature of the
learner, but also on what is to be learned
and how. Assessments are made
continuously in educational settings.
EVALUATION

 A process of systematic collection


and analysis of both qualitative and
quantitative data for the purpose of
making decisions or judgments.
WHY MEASUREMENT?
 Assistance in interpretation.
 Data reduction.
 Descriptive flexibility.
 Identification of patterns.
TESTS, MEASUREMENT, AND
EVALUATION
Test Vocabulary subtest of the
Stanford Achievement test
Measurement Obtaining 62 correct
answers on a 75-item
teacher-made classroom
test covering the history of
ancient Egypt
Evaluation Student is promoted to
sixth grade
APPLICATION OF EDUCATIONAL

MEASUREMENT DATA
Selecting, appraising, and clarifying
instructional objectives
 Determining and reporting pupil
achievement of education objectives
 Accountability and program evaluation
 Planning, directing, and improving
learning experiences
 Accountability and program evaluation
 Counseling
 Selection
Types of Evaluation
 PLACEMENT= determines student
placement or classification before
instruction begins.
 SUMMATIVE =determines the
extent to which objectives of
instruction have been achieved
 FORMATIVE= Monitors student's
progress during the learning process
 DIAGNOSTIC =Identifies specific
strengths and weaknesses in the
student's past and present learning
Purposes of Evaluation
 to determine achievement of
curricular objectives
 to monitor the effectiveness of their
teaching, and to identify individual
learning problems
 Feedbacks to parents

 Feedbacks to students
Forms / Kinds of Assessment
 Traditional Assessment - refers to the
forced-choice measures
 Performance Assessment - complex task,
often involving the creation of a product.
 Portfolio Assessment - a collection of many
different indicators of student progress in
support of curricular goals in dynamic on-going
and collaborative process.
 Authentic Assessment - is used to evaluate
student's work by measuring the product
according to real life criteria.
Principles of Evaluation
 integral part of the teaching
 continuous process
 objectives of learning
 validity
 Reliability
 diagnostic characteristic
 participative
 variety
 Validity
In evaluating learners, there must
be a close relationship between
what the test measures and what it
is supposed to measure.
Validity is shown when the
arrow hits its target
10 POINTS

2 POINTS

1 POINT
RELIABILITY
 Reliability
refers to the consistency
with which students perform on a
test.

It is stability or repeatability of the


performance on a test.
This is repeatability but
not within the target.

10 POINTS

2 POINTS

1 POINT
This is being Valid & Reliable
10 POINTS

8 POINTS

2 POINTS
Classification of Teacher-Made Tests

 Objective Test
– Supply Type
 short answer
 completion
– Selective Type
 true-false or alternative response
 Matching
 multiple choice
 Essay Test
– Extended response
– Restricted response
NON-TEST METHOD
– Observation of student work
– Group Evaluation Activities
 Class Discussion
 Homework
 Notebooks and Note Telling
 Reports, Themes and Research Papers
 Discussions and Debates
General Suggestions for Writing Test
– Use your test specification as guide to item
writing.
Items .
– Write more test items than needed.
– Write the test items well in advance of the
testing date.
– Write each test item so that the task to be
performed is clearly defined.
– Write each item at an appropriate difficulty
level.
– Write each item so that it does not provide
help in answering other items in the test.
– Write each test item so that the answer is
one that would be agreed upon by experts.
Short-Answer Items

 Make sure the required answer is brief and


specific.
 Avoid verbatim statements from the textbook
that encourage memorization.
 Word the item as a direct question, if possible.
 If fill-in-the-blank items are used, use only one
blank per statement and place it toward the
end of the statement.
 Omit the most important, not trivial, words
in completion items in order to assess
understanding of relevant concepts.
 Avoid unintended grammatical cues.

 Prepare a scoring key with anticipated


acceptable answers or model answers.
 Provide sufficient answer space, making all
blanks the same length to avoid providing
clues to the correct answer.
 Inpreparing short-answer items, if a
statement is over mutilated the
meaning is likely to get lost and the
pupils simply tend to guess the
answer. How would you improve the
item? “______is anything that
occupies _______and has ________”.
As regards alternative response items,
double negatives can be particularly
difficult since they contribute to the
statement ambiguity. How may the
following item be proved?
“Intelligence is not a non-hereditary
trait.”
General Suggestions for
writing Matching Type
Items
Matching Type
 Use homogenous material in a given exercise.
A set of matching items must deal with the
same material. It is difficult to write matching
items across topics.
 Use more responses than premises, providing
directions that responses may be used, once,
more than once, or not at all, to avoid giving
away answers.
 Keep the list or premises, especially the
responses that students have to scan for
the correct one.
 List responses in a logical order.
 Indicate in the test direction the basis to
be used for matching premises to
responses.
 Place all items for one matching exercise
on the same page.
 Label the premises with numbers and the
responses with letters.
Multiple-Choice Items

Example:

 How could you handle a child who clings to


immature behavior?
a) Put him back to a lower grade.
b) Seek the assistance of the school
psychologist.
c) Help him to meet his needs in a more
mature manner.
d) Advise the parents to let him stop studying
for a while until he becomes more mature.
Parts Of The Multiple –Choice Item
 Stem: How could you handle a child
who clings to immature behavior?
 Correct Answer.

 Foil/Distractors – the wrong choices


GENERAL
SUGGESTIONS IN
WRITING MULTIPLE-
CHOICE ITEMS
 Make the alternatives grammatically
consistent with the stem to avoid providing
inadvertent clues to the correct answer.
 Write the stem of the item so that it is
meaningful and present a clear problem
without the students having to look at the
alternatives.
 Include as much of the item as possible in
the stem without providing irrelevant
material.
 Use negatively stated items rarely and only if
absolutely necessary. If used, emphasize the
negative using boldface type of capital
letters.
 Make sure there is only one correct
or clearly best answer.
 Provide plausible foils to avoid giving
away the answer. Use foils
(distractors) that represent likely
mistakes of students to help
diagnose misconceptions or errors in
reasoning.
 Avoid verbal associations between
the stem and the answer that give
unintended clues.
 Make sure the length of the correct
alternative does not provide clues by
being either significantly longer or shorter
than the foils.
 Make alternative position (A, B, C, D) the
correct answer approximately an equal
number of times. The correct answer
position should be arranged randomly.
 Avoid using “none of the above” and “all
of the above” unless there are specific
reasons for doing so.
 Avoid requiring personal opinion, which
will lead to the possibility of more than
one correct answer.
 Avoid wording that is taken verbatim
from the textbook or other instructional
materials, as this encourages
memorization rather than
understanding.
 Avoid linking two or more items
together, except when writing
interpretative exercise. Items should be
independent and not provide clues to
other items.
Types of Validity

 Content Validity
 Face Validity
 Criterion-Related
Validity
Content Validity – involves
essentially the systematic
examination of the test content
to determine to whether it
covers a representative sample
of behavior domain to be
measured. This is assured by a
table of specification.
Face Validity – refers not to
what a test actually
measures, but to what it
appears superficially to
measure. Face validity
pertains to whether the test
“looks valid” to the
examinees who take it.
Criterion-Related Validity
– indicates the effectiveness
of a test in predicting an
individual’s behavior in
specific situations.
 CRITERION-RELATED VALIDITY – is
established statistically such that a set of
scores revealed by a test is correlated
with the scores obtained in an identified
criterion or measure.
– Concurrent Validity – describes the present
status of the individual by correlating the sets
of scores obtained from two measures given
concurrently.
– Predictive Validity – describes the future
performance of an individual by correlating
the sets of scores obtained from two
measures given at a longer time interval.
CONSTRUCT VALIDITY –

 involvespsychological meaningfulness of a
test score, that is the degree to which
certain theoretical factors or constructs can
account for item responses or performances.
Validation of Content
Validity. The instrument
exhibits validity when it
measures what it is
supposed to measure, and
when it hits its target
information.
 Instruments such as tests should
show content validity. Content
validity in tests, such as diagnostic
tests, achievement tests, quarter
tests, etc. must be assured by a table
of specification, which shows the
distribution of items within the
content scope of the test.
Table of Specification
Content Objectives
Knowledg Computati Analysi Comprehe
e on s nsion

Addition Test I-1 II - 1 III - 1

Subtraction I-2 II- 2,3 III -2

Multiplicati I- 3,4 II - 4 III -3


on

Division I-5 II- 5 III-4


Content Know Comp Analy Com #item %
s p s

Add’n Test I- II - 1 III - 1 3 21%


1
Subt I-2 II- 2,3 III -2 4 29%

Mult I- 3,4 II - 4 III -3 4 29%

Div I-5 II- 5 III-4 2 21%

# of 5 5 2 2 14 100
items
% 40 33.3 13.3 13.3 100
 Aside from the table of specification,
a test must come up with the indices
of difficulty and discrimination.
The difficulty index
 The difficulty index shows whether
an item is acceptable or not relative
to student’s difficulty in answering
the item.
The discrimination index
 The discrimination index shows the
index at which the item discriminates
the high group and low group of
students. It validates the performance
of the high group and the low group.
If the discrimination index is high, it
means that the item confirms the
good performance of the high group
compared to the low group.
Item analysis
Item analysis follows the given procedure:
 1. Dry run the test and score the papers.
 2. Arrange the papers from highest to lowest.
 3. Get the upper and lower 27% of the
papers. The upper 27% shall compose the
upper group while the lower 27%, the lower
group.
 4. Tally the answers of the upper and lower
group in each item.
 5. Compute necessary statistics to analyze
the items and the whole test.
A Response Analysis Table
Response
Item Group
a b c d
Upper 5 7 12* 0
1 Lower 10 6 11* 0

Upper 0 2 2 15*
2 Lower 7 5 4 11*
choices
Item Group

a b c d
Upper 5 7 12* 0
1 Lower 10 6 11* 0

C is the correct answer


d = ineffective distracter for item 1
a= good distracter for item 1
b = poor distracter for item 1
 Difficulty Index = (Ru + Rl)/N

 Discrimination Index
= (Ru – Rl)/1/2 N

 RU - number of correct responses in the upper group


 RL - number of correct responses in the lower group
 N - Total number of students in the upper & lower
group
 ½N - N divided by 2
 Based on table 4, c is the correct
response, thus:
 Difficulty Index =

(12 + 11)/54= .4354


 Discrimination Index =

(12 – 11)/27 = .01827


 To judge the results as to acceptability
Discrimination
Difficult .1 .2 .3 .4 .5 .6 .7 .8 .9 1
y
19.5 and
below
Very diff.
19.60 –
44.49
Difficult
Optimum
44.50 –
74.50
Easy
74.51 –
89.50
Discrimination
Difficulty .1 .2 .3 .4 .5 .6 .7 .8 .9 1
19.5 and
below
Very diff.
19.60 – 44.49 *
Difficult
Optimum
44.50 – 74.50
Easy
74.51 – 89.50
Very Easy
89.51 and
above
Reliability.
 The reliability of the test using the Kuder-
Richardson 20 can also be computed
using the data from the response analysis
table by getting the total number of
correct responses in both the upper and
lower group. Based on the Table there
were 23 students who got the correct
answer (see difficulty index). The
difficulty index is equal to the p, which
represents the proportion of correct
responses over the total number of
students in the upper and lower group
Reliability Computation
Item p q= (1-p) pq
1. .43 .57 .2451
2 .48 .52 .2496
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
k pk qk pk qk /
pq
 KR20 = tt = k/(k-1) 1- (pq /2x )

Where:
k = Total number of items
2x = the variance of the total test
P= proportion of those who got the
item correctly
q=1-p
pq =the sum of the products of pi and
qi
 Example:

A class of 54 took a ten-item test in


Physics. Each item is worth 1 point.
The upper 27% and lower 27% of the
students were taken, and they
composed the upper and lower
group, respectively. The response
analysis table and the discrimination
and difficulty indices, were computed
as shown.
 The scores of the upper and lower
groups on the test were recorded as
follows: upper group; 10, 10, 10, 9, 9,
9, 9, 9, 8, 8, 8, 8, 8, 7,7 and lower
group; 5, 5, 4, 4, 4, 3, 3, 3, 3, 2, 2, 2, 1,
1, 1.
 Get the 2x ( variance or square of the

Standard Deviation) of the scores using


the calculator. Ans=9.53
 or the formula
  2 = √ x2/N .

 Where x2 = X2 – ( X)2/N.

X=score
X2= Square of the score
 Where:

k = Total number of items


 2x = the variance of the total test

 P= proportion of those who got the


item correctly
q = 1 - p

 pq =the sum of the products of p


and q
Illustrative Example of
item analysis
Ite Group Response Diff Disc
m
a b c d

1 Upper 0 0 0 15* 15+10 15-


)/30= 10)/15=
Lower 2 1 1 10* .83 .33
2 Upper 0 0 0 15* .83 .33
Lower 2 1 1 10*
3 Upper 0 14* 0 0 .56 .73
Lower 2 3* 3 8
4 Upper 0 0 0 15* .63 .73
Lower 5 2 2 4*
5 Upper 0 0 0 15* .83 .33
Lower 1 2 2 10*
6 Upper 0 15* 0 0 .53 .93

lower 4 1* 4 6
7 Upper 0 10* 5 0 .37 .60
lower 1 1* 10 3
8 Upper 0 15* 0 0 .76 .47
lower 2 8* 2 3
9 Upper 0 14* 0 1 .53 .8
lower 1 2* 9 2
10 Upper 15* 0 0 0 .67 .67
lower 5* 7 1 2
 To judge the results as to acceptability
Discrimination
Difficulty .1 .2 .3 .4 .5 .6 .7 .8 .9
1
29.5 and below

19.60 – 44.50 8 1,2,


5
Optimum 3, 4, 9 6
44.50 – 74.50 10
Easy 7
74.60 – 89.50
Very Easy
89.6 and above
Compute
Item Pq
P by getting
q the diff.
pq index
1 .83 .17 .1411
2 .83 .17 .1411
3 .57 .43 .2451
4 .63 .37 .2331
5 .83 .17 .1411
6 .53 .47 .2491
7 .37 .63 .2331
8 .76 .24 .1824
9 .53 .47 .2491
10 .67 .33 .2211
pq = 2.03
The Reliability Coefficient
 KR20 = tt = k/(k-1) 1- (pq /2x )

=10/9 1- (2.0363/9.53)
=.87

This is a very high reliability


coefficient.
Types of Reliability
 Test-Retest reliability
 Alternate Form reliability

 Split-Half Reliability

 Rational Equivalence Reliability

 Scorer Reliability
1. Test-Retest reliability (coefficient
of stability) – repeating a test on a
second occasion using the same
group of examinees. The two sets of
scores are then correlated by
pearson’s r. The computed value is
the reliability coefficient.
Stud. Test Retest x2 xy y2
X Y
1 11 8 121 88 64
2 9 10 81 90 100
3 5 6 25 30 36
4 13 14 169 182 196
5 15 16 225 240 256
6 3 4 9 12 16
7 1 3 1 3 9
8 2 3 4 6 9
9 8 9 64 72 81
10 5 6 25 30 36
Σ 72 79 724 753 803
r = NΣXY –ΣXΣY
√ NΣX2- (ΣX)2 NΣX2- (ΣX)2

r = 10*753 -72*79
√ 0*724 - (77)2 10*803 - (79)2

r=0.9604 The reliability of the test is


96%
2. Alternate Form Reliability
(COEFFICIENT of EQUIVALENCE) –
testing of the same
persons/individuals with one form of
the test (set A) on the first occasion
and with another comparable form
(set B) on the second occasion. The
two sets of scores are then
correlated by pearson’s r. The
computed value is the reliability
coefficient.
3. Split-Half Reliability – is determined by
establishing the relationship between the
scores on two equivalent halves of a test
administered to a total group at one time.
Example of a scheme used to establish the
split –half is to split the test into odd and even.
Thus the scores of an individual will be divided
into two, the scores on the odd and even
items.
Two sets of scores will be correlated
using the pearson’s r. Then, the reliability of
the total test (rtt) will be solved using the
spearman brown prophecy formula:
rtt =___2r__
1+r
4. Rational Equivalence
Reliability – is not established
through correlation by rather
estimates internal consistency by
determining how all items on a test
relate to all other items and to the
total test. It is administered
through application of the Kuder-
Richardson, usually formula 20 or
21 (KR-20 or KR-21).
5. Scorer Reliability – can be
by having a sample of test
papers independently scored
by two examiners. The two
scores thus obtained by each
examinee are then correlated
to define the reliability
coefficient.
Factors Affecting Reliability
 LENGTH OF THE TEST – as a general rule, the longer
the test, the higher the reliability. A longer test provides a
more adequate sample of the behavior being measured
and is less distorted by chance factors like guessing.
 DIFFICULTY OF THE TEST – ideally, achievement tests
should be constructed such that the average difficulty is
from 45 to 75%. The bigger the spread of the scores, the
more reliable the measured difference is likely to be. A
test is reliable by eliminating the bias, opinions or
judgments of the person who checks the test.
 OBJECTIVITY – can be obtained by eliminating the bias,
opinions or judgments of the person who checks the test.
Kinds of Tests
INTELLIGENCE TEST-
PERSONALITY TEST
APTITUDE TEST
ACHIEVEMENT.
PROGNOSTIC TEST
PERFORMANCE TEST Example: TESDA
Trade Skills Test.
DIAGNOSTIC TEST
 PREFERENCE TEST- vocational or avocational interest or
aesthetic judgments
 ACCOMPLISHMENT TEST – a measure of achievement usually for
individual subjects in the curriculum.
 SCALE TEST – a series of items arranged in order of difficulty
 Example: Binet-Simon Scale
 SPEED TEST –a series of items arranged in the order of difficulty,
measures the speed and accuracy of the examinee within the time
limits imposed.
 POWER TEST – made up of a series of items graded in difficulty,
from the easiest to the most difficult, the score begins with the
level of difficulty the examinee is able to cope with.
 NORM-REFERENCED TEST – determines student's level of
achievement relative to the performance of other students in the
class.
 CRITERION-REFERENCED TEST – a test which determines the
extent to which a student has met the criteria or the well defined
objectives of a subject or course that were spelled out on advance.
 STANDARDIZED TEST – provides exact procedures in
controlling the method of administration and scoring
and with norms and data concerning the reliability and
validity of the test.
 TEACHER-MADE-TEST – a test constructed by
teachers but not as carefully prepared as the
standardized test.
 PLACEMENT TEST – measures the type of job an
applicant should fill; or a test used to determine the
grade or year level the pupil or student be enrolled
after ceasing from school.
 NORM-REFERENCED TEST – a test, which determines
the student's level of achievement relative to the
performance of other students in the class.
 CRITERION-REFERENCED TEST – a test which
determines the extent to which a student has met the
criteria or the well defined objectives of a subject or
course that were spelled out on advance.
 SURVEY TEST – a test that serves a broad range
of objectives.
 MASTERY TEST – a test that covers specific
learning objective.
 OBJECTIVE TEST – a test that is unaffected by
the corrector’s biases
 SUBJECTIVE TEST – a test that is affected by the
personal biases.
 VERBAL TEST – a test that uses words.
 NON-VERBAL TEST - a test that uses pictures or
symbols.
Assessment using Rubric
Assessment using Rubric
A rubric is a scoring guide that seeks
to evaluate a student's performance
based on the sum of a full range of
criteria rather than a single
numerical score.
Kinds
A Holistic Rubric – describes the
overall quality of a performance or
product
 An Analytic Rubric – describes the
quality of a performance or product
in relation to a specific criterion.
Example
Example 1 : A barof graph
a Holistic Rubric
of a household
electric consumption in the last 5 months
4 – Excellent such that work satisfies all of the
following criteria:
 Present complete information
 Is neatly done
 Uses indigenous materials

3 – Very Satisfactory such that the work


satisfies only 2 of the given criteria
2 – Satisfactory such that the work satisfies
only 1 of the
1 – Needs Improvement such that the work
fails to satisfy any.
Using Measures of Central Tendency
and Variability or Dispersion

Mean
median
Mean Median
Median Mean
Thank You

You might also like