0% found this document useful (0 votes)
451 views21 pages

Module 6 Item Analysis and Validation

The document appears to be an assessment submitted by a student named Rey Jhon Anthony C. Dela Torre to their professor Dr. Felicula G. Quilicol. It contains exercises on item analysis and validation, including calculating the index of difficulty and discrimination index for test items, and identifying which items are most difficult. The assessment also discusses different types of classroom tests and their purposes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
451 views21 pages

Module 6 Item Analysis and Validation

The document appears to be an assessment submitted by a student named Rey Jhon Anthony C. Dela Torre to their professor Dr. Felicula G. Quilicol. It contains exercises on item analysis and validation, including calculating the index of difficulty and discrimination index for test items, and identifying which items are most difficult. The assessment also discusses different types of classroom tests and their purposes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 21

Republic of the Philippines

Province of Northern Samar

Municipality of Las Navas

Colegio De Las Navas

( Community College )

Assessment of Learning
Module 6
(Item Analysis and Validation)

Submitted by:

Rey Jhon Anthony C. Dela Torre

BEED-III English

Submitted to:

Dr. Felicula G. Quilicol

Professor
Exercises:

A. Find the Index of Difficulty in each of the following situations:

The difficulty of an item or item difficulty is defined as the number of students who are able to
answer the item correctly divided by the total number of students, Thus:

Item Difficulty = number of students with wrong answer/total number of students.

Item Difficulty = RU + RL

The item of difficulty expressed in percentage.

RU- The number in the upper group who answer the item correctly

RL- the number of lower group who answer the item correctly

T- total number who tried to answer

1. N=60, number of wrong answers: upper 25% = 2, Lower 25% = 6

2+6 8

------------ = -------- = 0.13

60 60

2. N=80, number of wrong answers: upper 25% = 2, Lower 25% = 9

2+9 11

------------ = -------- = 0.14

80 80

3. N=30, number of wrong answers: upper 25% = 1, Lower 25% = 6

1+6 7

------------ = -------- = 0.23

30 30

4. N=50, number of wrong answers: upper 25% = 3, Lower 25% = 8


3+8 11

------------ = -------- = 0.22

50 50

5. N=70, number of wrong answers: upper 25% = 4, Lower 25% = 10

4 + 10 14

------------ = -------- = 0.20

70 70

B. Which of the items in Exercise A are found to be the most difficult?

1,1. N=60, number of wrong answers: upper 25% = 2, Lower 25% = 6

.25 x .60 = 0.15

15 - 2 = 13

15 - 6 = 9

13 + 9 22

------------ = -------- = 0.37 or 37%

60 60

2. N=80, number of wrong answers: upper 25% = 2, Lower 25% = 9

.25 x .80 = 0.20

20 - 2 = 18

20 - 9 = 11

18 + 11 29

------------ = -------- = 0.36 or 36%

80 80

3. N=30, number of wrong answers: upper 25% = 1, Lower 25% = 6

.25 x .30 = 0.08


8-1=7

8-6=2

7+2 9

------------ = -------- = 0.3 or 30%

30 30

4. N=50, number of wrong answers: upper 25% = 3, Lower 25% = 8

.25 x .50 = 0.13

13 - 3 = 10

13 - 8 = 5

10 + 5 15

------------ = -------- = 0.3 or 30%

50 50

5. N=70, number of wrong answers: upper 25% = 4, Lower 25% = 10

.25 x .70 = 0.18

18 - 4 = 14

18 - 10 = 8

8 + 14 22

------------ = -------- = 0.31 or 31%

70 70

To determine the difficulty level of test items, a measure called the Difficulty Index is used.
This measure asks teachers to calculate the proportion of students who answered the test item
accurately.

To find out the most difficult item, the results of your computation will be based on the table
below;
Based on my computation, I figure out an items in Exercise A the most difficult. The following
table below will determine the results of every items.

Items Range of Difficulty Index Interpretation

( Computation Result)

1. N=60, number of wrong O.37 Right Difficult


answers: upper 25% = 2,
Lower 25% = 6

2. N=80, number of wrong 0.36 Right Difficult


answers: upper 25% = 2,
Lower 25% = 9

3. N=30, number of wrong 0.30 Right Difficult


answers: upper 25% = 1,
Lower 25% = 6

4. N=50, number of wrong 0.30 Right Difficult


answers: upper 25% = 3,
Lower 25% = 8

5. N=70, number of wrong O.31 Right Difficult


answers: upper 25% = 4,
Lower 25% = 10

And now we will show on the table above that the most difficult items in Exercise A is items 1, 2, and 5
while the items 3 and 4 are moderately Difficult

C. Compute the Discrimination Index for each of the items in Exercise A.

1. N=60, number of wrong answers: upper 25% = 2, Lower 25% = 6

2-6 4

------------ = -------- = 0.13

30 30

2. N=80, number of wrong answers: upper 25% = 2, Lower 25% = 9


2-9 7

------------ = -------- = 0.18

40 40

3. N=30, number of wrong answers: upper 25% = 1, Lower 25% = 6

1-6 5

------------ = -------- = 0.33

15 15

4. N=50, number of wrong answers: upper 25% = 3, Lower 25% = 8

3-8 5

------------ = -------- = 0.20

25 25

5. N=70, number of wrong answers: upper 25% = 4, Lower 25% = 10

4 - 10 14

------------ = -------- = 0.18

35 35
Teacher Constructed a
Test

Traditional paper and


pencil classroom test

D. Answer the following

1. A teacher constructed a test which would measure the student's ability to apply previous
knowledge to certain situations. In particular, the evidence that a student is able to apply
previous knowledge are: Objective Type of
test
 Draw a correct conclusion that are based on the information given ;
Classroom tests can also be categorized based on what they are intended to measure.
Traditional paper-and-pencil classroom tests (e.g. multiple-choice, matching, true-false) are
best used to measure knowledge. They are typically objectively scored (a computer with an
answer key could score it). Performance-based tests, sometimes called authentic or alternative
tests, are best used to assess student skill or ability. They are typically subjectively scored (a
teacher must apply some degree of opinion in evaluating the quality of a response).
Performance-based tests are discussed in a separate area on this website.

Tests designed to measure knowledge are usually made up of a set of individual questions.
Questions can be of two types: a) selection (or select) items, which allow students to select a
correct answer from a list of possible correct answers (e.g. multiple-choice, matching) and b)
supply items, which require students to supply the correct answer (e.g. fill-in-the-blank, short
answer). Scoring selection items is usually quicker and objective. Scoring supply items tends to
take more time and is usually more subjective. Sometimes teachers decide to use selection
items when they are interested in measuring basic, lower levels of understanding (at the
knowledge or comprehension level in a Bloom's taxonomy sense, Bloom et al.,1956) and use
supply items if they are interested in higher levels of understanding, but a well-written selection
item can still get at higher levels of understanding.
a.) true/false items, b) multiple-choice items ,c.
Teacher-made tests canmatching
also be items,
distinguished by whene.)they
d.) enumeration are given and how the results
completion,
are used. Tests given at the end of a unit or semester
and f.) essays.or after learning has occurred are called
summative tests. Their purpose is to assess learning and performance and usually affects a
student's class grade. Tests can also be given while learning is occurring, and these are called
formative tests. Their purpose is to provide feedback, so students can adjust how they are
learning or teachers can adjust how they are teaching. Usually these tests do not affect student
grades.

 Identify one ore more logical implications that follow from a given point of view;

The concept of logical implication


In the previous section we introduced the idea of logical implication to capture the idea of
support between premises and conclusion. But what exactly does it mean for premises to
imply a conclusion? We'll define the term in a moment, but let's first look at an example that
will make the definition easier to understand. Consider the following reasoning:

Chihuahuas are really vicious. My aunt has one named Zeppo that will attack you for no
reason at all.

You'll notice that it in this example neither the word "because" nor any other reasoning
indicator exists to help you distinguish the premises from the conclusion. Even so, it's pretty
clear that its author intends this as reasoning in support of the conclusion that Chihuaua's are
vicious dogs.

Premise 1: There is at least one Chihuahua that attacks people without provocation.

Conclusion: Chihuahuas are vicious.

Notice that in our reconstruction we did not use all of the same words that occur in the
example. The point of the reconstruction is to make the logical relationships clear. We will be
working on this quite a bit later.

Now, if you were to be critical of this reasoning, you might say something like this:

Just because one Chihuahua is vicious doesn't mean they all are.

The important thing to notice about this criticism is that in making it you do not deny the truth
of the premises. You essentially took it for granted that Zeppo is a Chihuahua and that Zeppo
is vicious. What you have pointed out is that that this information does not logically imply the
conclusion.

This example brings out the essential feature of the concept of logical implication, which is that
logical implication has nothing whatsoever to do with the actual truth of the premises or the
conclusion. In fact, whenever we try to determine whether premises imply a conclusion, we
simply assume that the premises are true, even when they are clearly false. On the basis of this
assumption we then ask whether the conclusion must be true, or (what amounts to the same
thing) whether it is still possible for the conclusion to be false. If it is not possible for the
premises to be true and the conclusion to be false, then we say that the premises logically
imply the conclusion.

Now here are three equivalent definitions of logical implication.

"The premises logically imply the conclusion" means "Whenever the premises are true, the
conclusion is also true."
"The premises logically imply the conclusion" means "If the premises are true, then the
conclusion must be true."

"The premises logically imply the conclusion" means "It is not possible for the premises to be
true and the conclusion to be false."

You can use any of these definitions, but it is important to see that they mean exactly the same
thing.

Notice that when we say that logical implication has nothing to do with the actual truth of the
premises, the term "actual" is key. Logical implication is a truth-preserving relation between
premises and conclusion: if you put truth in (the premises), you get truth out (of the
conclusion). But whether or not the relation exists has nothing to do with the quality of
information you actually put in. Logical implication is like a perfect recipe that depends on
good ingredients. If you use good ingredients, then a good result is guaranteed. If you use
inferior ingredients, then the dish may turn out poorly. But that's not the recipe's fault.

 State whether two ideas are identical, just similar, unrelated or contradictory.

The following is an explanation for QUESTION from the Practice Test. The strategy listed below
is two ideas stated whether identical, just similar, unrelated or contradictory which has helped
thousands of people beat their test and show their technique to answer the test by identifying
the correct or appropriate answer.

1. The words PERCEIVE and DISCERN have ________ meanings.

A. similar

B. contradictory

C. unrelated

Similar/contradictory words meanings on the test.

The test will give you two words with no context and ask you whether the two words are similar
(the same), contradictory (the opposite) or unrelated (the two words mean neither the same
nor the opposite).

These test is to simply decide how the two words are related.
If you do not know what a word means, solve by associating the word with a familiar context. In
many cases, you might not be able to actually define each word given, but you have heard the
words used before.Try to think about a time you heard the word used in the past.

PERCEIVE: I know that I have heard the word PERCEIVE before. A sentence that sounds correct
to me or that I remember hearing is “he could not perceive the small item with the naked eye”.
I assume that PERCEIVE means “to see” – especially in a careful, scrutinizing manner.

DISCERN: I have heard people use the word DISCERN when talking about being able to identify
a slight difference in two things. A sentence that would make sense is “I could not discern any
difference between the twins”.

I now can assume that these words are similar since they are both about being able to see or
identify small details.

ANSWER: A.

 Write test items using the multiple choice type of test that would cover these
concerns of the teacher. Show your test on an expert and ask him to judge whether
the items indeed these concerns.

1. The eleventh month of the year is:

A. January B. November

C. October D. May

2. A shop owner bought some shovels for $5,500. The shovels were sold for $7,300, with a
profit of $50 per a shovel. How many shovels were involved?

A. 18 B. 36

C. 55 D. 73

3. How many states are there in the U.S.A?

A. 20 B. 30
C. 40 D. 50

4. There are three times elephants compared to giraffes in the safari. If there is a total of 88
elephants and giraffes, how many elephants are there in the safari?
A. 22 B. 31

C. 43 D. 59

5. What is the next number in the series?

5, 15, 10, 13, 15, 11, ___

A. 9 B. 13

C. 10 D. 20

6. A car travels at a speed of 85 miles per hour. How far will it travel in 15 minutes?

A. 9 miles B. 11 miles

C. 13.25 miles D. 21.25 miles

7. A car dealership sells used cars for $7,000 and new cars for $16,000. If a total of 17 cars
were sold for $191,000, how many of the cars sold were used?

A. 2 B. 3

C. 5 D. 8

8. Joey got a 25% raise to his salary. If his original salary was $1,200, how much was it after
the raise was implemented?

A. $1225 B. $1500

C. $1350 D. $1450

E. $1800 F. None of these


9. ________ is to JUICE as WHEAT is to BREAD

A. Water B. Pitcher

C. Orange D. Soda

10. ATTRACTION is the opposite of:

A. Disgust B. Inclination

C. Fascination D. Inducement

11. DEVIATE AGITATE - The meanings of these words are:


A. Similar B. Contradictory

C. Neither similar nor contradictory C. None of these

12. START is the opposite of...

A. Source B. Inception

C. Conclusion D. Initiation

13. IMPERIOUS is the opposite of...

A. Arrogant B. Moody

C. Subservient D. Quiet

14. An oven can bake 8 pies of thin crust pizza per hour, or 2 pies of deep dish pizza pies per
hour. How many hours will it take to prepare an order of 16 thin crust pies and 4 pies of deep
dish?

A. 4 hours B. 2.5 hours

C. 2 hours D. 1 hours

15. A motorcyclist rode between Flibbertown and Guinevill at a steady pace of 60 miles per
hour for 2 hours. After two hours she accelerated to 65 miles per hour for another full hour.
What is the distance between those cities?

A. 150 miles B. 180 miles

C. 165 miles D. None of the above

Perhaps the most important ongoing concern about multiple-choice tests is their propensity to
measure surface knowledge, those facts and details that can be memorized without much (or
any) understanding of their meaning or significance. This article documents studies showing
that students’ preference for multiple-choice exams derives from their perception that these
exams are easier. Moreover, that perception results in students’ using studying strategies
associated with superficial learning: flashcards with a term on one side and the definition on
the back, reviewing notes by recopying them, and so on. Students also prefer multiple-choice
questions because they allow guessing. If there are four answer options and two of them can be
ruled out, there’s a 50 percent chance the student will get the answer right. So students get
credit for answers they didn’t know, leaving the teacher to wonder how many right answers
indicate knowledge and understanding the student does not have.
In one of the article’s best sections, the authors share a number of strategies teachers can use
to make multiple-choice questions more about thinking and less about memorizing. They start
with the simplest question. If the directions spell out that students should select “the best
answer,” “the main reason” or the “most likely” solution, that means some of the answer
options can be correct but not as correct as the right answer, which means that those questions
require more and deeper thinking.

2. What is expectancy Table? Describe the process of constructing and expectancy table.
When do we used an expectancy table?

The expectancy table is a simple and practical means of expressing criterion related evidence of
validity and is especially useful for making predictions from test scores. The expectancy table is
simply two foldchart with the test scores ( the predictor ) arranged in categories down the left
side of the table and the measure to be predicted the (criterion) arranged in categories across
the top of the table. For each categories of scores, the table indicates the percentage of
individual who falls within each category of criterion.

Apart from the use of correlation coefficient in measuring criterion-related validity, Gronlund
suggested using the so called expectancy table. This table is easy to construct and consists of
test (predator) categories listed on the left hand side and the criterion categories listed
horizontally along the top of the chart. For example, suppose that a mathematics achievement
test is constructed and the scores are categorized as high, average, and low. The criterion
measure used is the final average grades of the students in high school: Very Good, Good, and
Needs Improvement. The two way table list down the number of students show failing under
each of the two possible pairs of ( test, grade ) as shown as below:

Test Score Very Good Good Needs Improvement

High 20 10 5

Average 10 25 5

Low 1 10 14

The expectancy table show that there were 20 students getting high test scores and
subsequently rated excellent in terms of theirs final grades; 25 students got average scores and
subsequently rate good in their finals; and finally, 14 students obtained low test scores and
we're later graded as need improvement . The evidence for this particular test tends to indicate
that student getting high scores on it would be graded excellent; average scores on it would be
rated good later; and students getting low scores on the test would be graded as needing
improvement later.

We will not be able to discuss the measurement of constructed related validity in this book
since the method bro be used require sophisticated statistical techniques failing in the category
of factors analysis.

3. Enumerate the three types of validity evidence. Which of these type of validity is the most
difficult to measure? Why?

There are essentially three main types of evidence that may be collected: content related
evidence of validity, criterion related evidence of validity, and construct related evidence of
validity.

 Content-related evidence of validity refers to the content and format of instrument.


How appropriate is the content? How comprehensive? Does it logically get as intended
variables? How adequately does the sample of items or questions represent to content
to be assessed?
Content validity indicates the extent to which items adequately measure or represent
the content of the property or trait that the researcher wishes to measure. Subject
matter expert review is often a good first step in instrument development to assess
content validity, in relation to the area or field you are studying.
 Criterion-related evidence of validity refers to the relationship between scores
obtained using the instrument and scores obtained using one or more other test ( often
called criterion). How strong is this relationship? How well do such a ores estimate
present or predict future performance of a certain type?
Criterion-related validity indicates the extent to which the instrument’s scores correlate
with an external criterion (i.e., usually another measurement from a different
instrument) either at present (concurrent validity) or in the future (predictive validity). A
common measurement of this type of validity is the correlation coefficient between two
measures.
Often times, when developing, modifying, and interpreting the validity of a given
instrument, rather than view or test each type of validity individually, researchers and
evaluators test for evidence of several different forms of validity, collectively (e.g., see
Samuel Messick’s work regarding validity).
 Construct-related evidence of validity refers to the nature of the pschological constuct
or characteristics being measured by the test. How well does a measure of the construct
explain differences in the behaviors if the individuas or their performance of a certain
task?
Construct validity indicates the extent to which a measurement method accurately
represents a construct (e.g., a latent variable or phenomena that can’t be measured
directly, such as a person’s attitude or belief) and produces an observation, distinct from
that which is produced by a measure of another construct. Common methods to assess
construct validity include, but are not limited to, factor analysis, correlation tests, and
item response theory models (including Rasch model).

This criterion-related validity type, which is the most difficult to establish, is predictive in
nature and connects past performance with future performance.

Criterion validity (or criterion-related validity) measures how well one measure predicts an
outcome for another measure. A test has this type of validity if it is useful for predicting
performance or behavior in another situation (past, present, or future). For example:

A job applicant takes a performance test during the interview process. If this test accurately
predicts how well the employee will perform on the job, the test is said to have criterion
validity.

A graduate student takes the GRE. The GRE has been shown as an effective tool (i.e. it has
criterion validity) for predicting how well a student will perform in graduate studies.

The first measure (in the above examples, the job performance test and the GRE) is sometimes
called the predictor variable or the estimator. The second measure is called the criterion
variable as long as the measure is known to be a valid tool for predicting outcomes.

One major problem with criterion validity, especially when used in the social sciences, is that
relevant criterion variables can be hard to come by.

4. What is the relationship between validity and reliability? Can a test be reliable and yet no
valid? Illustrate

The relationship between reliability and validity is that they are both essential for any study or
research project. Without them, the results will not be fit for purpose.

As the previous educators have pointed out, reliability is all about getting consistent results
from a study. In contrast, validity is the extent to which your study actually does what it sets
out to do.
In terms of a relationship, what is interesting to note about reliability and validity is that they
are not mutually exclusive. In other words, you can have a study which is reliable but not valid
and, equally, you can have a study that is valid but lacks reliability.

There are ways to boost both validity and reliability. For example, if I am designing an
assessment for my history students, I would want an assessment that tests exactly what I have
taught in previous lessons. Every student should be given the same amount of revision time and
the same set of instructions on how to complete the test, including how long they have to
complete the test. When it comes to marking the tests, I will need to use the same mark
scheme for every student. This will give me high reliability and high validity.

However, this relationship is further complicated by the fact that there are different types of
validity and reliability, and all of these various measures need to be carefully considered when
designing an experiment or study. Please see the reference link for more information on these.

For any researcher, then, the goal is to have a study that is both high in validity and high in
reliability because this will provide a set of high-quality results from which to draw conclusions
and make an analysis.

Reliability and validity are concepts used to evaluate the quality of research. They indicate how
well a method, technique or test measures something. Reliability is about the consistency of a
measure, and validity is about the accuracy of a measure.

High reliability is one indicator that a measurement is valid. If a method is not reliable, it
probably isn’t valid.

Validity is harder to assess than reliability, but it is even more important. To obtain useful
results, the methods you use to collect your data must be valid: the research must be
measuring what it claims to measure. This ensures that your discussion of the data and the
conclusions you draw are also valid.

Reliability can be estimated by comparing different versions of the same measurement. Validity
is harder to assess, but it can be estimated by comparing the results to other relevant data or
theory. Methods of estimating reliability and validity are usually split up into different types.

It’s important to consider reliability and validity when you are creating your research design,
planning your methods, and writing up your results, especially in quantitative research.

A test can be reliable, meaning that the test-takers will get the same score no matter when or
where they take it, within reason of course. But that doesn't mean that it is valid or measuring
what it is supposed to measure. A test can be reliable without being valid. However, a test
cannot be valid unless it is reliable.
5. Discuss the different measures of reliability. Justify the use of each measures reliability.
Justify the use of each measure in the context of measuring reliability.

Reliability refers to the consistency of test scores from one measurement to another. Because
of the ever present measurement error, we can expect a certain amount of variation in test
performance from one time to another.

Test-Retest Method. The test re-test method requires administering the same form of the test
to same group after some time interval. The time between two administrations may be just a
few days or several years. The length of time interval should fit the type of interpretation to be
made from the results. Thus, if we are interested in using test score only to group students for
more effective learning , short-term stability may be sufficient. On the other hands, if we are
attempting to predict vocational success or make some other long range predictions, we would
desire evidence of stability over a period of time.

Test-Retest reliability coefficients are influenced both by errors within the measurement
procedure and by the day to day stability of the students responses. Thus, longer time period
between testing will result in lower reliability coefficients, due to changes in the students. In
the reporting test re-test reliability coefficients, then, it is important to include the time
interval. For example, a report might state: "The stability of tests scores obtained on the same
form over a three month period was .90". This makes it possible to determine the extent to
which the reliability data are significant for a particular interpretation.

Equivalent-Forms Method. With this method, two equivalent forms of a test ( also called
alternate forms or parallel forms ) are administered to the same group during the same testing
session. The test forms are equivalent in the sense that they are built to measure the same
abilities ( that is, they are built to the same set of specifications), but for determining reliability
it is also important that they be constructed independently. When this is the case, the reliability
coefficient would indicate that the two independent samples are apparently measuring the
same thing. A low reliability coefficient, of course, would indicate that the two forms are
measuring different behavior and that therefore both samples of the items are questionable.

Reliability coefficient determined by this methodtake in to account errors within the


measurement procedures and consistency over different samples of items, but they do not
include the day to day stability of the students responses.
Test-Retest Method with Equivalent Forms. This is a combination of both methods. Here, two
different forms of the same test are administered with time intervening. This is the most
demanding estimate of reliability, since it takes into account all possible sources of variation.
The reliability coefficient reflects errors within the testing procedure, consistency over different
samples of items, and the day to day stability of students reponses. For most porpuses, this is
probably the most useful type of reliability, since it enables us to estimate how generalizable
the test results are over the various conditions. A high reliability coefficient obtained by this
method would indicate that a test score represents not only present test performance but also
what test performance is likely to be at another time or on a different sample of equivalent
items.

Internal-Consistency Methods. These method require only a single administration of a test.


One procedure, the split-half method, involves scoring the odds items and the even items
separately and correlating the two sets of scores. This correlation coefficient indicates the
degree to which the two arbitrarily selected halves of the test provide the same result s. Thus, it
reports on the internal consistency of the test. Like the equivalent-forms method, this
procedure takes into account errors within the testing procedure and consistency over different
samples of items, but it omits the day to day stability of the students responses.

Since the correlation coefficient based on the odd and even items indicates the relationship
between two halves of the test , the reliability coefficient for the total test is determined by
applying the Spearman-Brown prophecy formula. A simplified version of this formula is as
follows:

2x reliability for ½ test

Reliability of total test = -------------------------

1 + reliability for ½ test

Thus, if we obtained a correlation coefficient of .60 for two halves of a test, the reliability for
the total test would be computer as follows:

2x.60 1.20

Reliability of total test = ------------- = ------------- = .75

1+.60 1.60
This application of the Spearman-Brown formula make a clear useful principle of test reliability;
the reliability of a test can be increased by lengthening it. This formula shows how much
reliability will increased when the length of the test is doubled. Application of the formula,
however, assumes that the test is lengthened by adding items like those already in the test.

Standard Error of Measurement. The standard error of measurement is an especially useful


way of expressing test reliability because it indicates the amount of error to allow for when
interpreting individual test scores. The standard error is derived from a reliability coefficient by
means of the following formula:

Standard error of measurement = s√1-rn

Where s = the standard deviation and rn = the reliability coefficient. In applying this formula to
a reliability estimate of .60 obtained for a test where s = 4.5, the following results would be
obtained.

Standard error of measurement = 4.5 √ 1 - .60

= 4.5 √ .40

= 4.5 x .63

= 2.8

The standard error of measurement shows how many points we must add to, and substract
from, an individuals test score in order to obtain " reasonable limits " for estimating that
individuals true scores (that is, a score free of error). The standard errors of test scores provide
a means of allowing for error during test interpretation. If we view test performance in terms of
score bands ( also called confidence bands), we are not likely to overinterpret small differences
between test scores.

For test user, the standard error of measurement is probably more useful than the reliability
coefficient. Although reliability coefficients can be used in evaluating the quality of a test and in
comparing the relatives merits of different tests, the standard error of measurement is directly
applicable to the interpretation of individual test scores.

Reliability of Criterion-Referenced Mastery Tests. As noted earlier, the traditional methods for
computing reliability require score variability ( that is, a spread of scores) and are therefore
useful mainly with norms-referenced tests. When used with Criterion-Referenced test, they are
likely to provide misleading results. Since criterion-referenced tests are not designed to
emphasize differences among individuals they typically have limited score variability. This
restricted spread of scores will result in low correlation estimates of reliability, even if the
consistency of our test results in adequate for the use to be made of them.

When a Criterion-Referenced test is used to determine mastery, our primary concern is with
how consistently our test classifies mastery and non masters. If we administered two
equivalent forms of a test to the same group of students, for example, we would like the results
of both forms of identify the same students as having mastered material. Such perfect
agreement is unrealistic, of course , since some students near the cut off scores are likely to
shift from one category to the other one basis of error measurement ( due to such factors are
lucky guesses or lapses of memory). However, if too many students demonstrated mastery on
one form but nonmastery on the others, our decisions concerning who mastered the material
would be hopelessly confused. Thus, the reliability of mastery tests can be determined by
computing the percentage of consistency mastery-nonmastery decision over the two forms of
the test.

You might also like