0% found this document useful (0 votes)
45 views

Testing A Test

Tests are widely used for admissions and jobs, but their value depends on their reliability and validity. Reliability refers to a test's consistency, such as whether someone would get a similar score if they took the test again. Validity refers to whether a test actually measures what it is intended to measure, such as predicting future performance. While tests of content or face validity ensure the test covers the relevant topics, this alone does not guarantee a test is useful, as scores could be too high or low to differentiate abilities. Predictive and concurrent validity, which show how well a test predicts performance on other criteria, are better measures of a test's value than just content validity.

Uploaded by

Ashish Tripathi
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views

Testing A Test

Tests are widely used for admissions and jobs, but their value depends on their reliability and validity. Reliability refers to a test's consistency, such as whether someone would get a similar score if they took the test again. Validity refers to whether a test actually measures what it is intended to measure, such as predicting future performance. While tests of content or face validity ensure the test covers the relevant topics, this alone does not guarantee a test is useful, as scores could be too high or low to differentiate abilities. Predictive and concurrent validity, which show how well a test predicts performance on other criteria, are better measures of a test's value than just content validity.

Uploaded by

Ashish Tripathi
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Testing the Tests

Tests are everywhere these days. They are required for admission to schools and for admission to
professional and occupational groups, and they are increasingly being required of applicants for
jobs. Throughout North America, the educational accountability movement has created a tidal
wave of jurisdiction-wide testing in schools.

So I thought a brief non-technical review of the characteristics of good tests would be helpful.
Well, I'll try to be as brief and non-technical as I can be.

Tests are often described as standardized. In the beginning all that meant was that they were
standard, but over the years the term standardized has taken on a narrower and more useful 
meaning. It is now used by many people in testing to describe a test whose scores have been set
by comparison with the performance of a norm group – that is, a group considered to be similar
to the people for whom the test is intended. Scores, then, for a standardized test of academic
achievement would be set by comparison with a group of students of an appropriate age who had
taken the test. These scores often take the form of percentiles, which are simply statements of the
percentage of the norm group who answered fewer questions correctly. So if you're told that you
scored at the seventieth percentile, that means you got more questions right than did 70% of the
norm group.

A couple of important considerations about percentiles are worth mentioning. The first is that
usually differences in percentile scores mean less the closer you get to the average score. Usually
there are more scores near the average than away from it, so a difference between the 50th and
55th percentiles, say, usually represents a smaller difference in the number of questions correct
than does  the difference between the 80th and 85th percentiles. The second consideration is
important in academic testing. Tests of academic achievement usually produce grade-equivalent
scores. A grade-equivalent score of 6.3 means that your score is equivalent to that of a student in
the third month of grade 6. Testing companies, however, do not test children in every month of
every grade. They simply interpolate the scores by drawing lines between the performances in
the grades and months in which they did test children.

Standardized tests have been widely criticized, often with good reason and not all that
infrequently with bad. The most serious question about standardized testing is its value for
assessing individuals. if an individual's score on a test varies widely with repeated testing, either
the test is not accurate or it is attempting to measure something which is not stable. Either way
its scores are useless. Tests of intelligence usually claim to be highly stable, but the proof of the
pudding is in the eating. Other tests, such as tests of academic achievement, although useful in
other ways, are rarely accurate enough for a single assessment to be dependable. If a decision
about an individual is made with a test which does not provide accurate individual assessments,
then you're likely to get an incorrect decision. Research in Ontario has shown that teachers tend
to use individual academic achievement scores as additional sources of information rather than
as wholly reliable scores, which is a responsible way to use academic achievement scores.

The chief competitor of standardized testing these days is performance assessment, also known
as authentic testing. The idea of performance assessment is simply that the best way to assess
someone's mastery of a skill is to have them perform the skill. For example, people aren't
allowed to drive on the public highways until they've passed the driver's test – we wouldn't let
them go on the road just because they passed the written test.

The problem with performance assessment is that it's difficult to devise dependable tests of this
type – you need spend only a little while in traffic to realize that. Research on educational
performance assessments has shown that tests of similar topics often produce dissimilar results.
In fact, standardized test results have been found to be more closely related to performance
assessment scores than the performance assessment scores are to each other. That isn't surprising,
since any individual performance assessment samples only a limited range of behaviour (the
driving test, for example, doesn't assess performance when a government agent is not in the car).

Performance assessments are often useful when you can make a lot of performance assessments.
Schoolteachers, for example, make lots of performance assessments and keep records of them.
They also compare performance against a criterion rather than a norm-referenced standard,
which is probably also an important factor in the successful use of performance assessments.

Tests are scored more or less objectively. In testing, objectivity means simply universal
agreement. A test is objectively scored if everyone scoring it arrives at the same score. Tests like
essay examinations, though, will obviously fall short of this ideal. Proper training of scorers
followed by proper monitoring of scoring can produce essay examinations which approach total
objectivity. Nevertheless, they won't approach it closely and the interpretation of essay test
results, or of the results of any test which is less than completely objective, should include
consideration of statistical analysis of its objectivity.

Holistic scoring is scoring which involves consideration of the entire performance of the tested
person. That is, a person performs some required tasks (answering questions, for example) or
creates a product, and the holistic score is arrived at by consideration of the performance or
product as a whole, and not by individual ratings of different aspects of the performance
(individual questions, for example, or the ability to make left turns) or product. Objectivity will
of course be an especially important consideration with this type of scoring.

Two words often trotted out in discussions of tests are reliability and validity. Reliability is
simply consistency. Different types of reliability have been defined. For example, internal
consistency is the extent to which different parts of a test produce the same assessment.
Obviously, the more internally consistent a test is, the more accurate it can be. Stability is the
degree of similarity between scores on repeated administrations of the test at which scores should
be similar (what I mean is that you wouldn't have much faith in your ruler if it kept giving you
the same estimate of the height of a growing child; the idea of stability assumes that there should
be no differences between administrations). Equivalence is  the degree of similarity between
scores on different forms of the test. If you're going to do repeated testing you need different
forms of the test so that familiarity does not increase scores on the second administration. The
forms, of course, should produce similar results.

The reliability of a test is often expressed in various reliability coefficients which have a
maximum value of 1. The absolute minimum standard for any of these coefficients is .71 (an
argument can be made that one type of coefficient, the split-half, should be at least .83). Any
lower figure, for good mathematical reasons, means that the test is simply inadequate. A
coefficient of .71 can be taken as representing an improvement of 50% over complete
inconsistency, a coefficient of .80 as representing an improvement of 64%, and a coefficient of .
90 as representing an improvement of 81%.

Reliability should be assessed at every administration of the test. Tests are used if they have been
reliable in the past, but they are only useful to you if they are reliable when you use them.
Groups of people tested can differ in many important ways, some of which can affect reliability.
It is not unusual to find that a highly acclaimed test fails to live up to its history of reliability
when you use it. Usually this is not a reflection on the quality of the test (or of you as a test
administrator), but simply a reflection of the facts of life – no test is appropriate for everyone.
For example, the reliability of many tests varies markedly with the age of the people taking it.

The validity of a test is simply its relevance. A driving test, for example, is valid if it predicts
ability to drive. Ability to drive can be measured in a number of ways – number of traffic tickets,
for example.

Recent years have seen a revival of the reputation of content or face validity, a revival I consider
wholly unwarranted. A test has content validity if its items refer to the area of knowledge or to
the skills which the test is intended to assess. Obviously, you want a math test to have questions
about math, and a driving test to have something to do with driving. However, content validity is
no guarantee that the test will have any value. For example, if the items on a math test are too
difficult everyone will end up with low scores and it will be difficult to assess differences in
mathematical ability. For example, if everyone gets zero then you know nothing about their
relative mathematical abilities, even though the test has content validity. These days, though,
people often talk as if content validity is an adequate substitute for predictive or concurrent
validity.

Predictive validity is a test's ability to predict how a person will perform at a later date on a
different assessment of ability – performance in school or on a job, for example. Concurrent
validity assesses how well a test agrees with a concurrent assessment of a different type – this
type of validity is important if you want to use the test as a substitute for a less convenient
measure. Construct validity is a more complicated concept which involves explaining test scores
as the results of psychological concepts – intelligence, for example, or motivation. It is chiefly
useful as a guide to research, and interpreting the results of research in construct validation
requires some statistical  knowledge.

The results of the assessment of predictive and concurrent validity are expressed as correlation
coefficients, and again the absolute minimum standard is .71. Often evidence of predictive
validity is impossible to obtain. For example, university entrance examinations notoriously fail to
predict success in university. The reason, though, is not necessarily the inadequacy of the
examinations but the inadequacy of the sample. When you compare scores on the entrance
examination with success in university, you're looking at the success only of the highest-scoring
students on the entrance examination. If people with middling or low scores on the entrance
examination were admitted to university then you could well find a relationship between the
entrance examination results and success in university.

That's my brief non-technical guide. If you've found it too brief or too technical, send me some e-
mail (there's an e-mail link a few lines farther down)

You might also like