Testing A Test

Tests are widely used for admissions and jobs, but their value depends on their reliability and validity. Reliability refers to a test's consistency, such as whether someone would get a similar score if they took the test again. Validity refers to whether a test actually measures what it is intended to measure, such as predicting future performance. While tests of content or face validity ensure the test covers the relevant topics, this alone does not guarantee a test is useful, as scores could be too high or low to differentiate abilities. Predictive and concurrent validity, which show how well a test predicts performance on other criteria, are better measures of a test's value than just content validity.

Uploaded by

Ashish Tripathi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

53 views4 pages

Testing A Test

Uploaded by

Ashish Tripathi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 4

Testing the Tests

Tests are everywhere these days. They are required for admission to schools and for admission to
professional and occupational groups, and they are increasingly being required of applicants for
jobs. Throughout North America, the educational accountability movement has created a tidal
wave of jurisdiction-wide testing in schools.

So I thought a brief non-technical review of the characteristics of good tests would be helpful.
Well, I'll try to be as brief and non-technical as I can be.

Tests are often described as standardized. In the beginning all that meant was that they were
standard, but over the years the term standardized has taken on a narrower and more useful
meaning. It is now used by many people in testing to describe a test whose scores have been set
by comparison with the performance of a norm group – that is, a group considered to be similar
to the people for whom the test is intended. Scores, then, for a standardized test of academic
achievement would be set by comparison with a group of students of an appropriate age who had
taken the test. These scores often take the form of percentiles, which are simply statements of the
percentage of the norm group who answered fewer questions correctly. So if you're told that you
scored at the seventieth percentile, that means you got more questions right than did 70% of the
norm group.

A couple of important considerations about percentiles are worth mentioning. The first is that
usually differences in percentile scores mean less the closer you get to the average score. Usually
there are more scores near the average than away from it, so a difference between the 50th and
55th percentiles, say, usually represents a smaller difference in the number of questions correct
than does the difference between the 80th and 85th percentiles. The second consideration is
important in academic testing. Tests of academic achievement usually produce grade-equivalent
scores. A grade-equivalent score of 6.3 means that your score is equivalent to that of a student in
the third month of grade 6. Testing companies, however, do not test children in every month of
every grade. They simply interpolate the scores by drawing lines between the performances in
the grades and months in which they did test children.

Standardized tests have been widely criticized, often with good reason and not all that
infrequently with bad. The most serious question about standardized testing is its value for
assessing individuals. if an individual's score on a test varies widely with repeated testing, either
the test is not accurate or it is attempting to measure something which is not stable. Either way
its scores are useless. Tests of intelligence usually claim to be highly stable, but the proof of the
pudding is in the eating. Other tests, such as tests of academic achievement, although useful in
other ways, are rarely accurate enough for a single assessment to be dependable. If a decision
about an individual is made with a test which does not provide accurate individual assessments,
then you're likely to get an incorrect decision. Research in Ontario has shown that teachers tend
to use individual academic achievement scores as additional sources of information rather than
as wholly reliable scores, which is a responsible way to use academic achievement scores.

The chief competitor of standardized testing these days is performance assessment, also known
as authentic testing. The idea of performance assessment is simply that the best way to assess
someone's mastery of a skill is to have them perform the skill. For example, people aren't
allowed to drive on the public highways until they've passed the driver's test – we wouldn't let
them go on the road just because they passed the written test.

The problem with performance assessment is that it's difficult to devise dependable tests of this
type – you need spend only a little while in traffic to realize that. Research on educational
performance assessments has shown that tests of similar topics often produce dissimilar results.
In fact, standardized test results have been found to be more closely related to performance
assessment scores than the performance assessment scores are to each other. That isn't surprising,
since any individual performance assessment samples only a limited range of behaviour (the
driving test, for example, doesn't assess performance when a government agent is not in the car).

Performance assessments are often useful when you can make a lot of performance assessments.
Schoolteachers, for example, make lots of performance assessments and keep records of them.
They also compare performance against a criterion rather than a norm-referenced standard,
which is probably also an important factor in the successful use of performance assessments.

Tests are scored more or less objectively. In testing, objectivity means simply universal
agreement. A test is objectively scored if everyone scoring it arrives at the same score. Tests like
essay examinations, though, will obviously fall short of this ideal. Proper training of scorers
followed by proper monitoring of scoring can produce essay examinations which approach total
objectivity. Nevertheless, they won't approach it closely and the interpretation of essay test
results, or of the results of any test which is less than completely objective, should include
consideration of statistical analysis of its objectivity.

Holistic scoring is scoring which involves consideration of the entire performance of the tested
person. That is, a person performs some required tasks (answering questions, for example) or
creates a product, and the holistic score is arrived at by consideration of the performance or
product as a whole, and not by individual ratings of different aspects of the performance
(individual questions, for example, or the ability to make left turns) or product. Objectivity will
of course be an especially important consideration with this type of scoring.

Two words often trotted out in discussions of tests are reliability and validity. Reliability is
simply consistency. Different types of reliability have been defined. For example, internal
consistency is the extent to which different parts of a test produce the same assessment.
Obviously, the more internally consistent a test is, the more accurate it can be. Stability is the
degree of similarity between scores on repeated administrations of the test at which scores should
be similar (what I mean is that you wouldn't have much faith in your ruler if it kept giving you
the same estimate of the height of a growing child; the idea of stability assumes that there should
be no differences between administrations). Equivalence is the degree of similarity between
scores on different forms of the test. If you're going to do repeated testing you need different
forms of the test so that familiarity does not increase scores on the second administration. The
forms, of course, should produce similar results.

The reliability of a test is often expressed in various reliability coefficients which have a
maximum value of 1. The absolute minimum standard for any of these coefficients is .71 (an
argument can be made that one type of coefficient, the split-half, should be at least .83). Any
lower figure, for good mathematical reasons, means that the test is simply inadequate. A
coefficient of .71 can be taken as representing an improvement of 50% over complete
inconsistency, a coefficient of .80 as representing an improvement of 64%, and a coefficient of .
90 as representing an improvement of 81%.

Reliability should be assessed at every administration of the test. Tests are used if they have been
reliable in the past, but they are only useful to you if they are reliable when you use them.
Groups of people tested can differ in many important ways, some of which can affect reliability.
It is not unusual to find that a highly acclaimed test fails to live up to its history of reliability
when you use it. Usually this is not a reflection on the quality of the test (or of you as a test
administrator), but simply a reflection of the facts of life – no test is appropriate for everyone.
For example, the reliability of many tests varies markedly with the age of the people taking it.

The validity of a test is simply its relevance. A driving test, for example, is valid if it predicts
ability to drive. Ability to drive can be measured in a number of ways – number of traffic tickets,
for example.

Recent years have seen a revival of the reputation of content or face validity, a revival I consider
wholly unwarranted. A test has content validity if its items refer to the area of knowledge or to
the skills which the test is intended to assess. Obviously, you want a math test to have questions
about math, and a driving test to have something to do with driving. However, content validity is
no guarantee that the test will have any value. For example, if the items on a math test are too
difficult everyone will end up with low scores and it will be difficult to assess differences in
mathematical ability. For example, if everyone gets zero then you know nothing about their
relative mathematical abilities, even though the test has content validity. These days, though,
people often talk as if content validity is an adequate substitute for predictive or concurrent
validity.

Predictive validity is a test's ability to predict how a person will perform at a later date on a
different assessment of ability – performance in school or on a job, for example. Concurrent
validity assesses how well a test agrees with a concurrent assessment of a different type – this
type of validity is important if you want to use the test as a substitute for a less convenient
measure. Construct validity is a more complicated concept which involves explaining test scores
as the results of psychological concepts – intelligence, for example, or motivation. It is chiefly
useful as a guide to research, and interpreting the results of research in construct validation
requires some statistical knowledge.

The results of the assessment of predictive and concurrent validity are expressed as correlation
coefficients, and again the absolute minimum standard is .71. Often evidence of predictive
validity is impossible to obtain. For example, university entrance examinations notoriously fail to
predict success in university. The reason, though, is not necessarily the inadequacy of the
examinations but the inadequacy of the sample. When you compare scores on the entrance
examination with success in university, you're looking at the success only of the highest-scoring
students on the entrance examination. If people with middling or low scores on the entrance
examination were admitted to university then you could well find a relationship between the
entrance examination results and success in university.

That's my brief non-technical guide. If you've found it too brief or too technical, send me some e-
mail (there's an e-mail link a few lines farther down)

12 Characteristics of A Good
100% (10)
12 Characteristics of A Good
7 pages
Standardized Achievement Tests
100% (1)
Standardized Achievement Tests
31 pages
edu-106-week-3-4
No ratings yet
edu-106-week-3-4
34 pages
Norm and Criterion-Referenced Test: Jenelyn P. Daanoy Prusil D. Dequilla
No ratings yet
Norm and Criterion-Referenced Test: Jenelyn P. Daanoy Prusil D. Dequilla
35 pages
Types of Test - Assessment
No ratings yet
Types of Test - Assessment
18 pages
Scientific Method Presentation Colorful Illustrative Style
No ratings yet
Scientific Method Presentation Colorful Illustrative Style
155 pages
Test 5th Grade
100% (1)
Test 5th Grade
3 pages
Test and Measurement: The Nuts and Bolts
No ratings yet
Test and Measurement: The Nuts and Bolts
21 pages
Psychometrics
No ratings yet
Psychometrics
52 pages
ASSESMENT cont
No ratings yet
ASSESMENT cont
15 pages
Evalution Management
No ratings yet
Evalution Management
41 pages
Factors Influencing The Validity of The Tests in General
100% (2)
Factors Influencing The Validity of The Tests in General
10 pages
Standardized and Non Standardized Test (1)
No ratings yet
Standardized and Non Standardized Test (1)
23 pages
Aballejml 201011063136
No ratings yet
Aballejml 201011063136
67 pages
Unit 8 EE 1
No ratings yet
Unit 8 EE 1
19 pages
measurements evaluation pdf
No ratings yet
measurements evaluation pdf
14 pages
Assessment of Learning Notes
No ratings yet
Assessment of Learning Notes
96 pages
An Achievement Test
100% (2)
An Achievement Test
20 pages
lecture 5 and 6
No ratings yet
lecture 5 and 6
19 pages
Constructionoftests 211015110341
No ratings yet
Constructionoftests 211015110341
57 pages
Dedu504 Educational Measurement and Evaluation English PDF
100% (1)
Dedu504 Educational Measurement and Evaluation English PDF
400 pages
Constructionoftests 211015110341
No ratings yet
Constructionoftests 211015110341
57 pages
Review Notes Assessment of Learning
No ratings yet
Review Notes Assessment of Learning
12 pages
Building & Architectural Engineering
100% (1)
Building & Architectural Engineering
164 pages
Psy406 Marked Handout 6
No ratings yet
Psy406 Marked Handout 6
12 pages
Assessment in Schools
100% (1)
Assessment in Schools
169 pages
Psy 311 - Topic - 1
No ratings yet
Psy 311 - Topic - 1
15 pages
Based On Different Criteria, Different People Classify Test Into Different Categories As Seen Below
No ratings yet
Based On Different Criteria, Different People Classify Test Into Different Categories As Seen Below
44 pages
The Meaning & Importance of Evaluation (Lesson Planning) Beed 17
No ratings yet
The Meaning & Importance of Evaluation (Lesson Planning) Beed 17
5 pages
Group 4 Mk.l.testing
No ratings yet
Group 4 Mk.l.testing
19 pages
Standardized and Nonstandardized Test
50% (2)
Standardized and Nonstandardized Test
16 pages
Ful Edu 311 Lecture Note1
No ratings yet
Ful Edu 311 Lecture Note1
16 pages
Psy407 Handout 6
No ratings yet
Psy407 Handout 6
8 pages
Lesson 2 - Common Terminologies
100% (2)
Lesson 2 - Common Terminologies
48 pages
5. Measuring Human Behaviour
No ratings yet
5. Measuring Human Behaviour
14 pages
Chapter 8 Assessing Student Learning
No ratings yet
Chapter 8 Assessing Student Learning
24 pages
REPORT
No ratings yet
REPORT
3 pages
6th STD Balbharti English Textbook PDF
No ratings yet
6th STD Balbharti English Textbook PDF
116 pages
Epy Topic 11 Notes
No ratings yet
Epy Topic 11 Notes
9 pages
Measure
No ratings yet
Measure
10 pages
The Importance of Testing and Evaluation
No ratings yet
The Importance of Testing and Evaluation
9 pages
Test Ok
No ratings yet
Test Ok
8 pages
Evaluation, Measurement and Assessment Cluster 14
No ratings yet
Evaluation, Measurement and Assessment Cluster 14
25 pages
Psychological Standardized Test
No ratings yet
Psychological Standardized Test
22 pages
Lesson 1.3
No ratings yet
Lesson 1.3
4 pages
Chapter 1 All Lessons
No ratings yet
Chapter 1 All Lessons
9 pages
Gose Educ 105
No ratings yet
Gose Educ 105
19 pages
Test Items: Computer
No ratings yet
Test Items: Computer
10 pages
1.0 Brief Overview of Educational Assessment
No ratings yet
1.0 Brief Overview of Educational Assessment
4 pages
roadmap
No ratings yet
roadmap
13 pages
Lecture
No ratings yet
Lecture
14 pages
assessment midtrerm quiz reviewer
No ratings yet
assessment midtrerm quiz reviewer
3 pages
University of Northeastern Philippines: School of Graduate Studies and Research
No ratings yet
University of Northeastern Philippines: School of Graduate Studies and Research
7 pages
Standardized & Non Standardized Tests
100% (1)
Standardized & Non Standardized Tests
45 pages
MIDTERM ASSESS
No ratings yet
MIDTERM ASSESS
3 pages
Handout 2 Test and Their Uses in The Educational Assessment
No ratings yet
Handout 2 Test and Their Uses in The Educational Assessment
5 pages
Educ Measurement Prelim
No ratings yet
Educ Measurement Prelim
24 pages
Module-2-Lesson-3-TYPES-AND-DISTINCTIONS-OF-TESTS-3
No ratings yet
Module-2-Lesson-3-TYPES-AND-DISTINCTIONS-OF-TESTS-3
4 pages
2018 BKK Kids Intl School Guide
No ratings yet
2018 BKK Kids Intl School Guide
55 pages
Sheila T. Balasid GROUP 7-4 Reporter Subject: Advance Educational Psychology Professor: Prof. Pablo L. Eulatic, JR., R.N
No ratings yet
Sheila T. Balasid GROUP 7-4 Reporter Subject: Advance Educational Psychology Professor: Prof. Pablo L. Eulatic, JR., R.N
5 pages
WS4 Rational Numbers
No ratings yet
WS4 Rational Numbers
2 pages
Department of Education: Sagutang Papel Week 1
No ratings yet
Department of Education: Sagutang Papel Week 1
8 pages
ASSESSMENT
No ratings yet
ASSESSMENT
6 pages
Standardized Test
No ratings yet
Standardized Test
5 pages
27073786
No ratings yet
27073786
18 pages
LESSON PLAN 6TH GRADE WRITING EMAILS Final Version
No ratings yet
LESSON PLAN 6TH GRADE WRITING EMAILS Final Version
6 pages
What Is Measurement
No ratings yet
What Is Measurement
4 pages
Uts Tmes Guevarra
No ratings yet
Uts Tmes Guevarra
16 pages
Mathematics Assignment PDF
No ratings yet
Mathematics Assignment PDF
16 pages
LET-how To Pass It
No ratings yet
LET-how To Pass It
10 pages
Module 7 Communication For Academic Purposes
No ratings yet
Module 7 Communication For Academic Purposes
50 pages
Englsih 2 Q2 SLK # 4
No ratings yet
Englsih 2 Q2 SLK # 4
12 pages
FFT Smart Card
0% (1)
FFT Smart Card
2 pages
2 Arq Xii O.P. Gupta
No ratings yet
2 Arq Xii O.P. Gupta
13 pages
CV Saz
No ratings yet
CV Saz
3 pages
The Entrepreneurial Decision Process For Start-Ups
No ratings yet
The Entrepreneurial Decision Process For Start-Ups
13 pages
tikhonenkov_2023_provora-review
No ratings yet
tikhonenkov_2023_provora-review
2 pages
PR - Module 10 - JR
No ratings yet
PR - Module 10 - JR
4 pages
7.-Pharmacy-Technician-B-Category-Social-Behavior-Law-and-Ethics-Curriculum
No ratings yet
7.-Pharmacy-Technician-B-Category-Social-Behavior-Law-and-Ethics-Curriculum
2 pages
Oracle Sample Questions
No ratings yet
Oracle Sample Questions
35 pages
Lesson Plan Boyle's Law
No ratings yet
Lesson Plan Boyle's Law
12 pages
Jose Diokno
100% (2)
Jose Diokno
4 pages
Curriculum Vitae: Tharesh Kumar KG
No ratings yet
Curriculum Vitae: Tharesh Kumar KG
3 pages
Surat Lamaran - CV Siska Dwi Lestari - Organized
No ratings yet
Surat Lamaran - CV Siska Dwi Lestari - Organized
1 page
Why Test Artifacts: Prepared by The Test Manager
No ratings yet
Why Test Artifacts: Prepared by The Test Manager
2 pages
Allison Score Analysis
No ratings yet
Allison Score Analysis
3 pages
Class 4 Mathematics Addition and Subtraction
No ratings yet
Class 4 Mathematics Addition and Subtraction
3 pages
Sse Final Lesson Plan Travel Wants and Needs
No ratings yet
Sse Final Lesson Plan Travel Wants and Needs
2 pages
Instructional Strategies Instructional Activities
No ratings yet
Instructional Strategies Instructional Activities
1 page
More How to Win at Aptitude Tests
From Everand
More How to Win at Aptitude Tests
Liam Healy
4/5 (7)

Testing A Test

Uploaded by

Testing A Test

Uploaded by

Testing the Tests

You might also like