0% found this document useful (0 votes)
21 views77 pages

Week V & VI

Uploaded by

yirem1333
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views77 pages

Week V & VI

Uploaded by

yirem1333
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 77

Testing Principles

Practicality
 Is not excessively expensive
 Stays within appropriate time constraints
 Is relatively easy to administer
 Has a scoring/evaluation procedure that is specific and time
efficient
 Items can be replicated in terms of resources needed e.g. time,
materials, people
 can be administered
 can be graded
 results can be interpreted
Reliability
 A reliable test is consistent and dependable.
 Related to accuracy, dependability and consistency e.g.
20°C here today, 20°C in North Italy – are they the same?

According to Henning [1987], reliability is


 a measure of accuracy, consistency, dependability, or
fairness of scores resulting from the administration of a
particular examination e.g. 75% on a test today, 83%
tomorrow – problem with reliability.
Reliability
 Student Related reliability: the deviation of an observed score
from one’s true score because of temporary illness, fatigue,
anxiety, bad day, etc.
 Rater reliability: two or more scores yield an inconsistent
scores of the same test because of lack of attention to scoring
criteria, inexperience, inattention, or preconceived bias.
 Administration reliability: unreliable results because of testing
environment such as noise, poor quality of audio recorders,
etc.
 Test reliability: measurement errors because the test is too
long.
To Make Test More Reliable
 Take enough sample of behaviour
 Exclude items which do not discriminate well between
weaker and stronger students
 Do not allow candidate too much freedom.
 Provide clear and explicit instructions
 Make sure that the tests were perfectly laid out and
legible
 Make candidates familiar with format and testing
techniques
To Make Test More Reliable
 Provide uniform and undistracted conditions of
administration
 Use items that permit objective scoring
 Provide a detailed scoring key
 Train scorers
 Identify candidate by number, not by name
 Employ multiple, independent scoring
Measuring Reliability

 Test retest reliability: administer whatever the test


involved two times.
 Equivalent –forms reliability/parallel-forms reliability:
administering two different bu equal tests to a single
group of students (e.g. Form A and B)
 Internal consistency reliability: estimate the consistency
of a test using only information internal to a test,
available in one administration of a single test. This
procedure is called Split-half method.
Validity
 Criterion related validity: the degree to which results
on the test agree with those provided by some
independent and highly dependable assessment of
the candidates’ ability.
 Construct validity: any theory, hypothesis, or model
that attempts to explain observed phenomena in our
universe and perception; Proficiency and
communicative competence are linguistic constructs;
self-esteem and motivation are psychological
constructs.
Reliability Coefficient
 Validity coefficient to compare the reliability of different tests.
 Lado: vocabulary, structure, reading (0,9-0,99), auditory
comprehension (0,80-0,89), oral production (0,70-0,79)
 Standard error: how far an individual test taker’s actual score
is likely to diverge from their true score
 Classical analysis: gives us a single estimatefor all test takers
 Item Response theory: gives estimate for each individual,
basing this estimate on that individual’s performance
The Item Response Theory
 The item response theory (IRT), also known as the latent response
theory refers to a family of mathematical models that attempt to
explain the relationship between latent traits (unobservable
characteristic or attribute) and their manifestations (i.e. observed
outcomes, responses or performance).
 They establish a link between the properties of items on an
instrument, individuals responding to these items and the underlying
trait being measured.
The Item Response Theory

 IRT assumes that the latent construct (e.g.


stress, knowledge, attitudes) and items of
a measure are organized in an
unobservable continuum.
 Therefore, its main purpose focuses on
establishing the individual’s position on
that continuum.
Classical Test Theory
 Classical Test Theory [Spearman, 1904, Novick,
1966]focuses on the same objective and before the
conceptualization of IRT; it was (and still being) used
to predict an individual’s latent trait based on an
observed total score on an instrument.
 In CTT, the true score predicts the level of the latent
variable and the observed score.
Validity
 The extent to which the inferences made from
assessment results are appropriate, meaningful and
useful in terms of the purpose of the assessment.
 Content validity: requires the test taker to perform
the behaviour that is being measured.
 Its content constitutes a representative sample of the
language skills, structures, etc. with which it is
meant to be measured
Validity
 Consequential validity: accuracy in measuring intended
criteria, its impacts on the preparation of test takers, its effects
on the learner, and social consequences of test interpretation
and use.
 Face validity: the degree to which the test looks right and
appears to the knowledge and ability it claims to measure
based on the subjective judgement of examinees who take it
and the administrative personnel who decide on its use and
other psychometrical observers.
Validity
Response validity [internal]
 the extent to which test takers respond in the way expected by
the test developers

Concurrent validity [external]


 the extent to which test takers' scores on one test relate to those
on another externally recognised test or measure
Predictive validity [external]
 the extent to which scores on test Y predict test takers' ability to
do X e.g. IELTS + success in academic studies at university
Validity
 'Validity is not a characteristic of a test, but a
feature of the inferences made on the basis of test
scores and the uses to which a test is put.'
 To make test more valid:
1) Write explicit test specification
2) Use direct testing
3) Scoring of responses related directly to what is
being tested.
4) Make the test reliable.
Washback

 The quality of the relationship between a test and


associated teaching.
 We have positive effect and negative effect.
 Test is valid when it has a good washback
 Students have ready access to discuss the
feedback and evaluation you have given.
Washback
 The effect of testing on teaching and learning
 The effect of test on instruction in terms of how students
prepare for the test
 Formative test: provides washback in the form of information
to the learner on progress toward goals, while Summative test
is always the beginning of further pursuits, more learning,
more goals
 To improve washback: use direct testing, use criterion
reference-testing, base achievement tests on objectives, and
make sure that the tests are understood by students and
teachers.
Evaluation of Classroom Tests

 Are the test procedures practical?


 Is the test reliable?
 Does the procedure demonstrate content validity?
 Are the test tasks as authentic as possible?
 Does the test give beneficial washback?
Norm-Referenced Tests

 Norm-referenced refers to standardized tests that are


designed to compare and rank test takers in relation to one
another.
 Norm-referenced tests report whether test takers performed
better or worse than a hypothetical average student, which is
determined by comparing scores against the performance
results of a statistically selected group of test takers, typically
of the same age or grade level, who have already taken the
exam.
Norm-Referenced Tests

 Calculating norm-referenced scores is called the “norming


process,” and the comparison group is known as the
“norming group.” Norming groups typically comprise only
a small subset of previous test takers, not all or even most
previous test takers.
 Test developers use a variety of statistical methods to select
norming groups, interpret raw scores, and determine
performance levels.
NRT

 Is designed to measure the global language


abilities such as overall English Proficiency,
academic listening ability, reading
comprehension, and so on.
 Each student’s score on such a test is interpreted
relative to the scores of all other students who
took the test with reference to normal distribution
CRT

 Criterion reference test is usually produced


to measure well-defined and failrly specific
instructional objectives
 The interpretation of CRT is considered as
absolute in a sense that each student’s score
is meaningful without reference to the other
students’ scores
Criterion-referenced test results

 They are often based on the number of correct


answers provided by students, and scores might be
expressed as a percentage of the total possible number
of correct answers.
 On a norm-referenced exam, however, the score
would reflect how many more or fewer correct
answers a student gave in comparison to other
students.

Criterion-referenced test results

 Hypothetically, if all the students who took a


norm-referenced test performed poorly, the
least-poor results would rank students in the
highest percentile. Similarly, if all students
performed extraordinarily well, the least-
strong performance would rank students in the
lowest percentile.
Norm-Referenced vs. Criterion-Referenced
Tests
 Norm-referenced tests are specifically designed to
rank test takers on a “bell curve,” or a distribution
of scores that resembles, when graphed, the outline
of a bell—i.e., a small percentage of students
performing well, most performing average, and a
small percentage performing poorly.
 To produce a bell curve each time, test questions
are carefully designed to accentuate performance
differences among test takers, not to determine if
students have achieved specified learning
standards, learned certain material, or acquired
specific skills and knowledge.
 Tests that measure performance against a fixed set
of standards or criteria are called criterion-
referenced tests.
NRT and
Characteristics
CRT
NRT CRT

Types of interpretation Relative Absolute

Type of measurement To measure general To measure specific


language abilities objective-based language
points

Purpose of testing Spread students out a long a Assess the amount of


continuum of general material known or learned
abilities of proficiencies by each student

Distribution of scores Normal distribution Varies; often non normal.

Test structure A few relatively long subtest A series of short-well


with a variety of item defined subtests with similar
content item contents

Knowledge of questions Students have little or no Student know exactly what


idea of what content to content to expect in test
expect in test items items
Test and Decision Purposes
TYPES OF DECISION
NORM-REFERENCED CRITERION-REFERENCED
Test Qualities Proficiency Placement Achievement Diagnostic
Detail of Very general general specific Very specific
information
Focus General skills From all levels & Terminal Terminal and
prerequisite to skills of program objectives of enabling
entry course objective
Purpose of To compare To find each To determine the To inform
Decision individual and student’s degree of students and
individual appropriate level learning for teachers of
advancement or weaker
graduation objectives
Relationship to Comparisons Comparison Directly related Related to
Program with other within program to objectives objectives need
institutions more worls
Interpretation Before entry and Beginning of End of courses Beginning and/or
When at exit program middle of courses
administered
score Spread of wide Spread of Overall number Percentage of
Characteristics of communicative
tests
 Communicative test setting requirements:
1) Meaningful communication
2) Authentic situation
3) Unpredictable language input
4) Creative language output
5) All language skills
 Bases for ratings
1) Success in getting meaning across
2) Use focus rather than usage
3) New components to be rated
Components of Communicative
competence
 Grammatical competence (phonology,
orthography, vocabulary, word formation,
sentence formation)
 Sociolinguistic competence (social
meanings, grammatical forms in different
sociolinguistic contexts)
Components of Communicative
competence
 Discourse competence (cohesion in
different genres, cohesion in different
genres)
 Strategic competence (grammatical
difficulties, sociolinguistic difficulties,
discourse difficulties, performance
factors)
Discrete-point/Integrative Issue

 Discrete point: measures the small bits and


pieces of a language as in a multiple choice
test made up of questions constructed to
measure students’ knowledge of different
structure
 Integrative test: measures several skills at
one time such as dictation
Practical Issues

 Fairness issue: a test treats every student the


same.
 The cost issue
 Ease of test construction
 Ease of test administration
 Ease of test scoring
 Interactions of theoretical issues
General Guidelines for Item
Formats
 correctly matched to the purpose and content of
the item
 only one correct answer?
 written at the students’ level of proficiency
 Avoiding ambiguous terms and statements
 Avoiding negatives and double negatives
General Guidelines for Item
Formats
 Avoid giving clues that could be used in
answering other items
 All parts of the item on the same page
 Only relevant information presented
 Avoiding bias of race, gender and nationality
 Let another person look over the item
STEM

 A multiple choice item consists of a problem,


known as the stem, and a list of suggested
solutions, known as alternatives.
 The alternatives consist of one correct or best
alternative, which is the answer, and incorrect or
inferior alternatives, known as distractors.
Stem
 The stem should not contain irrelevant material, which can
decrease the reliability and the validity of the test scores
(Haldyna and Downing 1989).
 The stem should be negatively stated only when significant
learning outcomes require it.
 Consider whether the stem:

 presents a clearly defined problem or task to the student,


 contains unnecessary information,
 could be more simply, clearly, or concisely stated.
Versatility

 Multiple choice test items can be written to assess


various levels of learning outcomes, from basic recall
to application, analysis, and evaluation.
 Because students are choosing from a set of potential
answers, however, there are obvious limits on what
can be tested with multiple choice items.
 For example, they are not an effective way to test
students’ ability to organize thoughts or articulate
explanations or creative ideas.
Consider the Following When
Reviewing Multiple-Choice Question

 Consider whether the alternatives:

 are parallel in structure,


 fit logically and grammatically with the stem,
 could be more simply, clearly, or concisely
stated,
 are so inclusive that they logically eliminate any
other option from being a possible answer.
Consider the Following When
Reviewing Multiple-Choice Question

 Consider whether the distractors

 contain one or more items a student can consider


as a correct answer,
 are plausible enough to be attractive to students
who are low achievers
 contain one or more that can call attention to the
key.
More than one correct answer

 The apple is located on or around


 A) a table C) the table
 B) an table D) table
- Two correct answers (A and C), wordy
(somewhere around), repeat the word
table inefficiently
Multiple Choice

 Do you see the chair and table? The apple


is on _____ table.
a) A c) the
b) An d) (no article)

Option d (no article) will be easily detected


as a wrong option so it is not a good
distractor.
Guidelines for Writing Multiple-Choice Test
Items.
 The following are some guidelines that you should
use for preparing multiple-choice test items.
 The entire stem must always precede the
alternatives and it should contain the problem and
any clarifications.
 Avoid negatively stated stems.
Guidelines for Writing Multiple-Choice Test
Items.

 If an omission occurs in the stem, it


should appear near the end of the stem
and not at the beginning.
 Use only correct grammar in the stem and
alternatives.
 Avoid repeating words between the stem
and key. You can do that, however, to
make distractors more attractive.
Guidelines for Writing Multiple-
Choice Test Items.
 Avoid wording directly from a reading passage or
use of stereotyped phrasing in the key.
 Try to avoid “all of the above” as the last option. If
a student can eliminate any of the other choices, this
choice can be automatically eliminated as well.
Guidelines for Writing Multiple-
Choice Test Items
 To test understanding of a term or concept,
present the term in the stem followed by
definitions or descriptions in the alternatives.
 Do not use “none of the above” as the last option
when the correct answer is simply the best
answer among the choices offered.
 Avoid terms such as “always” or “never,” as they
generally signal incorrect choices.
True-False

 According to the passage, antidisestablismentarianism


diverges fundamentally from the conventional proceedings
and traditions of the Church of England
* Containing too difficult vocabulary.
Ambiguous Word

 Why are statistical studies inaccessible to language


teachers in Brazil according to the reading passage?
 Accessible: language teachers get very little training
in mathematics and/or such teachers are averse
(strongly disliking or opposed to)to numbers
 Accessible: the libraries may be far away.
Double negatives
 One theory that is not unassociated with Noam
Chomsky is:
 A. Transformational generative grammar
 B. Case grammar
 C. Non-universal phonology
 D. Acoustic phonology
- Use one negative only
- Emphasize it by underline, upper case, or bold-
face. For example: not, NEVER, inconsistent
Receptive response items

 True-False
1) the statement worded carefully enough so it can be judged
without ambiguity
2) absoluteness clues are avoided
 Matching
1) More options than premises
2) Options shorter than premises to reduce reading
3) Option and premise lists related to one central theme
Multiple Choice

 Unintentional clues are avoided


 The distracters are plausible
 Needless redundancy in the options is avoided
 Ordering of the option is carefully considered
 The correct answers are randomly assigned
True-False

 Items should be worded carefully enough so it can be


judged without ambiguity
 Avoid absoluteness
 This book is always crystal clear in all its
explanation: T F
- allow the students to answer correctly without
knowing the correct response.
- Absolute clues: all, always, absolutely, never, rarely,
most often
Guidelines for good alternatives

 Use a logical sequence for alternatives


(e.g., temporal(relating to time) sequence,
length of the choice).
 If two alternatives are very similar
(cognitively or visually), they should be
placed next to one another to allow students
to compare them more easily.
Guidelines for good alternatives

 Make all incorrect alternatives (i.e., distractors) plausible


and attractive. It is often useful to use popular
misconceptions and frequent mistakes as distractors.
 Make all alternatives grammatically consistent with the
stem.
 Item distractors should include only correct forms and
vocabulary that actually exist in the language.
Guidelines for good alternatives

 Use 4 or 5 alternatives in each item.


 If one or more alternatives are partially
correct, ask for the “best” answer.
 Alternatives should not overlap in
meaning or be synonymous with one
another.
Guidelines for good alternatives

 All alternatives should be homogeneous in


content, form, and grammatical structure.
 The length, explicitness, and technical
information in each alternative should be
parallel so as not to give away the correct
answer.
Multiple Choice

 Avoid unintentional clues


 The fruit that Adam ate in the Bible was
an ____
A. Pear C. Apple
B. Banana D. Papaya
Unintentional clues: grammatical,
phonological, morphological, etc.
Multiple Choice

Are all distracters plausible?


Adam ate _______
A. An apple C. an apricot
B. A banana D. a tire
Multiple Choice

 Avoid needless redundancy


 The boy on his way to the store, walking down the
street, when he stepped on a piece of cold wet ice and
A. fell flat on his face
B. fall flat on his face
C. felled flat on his face
D. falled flat on his face
Multiple Choice

 More effective:
The boy stepped on a piece of ice and
______ flat on his face.
A. fell
B. fall
C. felled
D. falled
Multiple Choice

 Correct answers should be randomly


assigned
 Distractors like “none of the above”, “A
and B only”, “all of the above’ should be
avoided
Matching

 Present the students with two columns of


information; the students then must find and
identify matches between the two sets of
information.
 The information on the left-hand column is
called matching-item premise
 On the right hand column is called option
Matching

 More options should be supplied than premises so the


students can narrow down the choices as they
progress through the test simply by keeping track of
the options they have used.
 Options should be shorter than premises because most
students will read a premise then search through the
options
 The options and premises should relate to one central
theme that is obvious to students
Fill in Items
 The required response should be concise
 Bad item:
 John walked down the street ________
(slowly, quickly, angrily, carefully, etc.)
 Good item:
 John stepped onto the ice and immediately
____ down hard (fell)
Fill in Items

 There should be a sufficient context to convey


the intent of the question to the students.
 The blanks should be standard in length
 The main body of the question should precede the
blank
 Develop a list of acceptable responses
Short Response

 Items that the students can answer in a few phrases


or sentences.
 The item should be formatted that only one
relatively concise answer is possible.
 The item is framed as a clear and direct item
 E.g. According to the reading passage, what are the
three steps in doing research?
Task Items
 Task item is any of a group of fairly-open ended item
types that require students to perform a task in the
language that is being tested.
 The task should be clearly defined
 The task should be sufficiently narrow for the time
available.
 A scoring procedure should be worked out in advance
in regard to the approach that will be used.
Task Items
 A scoring procedure should be worked out
in advance in regard to the categories of
language that will be rated.
 The scoring procedure should be clearly
defined in terms of what each score within
each category means.
 The scoring should be anonymous
Analytic Score for Rating
Composition Tasks
Holistic Version of the Scale for
Rating Composition Tasks
 Content
 Organization
 Language Use
 Vocabulary
 Mechanics
Personal Response Items

 The response allows the students to


communicate in ways and about things that
are interesting to them personally
 Personal Responses include: self
assessment, conferences, porfolio
Self-Assessment
 Decide on a scoring type
 Decide what aspect of students’ language performance they
will be assessing
 Develop a written rating for the learners
 The rating scale should decide concrete language and
behaviours in simple terms
 Plan the logistics of how the students will assess themselves
 The students should learn the self-scoring procedures
 Have another student/teacher do the same scoring
Conferences
 Introduce and explain conferences to the students
 Give the students the sense that they are in control of
the conference
 Focus the discussion on the students’ views concerning
the learning process
 Work with the students concerning self-image issue
 Elicit performances on specific skills that need to be
reviewed.
 The conferences should be scheduled regularly
Portfolios

 Explain the portfolios to the students


 Decide who will take responsibility for what
 Select and collect meaningful work.
 The students periodically reflect in writing on their
portfolios
 Have other students, teachers, outsiders periodically
examined the portfolios.

You might also like