0% found this document useful (0 votes)
14 views

Week 1& 2 Testing

The document discusses the distinctions between testing, assessment, and evaluation in educational contexts, emphasizing that assessment is typically conducted by others while evaluation is self-reflective. It outlines various types of evaluations, including formative and summative, and explains the importance of reliability, validity, and practicality in testing. Additionally, it covers different testing methodologies and types, highlighting how they can be used to measure learners' progress and inform teaching practices.

Uploaded by

ilaydazturk258
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Week 1& 2 Testing

The document discusses the distinctions between testing, assessment, and evaluation in educational contexts, emphasizing that assessment is typically conducted by others while evaluation is self-reflective. It outlines various types of evaluations, including formative and summative, and explains the importance of reliability, validity, and practicality in testing. Additionally, it covers different testing methodologies and types, highlighting how they can be used to measure learners' progress and inform teaching practices.

Uploaded by

ilaydazturk258
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 50

Testing,

Assessment and
Evaluation
Defining Terms
• When we are talking about giving people a test and recording
scores etc., we would normally refer to this as an assessment
procedure.
• If, on the other hand, we are talking about looking back over a
course or a lesson and deciding what went well, what was learnt
and how people responded, we would prefer the term 'evaluate'
as it seems to describe a wider variety of data input (testing, but
also talking to people and recording impressions and so on).
EVALUATION

• Evaluation doesn't have to be very elaborate.


• The term could be used to describe nodding to accept an
answer in class up to formal examinations set by
international testing bodies but at that end of the cline,
we are more likely to talk about assessment and
examining.
Defining Terms
• Another difference in use is that when we measure success for
ourselves (as in teaching a lesson) we are conducting evaluation;
when someone else does it, it's called assessment.
• In what follows, therefore, the terms are used to mean the same
thing but the choice of which term to use will be made to be
appropriate to what we are discussing.
• How about 'testing'? In this guide 'testing' is seen as a form of
assessment but, as we shall see, testing comes in all shapes and
sizes. Look at it this way:
Testing, Assessment and Evaluation
• As you see, testing sits uncomfortably between evaluation and assessment. If
testing is informal and classroom based, it forms part of evaluation. A
bi-weekly progress test is part of evaluation although learners may see it as
assessment. When testing is formal and externally administered, it's
usually called examining.
• Testing can be anything in between. For example, an institution's end-of-
course test is formal testing (not examining) and a concept-check question to
see if a learner has grasped a point is informal testing and part of evaluating
the learning process in a lesson.
Matching Exercise • A.Chatting to students to ask if
they think they benefited from
the lesson
1. Formal summative assessment…..
• B.Setting a test mid-lesson to
2. External assessment / examining…..
see where to go from here
3. Formal evaluation….
• C. A learner taking an IELTS
4. Informal evaluation…. examination
5. Informal testing…. • D. Listening to learners in the
free stage of the lesson to see
if they are using the target
language
• E. An end-of-course written
test set by the institution
Why evaluate, assess or test?

• It's not enough to be clear about what you want people to learn and to
design a teaching programme to achieve the objectives. We must also
have some way of knowing whether the objectives have been achieved.
• That's called testing.

• If you can't measure it, you can't improve it


• Peter Drucker
Types of evaluation, assessment and testing
• Initial vs. Formative vs. Summative evaluation
• Initial testing is often one of two things in ELT: a diagnostic test to help
formulate a syllabus and course plan or a placement test to put learners into
the right class for their level.
• Formative testing is used to enhance and adapt the learning
programme. Such tests help both teachers and learners to see what has
been learned and how well and to help set targets. It has been called
educational testing.
• Formative evaluation may refer to adjusting the programme or
helping people see where they are. In other words, it may be
targeted at teaching or learning (or both).
Types of evaluation, assessment and testing
• Summative tests, on the other hand, seek to measure how well a set
of learning objectives has been achieved at the end of a period of
instruction.
• Robert Stake describes the difference this way: When the cook tastes
the soup, that's formative. When the guests taste the soup, that's
summative. (cited in Scrivener, 1991:169).
• There is more on the distinctions and arguments surrounding formative
and summative testing below.
Informal vs. Formal evaluation

• Formal evaluation usually implies some kind of written document


(although it may be an oral test) and some kind of scoring
system. It could be a written test, an interview, an on-line test, a piece
of homework or a number of other things.
• Informal evaluation may include some kind of document but
there's unlikely to be a scoring system as such and evaluation
might include, for example, simply observing the learner(s),
listening to them and responding, giving them checklists, peer-
and self-evaluation and a number of other procedures.
Objective vs. Subjective assessment
• Objective assessment (or, more usually, testing) is characterized by tasks in
which there is only one right answer. It may be a multiple-choice test, a
True/False test or any other kind of test where the result can readily be seen and
is not subject to the marker's judgement.
• Subjective tests are those in which questions are open ended and the
marker's judgement is important.
• Of course, there are various levels of test on the subjective-objective scale.
Criterion-referencing vs. Norm-referencing in tests

Criterion-referenced tests are those in which the result is


measured against a scale (e.g., by grades from A to E or by a
score out of 100).
 The object is to judge how well someone did against a set of
objective criteria independently of any other factors.
A good example is a driving test.
Norm-Referencing
Norm-referencing is a way of measuring students against
each other.
For example, if 10% of a class are going to enter the next
class up, a norm-referenced test will not judge how well they
achieved a task in a test but how well they did against the
other students in the group.
Some universities apply norm-referencing tests to select
undergraduates.
Matching Exercise
• 1. Formal assessment….. • A. Assessing teaching on a course such as CELTA
• B. A multiple-choice test with one right answer only
• 2. Criterion referencing….
• C. This test (designed to tell you if you need to go
• 3. Summative evaluation…. back and look again)
• D. Asking students if they need to revise the area
• 4. Subjective testing….
• E. An externally set and marked examination
• 5. Norm referencing…. • F. An end-of-course test

• 6. Objective testing…. • G. Finding the best and worst students in the class
by giving them a test
• 7. Initial evaluation…. • H. A placement test

• 8. Formative evaluation…. • I. An essay set for homework on a subject of the


learners' choice marked out of 10 by how well you
• 9. Informal evaluation…. think they write
The first thing to get clear is the distinction between
testing and examining:
Washback Effect
The term 'backwash' or, sometimes, 'washback', is used to
describe the effect on teaching that knowledge of the
format of a test or examination has.
 For example, if we are preparing people for a particular style
of examination, some (perhaps nearly all) of the teaching will
be focused on training learners to perform well in that test
format.
. It is averred that formative assessment:

Makes teaching more effective because it provides data in an


ongoing way on which the teacher can act and to which the
learners can react
has a positive effect on achievement because goals are set
rationally and realistically and are achievable in small steps
empowers learners and encourages them to take more
responsibility for their own learning and progress
It is averred that formative assessment:

acts as a corrective to prevent misunderstandings of what is


required because feedback on learning success is specific and
targeted (unlike a grade on an examination certificate)
is cooperative because learners can involve each other in
assessing their own progress
encourages autonomy because learners can acquire the skills
of self-assessment which can be transferred to other settings
Types of tests

• As far as day-to-day classroom use is concerned,


teachers are mostly involved in writing and
administering achievement tests as a way of
telling the learners how successfully what has
been taught has been learned.
Types of test items

• Alternate response
• This sort of item is probably most familiar to language teachers as a True /
False test. (Technically, only two possibilities are allowed. If you have a True /
False / Don't know test, then it's really a multiple-choice test.)
• Multiple-choice
• This is sometimes called a fixed-response test. Typically, the correct answer
must be chosen from three or four alternatives. The 'wrong' items are called
the distractors.
Types of Test Items
• Structured response
• In tests of this sort, the subject is given a structure in which to form the answer.
Sentence completion items of the sort which require the subject to expand a sentence such as
He / come/ my house / yesterday / 9 o'clock into He came to my house at 9 o'clock yesterday
are tests of this sort as are writing tests in which the test-taker is constrained to include a
list of items in the response.
• Free response
• In these tests, no guidance is given other than the rubric and the subjects are free to write or
say what they like.
• A hybrid form of this and a structured response item is one where the subject is given a
list of things to include in the response but that is usually called a structured response test,
especially when the list of things to include covers most of the writing and little is left to the
test-taker's imagination.
TEST TYPES WHAT THE TESTS ARE EXAMPLE
INTENDED TO DO

Aptitude tests The Modern Language Aptitude Test


test a learner’s general ability to learn (US Army) and its successors
a language rather than the ability to
use a particular language
Achievement tests measure students' performance at an end-of-course or end-of-week etc.
the end of a period of study to test (even a mid-lesson test)
evaluate the effectiveness of the
programme

Diagnostic tests
discover learners' strengths and a test set early in a programme to
weaknesses for planning purposes plan the syllabus
Proficiency tests test a learner’s ability in the public examinations and placement
language regardless of any course tests
they may have taken

Barrier tests a special type of test designed to a pre-course test which assesses the
discover if someone is ready to take learner's current level with respect
a course to the intended course content
WAYS OF TESTING AND
MARKING
Methodology Description Example Comments
The argument is that this
kind of test is more reliable
Testing a particular skill by Testing whether someone because it tests the
Direct testing getting the student to can write a discursive essay outcomes, not just the
perform that skill by asking them to write one individual skills and
knowledge that the test-
taker needs to deploy

Testing whether someone Although this kind of test is


Trying to test the abilities can write a discursive essay less reliable in testing
Indirect testing which underlie the skills we by testing their ability to use whether the individual skills
are interested in contrastive markers, can be combined, it is easier
modality, hedging etc. to mark objectively

Placement tests are usually


A test format with many These sorts of tests can be
of this sort with multiple-
Discrete-point items requiring short very objectively marked and
choice items focused on
testing answers which each target a need no judgement on the
vocabulary, grammar,
defined area part of the markers
functional language etc.
Methodology Description Example Comments
Public examinations Although the task is
contain a good deal of this integrative, the marking
sort of testing with marks scheme is designed to
Combining many language
Integrative testing awarded for various make the marking non-
elements to do the task
elements: accuracy, range, judgemental by breaking
communicative success down the assessment into
etc. discrete parts
Subjective marking has
the great disadvantage of
requiring markers to be
The marks awarded Marking an essay on the
very carefully monitored
Subjective marking depend on someone’s basis of how well you think
and standardised to
opinion or judgement it achieved the task
ensure that they all apply
the same strictness of
judgement consistently

This obviously makes the


Machine marking a
marking very reliable but
Marking where only one multiple-choice test
it is not always easy to
Objective marking answer is possible – right completed by filling in a
break language knowledge
or wrong machine-readable mark
and skills down into digital,
sheet
right-wrong elements.
This is very similar to
Breaking down a task integrative testing but
The separate marking
into parts and care has to be taken
of the constituent
marking each bit to ensure that the
Analytic marking parts that make up
separately (see breakdown is really
the overall
integrative testing, into equivalent and
performance
above) usefully targeted
areas
The term holistic
refers to seeing the
whole picture and
Marking an essay on
Different activities are such test marking
the basis of how well
included in the overall means that it has the
Holistic marking it achieves its aims
description to produce same drawbacks as
(see subjective
a multi-activity scale subjective marking,
marking, above)
requiring monitoring
and standardisation of
markers.
• Naturally, these types of testing and marking can be combined in any
assessment procedure and often are.
• For example, a piece of writing in answer to a structured response test
item can be marked by awarding points for mentioning each required
element (objective) and then given more points for overall effect on the
reader (subjective).
Three fundamental concepts:

• Reliability, Validity and


Practicality
Reliability
• This refers, oddly, to how reliable the test is. It answers this
question:
• Would a candidate get the same result whether they took the test
in London or Kuala Lumpur or if they took it on Monday or
Tuesday?
• This is sometimes referred to as the test-retest test.
• A reliable test is one which will produce the same result if it is
administered again.
• Statisticians reading this will immediately understand that it is
the correlation between the two test results that measures
reliability.
Validity
• Two questions here:
• Does the test measure what we say it measures?
• For example, if we set out to test someone's ability to participate in
informal spoken transactions, do the test items we use actually test
that ability or something else?
• Does the test contain a relevant and representative sample of what it is
testing?
• For example, if we are testing someone's ability to write a formal email,
are we getting them to deploy the sorts of language they actually need
to do that?
Practicality

Is the test deliverable in practice?


 Does it take hours to do and hours to mark
or is it quite reasonable in this regard?
Matching Activity
• A. Does this test measure what we want it to?
• 1. A measure of reliability
• B. Please write 500 words about yourself.
• 2. Alternate response item
• C.How good are you at English?
• 3. A measure of validity • D. Tick box A or B only.
• 4. Diagnostic test • E. If we give the test again, will we get the
same result?
• 5. Free response test
• F.This test
• 6. Multiple-choice test
• G. Where are your weaknesses in English?
• 7. Proficiency test • H. How well have we done?

• 8. Achievement test
Reliability
Validity
• If you are writing a test for your own class or an individual
learner or group of students for whom you want to plan a
course, or see how a course is going, then validity is most
important for you.
• You will only be running the test once and it isn't important that
the results are correlated to other tests. All you want to ensure
is that the test is testing what you think it's testing so
the results will be meaningful.
• There are five different sorts of validity to consider. Here they
VALIDITY
Construct validity

A construct is something that happen in your brain and is not, here,


to do with constructing a test.
To have high construct validity, a test-maker must succinctly and
consistently answer the
question:

What exactly are you testing?


If you cannot closely and accurately describe what you are
testing, you will not be able to construct
a good test.
.
Construct Validity
• It is not enough to answer with something like:
I am testing writing ability.
because that begs more questions:

• At what level?
Concerning what topics?
For which audiences?
In what style?
In what register?
In what length of text?
and so on.
Concurrent validity
• If, for example, you have a well established proficiency test, such as
one administered by experienced examination boards, you may feel
that you would be better served with a shorter test that gave you the
same sort of data.
• This may be less important to you but if your test predicts well how
learners perform in the examination proper, it will tell you more than
if it doesn't.
Concurrent Validity

• Concurrent validity is a method of assessing


validity that involves comparing a new test
with an already existing test, or an already
established criterion.
Concurrent Validity

To establish concurrent validity, you need to administer


both tests to as large a group as possible and then carefully
compare the results.
Parallel results are a sign of good concurrent validity and
you may be able to dispense with the longer test
altogether.
Predictive Validity

• Predictive validity refers to whether scores on one


test are associated with performance on a given
criterion.
• That is, can a person’s score on the test predict
their performance on the criterion?
Predictive validity
• Equally, your test should tell you how well your learners will perform in the
tasks you set and the lessons you design to help them prepare for the
examination.
• For example, if you want to construct a barrier test to see if people are able
successfully to follow a course leading to an examination, you will want the test
to have good predictive validity.
Predictive Validity
This is not easy to achieve because, until at least one cohort of
learners have taken the examination, you cannot know how well
the barrier test has worked. Worse, you need to administer the
barrier test to a wide range of learners and compare the results of
the test with the examination results they actually achieved.
This will mean that the barrier test cannot be used to screen out
learners until it has been shown to have good predictive validity so
the endeavour may take months to come to fruition.
Content validity
• If you are planning a course to prepare students for a particular
examination, for example, you want your test to represent the
sorts of things you need to teach to help them succeed.
• A test which is intended to measure achievement at the end of a
course also needs to contain that which has been taught only and
not have any extraneous (not directly connected with or related to
something) material which has not been the focus of teaching.
• Coverage plays a role here, too, because the more that has been
taught, the longer and more comprehensive the test has to be.
Face validity
• Students won't perform at their best in a test they don't trust is really
assessing properly what they can do. For example, a quick chat in a corridor
may tell you lots about a learner's communicative ability but the learner
won't feel he/she has been fairly assessed (or assessed at all).
• The environment matters, too. Most learners expect a test to be quite a
formal event held in silence with no cooperation between test-takers. If the
test is not conducted in this way, some learners may not take it as seriously
as others and perform less well than they are able to in other environments.
Make rubrics clear
• Any misunderstanding of what's required undermines
reliability.
• Learners vary in their familiarity with certain types of task
and some may, for example, instantly recognize what they
need to do from a glance at the task. Others may need
more explicit direction and even teaching. Making the
rubric clear contributes to levelling the playing field
Finally, having considered all this, you need to
construct your test. How would you go about that?

You might also like