Language Assessment Module
Language Assessment Module
Language Assessment Module
1.0
OVERVIEW OF ASSESSMENT:
CONTEXT, ISSUES AND TRENDS
SYNOPSIS
LEARNING OUTCOMES
By the end of this topic, you will be able to:
1.2
1.
2.
3.
FRAMEWORK OF TOPICS
CONTENT
SESSION ONE (3 hours)
1.3
INTRODUCTION
1.4
1.4.1 Test
The four terms above are frequently used interchangeably in any
academic discussions. A test is a subset of assessment intended to measure
a test-taker's language proficiency, knowledge, performance or skills. Testing
is a type of assessment techniques. It is a systematically prepared procedure
that happens at a point in time when a test-taker gathers all his abilities to
achieve ultimateperformance because he knows that his responses are being
evaluated and measured.A test is first a method of measuring a test-takers
ability, knowledge or performance in a given area; and second it must
measure.
Bachman (1990) who was also quoted by Brown defined a test as a
process of quantifying a test-takers performance according to explicit
procedures or rules.
2
1.4.2 Assessment
Assessment is every so oftena misunderstood term. Assessment is a
comprehensive process of planning, collecting, analysing, reporting, and using
information on students over time(Gottlieb, 2006, p. 86).Mousavi (2009)is of
the opinion that assessment is appraising or estimating the level of magnitude
of some attribute of a person. Assessment is an important aspect in the fields
of language testing and educational measurement and perhaps, the most
challenging partof it. It is an ongoing process in educational practice, which
involves a multitude of methodological techniques. It can consist of tests,
projects, portfolios, anecdotal information and student self-reflection.A test
may be assessed formally or informally, subconsciously or consciously, as well
as incidental or intended by an appraiser.
1.4.3 Evaluation
Evaluation is another confusing term. Many are confused between
evaluation and testing. Evaluation does not necessary entail testing. In reality,
evaluation is involved when the results of a test (or other assessment
procedure) are used for decision-making (Bachman, 1990, pp. 22-23).
Evaluation involves the interpretation of information. If a teacher simply
records numbers or makes check marks on a chart, it does not constitute
evaluation. When a tester or marker evaluate, s/he values the results in such
a way that the worth of the performance is conveyed to the test-taker. This is
usually done with some reference to the consequences, either good or bad of
the performance.This is commonly practised in applied linguistics research,
where the focus is often on describing processes, individuals, and groups, and
the relationships among language use, the language use situation, and
language ability.
Test scores are an example of measurement, and conveying the
meaning of those scores is evaluation. However, evaluation can occur
without measurement. For example, if a teacher appraises a students correct
oral response with words like Excellent insight, Lilly!it is evaluation.
3
1.4.4 Measurement
Measurement is the assigning of numbers to certain attributes of
objects, events, or people according to a rule-governed system. For our
purposes of language testing, we will limit the discussion to unobservable
abilities or attributes, sometimes referred to as traits, such as grammatical
knowledge, strategic competence or language aptitude. Similar to other tyoes
of assessment, measurement must be conducted according to explicit rules
and procedures as spelled out in test specifications, criteria, and procedures
for scoring.Measurement could be interpreted as the process of quantifying the
observed performance of classroom learners. Bachman (1990) cautioned us
to distinguish between quantitative and qualitative descriptions. Simply put,
the former involves assigning numbers (including rankings and letter grades)
to observed performance, while the latter consists of written descriptions, oral
feedback, and non-quantifiable reports.
The relationships among test, measurement, assessment, and their
uses are illustrated in Figure 1.
research methodology;
b)
practical advances;
c)
d)
e)
3.0
Examinations
Examinations were
were conducted
conducted according
according to
to the
the
needs of school or based on overseas
examinations such as the Overseas School
Certificate.
Certificate.
PreIndependence
Implementation
of the Razak
Report (1956)
Razak
Razak Report
Report gave
gave birth
birth to
to the
the National
National
Education
Education Policy
Policy and
and the
the creation
creation of
of
Examination Syndicate (LP). LP conducted
examinations such as the Cambridge and
RahmanTalib ReportSchool
recommended the
Malayan
Malayan Secondary
Secondary School Entrance
Entrance
following
actions:
following
actions:
Examination (MSSEE), and Lower Certificate of
1. Extend (LCE)
schooling
age to 15 years old.
Education
Examination.
2. Automatic promotion to higher classes.
3.
3. Multi-stream
Multi-stream education
education (Aneka
(Aneka Jurusan).
Jurusan).
The
The following
following changes
changes in
in examination
examination were
were
made:
- The entry of elective subjects in LCE and
SRP.
SRP.
- Introduction examination of the Standard 5
Evaluation Examination.
-- The
The introduction
introduction of
of Malaysia's
Malaysia's Vocational
Vocational
Education Examination.
- The introduction of the Standard 3 Dignostic
Test
Test (UDT).
(UDT).
Implementation
of the
RahmanTalib
Report (1960)
The
The implementation
implementation of
of Cabinet
Cabinet Report
Report
resulted in evolution of the education system
to its present state, especially with KBSR
and
and KBSM.
KBSM. Adjustments
Adjustments were
were made
made in
in
examination
examination to
to fulfill
fulfill the
the new
new curriculum's
curriculum's
needs and to ensure it is in line with the
National Education Philosophy.
Implementation
of the Cabinet
Report (1979)
Implementation of
the Malaysia
Education Blueprint
(2013 2025)
The
The emphasis
emphasis is
is on
on School-Based
School-Based Assessment
Assessment
(SBA). It was first introduced in 2002. It is a new
system
system of
of assessment
assessment and
and is
is one
one of
of the
the new
new
areas
areas where
where teachers
teachers are
are directly
directly involved.
involved. The
The
revamp
revamp of
of the
the national
national examination
examination and
and schoolschoolbased
based assessments
assessments in
in stages,
stages, whereby
whereby by
by 2016,
2016,
at
at least
least 40%
40% of
of questions
questions in
in
UjianPenilaianSekolahRendah
UjianPenilaianSekolahRendah (UPSR)
(UPSR) and
and 50%
50%
in SijilPelajaran Malaysia (SPM) are of high order
thinking
thinking skills
skills questions.
questions.
TOPICvi 2
ROLE
AND PURPOSES OF
i
ASSESSMENT IN
TEACHING AND LEARNING
ii
iii
v
iv
Tutorial question
Examine the contributing factors to the changing trends of
language assessment.
Create and present findings using graphic organisers.
10
TOPIC 3
level in each individual skill area. Elsewhere, placement test scores are used
to determine if a student needs any further instruction in the language or could
matriculate directly into an academic programme.
Discuss the extent tests or assessment tasks serve their purpose.
13
CONTENT
Formative and
Summative
Objective and
Subjective
Criterion-Referenced Test
Definition
An approach that
provides information on
students mastery based
on a criterion specified by
the teacher
Purpose
Determine performance
Determine learning
difference among
mastery based on
individual and groups
specified criterion and
standard
Test Item
From easy to difficult level Guided by minimum
and able to discriminate
achievement in the
examinees ability
related objectives
Frequency
Continuous assessment
Continuous assessment
in the classroom
Appropriateness
Summative evaluation
Formative evaluation
Example
Public exams: UPSR,
Mastery test: monthly
PMR, SPM, and STPM
test, coursework, project,
exercises in the
classroom
Table 3: The differences between Norm-Referenced Test (NRT) and CriterionReferenced Test (CRT)
3.5
Formative Test
Formative test or assessment, as the name implies, is a kind of
and help them to get it right. This can take place when teachers examine the
results of achievement and progress tests. Based on the results of formative
test or assessment, the teachers can suggest changes to the focus of
curriculum or emphasis on some specific lesson elements. On the other hand,
students may also need to change and improve. Due to the demanding nature
of this formative test, numerous teachers prefer not to adopt this test although
giving back any assessed homework or achievement test present both
teachers and students healthy and ultimate learning opportunities.
3.6
Summative Test
Summative test or assessment, on the other hand, refers to the kind of
measurement that summarise what the student has learnt orgive a one-off
measurement.In other words, summative assessment is assessment of
student learning. Students are more likely to experience assessment carried
out individually where they are expected to reproduce discrete language items
from memory.The results then are used to yield a school report and to
determine what students know and do not know.It does not necessarily provide
a clear picture of an individuals overall progress or even his/her full potential,
especially if s/heis hindered by the fear factor of physically sitting for a test, but
may provide straightforward and invaluable results for teachers to analyse. It is
given at a point in time to measure student achievement in relation to a clearly
defined set of standards, but it does not necessarily show the way to future
progress. It is given after learning is supposed to occur. End of the year tests
in a course and other general proficiency or public exams are some of the
examples of summative tests or assessment.Table 3.1 shows formative and
summative assessments that are common in schools.
Formative Assessment
Anecdotal records
Quizzes and essays
Summative Assessment
Final exams
National exams (UPSR, PMR, SPM,
STPM)
Diagnostic tests
Entrance exams
Table 3.1: Common formative and summative assessments in schools
16
3.7
Objective Test
According to BBC Teaching English, an objective test is a test that
ii.
True-false items/questions:
iii.
Matchingitems/questions; and
iv.
17
1.
2.
Stem
Every multiple-choice item consists of a stem (the body of the item that
Options or alternatives
They are known as a list of possible responses to a test item.
There are usually between three and five options/alternatives to
choose from.
4.
Key
This is the correct response. The response can either be correct
or the best one. Usually for a good item, the correct answer is not obvious as
compared to the distractors.
5. Distractors
This is known as a disturber that is included to distract students from
selecting the correct answer. An excellent distractor is almost the same as the
correct answer but it is not.
ii.
iii.
Make certain that the intended answer is clearly the one correct
one;
18
iv.
3.8
Subjective Test
Contrary to an objective test, a subjective test is evaluated by giving an
opinion, usually based on agreed criteria.Subjective tests include essay, shortanswer, vocabulary, and take-home tests. Some students become very
anxious of these tests because they feel their writing skills are not up to par.
In reality, a subjective test provides more opportunity to test-takers to
show/demonstrate their understanding and/or in-depth knowledge and skills in
the subject matter. In this case, test takers might provide some acceptable,
alternative responses that the tester, teacher or test developer did not
predict. Generally, subjective tests will test the higher skills of analysis,
synthesis, and evaluation. In short, subjective test will enable students to be
more creative and critical. Table 3.2 shows various types of objective and
subjective assessments.
Objective Assessments
Subjective Assessments
True/False Items
Extended-response Items
Multiple-choice Items
Restricted-response Items
Multiple-responses Item
Essay
Matching Items
Table 3.2: Various types of objective and subjective assessments
Some have argued that the distinction between objective and subjective
assessments is neither useful nor accurate because, in reality, there is no such
thing as objective assessment. In fact, all assessments are created with
inherent biases built into decisions about relevant subject matter and content,
as well as cultural (class, ethnic, and gender) biases.
Reflection
1.
Objective test items are items that have only one answer or correct
response. Describe in-depth the multiple-choice test item.
19
2.
Discussion
1. Identify at least three differences between formative and summative
assessment?
2. What are the strengths of multiple-choice items compared to essay
items?
3. Informal assessments are often unreliable, yet they are still
important in classrooms. Explain why this is the case, and defend
your explanation with examples.
4. Compare and contrast Norm-Referenced Test with CriterionReferenced Test.
TOPIC 4
4.0
SYNOPSIS
LEARNING OUTCOMES
By the end of this topic, you will be able to:
1.
2.
3.
4.2
FRAMEWORK OF TOPICS
Reliability
Interpretability
Validity
Types of
Tests
Practicality
Authenticity
CONTENT
Washback Effect
Objectivity
SESSION
FOUR (3 hours)
4.3
INTRODUCTION
Assessment is a complex, iterative process requiring skills,
RELIABILITY ( consistency)
21
the raters agree 8 out of 10 times, the test has an 80% inter-rater
reliability rate. Rater reliability is assessed by having two or more
independent judges score the test. The scores are then compared to
determine the consistency of the raters estimates.
Intra-rater reliability is an internal factor. In intra-rater reliability,
its main aim is consistency within the rater. For example, if a rater
(teacher) has many examination papers to mark and does not have
enough time to mark them, s/he might take much more care with the
first, say, ten papers, than the rest. This inconsistency will affect the
students scores; the first ten might get higher scores. In other
words, while inter-rater reliability involves two or more raters, intrarater reliability is the consistency of grading by a single rater.
Scores on a test are rated by a single rater/judge at different times.
When we grade tests at different times, we may become
inconsistent in our grading for various reasons. Some papers that are
graded during the day may get our full and careful attention, while
others that are graded towards the end of the day are very quickly
23
same rater
(Clark, 1979).
4.4.2 Test Administration Reliability
There are a number of reasons which influences test
administration reliability. Unreliability occurs due to outside
interference like noise, variations in photocopying, temperature
variations, the amount of light in various parts of the room, and even
the condition of desk and chairs. Brown (2010) stated that he once
witnessed the administration of a test of aural comprehension in which
an audio player was used to deliver items for comprehension, but due
to street noise outside the building, test-taker sitting next to open
windows could not hear the stimuli clearly. According to him, that was
a clear case of unreliability caused by the conditions of the test
administration.
24
Teacher-Student factors
In most tests, it is normally for teachers to construct and
c.
Environment factors
25
Because students' grades are dependent on the way tests are being
administered, test administrators should strive to provide clear and
accurate instructions, sufficient time and careful monitoring of tests to
improve the reliability of their tests. A test-re-test technique can be
used to determine test reliability.
e.
Marking factors
4.5
VALIDITY
Validity refers to the evidence base that can be provided about
(conclusion that testers would like to make on the basis of obtained scores.
Clearly, we have to evaluate the whole assessment process and its constituent
(component) parts by how soundly (thoroughly) we can defend the
consequences that arise from the inferences and decisions we make. Validity,
in other words, is not a characteristic of a test or assessment; but a judgment,
which can have varying degrees of strength.
So, the second characteristic of good tests is validity, which refers to
whether the test is actually measuring what it claims to measure. This is
important for us as we do not want to make claims concerning what a student
can or cannot do based on a test when the test is actually measuring
something else. Validity is usually determined logically although several types
of validity may use correlation coefficients.
According to Brown (2010), a valid test of reading ability actually
measures reading ability and not 20/20 vision, or previous knowledge of a
subject, or some other variables of questionable relevance. To measure
writing ability, one might ask students to write as many words as they can in 15
minutes, then simply count the words for the final score. Such a test is
practical (easy to administer) and the scoring quite dependable (reliable).
However, it would not constitute (represent ) a valid test of writing ability
without taking into account its comprehensibility (clarity), rhetorical discourse
elements, and the organisation of ideas.
The following are the different types of validity:
Content validity: Does the assessment content cover what you want to
assess? Have satisfactory samples of language and language skills been
selected for testing?
Construct validity: Are you measuring what you think you're measuring?
Is the test based on the best available theory of language and language
use?
Concurrent (parallel) validity: Can you use the current test score to
estimate scores of other criteria? Does the test correlate with other existing
measures?
27
the criteria (concepts, skills and knowledge) relevant to the purpose of the
examination. The important notion here is the purpose.
http://www.2dix.com/pdf-2011/testing-and-evaluation-in-esl-pdf.php
4.5.6 Practicality
Although practicality is an important characteristic of tests, it is by
far a limiting factor in testing. There will be situations in which after we
have already determined what we consider to be the most valid test, we
need to reconsider the format purely because of practicality issues. A
valid test of spoken interaction, for example, would require that the
examinees be relaxed, interact with peers and speak on topics that they
are familiar and comfortable with. This sounds like the kind of
conversations that people have with their friends while sipping afternoon
teaby the roadside stalls. Of course such a situation would be a highly
valid measure of spoken interaction if we can set it up. Imagine if we
even try to do so. It would require hidden cameras as well as a lot of
telephone calls and money.
Therefore, a more practical form of the test especially if it is to be
administered at the national level as a standardised test, is to have a
short interview session of about fifteen minutes using perhaps a picture
or reading stimulus that the examinees would describe or discuss.
Therefore, practicality issues, although limiting in a sense, cannot be
dismissed if we are to come up with a useful assessment of language
ability. Practicality issues can involve economics or costs, administration
considerations such as time and scoring procedures, as well as the
ease of interpretation. Tests are only as good as how well they are
interpreted. Therefore tests that cannot be easily interpreted will
definitely cause many problems.
4.5.7 Objectivity
The objectivity of a test refers to the ability of
teachers/examiners who mark the answer scripts. Objectivity refers to
the extent, in which an examiner examines and awards scores to the
32
same answer script. The test is said to have high objectivity when the
examiner is able to give the same score to the similar answers guided
by the mark scheme. An objective test is a test that has the highest
level of objectivity due to the scoring that is not influenced by the
examiners skills and emotions. Meanwhile, subjective test is said to
have the lowest objectivity. Based on various researches, different
examiners tend to award different scores to an essay test. It is also
possible that the same examiner would give different scores to the
same essay if s/he is to re-check at different times.
4.5.8 Washback effect
The term 'washback' or backwash (Hughes, 2003, p.1)
refers to the impact that tests have on teaching and learning. Such
impact is usually seen as being negative: tests are said to force
teachers to do things they do not necessarily wish to do.However, some
have argued that tests are potentially also 'levers for change' in
language education: theargument being that if a bad test has negative
impact,a good test should or could have positive washback(Alderson,
1986b; Pearson, 1988).
Cheng, Watanabe, and Curtis (2004) offered an entire anthology
to the issue of wash back while Spratt (2005) challenged teachers to
become agents of beneficial washback in their language classrooms.
Brown (2010) discusses the factors that provide beneficial washback in
a test.He mentions that such a test can positively influence what and
how teachers teach, students learn; offer learners a chance to
adequately prepare, give learners feedback that enhance their language
development, is more formative in nature than summative, and provide
conditions for peak performance by the learners.
In large-scale assessment, washback often refers to the effects
that tests have on instruction in terms of how students prepare for the
test. In classroom-based assessment, washback can have a number of
positive manisfestations, ranging from the benefit of preparing and
33
reviewing for a test to the learning that accrues from feedback on ones
performance. Teachers can provide information that washes back to
students in the form of useful diagnoses of strengths and weaknesses.
The challenge to teachers is to create classroom tests that serve
as learning devices through which washback is achieved. Students
incorrect responses can become a platform for further improvements.
On the other hand, their correct responses need to be complimented,
especially when they represent accomplishments in a students
developing competence. Teachers can have various strategies in
providing guidance or coaching. Washback enhances a number of
basic principles of language acquisition namely intrinsic motivation,
autonomy, self-confidence, language ego, interlanguage, and strategic
investment, among others.
Washback is generally said to be either positive or negative.
Unfortunately, students and teachers tend to think of the negative
effects of testing such as test-driven curricula and only studying and
learning what they need to know for the test. Positive washback, or
what we prefer to call guided washback can benefit teachers, students
and administrators. Positive washback assumes that testing and
curriculum design are both based on clear course outcomes, which are
known to both students and teachers/testers. If students perceive that
tests are markers of their progress towards achieving these outcomes,
they have a sense of accomplishment. In short, tests must be part of
learning experiences for all involved. Positive washback occurs when a
test encourages good teaching practice.
Washback is particularly obvious when the tests or examinations
in question are regarded as being very vital and having a definite impact
on the students or test-takers future. We would expect, for example,
that national standardised examinations would have strong washback
effects compared to a school-based or classroom-based test.
34
4.5.9 Authenticity
Another major principle of language testing is authenticity. It is a
concept that is difficult to define, particularly within the art and science
of evaluating and designing test. Citing Bachman and Palmer (1996) in
Brown (2010) authenticity is the degree of correspondence of the
characteristics of a given language test task to the features of a target
language task (p.23) and then suggested an agenda for identifying
those target language tasks and for transforming them into valid test
items.
Language learners are motivated to perform when they are faced
with tasks that reflect real world situations and contexts. Good testing
or assessment strives to use formats and tasks that reflect the types of
situation in which students would authentically use the target language.
Whenever possible, teachers should attempt to use authentic materials
in testing language skills.
4.6.0 Interpretability
Test interpretation encompasses all the ways that meaning is
assigned to the scores. Proper interpretation requires knowledge
about the test, which can be obtained by studying its manual and other
materials along with current research literature with respect to its
use; no one should undertake the interpretation of scores on any test
without such study. In any test interpretation, the following
considerations should be taken into account.
A. Consider Reliability: Reliability is important because it is a
prerequisite to validity and because the degree to which a score may
vary due to measurement error is an important factor in its
interpretation.
B. Consider Validity: Proper test interpretation requires knowledge of
the validity evidence available for the intended use of the test. Its
35
TOPIC 5
SYNOPSIS
Discuss the importance of authenticity in testing.
Topic 5Based
exposes
the stages
of test and
construction,
the
preparing of discuss
test
on you
samples
of formative
summative
assessments,
aspectsspecifications,
of reliability/validity
that must
considered
in these
blueprint/test
the elements
in abeTest
Specifications
Guidelines
assessments.
And the importance of following the guidelines for constructing tests items.
Then we
look atmeasures
the various
test
formats can
that take
are appropriate
for language
Discuss
that
a teacher
to ensure high
validity of
language
assessment
for
the
primary
classroom.
assessment.
5.1
LEARNING OUTCOMES
By the end of this topic, you will be able to:
1.
2.
3.
4.
5.
6.
7.
8.
validity
identify the elements in a Test Specifications Guidelines
demonstrate an understanding of the importance of following the
9.
5.2
FRAMEWORK OF TOPICS
CONTENT
SESSION FIVE (3 hours)
5.3
determining
planning
writing
preparing
reviewing
vi
vii
pre-testing
validating
37
5.3.1 Determining
The essential first step in testing is to make oneself perfectly
clear about what it is one wants to know and for what purpose. When
we start to construct a test, the following questions have to be
answered.
5.3.2 Planning
The first form that the solution takes is a set of specifications for
the test.This will include information on: content, format and timing,
criteria,levels of performance, and scoring procedures.
In this stage, the test constructor has to determine the content by
answering the following questions:
Describing the purpose of the test;
Describing the characteristics of the test takers, the nature of the
population of the examinees for whom the test is being designed.
Defining the nature of the ability we want to measure;
Developing a plan for evaluating the qualities of test usefulness, which
is the degree to which a test is useful for teachers and students, it
includes six qualities: reliability, validity, authenticity, practicality interactiveness, and impact;
Identifying resources and developing a plan for their allocation and
management;
Determining format and timing of the test;
Determining levels of performance;
Determining scoring procedures
38
5.3.3 Writing
Although writing items is time-consuming, writing good items is an art.
No one can expect to be able consistently to produce perfect items.
Some items will have to be rejected, others reworked. The best way to
identify items that have to be improved or abandoned is through
teamwork. Colleagues must really try to find fault; and despite the
seemingly inevitable emotional attachment that item writers develop to
items that they have created, they must be open to, and ready to
accept, the criticisms that are offered to them. Good personal relations
are a desirable quality in any test writing team.
Test items writers should possess the following characteristics:
5.3.4 Preparing
One has to understand the major principles, techniques and experience
of preparing the test items. Not every teacher can make a good tester.
To construct different kinds of tests, the tester should observe some
principles. In the production-type tests, we have to bear in mind that no
39
comments are necessary. Test writers should also try to avoid test
items, which can be answered through test-
wiseness. Test-
5.3.5 Reviewing
Principles for reviewing test items:
The test should not be reviewed immediately after its construction,
but after some considerable time.
Other teachers or testers should review it. In a language test, it is
preferable if native speakers are available to review the test.
5.3.6 Pre-testing
After reviewing the test, it should be submitted to pre-testing.
The tester should administer the newly-developed test to a group of
examinees similar to the target group and the purpose is to analyse
every individual item as well as the whole test.
Numerical data (test results) should be collected to check the
efficiency of the item, it should include item facility and
discrimination.
5.3.7 Validating
Item Facility (IF) shows to what extent the item is easy or difficult. The
items should neither be too easy nor too difficult. To measure the facility
or easiness of the item, the following formula is used:
IF= number of correct responses (c) / total number of candidates (N)
And to measure item difficulty:
IF= (w) / (N)
The results of such equations range from 0 1. An item with a
facility index of 0 is too difficult, and with 1 is too easy. The ideal item is
one with the value of (0.5) and the acceptability range for item facility is
between [0.37 0.63], i.e. less than 0.37 is difficult, and above 0.63 is
40
easy.
Thus, tests which are too easy or too difficult for a given sample
population, often show low reliability. As noted in Topic 4, reliability is
one of the complementary aspects of measurement.
5.4
skills to be included
are your guiding plan for designing an instrument that effectively fulfils
your desired principles, especially validity.
It is vital to note that for large-scale standardised tests like Test
of English as a Foreign Language (TOEFL Test), International
English Language Testing System (IELTS), Michigan English
Language Assessment Battery) MELAB, and the like, that are intended
to be widely distributed and thus are broadly generalised, test
specifications are much more formal and detailed (Spaan, 2006). They
are also usually confidential so that the institution that is designing the
test can ensure the validity of subsequent forms of a test.
Many language teachers claim that it is difficult to construct an item. In
reality, it is rather easy to develop an item, if we are committed in the
planning of the measuring instruments to evaluate students
achievement.
41
Taxonomy by allowing these two aspects, the noun and verb, to form
separate dimensions, the noun providing the basis for the Knowledge
dimension
and the verb forming the basis for the Cognitive Process
44
The
product of
word knowledge
Categories &
Cognitive Processes
Alternative Names
Remember
Definition
Retrieve knowledge
from long-term
memory
Recognising
Identifying
Locating knowledge in
long-term memory that
is consistent with
presented material
Recalling
Retrieving
Retrieving relevant
knowledge from longterm memory
Level 2 C2
Categories &
Cognitive Processes
Alternative Names
Understand
Definition
Construct meaning
from instructional
messages, including
45
Clarifying
Paraphrasing
Representing
Translating
Exemplifying
Illustrating
Instantiating
Classifying
Categorising
Subsuming
Summarising
Abstracting
Generalising
Inferring
Comparing
Concluding
Extrapolating
Interpolating
Predicting
Contrasting
Mapping
Matching
Constructing models
Explaining
Constructing a cause
and effect model of a
system
Level 3 C3
Categories &
Cognitive Processes
Alternative Names
Apply
Definition
Applying a procedure
to a familiar task
Carrying out
Executing
Exemplifying
Applying a procedure to
a familiar task
Illustrating
Instantiating
46
Applying a procedure to
an unfamiliar task
Using
Analyse
Differentiating
Organising
Distinguishing relevant
from irrelevant parts or
important from
unimportant parts of
presented material
Determining how
elements fit or function
within a structure
Attributing
Determining a point of
view, bias, values, or
intent underlying
presented material
Evaluating
Make judgments
based on criteria and
standards
Checking
Coordinating
Detecting
Monitoring
Testing
Detecting
inconsistencies or
fallacies within a
process or product,
determining whether a
process or product has
internal consistency;
detecting the
effectiveness of a
procedure as it is being
implemented
Judging
Critiquing
Detecting
inconsistencies
betweena product and
external
criteria;determining
whether a product has
external consistency;
detecting the
appropriateness of a
47
Putting elements
together to form a
coherent or functional
whole; reorganise
elements into a new
pattern or structure
Hypothesising
Generating
Coming upwith
alternative hypotheses
based on criteria
Designing
Planning
Producing
Inventing a product
The Knowledge Domain
Categories &
Cognitive Processes
Factual Knowledge
Definition
The basic elements students must know to the
acquainted with a discipline or solve problems in it
Conceptual Knowledge
Procedural Knowledge
Metacognitive
Knowledge
49
diagram
50
1982; Hattie & Brown, 2004). In their later research into multimodal learning,
Biggs & Collis noted that there was an increase in the structural complexity of
their (the students) responses (1991:64).
It may be useful to view the SOLO taxonomy as an integrated strategy,
to be used in lesson design, in task guidance and formative and summative
assessment (Smith & Colby, 2007; Black & William, 2009; Hattie, 2009; Smith,
2011). The structure of the taxonomy encourages viewing learning as an ongoing process, moving from simple recall of facts towards a deeper
understanding; that learning is a series of interconnected webs that can be
built upon and extended. Nckles et al., (2009:261) elaborates:
Cognitive strategies such as organization and elaboration are at the
heart of meaningful learning because they enable the learner to
organize learning into a coherent structure and integrate new
information with existing knowledge, thereby enabling deep
understanding and long-term retention.
This would help to develop Smiths (2011:92) self-regulating, self-evaluating
learners who were well motivated by learning.
A range of SOLO based techniques exist to assist teachers and
students. Use of constructional alignment (Biggs & Tang, 2009) encourages
teachers to be more explicit when creating learning objectives, focusing on
what the student should be able to do and at which level. This is essential for a
student to make progress and allows for the creation of rubrics, for use in class
(Black &Wiliam, 2009; Nckles et al., 2009; Huang, 2012), to make the
process explicit to the student. Use of HOTS viz. Higher Order Thinking Skills)
maps (Hook & Mills, 2011) can be used in English to scaffold in depth
discussion, encouraging students to:
Develop interpretations, use research and critical thinking
effectively to develop their own answers, and write essays that
engage with the critical conversation of the field (Linkon, 2005:247,
cited in Allen, 2011).
52
5.6
clearly written questions that do not attempt to trick or confuse them into
incorrect responses. The following presents the major characteristics of wellwritten test items.
5.6.1 Aim of the test
Test item development is a critical step in building a test that properly
meets certain standards. A good test is only as good as the quality of the test
items. If the individual test items are not appropriate and do not perform well,
how can the test scores be meaningful? The topic to be evaluated (construct)
and where the evaluation is done (title/context) must be part of the curriculum.
If it is evaluated outside the curriculum, the curricular validity of the item can
be disputed. Therefore, test items must be developed to precisely measure the
objectives prescribed by the blueprint and meet quality standards.
5.6.2 Range of the topics to be tested
A test must measure the test-takers ability or proficiency in applying the
knowledge and principles on the topics that they have learnt. Ample
opportunity must be given to students to learn the topics that are to be
evaluated. This opportunity would include the availability of language
teachers, well-equipped facilities, and the expertise of the language teachers
in conducting the lessons and providing the skills and knowledge that would be
evaluated to the test-takers or students.
5.6.3 Range of skills to be tested
Test item writers should always attempt to write test items that measure
higher levels of cognitive processing. This is not an easy task. It should be a
54
56
http://books.google.com.my/books/about/Constructing_Test_Items.html?id=Ia3SGDfbaV
6.0
Test format
What is the difference between test format and test type? For example,
when you want to introduce new kinds of test, for example, reading test, which
is organised a little bit different from the existing test items, what do you say?
Test format or test type? Test format refers to the layout of questions on a test.
For example, the format of a test could be two essay questions, 50 multiplechoice questions, etc.For the sake of brevity, I will consider providing the
outlines of some large-scale standardised tests.
UPSR
Primary School Evaluation Test, also as known Ujian Penilaian Sekolah
Rendah (commonly abbreviated as UPSR; Malay), is a national examination
taken by all pupils in our country at the end of their sixth year in primary
school before they leave for secondary school. It is prepared and examined by
the Malaysian Examinations Syndicate. This test consists of two papers
namely Paper 1 and Paper 2.
Multiple-choice questions are tested using a standardised optical
answer sheet that uses optical mark recognition for detecting answers for
Paper 1 and Paper 2 comprises three sections, namely Sections A, B, and C.
57
58
59
TOPIC 6
6.0
SYNOPSIS
Topic 6 focuses on ways to assess language skills and language
content. It defines the types of test items used to assess language skills
and language content. It also provides teachers with suggestions on
ways a teacher can assess the listening, speaking, reading and writing
skills in a classroom. It also discusses concepts of and differences
between discrete point test, integrative test and communicative test.
6.1
LEARNING OUTCOMES
At the end of Topic 6, teachers will be able to:
6.2
FRAMEWORK OF TOPICS
CONTENT
60
b.
Speaking
In the assessment of oral production, both discrete feature
objective tests and integrative task-based tests are used. The first
type tests such skills as pronunciation, knowledge of what
language is appropriate in different situations, language required in
doing different things like describing, giving directions, giving
61
instructions, etc. The second type involves finding out if pupils can
perform different tasks using spoken language that is appropriate
for the purpose and the context. Task-based activities involve
describing scenes shown in a picture, participating in a discussion
about a given topic, narrating a story, etc. As in the listening
performance assessment tasks, Brown 2010 cited four categories
for oral assessment.
1.
2.
A.
B.
C.
2.
concern.
Intensive (controlled). Beyond the fundamentals of imitative
writing are skills in producing appropriate vocabulary within a
context, collocation and idioms, and correct grammatical features
up to the length of a sentence. Meaning and context are
important in determining correctness and appropriateness but
most assessment tasks are more concerned with a focus on form
3.
4.
It is not by accident that we find there are few, if any, test formats that are
either supply type and objective or select type and subjective. Select type
tests tend to be objective while supply type tests tend to be subjective.
In addition to the above, Brown and Hudson (1998), have also suggested
three broad categories to differentiate tests according to how students are
expected to respond. These categories are the selected response tests, the
constructed response tests, and the personal response tests. Examples of
each of these types of tests are given in Table 6.1.
Constructed response
67
Personal response
True false
Fill-in
Conferences
Matching
Short answer
Portfolios
Multiple choice
Performance test
Communicative Test
As language teaching has emphasised the importance of
communication through the communicative approach, it is not surprising
that communicative tests have also been given prominence. A
69
In short, the kinds of tests that we should expect more of in the future
will be communicative tests in which candidates actually have to
produce the language in an interactive setting involving some degree of
unpredictability which is typical of any language interaction situation.
These tests would also take the communicative purpose of the
interaction into consideration and require the student to interact with
language that is actual and unsimplified for the learner. Fulcher finally
points out that in a communicative test, the only real criterion of
success is the behavioural outcome, or whether the learner was able
to achieve the intended communicative effect (p. 493). It is obvious
from this description that the communicative test may not be so easily
developed and implemented. Practical reasons may hinder some of the
demands listed. Nevertheless, a solution to this problem has to be
found in the near future in order to have valid language that are
purposeful and can stimulate positive washback in teaching and
learning.
71
Exercise 1
TOPIC 7
7.0
1.
2.
SYNOPSIS
Topic 7 focuses on the scoring, grading and assessment criteria. It provides
teachers with brief descriptions on the different approaches to scoring
namely:-objective, holistic and analytic.
7.1
LEARNING OUTCOMES
7.2
FRAMEWORK OF TOPICS
CONTENT
SESSION SEVEN (3 hours)
7.2.1
Objective approach
A type of scoring approach is the objective scoring approach. This scoring
approach relies on quantified methods of evaluating students writing. A
sample of how objective scoring is conducted is given by Bailey (1999) as
follows:
CCriteria
74
The 6 point scale above includes broad descriptors of what a students essay
reflects for each band. It is quite apparent that graders using this scale are
expected to pay attention to vocabulary, meaning, organisation, topic
development and communication. Mechanics such as punctuation are
secondary to communication.
Bailey also describes another type of scoring related to the holistic approach
which she refers to as primary trait scoring. In primary trait scoring, a particular
functional focus is selected which is based on the purpose of the writing and
grading is based on how well the student is able to express that function. For
example, if the function is to persuade, scoring would be on how well the
author has been able to persuade the grader rather than how well organised
the ideas were, or how grammatical the structures in the essay were. This
technique to grading emphasises functional and communicative ability rather
than discrete linguistic ability and accuracy.
7.2.3 Analytic approach
75
Components
Content
Organisation
Vocabulary
Language Used
Mechanics
Weight
30 points
20 points
20 points
25 points
5 points
Advantages
Disadvantages
Quickly graded
Provide a public standard that is
understood by the teachers and
students alike
Relatively higher degree of rater
reliability
Applicable to the assessment of
many different topics
Emphasise the students
strengths rather than their
weaknesses.
It provides clear guidelines in
76
Analytical
Objective
EXERCISE
1.
TOPIC 8
8.0
SYNOPSIS
Topic 8 focuses on item analysis and interpretation. It provides teachers with
brief descriptions on basic statistics terminologies such as mode, median, mean,
standard deviation, standard score and interpretation of data. It will also look at
some item analysis that deals with item difficulty and item discrimination.
Teachers will also be introduced to distractor analysis in language assessment.
FRAMEWORK OF TOPICS
77
CONTENT
SESSION EIGHT (6 hours)
78
MEDIAN
MEAN
8.2.2
Standard deviation
Standard deviation refers to how much the scores deviate from the mean.
There are two methods of calculating standard deviation which are the
deviation method and raw score method which are illustrated by the following
formulae.
To illustrate this, we will use 20, 25,30. Using standard deviation method, we
come up with the following table:
Table 8.1:Calculating the Standard Deviation Using the Deviation Method
79
Using the raw score method, we can come up with the following:
Table 8.2 : Calculating the Standard Deviation Using the Raw Score Method
Both methods result in the same final value of 5. If you are calculating
standard deviation with a calculator, it is suggested that the deviation
80
method be used when there are only a few scores and the raw score
method be used when there are many scores. This is because when
there are many scores, it will be tedious to calculate the square of the
deviations and their sum.
8.2.3 Standard score
Standardised scores are necessary when we want to make
comparisons across tests and measurements. Z scores and T scores
are the more common forms of standardised scores although you
may come up with your own standardised score. A standardised score
can be computed for every raw score in a set of scores for a test.
i. The Z score
The Z score is the basic standardised score. It is referred to as the
basic form as other computations of standardised scores must first
calculate the Z score. The formula used to calculate the Z score is as
follows:
81
Z score values are very small and usually range only from 2 to 2.
Such small values make it inappropriate for score reporting especially
for those unaccustomed to the concept. Imagine what a parent may
say if his child comes home with a report card with a Z score of 0.47 in
English Language! Fortunately, there is another form of standardised
score - the T score with values that are more palatable to the
relevant parties.
ii.
8.2.4
The T score
The T score is a standardised score which can be computed using the
formula 10 (Z) + 50. As such, the T score for students A, B, C, and D in
the table 4.3 are 10(-1.28) + 50; 10 (-0.23) + 50; 10(0.47) + 50; and 10
(1.04) + 50 or 37.2, 47.7, 54.7, and 60.4 respectively. These values
seem perfectly appropriate compared to the Z score. The T score
average or mean is always 50 (i.e. a standard deviation of 0) which
connotes an average ability and the mid point of a 100 point scale.
Interpretation of data
The standardised score is actually a very important score if we want to
compare performance across tests and between students. Let us take the
following scenario as an example:
How can En. Abu solve this problem? He would have to have
standardised scores in order to decide. This would require the following
information:
Test 1 : X = 42 standard deviation= 7
82
Based on Table 8.4, both Ali and Chong have a negative Z score as
their total score for both tests. However, Chong has a higher Z score
total (i.e. 1.07 compared to 1.34) and therefore performed better
when we take the performance of all the other students into
consideration.
83
Item analysis
84
a.
Item difficulty
Item difficulty refers to how easy or difficult an item is. The formula
used to measure item difficulty is quite straightforward. It involves
finding out how many students answered an item correctly and
dividing it by the number of students who took this test. The formula is
therefore:
discrimination with the better students performing much better than the
weaker ones as is to be expected.
Lets use the following instance as an example. Suppose you have just
conducted a twenty item test and obtained the following results:
As there are twelve students in the class, 33% of this total would be 4
students. Therefore, the upper group and lower group will each consist
of 4 students each. Based on their total scores, the upper group would
consist of students L, A, E, and G while the lower group would consist of
students J, H, D and I.
86
We now need to look at the performance of these students for each item
in order to find the item discrimination index of each item.
For item 1, all four students in the upper group (L, A, E, and G)
answered correctly while only student H in the lower group answered
correctly. Using the formula described earlier, we can plug in the
numbers as follows:
Distractor analysis
Distractor analysis is an extension of item analysis, using techniques
that are similar to item difficulty and item discrimination. In distractor
analysis, however, we are no longer interested in how test takers select
the correct answer, but how the distractors were able to function
effectively by drawing the test takers away from the correct answer. The
number of times each distractor is selected is noted in order to
determine the effectiveness of the distractor. We would expect that the
distractor is selected by enough candidates for it to be a viable
distractor.
What exactly is an acceptable value? This depends to a large extent on
87
Let us assume that 100 students took the test. If we assume that A is the
answer and the item difficulty is 0.7, then 70 students answered correctly.
What about the remaining 30 students and the effectiveness of the three
distractors? If all 30 selected D, then distractors B and C are useless in their
role as distractors. Similarly, if 15 students selected D and another 15
selected B, then C is not an effective distractor and should be replaced.
Therefore, the ideal situation would be for each of the three distractors to be
selected by an equal number of all students who did not get the answer
correct, i.e. in this case 10 students. Therefore the effectiveness of each
distractor can be quantified as 10/100 or 0.1 where 10 is the number of
students who selected the tiems and 100 is the total number of students
who took the test. This technique is similar to a difficulty index although the
result does not indicate the difficulty of each item, but rather the
effectiveness of the distractor. In the first situation described in this
paragraph, options A, B, C and D would have a difficulty index of 0.7, 0, 0,
and 0.3 respectively. If the distractors worked equally well, then the indices
would be 0.7, 0.1, 0.1, and 0.1. Unlike in determining the difficulty of an
item, the value of the difficulty index formula for the distractors must be
interpreted in relation to the indices for the other distractors.
From a different perspective, the item discrimination formula can also be
used in distractor analysis. The concept of upper groups and lower groups
would still remain, but the analysis and expectation would differ slightly from
the regular item discrimination that we have looked at earlier. Instead of
88
Distractor B
Distractor C
Distractor D
Item 1
8*
Item 2
8*
Item 3
8*
Item 4
8*
Item 5
7*
d.
* indicates key
For Item 1, the discrimination index for each distractor can be calculated
using the discrimination index formula. From Table 8.5, we know that all the
students in the upper group answered this item correctly and only one student
from the lower group did so. If we assume that the three remaining students
from the lower group all selected distractor B, then the discrimination index for
item 1, distractor B will be:
This negative value indicates that more students from the lower group
selected the distractor compared to students from the upper group. This result
is to be expected of a distractor and a value of -1 to 0 is preferred.
EXERCISE
1. Calculate the mean, mode, median and range of the following set of
scores:
23, 24, 25, 23, 24, 23, 23, 26, 27, 22, 28.
2. What is a normal curve and what does this show? Does the final
result always show a normal curve and how does this relate to
standardised tests?
89
TOPIC 9
9.0 SYNOPSIS
Topic 9 focuses on reporting assessment data. It provides teachers with brief
descriptions on the purposes of reporting and the reporting methods.
9.1 LEARNING OUTCOMES
By the end of Topic 9, teachers will be able to:
90
CONTENT
SESSION NINE (3 hours)
91
9.2.2
Reporting methods
Student achievement progress can be reported by comparing:
i. Norm - Referenced Assessment and Reporting
Assessing and reporting a student's achievement and progress in
comparison to other students.
ii Criterion - Referenced Assessment and Reporting
93
95
TOPIC 10
10.0 SYNOPSIS
Topic 10 focuses on the issues and concerns related to assessment in the
Malaysian primary schools. It will look at how assessment is viewed and used
in Malaysia.
10.1 LEARNING OUTCOMES
By the end of Topic 10, teachers will be able to:
CONTENT
SESSION TEN (3 hours)
10.3
Exam-oriented System
98
101
10.4
Knowledge
Comprehension
Application
Analysis
Synthesis
Evaluation
Knowledge
Recalling memorized information. May involve remembering a wide range of
material from specific facts to complete theories, but all that is required is the
bringing to mind of the appropriate information. Represents the lowest level of
learning outcomes in the cognitive domain.
Learning objectives at this level: know common terms, know specific facts,
know methods and procedures, know basic concepts, know principles.
Question verbs: Define, list, state, identify, label, name, who? when? where?
what?
Comprehension
The ability to grasp the meaning of material. Translating material from one
form to another (words to numbers), interpreting material (explaining or
summarizing), estimating future trends (predicting consequences or effects).
Goes one step beyond the simple remembering of material, and represent the
lowest level of understanding.
Learning objectives at this level: understand facts and principles, interpret
verbal material, interpret charts and graphs, translate verbal material to
mathematical formulae, estimate the future consequences implied in data,
justify methods and procedures.
Question verbs: Explain, predict, interpret, infer, summarize, convert,
translate, give example, account for, paraphrasex?
Application
The ability to use learned material in new and concrete situations. Applying
rules, methods, concepts, principles, laws, and theories. Learning outcomes
in this area require a higher level of understanding than those under
comprehension.
Learning objectives at this level: apply concepts and principles to new
situations, apply laws and theories to practical situations, solve mathematical
102
Analysis
The ability to break down material into its component parts. Identifying parts,
analysis of relationships between parts, recognition of the organizational
principles involved. Learning outcomes here represent a higher intellectual
level than comprehension and application because they require an
understanding of both the content and the structural form of the material.
Learning objectives at this level: recognize unstated assumptions, recognizes
logical fallacies in reasoning, distinguish between facts and inferences,
evaluate the relevancy of data, analyze the organizational structure of a work
(art, music, writing).
Question verbs: Differentiate, compare / contrast, distinguish x from y, how
does x affect or relate to y? why? how? What piece of x is missing / needed?
Synthesis
(By definition, synthesis cannot be assessed with multiple-choice questions. It
appears here to complete Bloom's taxonomy.)
The ability to put parts together to form a new whole. This may involve the
production of a unique communication (theme or speech), a plan of
operations (research proposal), or a set of abstract relations (scheme for
classifying information). Learning outcomes in this area stress creative
behaviors, with major emphasis on the formulation of new patterns or
structure.
Learning objectives at this level: write a well organized paper, give a well
organized speech, write a creative short story (or poem or music), propose a
plan for an experiment, integrate learning from different areas into a plan for
solving a problem, formulate a new scheme for classifying objects (or events,
or ideas).
Question verbs: Design, construct, develop, formulate, imagine, create,
change, write a short story and label the following elements:
103
Evaluation
The ability to judge the value of material (statement, novel, poem, research
report) for a given purpose. The judgments are to be based on definite
criteria, which may be internal (organization) or external (relevance to the
purpose). The student may determine the criteria or be given them. Learning
outcomes in this area are highest in the cognitivehierarchy because they
contain elements of all the other categories, plus conscious value judgments
based on clearly defined criteria.
Learning objectives at this level: judge the logical consistency of written
material, judge the adequacy with which conclusions are supported by data,
judge the value of a work (art, music, writing) by the use of internal criteria,
judge the value of a work (art, music, writing) by use of external standards of
excellence.
Question verbs: Justify, appraise, evaluate, judgexaccording to given criteria.
Which option would be better/preferable to partyy?
10.5
School-based Assessment
The traditional system of assessment no longer satisfies the educational
and social needs of the third millennium. In the past few decades, many
countries have made profound reforms in their assessment systems.
Several educational systems have in turn introduced school-based
assessment as part of or instead of external assessment in their
certification. While examination bodies acknowledge the immense
potential of school-based assessment in terms of validity and flexibility,
yet at the same time they have to guard against or deal with difficulties
related to reliability, quality control and quality assurance. In the debate
on school-based assessment, the issue of why has been widely written
about and there is general agreement on the principles of validity of
this form of assessment.
Izard (2001) as well as Raivoce and Pongi (2001) explain that schoolbased assessment (SBA) is often perceived as the process put in place
to collect evidence of what students have achieved, especially in
104
Academic:
Non-academic:
Centralised Assessment
Conducted and administered by teachers in schools using instruments,
rubrics, guidelines, time line and procedures prepared by LP
Monitoring and moderation conducted by PBS Committee at School,
District and State Education Department, and LP
School Assessment
The emphasis is on collecting first hand information about pupils learning
based on curriculum standards
Teachers plan the assessment, prepare the instrument and administer the
assessment during teaching and learning process
Teachers mark pupils responses and report their progress continuously.
10.6
Alternative Assessment
106
Alternative Assessment
One-shot tests
Indirect tests
Direct tests
Inauthentic tests
Authentic assessment
Individual projects
Group projects
No feedback to learners
Speeded exams
Power exams
Classroom-based tests
Summative
Formative
Product of instruction
Process of instruction
Intrusive
Integrated
Judgmental
Developmental
Teacher proof
Teacher mediated
Physical demonstration
Pictorial products
Reading response logs
K-W-L (what I know/what I want to know/what Ive learned) charts
Dialogue journals
Checklists
Teacher-pupils conferences
Interviews
Performace tasks
Portfolios
Self assessment
Peer assessment
108
Portfolios
A well known and commonly uses alternative assessment is the portfolio
assessment. The contents of the portfolio become evidence of abilities
much like how we would use a test to measure the abilities of our
students.
Bailey (1998, p: 218), describes a portfolio to contain four primary
elements.
First, it should have an introduction to the portfolio itself
which provides an overview to the content of the portfolio. Bailey
even suggests that this section include a reflective essay by the
student in order to help express the students thoughts and
feelings about the portfolio, perhaps explaining strengths and
possible weaknesses as well as explain why certain pieces are
included in the portfolio.
Introductory Section
Overview
Reflective Essay
Personal Section
Assessment Section
Evaluation by peers
Self-evaluation
Journals
Score reports
Photographs
Personal items
4.
3.
I have difficulty with some questions, but I generally get the meaning
2.
1.
stimulate meta-cognition.
EXERCISE
In your opinion, what are the advantages of using portfolios as
a form of alternative assessment?
112
REFERENCES
Allen, I. J. (2011). Repriviledging reading: The negotiation of
uncertainty.
Pedagogy: Critical Approaches to Teaching
Literature, Language Composition, and Culture, 12 (1) pp. 97-120.
Available at:
http://pedagogy.dukejournals.org/cgi/doi/10.1215/153142001416540(RetrievedSeptember 26, 2013)
Alderson, J. C. (1986b). Innovations in language testing? In M.
Portal
(Ed.), Innovations in language testing. pp. 93-105.
Windsor: NFER/Nelson.
Alderson, J. C., Clapham, C., & Wall, D. (1995). Language test
construction
and evaluation. Cambridge: Cambridge University
Press.
Anderson, L.W. (Ed.), Krathwohl, D.R. (Ed.), Airasian,P.W.,
Cruikshank, K.A.,
Mayer, R.E., Pintrich, P.R.,Raths, J., &
Wittrock, M.C. (2001). A
taxonomy for learning, teaching, and
assessing: A revision of Bloom's
Taxonomy of Educational
Objectives (Complete edition). New York: Longman.
Anderson, K. M., (2007). Differentiating instruction to include all
students. Preventing School Failure, 51 (3) pp. 49-54.
Bachman, L. F. (2004). Statistical Analyses for Language
Assessment. pp.
22-23. Cambridge, UK: Cambridge
University Press.
Biggs, J. B. and Collis, K. F. (1982).Evaluating the Quality of
Learning: the
SOLO taxonomy. New York, NY: Academic Press.
Biggs, J. B., & Collis, K .F. (1991) Multimodal learning and the quality
of intelligent behaviour. In: H. Rowe (Ed.) Intelligence:
Reconceptualization and measurement. Hillsdale, NJ: Lawrence
Erlbaum. pp. 57-75.
113
University Press.
Moseley, D., Baumfield, V., Elliott, J., Gregson, M., Higgins, S.,
Miller, J., &
Newton, D. (2005).Frameworks for Thinking: A
handbook for teaching
and learning. Cambridge: Cambridge
University Press.
Mousavi, S. A. (2009). An encyclopedic dictionary of language
testing (4th ed.)
Tehran: Rahnama Publications.
Norleha Ibrahim. (2009). Management of measurement and
evaluation
Module. Selongor: Open University Malaysia.
Nckles, M., Hbner, S. & Renkl, A. (2009). Enhancing selfregulated learning
by writing learning protocols. Learning and
Instruction, 19(3), pp. 259 271. Available
at: http://linkinghub.elsevier.com/retrieve/pii/S0959475208000558
(Retrieved March 26, 2013).
Oller, J. W. (1979). Language tests at school: A pragmatic
approach. London: Longman.
Pearson, I. (1988).Tests as levers for change. In D. Chamberlain
& R. Baumgardner (Eds.), ESP in the classroom: Practice and
evaluation (Vol. 128, 98-107). London: Modern
116
EnglishPublications.
Pimsleur, P. (1966). Pimsleur Language Aptitude Battery. New
York, NY:
Harcourt, Brace & World.
Shepard, L. A. (2000). The role of assessment in a learning
culture. Paper
presented at the Annual Meeting of the
American Educational
Research Association.
Available
http://www.aera.net/meeting/am2000/wrap/praddr01.htm
(Retrieved 10.8.2013)
Smith, A. (2011) High Performers: The Secrets of Successful
Schools.
Camarthen: Crown House Publishing.
Smith, T.W. & Colby, S.A. (2007). Teaching for Deep Learning. The
Clearing
House. 80 (5) pp. 205211.
Spaan, M. (2006). Test and item specifications
development.Language Assessment Quarterly, 3, pp. 71-79.
Spratt, M. (2005). Washback and the classroom: The implications
for teaching and learning of studies of washback from exams.
Language Teaching Research, 19, 5-29.
Stansfield, C., & Reed, D. (2004). The story behind the Modern
Language
Aptitude Test: An interview with John B. Carrol
(1916-2003). Language
Assessment Quarterly, 1, pp.43-56.
Websites
http://www.catforms.com/pages/Introduction-to-Test-Items.html
(Retrieved 9.8.2013)
http://myenglishpages.com/blog/summative-formativeassessment/ - (Retrieved 10.8.2013)
http://www.teachingenglish.org.uk/knowledge-database/objectivetest - (Retrieved 12.8.2013)
http://assessment.tki.org.nz/Using-evidence-for
learning/Concepts/Concept/Reliability-and-validity
117
NAMA
NURLIZA BT OTHMAN
[email protected]
KELAYAKAN
KELULUSAN:
KELULUSAN
118