0% found this document useful (0 votes)
2K views24 pages

"Development of Large Scale Student Assessment Test": Chapter 13)

This document discusses the process of developing large-scale student assessment tests. It begins by outlining the key steps in classroom test development, including planning the test, constructing test items, and reviewing and revising items. It then notes some additional considerations needed for developing large-scale tests, such as addressing questions of test purpose, coverage, length, and ensuring technical quality. The rest of the document delves deeper into the standard process for developing large-scale tests, which involves creating a framework and blueprint, writing questions, pilot testing, and statistical review to establish validity and reliability.

Uploaded by

Brittaney Bato
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2K views24 pages

"Development of Large Scale Student Assessment Test": Chapter 13)

This document discusses the process of developing large-scale student assessment tests. It begins by outlining the key steps in classroom test development, including planning the test, constructing test items, and reviewing and revising items. It then notes some additional considerations needed for developing large-scale tests, such as addressing questions of test purpose, coverage, length, and ensuring technical quality. The rest of the document delves deeper into the standard process for developing large-scale tests, which involves creating a framework and blueprint, writing questions, pilot testing, and statistical review to establish validity and reliability.

Uploaded by

Brittaney Bato
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 24

“Development of Large

Scale Student Assessment


Test”
(Chapter 13)
Large Scale Student Assessment

LSA has indeed come a long way in being used


for different purpose with improvement of
student performance never failing to rank
first in importance.
 Is a measurement of student learning
designed to describe the achievement of
students in particular areas of learning
across of an education system.
Review of Classroom test Development Process
 Planning the test which specifies
1. Purpose of testing
2. Learning outcomes
3. Test blueprint-test format, number of items.
 Item- Construction
– which is performed by the classroom teacher following a table of specification.
 Review and Revision for item improvement
1. Judgemental approach- before and after administration test.
a. By the teacher/peers to ensure the accuracy and alignments of test
content to learning outcomes.
b. by the students to ensure comprehensibility of items and test instruction
2. Emperical approaches- after administration of test.
a. Obtain items statistics in the form of quality indices.
 A teacher of whatever level defines to himself/herself even in
the most simple way why s/ he is going to prepare a test,
what s/he well assess and how s/he will test them! This is the
design or planning phase which every assessment tool for
whatever purposes it will be used will have to be drawn. The
item construction which follows gives flesh to the test. To
less informed or less conscientious teachers, test construction
is item construction, full, stop! To them, what comes before
and after item construction which is reviewing the items, is of
little consequence anymore. Hopefully, the present course on
assessment would bring about changes in the way you view
test development as a process in order for testing results to
get maximized. Score-based inferences on student
performance can only be appropriately done if the tests from
which they are derived have been constructed properly.
DEVELOPMENT PROCESS FOR LARGE–SCALE TEST
 Changing the context from classroom to systems-wide testing, there are other
significant consecrations that must be in addition to what are required by a
teacher-made test. With an understanding of the nature of large-scale student
assessment, more questions must be addressed in the development process
concerning purpose of test, coverage, length of test, review of items for quality
and fairness and such technical merits as validity and reliability among others.
 Divide the class into four conversation groups. Discuss the concerns that
must be raised if a large-scale assessment test will be developed.
 As a class, discuss these questions and classify them into phases of work
they will fall under.
What do you see as common steps between
developing classroom tests and large-scale tests?
 They both need a test framework for specifying purpose of test, what are to be
measured, to whom the test will be administered, what test format to use, the
length of test, etc.
 They both need to prepare a test blue print or table of specifications that specify
the content and knowledge and skills to be covered and the number of items to be
prepared for each learning outcome.
 There is a need to review the items to ensure that the items measure intended
outcomes, non-ambiguity of the problem, the plausibility of the distracters, and
the correctness of the keyed option
Create
Framework Make
Ready for use Blueprint

Write
Statistical Questions
Review

Standard Test
Development Process
Pilot Content
Testing Review

Stakeholder’s Editorial Fairness


Review Review Review
 The LSA process however spends much more time and effort in carrying out
multiple checks and balance. The various types of view to be undertaken,
i.e. Content, fairness, editorial , stakeholders, and the statistical review also
suggest the involvement of several committees or expertise like curriculum
experts, teachers, item developers, testing experts, language specialists,
sociologists, psychometricians and statisticians and large data base
specialists.
 These are reflected into two steps which apparently are not done with
classroom tests: pilot testing of tests to sample groups whose characteristics
are similar to the target population and the statistical review that
establishes the psychometric integrity of the items and the test as a whole in
terms of gathering empirical evidences for the validity of its score
interpretation and reliability in terms of consistency of scores obtained
across versions of the test.
KEY STEPS IN LARGE-SCALE TEST
DEVELOPMENT
 The test development process is basically influenced by the standards
for educational and Psychological Testing developed by American
Educational Research Association, American Psychological,
Association, & National Council on Measurement in Education (1985).
While they are regarded as criteria for evaluating tests , they serve as
the foundation for the process. Given these standards, ETS has
developed its stringent guidelines contained in 2014 ETS Standards for
quality and Fairness ,for its specific standards on “Validity”, “scoring”
and “Reporting Test Results”, in addition to “test design and
development.”
Steps in development Test by ETS
Step 1: Defining  Who will take the test and for what purpose?
Objectives

 What skills and/or areas of knowledge should be tested?

 How should test takers be able to use their knowledge?

 What kinds of questions should be including? How many of


each kind?

 How long should the test be?

 How difficult should the test be?


Step 2: Item Development Who will be
Committees  Defining test objective and specifications

 Helping ensure test question are unbiased

 Determining test format (e.g. , multiple-choice, essay,


constructed-response, etc.)

 Considering supplemental test materials

 Reviewing test questions, or test items, written before

 Writing test questions

Step 3: Writing and Reviewing Item development and reviewers must see to it that each item:
Questions  Has only one correct answer among the options provided in the
test

 Conforms to the style rules used throughout the test


Step 4: The Pretest Items are pretested to a sample group similar to the population to be
tested. Result should determine:
 The difficulty of each question
 If question are ambiguous or misleading
 If questions should be revised or eliminated
 If incorrect alternative answer should be revised or replaced
Step 5: detecting and Removing Unfair After pretesting, test reviews re-examine the items:
 Are there any test question which has language symbols or words and
phrases inappropriate or offensive to any subgroup of the population?
 Are there questions consistently performed better by a group than other
groups?
 What items farther need revision or removal before final version remade?
Step 6: Assembling the Test After the test is assembled, item reviewers prepare a list of correct answer
and are compared with existing answer keys:
 Are the intended answers indeed the correct answer?
Step 7: Making Sure that the test After test administration, statisticians perform analysis of results to find out if
questions are Functioning Properly test is working as intended:
 Is the test valid? Are the score interpretations supported by empirical test?
 Is the test reliable? Can the performance on one version of the test predict
performance on any other version of the test?
 What corrective actions need to be done when there are problems detected
before final scoring is done?
The Definition of Validity and
Reliability
 Validity is regarded as the basic requirement of every test to the
degree to which a test measures what is intended to be measured. Can
the perform its intended function? This is the business of validity and the
one adapted by the classical model for regarding validity. There are three
conventional types of validity according to this model: content validity,
criterion-related validity and construct validity .
 Reliability is related to the concept of error of measurement which
indicates the degree of fluctuation likely to occur in an individual score
as a result of irrelevant, chance factors which Anastasia and Urbina
(1997) call error variance. This occurs when the differences between
scores are not attributable to the construct being measured but simply
due to chance, something which cannot be controlled.
3 conventional types of validity

Validity
Construct validity refers to whether a scale or test
measures the construct adequately. An example is a measurement
of the human brain, such as intelligence, level of emotion,
proficiency or ability.
Content validity involves examination of the psychological
construct hypothetically assumed to be measured by the
established by doing a factor analysis of the test items to bring
about what defines the overall construct. It determines if the test
measures a unitary construct or if it is a multi-dimensional
construct says shown by the resultant factors. These “validities”
have for a while been what are required to be establishing by
educational and psychological test
Criterion validity (or criterion-related validity) measures how well one
measure predicts an outcome for another measure. A test has this type of
validity if it is useful for predicting performance or behavior in another
situation (past, present, or future).
5 categories of evidence supporting a score interpretation and
which have brought about other forms of validity :

Evidence based on response


Evidence based on test content processes

Evidence based on internal Evidence based on relations to other


structure variables

Evidence based on consequences


of testing
 There are ways of estimating the reliability of a test and they are grouped according to the
number of items the test is administered to the group of students. When two test session’s
test-retest reliability where the same test is given twice with time interval not exceeding
six months and alternate-form reliability where the two comparable versions of the test
are administered to the same individuals. Administration of the two forms can be
immediately done, one after the other or delayed with an interval not exceeding six
months. This is also widely known as parallel-form reliability since they emerge from the
same table of specifications. The nature and strength of relationship or correspondence
between the two sets of scores is then establish using the coefficient of correlation.
(Anastasia , 1976). This value ranges from -1.0 to +1.0. The closer it gets to +1.0, the
more consistent are the scores obtained from the two test trials. To obtain the reliability
coefficient in these two types, the person Pearson Product Moment Correlation is used
to get the coefficient of correlation (r) with this well-known formula.
With 𝒓𝒕𝒕 as the reliability coefficient for the total test, and r 11
as the coefficient of correlation of two half test.

With only the single administration, split-half reliability is workable. This divides the
test into two halves using the old-even spilt. All the odd-number items make up for a
while the even-numbered items compose form B. The coefficient of correlation
between two half tests is obtained using the Pearson Product Moment Correlation with
Spearman-Brown Formula applied to estimate the correlation of the tests (r).
Inter-rater reliability assesses the degree to which different judges or rates agree in their
assessment decisions. This is quite useful to avoid doubts on the scoring procedure of test
with non-objective items. The sets of scores obtained in the test from two raters can also
be subjects to Pearson r to get reliability coefficient. The other type of reliability looks at
the internal consistency of response to all items. With the assumption that all items in the
test are measures of the same construct, there will by inter-item consistency in the
responses of the test takers. The procedure will require how the individuals perform (i.e.
pass/fail) in each item. Ruder-Richardson Formula 20 (K-R 20) will be applied to
estimate the reliability coefficient.
 Establishing the validity and estimating the reliability of
tests are given attention in this last chapter to emphasize
their significance in the development process of large-scale
tests. Test documentation must include how reliability is
estimated and this may not be limited to only one type. The
more evidences these are the tests reliability, the test
becomes of its fidelity to measurement consistency. In terms
of validity, supporting evidences for the possible score
interpretations and actions recommended should be
effectively reported. These two technical merits speak well
usability for the recommended usage. With large-scale
students assessments now growing in acceptable all over, it
are important that the integrity of the development process
be upheld.
Thank You
&
God Bless !!!
Answer the Following Questions
1. What are the three conventional types of Validity?
2. This is a measurement of student learning designed to describe the achievement of
students in particular areas of learning across of an education system. (2 points)
3. 5 categories of evidence supporting a score interpretation and which have brought
about other forms of validity?
4. In Your own idea, what is the difference between Validity and Reliability .(4 points)
5. What are the steps in development test by ETS ?

Overall Total
15 Correct Answer

You might also like