EDU431 Short Notes
EDU431 Short Notes
3. Evaluation
Evaluation is process of making a value judgment against intended learning outcomes
and behavior, to decide quality and extent of learning.
Evaluation is always related to your purpose, you aligned your purpose of teaching with what
students achieved at the end, with their quality and quantity of learning
And if from these measure i decide that either their learning is getting better or not or how much
student learned what i taught them and what is the quality or standard of that learning, then it is
EVALUATION
Assessment and Evaluation don‘t exist in hierarchy; they both are parallel and different in
purpose. Measurement is the source to move towards assessment and evaluation because it
provides base and evidence to quantify teaching learning process. The quantified number has no
meaning until we do assessment or evaluation. Assessment purpose is to make teaching learning
process better so that student learning improve and measurement purpose is to align the learning
with purpose.
can determine their best abilities. It is designed to indicate the degree of success in some past
learning activity.
Aptitude test, we measure through aptitude test when we want to predict the success in future
learning activity e.g. it is used when we want to see the interest of student in a particular field
like medicine, sport, teaching. We know that different abilities used for going in different
professions, we make a test depending on these abilities and then try to assess, in what abilities
the students perform well.
ii. Typical Performance Assessment
The second category is typical performance assessment determines what individual will do
under natural conditions. This type of assessment includes attitude, Interest and personality
inventories, observational techniques, peer appraisal. Here emphasis is on what students will do
rather than what they can do.
2. By format of assessment
i. Fixed Choice Assessment
ii. Complex Performance Assessment
i. Fixed Choice Assessment
Fixed Choice Assessment is sued to measure the skills of people efficiently ( means measure
more skills in less time) and for this we usually use fixed choice items i.e. Multiple choice
question, matching exercise, fill in the blanks and true false. It is called fixed choice because the
person who is attempting the paper does not need to write the answer, just need to choice the
answer. From these, we can assess student abilities of lower level learning
Fixed Choice Assessment is used for efficient measurement of knowledge and skills. This type
of assessment includes standardized multiple choice questions
ii. Complex Performance Assessment
Complex Performance assessment is used for measurement of performance in contexts and the
problems valued in their own right. This includes hands on laboratory experiments, projects,
essays, oral presentations.
E.g. if want to measure the student ability of writing an essay and this cannot be judged by fixed
response items
Topic 4: Use of Assessment in Classroom Instruction
Placement and Diagnostic
© Copyright Virtual University of Pakistan 10
Test Development and Evaluation-EDU431 VU
In this session student will learn Classification of assessment in terms of its uses in
classroom instruction
3. Use in classroom instruction
i. Use of Placement Assessment
ii. Use of Diagnostic Assessment
iii. Formative Assessment
iv. Summative assessment
i. Placement Assessment
Placement Assessment determines prerequisite skills, degree of mastery of course goals and
mode of learning. Placement assessment is used when we want to assess student‘s prior
knowledge so that we can decide what the level of student is. It is associated with student‘s entry
level performance to know either student have a sufficient knowledge required for a particular
course or not. Through placement assessment, teacher can be able to know that where student
should be place according to their present knowledge or skills. It determines the level of student
knowledge at the beginning of session and helps teacher plan the lesson accordingly. In the
classroom, the teacher can use placement assessment to assess the level of students‘ knowledge
and skills and then make lesson plans keeping in mind the level and need of students
accordingly.
It also determines the interest and aptitude of student regarding a subject and helps in selecting
correct path for future.
Examples
Readiness test: It is a test used to determine the students‘ knowledge or concept about a
particular course of instruction or what is the level of students
Aptitude test: It is used for the admission in a particular program
Pretest: It is made according to the course objectives and determines the student present
knowledge about them
Self- report inventories: Determines the student level by interviewing or discussion
ii. Diagnostic Assessment
Diagnostic Assessment determines causes (intellectual physical, emotional
environmental) of persistent learning difficulties. e.g. if you are having a headache , first you will
try to cure it by yourself by taking a medicine and you got a relief but if you didn‘t got a relief by
© Copyright Virtual University of Pakistan 11
Test Development and Evaluation-EDU431 VU
taking a medicine then either you change your medicine or you go to the physician or doctor. At
first, doctor prescribed medicines , if you still have headache you again go to the doctor , then
the doctor suggest you the tests i.e. blood test, urine test etc. and then finally by seeing the test
report the doctors able to recognize the reason or cause of headache. And when doctors know
the root of your headache then he will prescribe you the medicine for that cause, this is
diagnosis.
Diagnosis doesn‘t start first day, it is for the constant or continuous problems e.g. if a
student continuous to experience failure in reading or mathematics or any other subject despite
the use of prescribed alternative methods , then a diagnosis is indicated. Teachers‘ try to find out
what is the root of students failure.
Norm referenced test includes standardized aptitude and achievement tests, teacher-made survey
tests, interest inventories, adjustment inventories
ii. Criterion-referenced Assessment
Criterion-referenced assessment describes student performance according to a specific
domain of clearly defined learning tasks e.g. adds single-digit while numbers. In this you don‘t
compare student performance with other students rather you compare the performance of all
students with criteria (in our case that criteria are our learning outcomes). It is most commonly
used in schools to report the achievement of learning outcomes against set goals rather than other
students. It grades the students to pre-defined criteria and student‘s grades represent their
mastery over content. Students with same level of expertise achieve same level of grades. A cut
point is determined to distinguish between failed and successful students regardless of score of
highest and lowest achiever. It consists of teacher-made tests, custom made tests from the test
publishers and observational techniques.
© Copyright Virtual University of Pakistan 14
Test Development and Evaluation-EDU431 VU
In this unit, students will learn about link between curriculum and assessment. For this purpose,
we precede our discussion in reference to National Curriculum of Pakistan 2006.
Competency
Standards
Benchmarks
Student learning outcome (SLOs)
Competency
It is a key learning area. For example algebra, arithmetic, geometry etc. in mathematics and
vocabulary, grammar, composition etc. in English.
Standards
These define the competency by specifying broadly, the knowledge, skills and attitudes that
students will acquire, should know and be able to do in a particular key learning area during
twelve years of schooling.
Benchmarks
The benchmarks further elaborate the standards, indicating what the students will accomplish at
the end of each of the five developmental levels in order to meet the standard.
These are built on the descriptions of the benchmarks and describe what students will accomplish
at the end of each grade. It is the lowest level of hierarchy.
In the image above, SLOs are at the bottom which is the lowest level. All SLOs combined to
make a benchmark and benchmarks convert into standards and then into competency.
Example:
Standard 1: All students will search for, discover and understand a variety of text types through
tasks which require multiple reading and thinking strategies for comprehension, fluency and
enjoyment.
Example:
6.1.1. periodic/formative assessment through homework, quizzes, class tests and group
discussions.
- A clear statement of the specific purpose(s) for which the assessment us being carried
out.
- A wide variety of assessment tools and techniques to measure students ability to use
language effectively.
- Criteria to be used for determining performance levels for the SLOs for each grade level.
- Procedures for interpretation and use of assessment results to evaluate the learning
outcomes.
- MCQs
- Constructed response
o Restricted response
o Extended response
- Performance tasks
© Copyright Virtual University of Pakistan 17
Test Development and Evaluation-EDU431 VU
Lecture 3
1. A model for how students present knowledge and develop competence in the subject
domain
2. Tasks or situations that allow the examiner to observe the students‘ performance
Popular Taxonomies
The taxonomy of Structure of Observed Learning Outcomes (SOLO) was initially developed by
Biggs and Collis in 1982, and then well described in Biggs and Tang in 2007. It carries five
different levels of competency of learners
1. Pre-structural
2. Uni-structural
3. Multi-structural
4. Relational
5. Extended Abstract
DOK (Depth of Knowledge) was presented by Webb in 1997, giving four levels of learning
activities
1. Recall
2. Skill/Concept
3. Strategic Thinking
4. Extended Thinking
Bloom‘s Taxonomy was presented by Benjamin Bloom in the 1956, consists of a framework
with most common objectives of classroom instruction.
Those dealing in three different domains and further sub categories in these domains.
1. Cognitive
2. Affective
3. Psychomotor
Cognitive Domain
i. Knowledge
ii. Comprehension
iii. Application
iv. Analysis
v. Synthesis
vi. Evaluation
Affective Domain
i. Receiving
ii. Responding
iii. Valuing
iv. Organization
v. Characterization
Psychomotor Domain
i. Perception
ii. Set
iii. Guided Response
iv. Mechanism
v. Complex covert Response
vi. Adaption
vii. Origination
Levels of SOLO
1. Pre-structural
2. Uni-Structural
3. Multi-structural
4. Relational
5. Extended Abstract
1. Pre-structural
Students are simply able to acquire bits of unconnected information and respond to a question in
meaningless way. Example of pre-structural level:
2. Uni Structural
Student shows concrete understanding of the topic. But at this level is only able to respond one
relevant element from the stimuli or item that is provided.
3. Multi- Structural
Student can understand several components but the understanding of each remains discreet.
A number of connections are made but the significance of the whole is not determined. Ideas
and concepts around an issue are disorganized and aren't related together.
4. Relational
Student can indicate connection between facts and theory, action and purpose. Shows
understanding of several components which are integrated conceptually showing
understanding of how the parts contribute to the whole. Indicative verbs: compare/contrast,
explain causes, integrate, analyze, relate, and apply.
5. Extended Abstract
Student at this level is able to think hypothetically and can synthesize a material logically.
Student make connections not only with in the given subject area but understanding is
transferable and generalizable to different areas. Indicative verbs: theorize, generalize,
hypothesize, reflect, generate
Levels of DOK
1. Recall
2. Skill/concept
3. Strategic Thinking
4. Extended Thinking
DOK measures the degree to which the knowledge bring about from students on assessments is
as complex as what students are expected to know and do as stated in the curriculum.
Recall
Recall of a fact, information, or procedure. The subject matter at this particular level usually
involves working with facts, terms and/or properties of objects.
Skill/Concept
Strategic Thinking
Items falling in this category demand a short-term use of higher order thinking processes, such as
analysis and evaluation, to solve real-world problems with predictable outcomes.
Extended Thinking
Learning outcomes to this level demand extended use of higher order thinking processes such as
synthesis, reflection, assessment and adjustment of plans over time.
There are three main domains of learning and all teachers should know about them and use them
to construct lessons.
• Cognitive Domain
• Affective Domain
• Psychomotor Domain
In 2000-01 revisions to the cognitive taxonomy were spearheaded by one of Bloom's former
students, Lorin Anderson, and Bloom's original partner in defining and publishing the cognitive
domain, David Krathwohl. One of the major changes that occurred between the old and the
newer updated version is that the two highest forms of cognition have been reversed.
Knowledge:
It is defined as the remembering of previously learned material. This may involve the recall of a
wide range of facts, procedures principals and generals, the recall of procedures and the
processes.
Sample Question: Define the 6 levels of Bloom's taxonomy of the cognitive domain.
Comprehension:
It is defined as the ability to grasp the meaning of the material. individual can make use of the
content or idea being communicated without necessarily related it to other content and seeing its
fullest implications. Sample Question: explain the purpose of Bloom's taxonomy of the cognitive
domain.
Application:
It refers to the ability to use the previously learned material in new and concrete situations. The
abstractions may be in the shape of universal ideas, rules of methods. Sample Question: write an
instructional objective for each level of Bloom's taxonomy.
Analysis:
The breakdown of a concept into its constituents parts such that the relative hierarchy of the
concept is made easy to understand or the relation between the parts of the concept is elaborated.
Sample Question: compare and contrast the cognitive and affective domains.
Synthesis:
Evaluation:
It is concerned with the ability to judge the value of the material for a given purpose. Judgments
are made on the definite criteria. Sample Question: How far the different BISEs and universities
are developing papers using Bloom's taxonomy? Support your answer with arguments.
Levels
Remembering:
Exhibit memory of previously learned material by recalling facts, terms, basic concepts, and
answers.
Key verbs:
Choose, Define, Find, How, Label, List, Match, Name, Omit, Recall, Relate, Select, Show, Spell,
Tell, What, When, Where, Which, Who, Why
Understanding:
Constructing meaning from different types of functions be they written or graphic messages, or
activities.
Key verbs: Classify, Compare, Contrast, Demonstrate, Explain, Extend, Illustrate, Infer,
Interpret, Outline, Relate, Rephrase, Show, Summarize, Translate
Applying:
Solve problems to new situations by applying acquired knowledge, facts, techniques and rules in
a different way.
Key verbs:, Apply, Build, Choose, Construct, Develop, Experiment with, Identify, Interview,
Make use of, Model, Organize, Plan, Select, Solve, Utilize.
Analyzing:
Breaking materials or concepts into parts, determining how the parts relate to one another, or
how the parts relate to an overall structure or purpose.
Key verbs: Analyze, Assume, Categorize, Classify, Compare, Conclusion, Contrast, Discover,
Dissect, Distinguish, Divide, Examine, Function, Inference, Inspect
Evaluating:
Making judgments based on criteria and standards through checking and critiquing.
Key verbs: Agree, Appraise, Assess, Award, Choose, Compare, Conclude, Criteria, Criticize,
Decide, Deduct, Defend, Determine, Disprove, Estimate
Creating:
Putting elements together to form a coherent or functional whole; reorganizing elements into a
new pattern or structure through generating, planning, or producing.
Key verbs: Adapt, Build, Change, Choose, Combine, Compile, Compose, Construct, Create,
Delete, Design, Develop, Discuss, Elaborate, Estimate, Formulate
These categories range from simple to complex and from concrete to abstract level of student‘s
learning. It is assumed that the taxonomy represents a cumulative hierarchy, so that mastery of
each simpler category is considered as prerequisite to mastery of the next, more complex one.
2. General objectives
3. Specific objectives
When viewing instructional objectives in terms of learning outcomes, we are concerned with
products rather than process of learning
© Copyright Virtual University of Pakistan 30
Test Development and Evaluation-EDU431 VU
• Methods books
• Year books
• Curriculum Frameworks
• Test manuals
1. Completeness
2. Appropriateness
3. Soundness
4. Feasibility
General Objectives
• Objective should be specific enough to provide the direction for instruction but not so
specific that instruction is reduced to training
• Stating general objectives in general terms, we provide for the integration of specific
facts and skills into complex response
© Copyright Virtual University of Pakistan 31
Test Development and Evaluation-EDU431 VU
• General statements gives teachers freedom in selecting the method and materials of
instruction
• Understands concepts
• Interpret graphs
3. State each general objective to include only one general learning outcome
5. Keep each general objective sufficiently free of course content so it can be used with
various units of study
Each General objective must be defined by a sample of specific learning outcome to clarify how
students can demonstrate that they have achieved general objective. Until the general objective
are further defined in this manner they will not provide adequate direction for assessment
1. List beneath each general objective a representative sample of specific learning outcome
that describes terminal performance students are expected to demonstrate
2. Begin each specific learning outcome with an action verb that specifies observable
performance
3. Make sure that each specific learning outcome is relevant to the general objective it
describes
4. Include enough SLOs to describe adequately the performances of students who have
attained the objectives.
5. Keep the SLOs sufficiently free of course content so that the list can be used with various
units of study
6. Consult reference materials for the specific components of those complex outcomes that
are difficult to define
Why Test?
In the classroom, decisions are constantly being made. Teachers face huge numbers of dilemmas
every day. These decisions can be of following nature
• Instructional
• Grading
• Diagnostic
• Selecting
• Placement
• Program or Curriculum
• Administrative
These types of decisions are taken at different levels. Some are decided at Board/Administrative
level while some are taken at school management level and other are taken in classrooms by
teachers
Instructional Decisions
Instructional decisions are the nuts and bolts types of decisions made in classroom by teachers.
These are most frequently made decisions. Such decisions include deciding to
• Instructional plans
Grading Decisions
Educational decisions based on grades are also made by the classroom teacher but much less
frequently than instructional decisions. For most students grading decisions are most influential
decision made about them
Diagnostic Decisions
Diagnostic decisions are those made about a student‘s strengths and weaknesses and the reasons
behind them. Teachers make diagnostic decisions based on information yielded by an in-formal
teacher made test
Decisions of diagnostic nature can also be made by the help of standardized tests (will be
discussed in next session)
Selection Decisions
Selection decisions involves test data used in part for accepting or rejecting applicants for
admission into a group, program, or institution
Placement Decisions
Placement decisions are made after an individual has been accepted in a program. They involve
determining where in program someone is best suited to begin with.
Counseling and guidance decisions involve the use of test data to help recommend programs of
study that are likely to be appropriate for the students
This type of decision is taken at policy level. Where it is decided if a lesion, unit or subject will
continue or abandoned for next academic session according to the national objectives of
education.
Administrative Decisions
Administrative policy decisions may be made at school, district, state or national level.
How to Measure
In classroom assessment different forms of assessments are utilized. Each form of test has its
own benefits and disadvantages. Most common type of assessment used in classrooms is written
assessment
• Verbal
• Non-verbal
• Objective
• Subjective
• Teacher Made
• Standardized
• Power
• Speed
Verbal
Emphasize reading, writing, or speaking. Most tests in education are verbal tests.
Non-verbal
Does not require reading, writing or speaking ability, tests composed of numerals or drawings is
example.
Objective
Refers to scoring of tests when two or more scorers can easily agree on whether the answer is
correct or incorrect, the test is objective one. True false, multiple choice and matching tests are
example
Subjective
Also refers to scoring. When it is difficult for two scorers to agree on whether an item is correct
or incorrect, the test is a subjective one. Essay tests are the example.
Teacher Made
Constructed solely by teacher only to be used in his/her own classroom. This type of test is
custom designed according to need and issues related to specific class
Standardized
Test constructed by measurement experts over a period of years. They are designed to measure
broad national objectives and have a uniform set of instructions that are adhered to during each
administration
Most also have tables of norms, to which a student performance may be compared to determine
where the student stands in relation to a national sample of students at same level of age or grade
Power
Tests with liberal time limits that allow each student to attempt each item. Item tend to be
difficult.
Speed
Tests with time limits so strict that no one is expected to complete all items. Items tend to be
easy.
Why Test?
General purpose of assessment is to gather information to make better and more informed
decision. The utility of that information is what differentiate between types of assessments. In
earlier session classification of assessment by method of interpreting results was discussed. This
session will further unpack the complexity of norm and criterion referenced assessment
NRT
Type of test which tells us where a student stands compared to other students. It helps
determining a student‘s place or rank among a group of similar students. Such kind of test is
called norm-referenced test
Dimensions
• NRT tend to be general. It measures variety of skills at same time but fails to measure
them thoroughly.
• It‘s hard to make decisions regarding the mastery of student‘s skill in subject.
© Copyright Virtual University of Pakistan 38
Test Development and Evaluation-EDU431 VU
It provides estimate of ability in a variety of skills in much shorter time. NRT are much difficult
for students to solve. On average only 50% students are able to get an item right in a test
A second type of test tells us about student‘s level of proficiency in or mastery of some skill or
set of skills. This is achieved by comparing a student‘s performance to a standard mastery called
a criterion. Test that yields such information is called Criterion Referenced Test
Dimensions
• CRT tends to be specific. It measures particular set of skill at one time and focus on level
of achievement of that skill. CRT gives clear picture regarding the mastery of student‘s
skill in subject.
• It measures skill more thoroughly so naturally it takes more time comparing to NRT in
measuring the mastery of said skill
• Items included in CRT are relatively easier. Around 80% of the students are expected to
respond item correctly in the test
Dimensions
• Sampled content in CRT is much more comprehensive, usually three or more items are
used to cover single objective.
• The meaning of the score does not depend upon on comparison with other scores.
• It flows directly from the connection between the items and the criterion.
• Items are chosen to reflect the criterion behavior. Emphasis is placed upon the domain of
relevant responses.
Basis of comparison
Comparison targets
Selection of items
Meaning of success
Average item difficulty
Score distribution
Reported scores
Comparison targets
Selection of items
Items included in CRT are of specific nature and designed for the student skilled in particular
subject. In NRT items are of general knowledge nature. Student should be able to answer it but
superficial knowledge is sufficient to respond the item correctly
Meaning of success
In CRT, the average item difficulty is fairly high. Examinees are expected to show mastery. In
NRT, the average item difficulty is lower. Tests are able to spread out the examinees‘ and
provide a reliable ranking.
Score Distributions
In CRT, a plot of the resulting score distribution will show most of the scores clustering near the
high end of the score scale. In NRT, broader spread of scores is expected, with a few examinees
earning very low or high scores and many earning medium scores.
Reported Scores
In earlier session classification of assessment by use in classroom instruction was discussed. This
session will further unpack the complexity of norm and criterion referenced assessment
Formative Assessment
Formative assessment provides feedback and information during the instructional process, while
learning is taking place, and while learning is occurring. Formative assessment measures student
progress but it can also assess your own progress as an instructor.
• Question and answer sessions, both formal (planned) and informal (spontaneous)
• Conferences between the instructor and student at various points in the semester
Summative Assessment
Summative assessment takes place after the learning has been completed and provides
information and feedback that sums up the teaching and learning process. Typically, no more
formal learning is taking place at this stage, other than incidental learning which might take place
through the completion of projects and assignments.
Summative assessment is more product-oriented and assesses the final product, whereas
formative assessment focuses on the process toward completing the product. Once the project is
completed, no further revisions can be made.
If, students are allowed to make revisions, the assessment becomes formative.
• Term papers (drafts submitted during the semester would be a formative assessment)
• Performances
Table of specification
One of the tools used by teachers to develop a blueprint for the test is called ―Table of
Specification‖ in other words Table of Specification is a technical name for the blue print of test.
It is the first formal step to develop a test.
Concept of Table of Specification
It helps a teacher in allotting the questions to different content areas and Bloom‘s
learning categories in a systematic manner.
The blueprint is meant to insure content validity. Content validity is the most important
factor in constructing an achievement test. (will be discussed in later unit)
A unit test or comprehensive exam is based on several lessons and/or chapters in a book
supposedly reflecting a balance between content areas and learning levels (objectives).
Two way Table of Specification
A Table of Specifications consists of a two-way chart or grid relating instructional objectives to
the instructional content.
Table of specification performs two important functions
1. Ensures the balance and proper emphasis across all content areas covered by teacher
2. It ensures the inclusion of items at each level of the cognitive domain of Bloom's
Taxonomy.
Carey (1988) listed six major elements that should be attended to in developing a Table of
Specifications for a comprehensive end of unit exam:
1. Balance among the goals selected for the exam (weighing objectives)
2. Balance among the levels of learning (higher order and lower order
5. The number of test items for each goal and level of learning
2. Do the specifications indicate the nature and limits of the achievement domain?
5. Is the number of test items indicated for the total test and for each subdivision?
6. Are the types of items to be used appropriate for the outcomes to be measured?
7. Is the difficulty of the items appropriate for the types of interpretation to be made?
Topic 31: Balance among Learning Objectives and their Weight in table of specification
In developing a test blueprint first of all it is necessary to select some learning. Objectives
and among this list of learning objectives some objectives are more important in sense that more
time of instruction is spent on them while some other are less important in terms of time spent on
them in classroom so in developing table of specification balance among these learning
objectives is important, for this purpose we need to weigh the learning objectives for calculating
their relative weightage in test.
Let us assume total marks of there are 100. Then 25 marks should be
allocated to questions related to objective/content/theme 1.
Step 3
25 ±2 = 25 ±2
It can be a bit tricky if the total marks of the test are 50. Then 25% of 50
will be 12.5 marks. Point total of questions for objective / total points * on
examination = % of examination value
Topic 32: Balance among the Levels of Learning Objectives in Table of Specification
We have learnt to give weightage to the content area in a table of specification. Now we
look at an example to develop table of specification practically. Following is the table of
specification comprised of topics to be cover in test and their weightage that represent percentage
of marks for each topic.
Pakistan Movement
Time: (100/500)*100 = 20%
Geography of Pakistan
Time: (150/500)*100 = 30%
Climate Change
Time: (150/500)*100 = 20%
Industries
Time: (50/500)*100 = 10%
Economy
Time: (50/500)*100 = 10%
Let‘s consider that we have to develop a test of 50 marks according to the above
discussed table of specification then distribution of marks for each topic is as under.
Industries 5 (10%)
Time: (50/500)*100 = 10%
Economy 5 (10%)
Time: (50/500)*100 = 10%
Then we have to consider the importance of each topic for cognitive level of questions
according to Bloom‘s Taxonomy.
Published test, supplement and complement informal classroom tests, and aid in many
instructional decisions.
Published test are designed and conducted in such a manner that each and every
characteristic is pre planned and known.
There are many published tests available for school use. The two most value to the instructional
program are:
1. Achievement tests
2. Aptitude tests
There are hundreds of tests available for each type. Selecting the most appropriate one is
important task. In some cases published tests are used by teachers. But more frequently these are
used by provincial or national testing programs.
In classrooms most used published tests are:
1. Achievement tests
2. Reading test
Published tests commonly used by provincial or national testing programs are:
1. Aptitude tests
2. Readiness tests
3. Placement tests
Topic 34: Standards for selecting appropriate test
In this session student will learn:
1. Evaluate the procedures used by test developers to avoid potentially insensitive content or
language
2. Review the performance of test takers of different races, gender, and ethnic groups when
sample of sufficient size are available.
3. Evaluate the extent to which the performance differences may have been caused by
inappropriate characteristics of test.
4. Use appropriately modified forms of tests or administration procedures for test takers
with handicapping conditions.
We discussed different types of assessment or how the results are to be used, all assessments
should possess certain characteristics. The most essential of these are:
Validity
Reliability
Usability
Validity
Reliability
Reliability vs Validity
Reliability of measurement is needed to obtain the valid results, but we can have reliability
without validity. Reliability is necessity but not sufficient condition for validity.
Usability
In addition to validity and reliability, an assessment procedure must meet certain practical
requirement which includes feasibility, administration environment and availability of results for
decision makers.
1. Nature of validity.
Validity is referred as ―validity of test‖ but it is in fact validity of the interpretation and use to be
made of the results.
It does not exist on all or none basis. It is best considered in term of categories that specify
degree, such as high, moderate or low validity
No assessment is valid for all purposes. An arithmetic test may have high degree of validity for
computational skill and low degree for arithmetical reasoning
Validity does not have different types. It is viewed as a unitary concept based on different kind
of evidences
1. Evidences of validity
2. Concept of content validity
3. Procedure to find content validity
4. Method of ensuring content validity.
Content
Construct
Criterion
Meaning
How well the sample of assessment tasks represents the domain of the tasks to be measured.
Procedure
It compares the assessment tasks to the specifications describing the task domain under
consideration
Method
Meaning
How well a test measures up to its claims. A test designed to measure depression must only
measure that particular construct, not closely related ideals such as anxiety or stress.
Procedure
Introduction It introduces the main 1. Single sentence called the thesis statement is
Paragraph idea, captures the written
interest of reader and 2. Background information about your topic
tells why topic is provided
important. 3. Definitions of important terms written
Supporting Supporting paragraphs 1. List the points about main idea of essay.
Paragraphs make up the main body 2. Write separate paragraph for each supporting
of your essay point.
3. Develop each supporting point with facts,
details, and examples.
Method
1. Expert judgment
There are experts of the field. For above example, people who are expert in essay writing
will be considered to assess the construct validity of the table and table will be revised
under their guidance.
2. Factor analysis
In this, we group the questions by keeping in view the responses of respondents on them.
Meaning
Demonstrates the degree of accuracy of a test by comparing it with another test, measure or
procedure which has been demonstrated to be valid.
Concurrent validity
This approach allows one to show the test is valid by comparing it with an already valid test
Predictive
It involves testing a group of subjects for a certain construct, and then comparing them with
results obtained at some point in the future
Procedure
Compare assessment results with another measure of performance obtained at a later date (for
prediction) or with another measure of performance obtained concurrently (for estimating
present status)
Method
The degree of relationship can be described more precisely by statistically correlating the two
sets of scores. The resulting correlation coefficient provides numerical summary of relationship
Meaning
How well use of assessment results accomplishes intend purposes and avoids unintended effects
Procedure
© Copyright Virtual University of Pakistan 60
Test Development and Evaluation-EDU431 VU
Evaluate the effects of the use of assessment results on teachers and students. Both, the intended
positive effects (e.g., increased learning) and possible unintended negative effects (e.g.,, dropout
of school) need to be evaluated
Considerations
• Unclear directions
• Ambiguity
• Overemphasis of easy to access aspects of domain at the expense of important, but hard
to access aspects
1. Nature of reliability.
1. Reliability refers to the results obtained with an assessment instrument and not to the
instrument itself.
2. An estimate of reliability always refers to particular type of consistency (stability,
equivalence, internal consistency)
3. Reliability is necessary but not sufficient condition for validity.
4. Reliability is primarily statistical (range +1 and -1).
Characteristics
1. Stability:
2. Equivalence:
3. Internal consistency:
In determining reliability, it would be desirable to obtain two sets of measures under identical
conditions and then to compare the results.
The reliability coefficient resulting from each method must be interpreted according to type of
consistency being investigated
• Test-Retest (stability)
1. Test-retest method
• It gives the same test twice to the same group with any time interval between tests, Time
interval can range from several minutes to the several years
Test- Retest
September 25 October 15
Form A Form A
1. Item a yes 2. Item a yes
2. Item b no 2. Item b no
3. Item c yes 3. Item c yes
• Very long interval will influence results by instability and actual changes in students over
time.
• It gives two forms of the test to the same group in close succession
September 25 September 25
Form A Form B
3. Item c No 3. Item f No
• It gives two forms of the test to the same group with increased interval between forms
September 25 September 25
Form A Form B
1. Item a 2. Item a
2. Item b 2. Item b
3. Item c 3. Item c
Score = 82 Score= 78
Test- Retest with Equivalent Forms
September 25 October 15
Form A Form B
1. Item a 2. Item a
2. Item b 2. Item b
3. Item c 3. Item c
Score = 82 Score= 74
• It gives test once. Score two equivalent halves of test, correct correlation between halves
to fit whole test by spearman –brown formula
Split Half Reliabilities tend to be higher than equivalent form reliabilities because split half
method is based on the administration of single assessment
• It gives test once. Score total test and apply Kuder- Richardson
As with the split half method, these formulas provide an index of internal consistency but do not
require splitting the assessment in half for scoring purposes
One formula KR20 is applicable only when student responses are scored dichotomously (0 or 1).
It is most useful with traditional test items scored correct or incorrect
The generalization of KR20 for assessments that have more than dichotomous, right-wrong
scores is called Coefficient Alpha
Inter-Rater Method
• It gives a set of students responses requiring judgmental scoring to two or more raters and
have them independently score the responses
Many outcomes in the cognitive domain, such as those pertaining to knowledge, understanding,
and thinking skills, can be measured by paper pencil tests. But there are still many learning
outcomes that require informal observation of natural interactions.
1. Observing students as they perform and describing or judging that behaviors (Anecdotal
record).
2. Asking their peers about them and assessing social relationships (Peer appraisal).
3. Questioning them directly and assessing expressed interests (Self-appraisal).
4. Measuring progress by recorded work (portfolio).
Anecdotal records
Impressions gained through observation are apt to provide an incomplete and biased
picture, however unless we keep an accurate record of our observations. Method to do so is
called anecdotal records.
Anecdotal records are factual descriptions of meaning incidents and events that the
teacher observes.
One should keep in mind the following points to use anecdotal records effectively.
1. Peer appraisal
Peer appraisal
In this procedure students rate their peers on the same rating device used by their teacher.
It depends on greatly simplified procedures.
The guess who technique is based on nomination method of obtaining peer ratings and is
scored by simply counting the number of mentions each students receive on each description.
Sociometric technique
This form was used to measure student‘s acceptance as seating companions, work
companions and play companions.
1. The choices should be real choices that are the natural part of classroom activities.
2. The basis for the choice and restriction on the choosing should be made clear.
3. All students should be equally free to participate in the activity or situation.
4. The choice of each student make must be kept confidential.
5. The choices should actually be used to organize or rearrange the group.
1. Portfolio
2. Weakness and strengths of portfolio
Portfolio
Systematic collection of students work into portfolios can serve a variety of instructional
and assessment purposes. The value of portfolios depend heavily on the clarity of purpose the
guidelines for the inclusion of materials, and the criteria to be used in evaluating portfolio.
1. Specify purpose.
2. Provide guidelines for selecting portfolios.
3. Define student‘s role in selection and self-evaluation.
4. Specify evaluation criteria.
5. Use portfolios in instruction and communication.
Strengths of portfolios
Weaknesses of portfolios
1. Purpose of Portfolio
2. Guidelines for portfolios entries.
Purposes of portfolios
Fundamentally two global purposes for creating portfolios of students work: for student‘s
assessment and instruction. It can be used to showcase student‘s accomplishment and document
the progress.
Instructional purposes:
When primary purpose is instruction, the portfolio might be used as means of:
Assessment purposes:
When the focus is on accomplishments, portfolios usually are limited to finished work and
may cover only a relatively small period of time.
When focus is on demonstrating growth and development the time frame is longer. It will
include multiple version of same work over time to measure progress.
It contains student selected entries. It demonstrate students ability to choose his best work
which demonstrates his ability to do a task.
It implies that work is complete for specific audience. A job application portfolio for
example. It is finished product for specific audience.