New Ed 9 Module 4

Module 3: Guiding Principles for the Assessment of Student Learning
Lesson 2. Principles of Validity and Reliability
OVERVIEW
In your previous module you have learned three principles of assessment: principles
of clarity and appropriateness of learning targets, appropriateness of methods and balance.
Module 4 (part 2) will deal with the principle of validity and principle of reliability. Validity and
reliability ensure high quality assessment. They are both considered whenever tests are
used to measure learning outcomes.
LEARNING OUTCOMES
At the end of this module, you are expected to:

1. identify the ways of establishing validity;
2. identify the sources of reliability evidences; and
3. cite evidences of validity and reliability in teacher-made tests;
LET US EXPLORE
Activity 1
It is not unusual to receive comments or complaints from students regarding

the tests administered by their teachers. Was there an instance in your studies when
your teacher gave you a test about topics that were not part of the lesson discussed
in class? What other test issues did you experience in your studies? Share your
experiences here.
___________________________________________________________________
___________________________________________________________________
___________________________________________________________________
___________________________________________________________________
___________________________________________________________________
Ed 9 Assessment of Learning 1, First Semester 2021-2022 33

CONTENT FOCUS
Principle 4: VALIDITY
Validity- is the degree to which the assessment instrument measures what it intends to
measure. It is also referring to the usefulness of the instrument for given purpose. It is the
most important criterion of a good assessment instrument.
Validity – describes the accuracy and soundness of evaluations made regarding
student performance.
An assessment is valid if it measures a student’s actual knowledge and performance with
respect to the intended outcomes, and not something else.
For example, an assessment purportedly for measuring arithmetic skills of grade 4 pupils is
invalid if used for grade 1 pupils of issues on content (test content evidence) and level of
performance (response process evidence).
Another example, a test that measures recall of math formula is invalid if it is suppose to
assess problem solving.
The above examples are validity problems particularly on content-related evidence.
CONTENT-RELATED EVIDENCE FOR VALIDITY pertains to the extent to which the test
covers the entire domain of content. If a summative test covers a unit with four topics, then
the assessment should contain items from each topic. This is done thru adequate sampling
of content. A student’s performance in the test may be used as an indicator of his/her
content knowledge. For example, if a grade 4 pupil was able to correctly answer 80% of the
items in a Science test about matter, the teacher may infer that the pupil knows 80% of the
content area.
A test that appears to adequately measure the learning outcomes and content is said to
possess face validity.
Another consideration, related to content validity is instructional validity, the extent to
which an assessment is systematically sensitive to the nature of the instruction
offered. This is closely related to instructional sensitivity which is defined by Popham
(2006) as the “degree to which student’s performance on a test accurately reflect the
quality of instruction to promote students’ mastery of what is being assessed”.
An instructionally valid test is one that registers differences in the amount and kind of
instruction to which students have been exposed (Yoon and Resnick, 1998).
Let’s consider the Grade 10 curriculum in Araling Panlipunan. In the first grading period, they
will cover three economic issues, unemployment, globalization and sustainable
development. Only two were discussed in class but assessment covered all three issues.
Although these were all identified in the curriculum guide and may even be found in a
textbook, the question remains as to whether the topics were all taught or not.

Inclusion of items that were not taken up in class reduces validity because students
had no opportunity to learn the knowledge of or skill being assessed.
To improve validity of assessment, it is recommended that the teacher constructs a two-
dimensional grid called Table of Specifications (ToS).
A ToS is prepared before developing the test. It is a test blueprint that identifies the
content area and describes the learning outcomes at each level of the cognitive domain.
Validity suffers if the test is too short to sufficiently measure behavior and cover the
content. Adding more items to the test may increase its validity. However, an excessively
long test may be taxing to the students. Regardless of the trade-off, teachers must construct
tests that the students can finish within a reasonable time. It would be helpful if teachers also
provide students with tips on how to manage their time.
CRITERION-RELATED EVIDENCE FOR VALIDITY refers to the degree to which test
scores agree with an external criteria. It examines the relationship between an assessment
and another measure of the same trait.
CRITERION-RELATED EVIDENCE has two types: Concurrent validity and Predictive
validity.
Concurrent validity – provides an estimate of a student’s current performance in relation to
a previously validated or established measure. For example, a school has developed a new
intelligence quotient (IQ) test. Results from this test are statistically correlated to the
results from a standard IQ test. If the statistical analysis reveals a strong correlation
between the two sets of scores, then there is a high criterion validity. It is important to
mention that data form the two measures are obtained at about the same time.
Predictive validity – pertains to the power or usefulness of the test to predict future
performance. For example, can scores in the admission test (predictor) be used to predict
college success (criterion)? If there is a significantly high correlation between
admission score and the first year grade point average (GPA) assessed a year later,
then there is predictive validity.
In testing correlations between two data sets for both concurrent and predictive validity, the
Pearson correlation coefficient (r) or Spearman’s rank order correlation may be used.
The square of the correlation coefficient (r 2) is called the coefficient of determination. In
the previous example where the admission test is the predictor and the college GPA as
the criterion, a coefficient of determination r2 = 0.25 means 25% of student’s college
grades may be explained by their scores in the admission test. It also implies that there
are other factors that contribute to the criterion variable. Teachers may then look into other
variables like study habits and motivation.
CONSTRUCT-RELATED EVIDENCE - is an assessment of the quality of the instrument
used. It measures the extent to which the assessment is a meaningful measure of an
unobservable trait or characteristic (McMillan, 2007).
A construct is an individual characteristic that explains some aspect of behavior (Miller, Linn
& Gronlund, 2009). Construct validity can take the form of a differentiated groups study. For
instance, a test in problem-solving strategies is given to two groups of students – those
specializing in Math and those specializing in Social Science. If the Math group performs
better than the other group, then there is evidence of construct validity.
There are two methods of establishing construct validity: convergent and divergent.

Convergent validity occurs when measures of constructs that are related are in fact
observed to be related.
Divergent (or discriminant) validity occurs when constructs that are unrelated are in
reality observed not to be.
Let us consider a test administered to measure knowledge and skills in geometrical
reasoning which is far different from a reading construct. Hence, comparison of test results
from these two constructs will show lack of communality. This is what we mean by divergent
or discriminant validity.
Convergent validity is obtained when test scores in geometrical reasoning turn out to be
highly correlated to scores from another geometry test measuring the same construct. These
construct validity evidences rely on statistical procedures.
Much have been discussed about validity of traditional assessment (paper-and pencil test).
What about other methods such as PERFORMANCE ASSESSMENTS? The same validity
concepts apply.
Developing performance assessments involve THREE STEPS:
1. define the purpose (what essential skills students need to develop and content
worthy of understanding)
To acquire validity evidence in terms of content, performance assessment should be
reviewed by qualified content experts.
2. choose the activity
In choosing the activity, Moskall (2003) laid down five recommendations:
a. the selected performance should reflect a valued activity;
b. the completion of performance should provide a valuable learning experience;
c. the statement of goals and objectives should be aligned with measurable
outcomes of the performance activity;
d. the task should not examine extraneous or unintended variables; and
e. performance assessment should be fair and free from bias.
3. develop criteria for scoring
In scoring, a rubric or rating scale has to be created. Teachers must exercise caution
because distracting factors like student’s handwriting and neatness of the product
affect rating. Additionally personal idiosyncrasies infringe on the objectivity of the
teacher/rater which lowers the validity of the performance assessment.
In controlled conditions, ORAL QUESTIONING has high validity. Nonetheless, a checklist

defining the outcomes to be covered and standards/criteria to be achieved can ensure
validity of the assessment. It is recommended though for summative purposes there should
be a structured or standard list of questions.
For OBSERVATIONS, operational and response definitions should accurately describe the
behavior of interest. It is highly valid if evidence is properly recorded and interpreted.
Additionally, validity is stronger if additional assessment strategies are used with observation
like interviews, surveys, and quantitative methods like tests. In qualitative research, this is
called TRIANGULATION - a technique to validate results through cross verification from
two or more sources.

Validity in SELF-ASSESSMENT is described by Ross (2006) as the agreement between

self-assessment ratings with teacher judgments or peer ratings. Studies on validity of self-
assessments have mixed results. In many cases, student self-assessment scores are higher
compared to teacher ratings. This is specially true for younger learners because of cognitive
biases and wishful thinking that lead to distorted judgments. Self-assessment ratings also
tend to bloat when these contribute to students’ final grades.
To increase validity, students should be informed of the domain in which the task is
embedded. They should be taught how to objectively assess their work based on clearly
defined criteria and dismiss any interest bias. Informing them that their self-assessment
ratings shall be compared to those made by their teacher and peers may also induce them
to make accurate assessment of their own performance.
No single type of assessment method can assess the vast array of learning and
development outcomes (Miller, Linn & Gronlund, 2009). For this reason, teachers should
use a wide range of assessment tools to build a complete profile of students’ strengths,
weaknesses and intellectual achievements. Teachers may opt to give multiple choice
questions to assess knowledge, understanding and application of theories. However,
selected response tests do not provide students opportunity to practice and demonstrate
their writing skills. Hence, these should be balanced with other forms of assessment like
essays. Additionally, direct methods of assessments should be coupled with direct methods
such as student surveys, interviews and focus group discussion. While direct assessment
examines actual evidences of student outcomes, indirect assessment gathers perceptive
data or feedback from students or other persons who may have relevant information about
the quality of the learning process.
THREATS TO VALIDITY
Miller, Linn & Gronlund, 2009 identified ten factors that affect validity of assessment results.
These factors are defects in the construction of assessment tasks that would render
assessment inferences inaccurate. The first four factors apply to traditional tests and
performance assessments. The remaining factors concern brief-constructed response and
selected-response items.
1. Unclear test directions
2. Complicated vocabulary and sentence structure
3. Ambiguous statements
4. Inadequate time limits
5. Inappropriate level of difficulty of test items
6. Poorly constructed test items
7. Inappropriate test items for outcomes being measured
8. Short test
9. Improper arrangement of items
10. Identifiable pattern of answer
SUGGESTIONS FOR ENHANCING VALIDITY (McMillan, 2007)
1. Ask others to judge the clarity of what you are asking.
2. Check to see if different ways of assessing the same thing give the same result.
3. Sample a sufficient number of examples of what is being assessed.
4. Prepare a detailed table of specifications.

5. Ask others to judge the match between the assessment items and the objectives of
the assessment.
6. Compare groups known to differ on what is being assessed.
7. Compare scores taken before to those taken after instruction.
8. Compare predicted consequences to actual consequences.
9. Compare scores on similar but different traits.
10. Provide adequate time to complete the assessment.
11. Ensure appropriate vocabulary, sentence structure and item difficulty.
12. Ask easy questions first.
13. Use different methods to assess the same thing.
14. Use only for intended purpose.
WAYS IN ESTABLISHING VALIDITY

1. FACE VALIDITY – is done by examining the physical appearance of the instrument
to make it readable and understandable. As the term suggests, it looks at the
superficial face value of the instrument. It is based on the subjective opinion of the
one reviewing it. Hence, it is considered non-systematic and non-scientific.
2. CONTENT VALIDITY – is done through a careful and critical examination of the
objectives of assessment to reflect the curricular objectives.
3. CRITERION—RELATED VALIDITY – is established statistically such that a set of
scores revealed by the measuring instrument is correlated with the scores obtained
in another external predictor or measure. It has two purposes: concurrent and
predictive.
a. Concurrent validity- describe the present status of the individual by
correlating the sets of scores from two measures given at a close interval.
b. Predictive validity- describe the future performance of an individual by
correlating the sets of scores obtained from two measures given at a longer
time interval.
4. CONSTRUCT VALIDITY- is established statistically by comparing psychological
traits or factors that theoretically influence scores in a test.
a. Convergent validity- is established if the instrument defines another similar
trait other than what is intended to measure. (e.g. Critical Thinking Test may
be correlated with Creative Thinking Test.)
b. Divergent validity – is established if an instrument can describe only the
intended trait and not the other traits. (e.g. Critical Thinking Test may not be
correlated with Reading Comprehension Test)
Principle 5: RELIABILITY
Reliability – it refers to the consistency of scores obtained by the same person when
retested using the same or equivalent instrument.
Reliability talks about reproducibility and consistency in methods and criteria. An
assessment is said to be reliable if it produces the same results if given to an examinee on
two occasions. It is important then to stress that reliability pertains to the obtained
assessment results and not to the test or any other instrument. Another point is that reliability
is unlikely to turn out 100% because no two tests will consistently produce identical results.
Even the same test administered to the same group of students after a day or two will have
some differences. There are environmental factors like lighting and noise that affect

reliability. Student error and physical well-being of examinees also affect consistency of
assessment results.
For a test to be valid, it has to be reliable. Let us look at an analogous situation. For
instance, a weighing scale is off by 6 pounds. You weighed a dumbbell for seven
consecutive days. The scale revealed the same measurement, hence the results are
reliable. However, the scale did not provide an accurate measure and therefore is not valid.
From the foregoing, reliability is a necessary condition for validity but not a sufficient one.
Similarly, a test can be found reliable but it doesn’t imply that the test measures what it
purports to.
Reliability is expressed as a correlation coefficient. A high reliability coefficient
denoted that if a similar test is re-administered to the same group of students, test results
from the first and second testing are comparable.
Types of Reliability
There are two types of reliability: internal and external reliability. Internal reliability
assesses the consistency of results across items within a test whereas external reliability
gauges the extent to which a measure varies from one use to another.
Sources of Reliability Evidence
1. Stability
The test-retest reliability correlates scores obtained from two administrations of the
same test over a period of time. It is used to determine the stability of test results over time.
it assumes that there is no considerable change in the construct between the first and the
second testing. Hence, timing is critical because characteristics may change if the time
interval is too long. A short gap between the testing sessions is also not recommendable
because subjects may still recall their responses. Typically, test-retest reliability coefficients
for standardized achievement and aptitude tests are between 0.80 and 0.90 when the
interval between testing is 3 to 6 months (Nitko & Brookhart, 2011). Note that reliability
coefficient is an index of reliability.
2. Equivalence
Parallel forms of reliability ascertain the equivalency of forms. In this method, two
fidderent versions of an assessment tool are administered to the same group of individuals.
However, the items are parallel, i.e. they probe the same construct, base knowledge or skill.
The two sets of scores are then correlated in order to evaluate the consistency of results
across alternate versions.
3. Internal consistency
Internal consistency implies that a student who has mastery learning will get all or
most of the correctly while a student who knows little or nothing about the subject matter will
get all or most of the items wrong. To check for internal consistency, the split-half method
can be used. The split half method is done by dividing the test into two – separating the first
half and second half of the test or by odd and even numbers, and then correlating the results
of the two halves. The Spearman-Brown formula is applied. It is a statistical correction to
estimate the reliability of the whole test and not each half of the test.
Whole test reliability = 2x reliability on ½ test
1+ reliability on ½ test


For instance, assume that the correlation between the two half scores is 0.70. Let us
estimate the reliability of the whole test reliability using the above formula.
Whole test reliability = 2 (0.70)

1+ 0.70
= 1.40
1.70
= 0.82
Split-half is effective for large questionnaires or test with several items measuring the same
construct.
To improve the reliability of the test employing this method, items with low correlations are
either removed or modified.
There are two other ways to establish internal consistency: Cronbach alpha and Kuder-
Richardson (KR) 20/21. The Cronbach alpha is a better measure than split-half because it
gives the average of all the split-half reliabilities. It measures how well items in a scale (i.e. 1
=strongly disagree to 5 = strongly agree) correlate with one another. The Kuder-Richardson
20/21 formula is applicable to dichotomous items (0/1). Items of the test are scored 1 if
marked correctly, otherwise zero.
For internal consistency, the range of reliability measures are rated as follows: less
than 0.50 – the reliability is low; between 0.50 and 0.80 – reliability is moderate; and greater
than 0.80 – the reliability is high.
4. Scorer or Rater consistency
People do not necessarily rate in a similar way. They may have disagreements as to
how responses or materials truly reflect or demonstrate knowledge of the construct or skill
being assessed. More so, certain characteristics of the raters contribute to errors like bias,
halo effect, mood, fatigue, among others. Bias is partiality or a display of prejudice in favor of
or against a student or group. Hence, a teacher playing favorites in class can sway him/her
to give high marks to students based on his/her preference or liking, and not on students’
performance. The halo effect is a cognitive bias, allowing first impressions to color one’s
judgment of another person’s specific traits . For instance, when a teacher finds a student
personable or behaved, this may lead the teacher to assume that the student is also diligent
and participative before he/she can actually observe or gauge objectively . The halo effect
can be traced back in 1920 to Edward Thorndike’s study entitled, “A Constant Error in
Psychological Ratings”. The problem with the halo effect is that teachers who are aware of it
may not have any idea that is already occurring in their judgments.
Just as several items can improve the reliability of a standardized test, having
multiple raters can increase reliability. Inter-rater reliability is the degree to which different
raters, observers or judges agree in their assessment decisions. To mitigate rating errors, a
wise selection and training of good judges and use of applicable statistical techniques are
suggested.
Inter-rater reliability is useful when grading essays, writing samples, performance
assessment and portfolios. However, certain conditions must be met. There should be a
good variation of products to be judged, scoring criteria are clear and raters are
knowledgeable or trained on how to use the observation instrument.

To estimate inter-rater reliability, the Spearman’s rho or Cohen kappa may be used
to calculate the correlation coefficient between or among the ratings. The first is used for
ordinal data while the other for nominal and discrete data.
5. Decision Consistency
Decision consistency describes how consistent the classification decisions are rather
than how consistent the scores are. It is seen in situations when teacher decide who will
receive a passing or fall mark, r considered to possess mastery or not.
Let us consider the levels of proficiency adapted in the Philippines at the onset of the
K to 12 program. In reporting grades of K to 12 learners at the end of each quarter, the
performance of the students are described based on the following levels of proficiency:
Beginning (74% and below); Developing (75% - 79%); Approaching Proficiency (80%-84%);
Proficient (85%-89%); and Advanced (90% and above). Suppose two students receive
marks of 80 and 84, then they are both regarded to have “approaching proficiency “ in the
subject. Despite the numerical difference in their grades and even though 84 is just a
percentage point shy from 85, the teacher on the basis of the foregoing classification can
infer that both students have not reached the proficient or advanced levels. If decisions are
consistent, then these students would still be classified in the same level regardless of the
type of assessment method used.
McMillan (2007) gives a similar explanation. Matching of classifications is done by
comparing scores from two administrations of the same tests. Suppose students were
judged as beginners, proficient, and advanced. In a class of 20 students, results in the first
testing showed that 5 are in the beginning level, 10 at the proficient level and 5 are
advanced. In the second testing involving the same students, 3 of 5 students who were
previously evaluated as beginners are still in the same level, all 10 proficient students remain
in that level, and 4 of 5 students were able to retain advanced level of proficiency. Hence
there are 17 matches which is equivalent to 85% consistency level.
Measurement Errors
An observation is composed of the true value plus measurement error. Measurement
errors can be caused by examinee-specific factors like fatigue, boredom, lack of motivation,
momentary lapses of memory and carelessness. Consider a student who lacked sleep for
one reason or another. He/she may perform poorly in examination. His/her physical
condition during the assessment does not truly reflect what he/she knows and can do. The
same can be said for another student who may have attained a high score in the same
examination and yet most of his/her responses were made through guessing. Test-specific
factors are also causes of measurement errors. Teachers who provide poor or insufficient
directions may likely cause students to answer test items differently and inaccurately.
Ambiguous questions would only elicit vague and varied responses. Trick questions
purposely intended to deceive test takers will leave students perplexed rather than
enlightened. Aside from examinee and test specific factors, errors can arise due to scoring
factors. Inconsistent grading systems, carelessness and computational errors lead to
imprecise or erroneous student evaluations.
Reliability indicates the extent to which scores are free from measurement errors. As
pointed out, lengthening or increasing the number of items in a test can increase reliability.
Teacher-made summative tests and standardized tests are more reliable compared to

informal observations conducted over a limited period of time. To increase reliability, ample
amount of observation is needed to detect patterns of behavior.
Reliability of Assessment Methods

Well-constructed objective test has better reliability compared to performance
assessment (Miller, Linn & Gronlund, 2009; Harris, 1997). Performance assessment is said
to have low reliability because of judgmental scoring. Inconsistent scores may be obtained
depending on the raters. This may be due to inadequate training of raters or inadequate
specification of the scoring rubrics (Harris, 1997). Additionally, in a performance
assessment, there is a limited sampling of course content. But as Harris (1997) explained,
constraining the domain coverage or structuring the responses may raise the reliability of
performance assessment. Reliable scoring of performance assessments can be enhanced
by the use of analytic and top-specific rubrics complemented with exemplars an/or rater
training (Jonsson & Svingby, 2007).
As for oral questioning, suggestions for improvement in reliability of written tests may
also be extended to oral examinations like increasing the number of questions, response
time and number of examiners, and using rubrics or marking guide that contains the criteria
and standards.
To reliably measure student behavior, observation instruments must be
comprehensive to adequately sample occurrences and non-occurrences of behavior but still
manageable to conduct. Direct observation data, according to Hintze (2005) can be
enhanced through inter-observer reliability. The first pertains to consistency of observation
data collected on behavior multiple times by a single teacher or observer.
Self-assessments, according to Ross (2006), have high consistency across tasks,
across items and over short time periods. This is especially true if self-assessments are
done by students who have been trained on how to evaluate their work. Greater variations
are expected with younger children.
WAYS TO IMPROVE RELIABILITY OF ASSESSMENT RESULTS (Nitko & Brookhart,
2011)
1. Lengthen the assessment procedure by providing more time, more questions and
more observation whenever practical.
2. Broaden the scope of the procedure by assessing all the significant aspects of the
largest learning performance.
3. Improve objectivity by using a systematic and more formal procedure for scoring
student performance. A scoring scheme or rubric would prove useful.
4. Use multiple markers by employing inter-rater reliability.
5. Combine results from several assessments especially when making crucial
educational decisions.
6. Provide sufficient time to students in completing the assessment procedure.
7. Teach students how to perform their best by providing practice and training to
students and motivating them.
8. Match the assessment difficulty to the students’ ability levels by providing tasks that
are neither too easy nor too difficult, and tailoring the assessment to each student’s
ability level when possible.
9. Differentiate among students by selecting assessment tasks that distinguish or
discriminate the best from the least able students.

Note that some of the suggestions on improving reliability overlap with those concerning
validity. This is because reliability is a precursor to validity. However, it is important to
note that high reliability DOES NOT ensure a high degree of validity.
METHOD TYPE OF PROCEDURE STATISTICAL

RELIABILITY MEASURE
MEASURE
Test-retest Measure of stability Give a test twice to the same Pearson r
learners with any time
interval between tests from
several minutes to several
years.
Equivalent Measure of Give parallel forms of test Pearson r
forms equivalence with close time interval
between forms.
Test-retest Measure of stability Give parallel forms of tests Pearson r
with and equivalence with increased time interval
equivalent between forms.
forms
Split half Measure of internal Give a test once to obtained Pearson r &
consistency scores for equivalent halves Spearman Brown
of the test e.g. odd- and Formula
even -numbered items.
Kuder- Measure of internal Give the test once then Kuder-Richardson
Richardson consistency correlate the Formula 20 and
proportion/percentage of the 21.
students passing and not
passing a given item.
Activity 2
Directions: For each of the following situations, identify the type of validity. State your
answer in one or two sentences.
1. Test constructors in a secondary school designed a new measurement procedures to
measure intellectual ability. Compared to well-established measures of intellectual
ability, the new test is shorter to reduce the arduous effect of a long test on students.
To determine its effectiveness, a sample of students accomplished two tests – a
standardized intelligence test and the new test with only a few days interval. Results
from both assessments revealed high correlation.
_________________________________________________________________________
_________________________________________________________________________
2. After the review sessions, a simulated examination was given to graduating students
a few months before the Licensure Examination for Teachers (LET). When the
results of the LET came out, the review coordinator found that the scores in the
simulated (mock) examination are not significantly correlated with the LET scores.
_________________________________________________________________________
_________________________________________________________________________
Activity 3

Directions: For each of the following situations, identify the source of reliability evidence.
State your answer in one or two sentences.
3. For sample of 150 grade 10 students, a science test on “Living things and their
environment was tested for reliability by comparing the scores obtained on odd-
numbered and even-numbered items.
_________________________________________________________________________
_________________________________________________________________________
4. Scores from 100 grade 7 students were obtained from a November 2019
administration of a test in Filipino about “panghalip na panao”. These were compared
to the scores of the same group from a September 2019 administration of the same
test.
_________________________________________________________________________
_________________________________________________________________________
LET US WRAP UP
Activity 4
To sum up what you have learned in this module, describe in your own words
the following:
Principle of validity
___________________________________________________________________
___________________________________________________________________
___________________________________________________________________
Principle of reliability
___________________________________________________________________
___________________________________________________________________
___________________________________________________________________
Activity 5
Directions: Assuming you are a head teacher/principal, what action would you take on the
following scenarios. Provide answers to the question in each scenario based on the principle
of validity and reliability. State your answer in 3 to 5 sentences. (2 points each)

1. In a geometry class, the learners have to calculate perimeters and areas of plane
figures like triangles, quadrilaterals, and circles. The teacher decided to use
alternative assessment rather than tests. Students came up with mathematics
portfolios containing their writings about geometry. What would you tell the teacher?
Why?
_________________________________________________________________________
_________________________________________________________________________
_________________________________________________________________________
_________________________________________________________________________
_________________________________________________________________________
2. Mr. Roa taught the different elements and principles of art. After instruction, he
administered a test about prominent painters and sculptors in the 20 th century.
Would you recommend revisions? Why?
_________________________________________________________________________
_________________________________________________________________________
_________________________________________________________________________
_________________________________________________________________________
_________________________________________________________________________
3. There are two available assessment instruments to measure English skills on

grammar and vocabulary. Test A has high validity but no information concerning its
reliability. Test B was tested to have a high reliability index but no information about
its validity. Which one would you recommend?
_________________________________________________________________________
_________________________________________________________________________
_________________________________________________________________________
_________________________________________________________________________
_________________________________________________________________________
LET US ASSESS

Directions: Choose the correct answer from the given choices. Encircle the letter of
your answer.
1. Mr. Cruz asked other social studies teachers in his high school to review his
periodical test to ensure that the test items represent his learning targets.
Which type of evidence of validity did he use?
a. construct-related c. content-related
b. criterion- related d. instructional-related
2. What evidence of validity is obtained by gathering test scores and criterion
scores at nearly the same time?
a. concurrent validity c. construct validity
b. content validity d. predictive validity
3. The school’s guidance counselor administered two forms of the same
personality test to the students. Which type of reliability does she like to
determine?
a. equivalence c. internal consistency
b. rater consistency d. stability
4. Which of the following statement is true about validity and reliability?
a. For an instrument to be valid, it must be reliable.
b. For an instrument to be reliable, it must be valid.
c. Both a and b
d. None of these
5. Which of the following is NOT a method to obtain a reliability coefficient?
a. test-retest c. internal consistency
b. equivalent forms d. concurrent forms
ANSWER KEY
1. C
2. A
3. A
4. A
5. D
REFERENCES

De Guzman and Adamos (2015) Assessment of Learning 1. Adriana Publishing

Company, Inc. Philippines.
Kubiszyn and Borich (2003) Educational Testing and Measurement Classroom

Application and Practice. 7th Edition. John Wiley and Sons (Asia) Pte. Ltd.
Rico (2011) Assessment of Student Learning A Practical Approach. Assessment of

Students’ Learning A Practical Approach. Anvil Publishing, Inc. Philippines.

New Ed 9 Module 4

Uploaded by

Document Informationclick to expand document informationModole

Document Informationclick to expand document information

Copyright:

Available Formats

New Ed 9 Module 4

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

New Ed 9 Module 4

Uploaded by

Copyright:

Available Formats

Module 3: Guiding Principles for the Assessment of Student Learning

Lesson 2. Principles of Validity and Reliability

At the end of this module, you are expected to:

It is not unusual to receive comments or complaints from students regarding

Ed 9 Assessment of Learning 1, First Semester 2021-2022 33

Lesson 2. Principles of Validity and Reliability

Ed 9 Assessment of Learning 1, First Semester 2021-2022 34

Lesson 2. Principles of Validity and Reliability

Ed 9 Assessment of Learning 1, First Semester 2021-2022 35

Lesson 2. Principles of Validity and Reliability

In controlled conditions, ORAL QUESTIONING has high validity. Nonetheless, a checklist

Ed 9 Assessment of Learning 1, First Semester 2021-2022 36

Lesson 2. Principles of Validity and Reliability

Validity in SELF-ASSESSMENT is described by Ross (2006) as the agreement between

Ed 9 Assessment of Learning 1, First Semester 2021-2022 37

Lesson 2. Principles of Validity and Reliability

WAYS IN ESTABLISHING VALIDITY

Ed 9 Assessment of Learning 1, First Semester 2021-2022 38

Lesson 2. Principles of Validity and Reliability

Ed 9 Assessment of Learning 1, First Semester 2021-2022 39

Lesson 2. Principles of Validity and Reliability

Whole test reliability = 2 (0.70)

Ed 9 Assessment of Learning 1, First Semester 2021-2022 40

Lesson 2. Principles of Validity and Reliability

Ed 9 Assessment of Learning 1, First Semester 2021-2022 41

Lesson 2. Principles of Validity and Reliability

Reliability of Assessment Methods

Ed 9 Assessment of Learning 1, First Semester 2021-2022 42

Lesson 2. Principles of Validity and Reliability

METHOD TYPE OF PROCEDURE STATISTICAL

Ed 9 Assessment of Learning 1, First Semester 2021-2022 43

Lesson 2. Principles of Validity and Reliability

Ed 9 Assessment of Learning 1, First Semester 2021-2022 44

Lesson 2. Principles of Validity and Reliability

3. There are two available assessment instruments to measure English skills on

Ed 9 Assessment of Learning 1, First Semester 2021-2022 45

Lesson 2. Principles of Validity and Reliability

Ed 9 Assessment of Learning 1, First Semester 2021-2022 46

Lesson 2. Principles of Validity and Reliability

De Guzman and Adamos (2015) Assessment of Learning 1. Adriana Publishing

Kubiszyn and Borich (2003) Educational Testing and Measurement Classroom

Rico (2011) Assessment of Student Learning A Practical Approach. Assessment of

Ed 9 Assessment of Learning 1, First Semester 2021-2022 47

You might also like