Equivalency Reliability
Equivalency Reliability
Equivalency Reliability
Reliability is the extent to which an experiment, test, or any measuring procedure yields the same
result on repeated trials. Without the agreement of independent observers able to replicate
research procedures, or the ability to use research tools and procedures that yield consistent
measurements, researchers would be unable to satisfactorily draw conclusions, formulate
theories, or make claims about the generalizability of their research. In addition to its important
role in research, reliability is critical for many parts of our lives, including manufacturing,
medicine, and sports.
Reliability is such an important concept that it has been defined in terms of its application to a
wide range of activities. For researchers, four key types of reliability are:
Equivalency Reliability
Equivalency reliability is the extent to which two items measure identical concepts at an identical
level of difficulty. Equivalency reliability is determined by relating two sets of test scores to one
another to highlight the degree of relationship or association. In quantitative studies and
particularly in experimental studies, a correlation coefficient, statistically referred to as r, is used
to show the strength of the correlation between a dependent variable (the subject under study),
and one or more independent variables, which are manipulated to determine effects on the
dependent variable. An important consideration is that equivalency reliability is concerned with
correlational, not causal, relationships.
For example, a researcher studying university English students happened to notice that when
some students were studying for finals, their holiday shopping began. Intrigued by this, the
researcher attempted to observe how often, or to what degree, this these two behaviors co-
occurred throughout the academic year. The researcher used the results of the observations to
assess the correlation between studying throughout the academic year and shopping for gifts. The
researcher concluded there was poor equivalency reliability between the two actions. In other
words, studying was not a reliable predictor of shopping for gifts.
Stability Reliability
Stability reliability (sometimes called test, re-test reliability) is the agreement of measuring
instruments over time. To determine stability, a measure or test is repeated on the same subjects
at a future date. Results are compared and correlated with the initial test to give a measure of
stability.
An example of stability reliability would be the method of maintaining weights used by the U.S.
Bureau of Standards. Platinum objects of fixed weight (one kilogram, one pound, etc...) are kept
locked away. Once a year they are taken out and weighed, allowing scales to be reset so they are
"weighing" accurately. Keeping track of how much the scales are off from year to year
establishes a stability reliability for these instruments. In this instance, the platinum weights
themselves are assumed to have a perfectly fixed stability reliability.
Internal Consistency
Internal consistency is the extent to which tests or procedures assess the same characteristic, skill
or quality. It is a measure of the precision between the observers or of the measuring instruments
used in a study. This type of reliability often helps researchers interpret data and predict the value
of scores and the limits of the relationship among variables.
For example, a researcher designs a questionnaire to find out about college students'
dissatisfaction with a particular textbook. Analyzing the internal consistency of the survey items
dealing with dissatisfaction will reveal the extent to which items on the questionnaire focus on
the notion of dissatisfaction.
Interrater Reliability
Interrater reliability is the extent to which two or more individuals (coders or raters) agree.
Interrater reliability addresses the consistency of the implementation of a rating system.
A test of interrater reliability would be the following scenario: Two or more researchers are
observing a high school classroom. The class is discussing a movie that they have just viewed as
a group. The researchers have a sliding rating scale (1 being most positive, 5 being most
negative) with which they are rating the student's oral responses. Interrater reliability assesses the
consistency of how the rating system is implemented. For example, if one researcher gives a "1"
to a student response, while another researcher gives a "5," obviously the interrater reliability
would be inconsistent. Interrater reliability is dependent upon the ability of two or more
individuals to be consistent. Training, education and monitoring skills can enhance interrater
reliability.
--
Validity
Validity refers to the degree to which a study accurately reflects or assesses the specific concept
that the researcher is attempting to measure. While reliability is concerned with the accuracy of
the actual measuring instrument or procedure, validity is concerned with the study's success at
measuring what the researchers set out to measure.
Researchers should be concerned with both external and internal validity. External validity refers
to the extent to which the results of a study are generalizable or transferable. (Most discussions
of external validity focus solely on generalizability; see Campbell and Stanley, 1966. We include
a reference here to transferability because many qualitative research studies are not designed to
be generalized.)
Internal validity refers to (1) the rigor with which the study was conducted (e.g., the study's
design, the care taken to conduct measurements, and decisions concerning what was and wasn't
measured) and (2) the extent to which the designers of a study have taken into account
alternative explanations for any causal relationships they explore (Huitt, 1998). In studies that do
not explore causal relationships, only the first of these definitions should be considered when
assessing internal validity.
Scholars discuss several types of internal validity. For brief discussions of several types of
internal validity, click on the items below:
Face Validity
Face validity is concerned with how a measure or procedure appears. Does it seem like a
reasonable way to gain the information the researchers are attempting to obtain? Does it seem
well designed? Does it seem as though it will work reliably? Unlike content validity, face
validity does not depend on established theories for support (Fink, 1995).
Criterion related validity, also referred to as instrumental validity, is used to demonstrate the
accuracy of a measure or procedure by comparing it with another measure or procedure which
has been demonstrated to be valid.
For example, imagine a hands-on driving test has been shown to be an accurate test of driving
skills. By comparing the scores on the written driving test with the scores from the hands-on
driving test, the written test can be validated by using a criterion related strategy in which the
hands-on driving test is compared to the written test.
Construct Validity
Construct validity seeks agreement between a theoretical concept and a specific measuring
device or procedure. For example, a researcher inventing a new IQ test might spend a great deal
of time attempting to "define" intelligence in order to reach an acceptable level of construct
validity.
Construct validity can be broken down into two sub-categories: Convergent validity and
discriminate validity. Convergent validity is the actual general agreement among ratings,
gathered independently of one another, where measures should be theoretically related.
Discriminate validity is the lack of a relationship among measures which theoretically should not
be related.
To understand whether a piece of research has construct validity, three steps should be followed.
First, the theoretical relationships must be specified. Second, the empirical relationships between
the measures of the concepts must be examined. Third, the empirical evidence must be
interpreted in terms of how it clarifies the construct validity of the particular measure being
tested (Carmines & Zeller, p. 23).
Content Validity
Content Validity is based on the extent to which a measurement reflects the specific intended
domain of content (Carmines & Zeller, 1991, p.20).
Content validity is illustrated using the following examples: Researchers aim to study
mathematical learning and create a survey to test for mathematical skill. If these researchers only
tested for multiplication and then drew conclusions from that survey, their study would not show
content validity because it excludes other mathematical functions. Although the establishment of
content validity for placement-type exams seems relatively straight-forward, the process
becomes more complex as it moves into the more abstract domain of socio-cultural studies. For
example, a researcher needing to measure an attitude like self-esteem must decide what
constitutes a relevant domain of content for that attitude. For socio-cultural studies, content
validity forces the researchers to define the very domains they are attempting to study.
In order for assessments to be sound, they must be free of bias and distortion. Reliability and
validity are two concepts that are important for defining and measuring bias and distortion.
Reliability refers to the extent to which assessments are consistent. Just as we enjoy having
reliable cars (cars that start every time we need them), we strive to have reliable, consistent
instruments to measure student achievement. Another way to think of reliability is to imagine a
kitchen scale. If you weigh five pounds of potatoes in the morning, and the scale is reliable, the
same scale should register five pounds for the potatoes an hour later (unless, of course, you
peeled and cooked them). Likewise, instruments such as classroom tests and national
standardized exams should be reliable – it should not make any difference whether a student
takes the assessment in the morning or afternoon; one day or the next.
Another measure of reliability is the internal consistency of the items. For example, if you create
a quiz to measure students’ ability to solve quadratic equations, you should be able to assume
that if a student gets an item correct, he or she will also get other, similar items correct. The
following table outlines three common reliability measures.
The values for reliability coefficients range from 0 to 1.0. A coefficient of 0 means no reliability
and 1.0 means perfect reliability. Since all tests have some error, reliability coefficients never
reach 1.0. Generally, if the reliability of a standardized test is above .80, it is said to have very
good reliability; if it is below .50, it would not be considered a very reliable test.
Validity refers to the accuracy of an assessment -- whether or not it measures what it is supposed
to measure. Even if a test is reliable, it may not provide a valid measure. Let’s imagine a
bathroom scale that consistently tells you that you weigh 130 pounds. The reliability
(consistency) of this scale is very good, but it is not accurate (valid) because you actually weigh
145 pounds (perhaps you re-set the scale in a weak moment)! Since teachers, parents, and school
districts make decisions about students based on assessments (such as grades, promotions, and
graduation), the validity inferred from the assessments is essential -- even more crucial than the
reliability. Also, if a test is valid, it is almost always reliable.
There are three ways in which validity can be measured. In order to have confidence that a test is
valid (and therefore the inferences we make based on the test scores are valid), all three kinds of
validity evidence should be considered.
Type of
Definition Example/Non-Example
Validity
A semester or quarter exam that only includes
The extent to which the content
content covered during the last six weeks is
Content of the test matches the
not a valid measure of the course's overall
instructional objectives.
objectives -- it has very low content validity.
The extent to which scores on the
test are in agreement with If the end-of-year math tests in 4th grade
Criterion (concurrent validity) or predict correlate highly with the statewide math tests,
(predictive validity) an external they would have high concurrent validity.
criterion.
If you can correctly hypothesize that ESOL
The extent to which an
students will perform differently on a reading
assessment corresponds to other
Construct test than English-speaking students (because
variables, as predicted by some
of theory), the assessment may have construct
rationale or theory.
validity.
So, does all this talk about validity and reliability mean you need to conduct statistical analyses
on your classroom quizzes? No, it doesn't. (Although you may, on occasion, want to ask one of
your peers to verify the content validity of your major assessments.) However, you should be
aware of the basic tenets of validity and reliability as you construct your classroom assessments,
and you should be able to help parents interpret scores for the standardized exams.
--