Reviewer
Reviewer
Reviewer
Clinical Setting. Used to help screen for or diagnose behavior problems. The hallmark of During-Test Obligations:
testing in clinical settings is that the test is employed with only one individual at a time. Rapport between the examiner and the examinee can be critically important.
Group testing is used primarily for screening in identifying those individuals who require Rapport is the working relationship between the examiner and the examinee.
further diagnostic evaluation.
After-Test Obligations:
Counseling Setting. Regardless of the tools used, the ultimate objective of many such Safeguarding the test protocols to conveying the test results in a clearly
assessment is the improvement of the assessee in terms of adjustment, productivity, or some understandable fashion is one of the obligations of the test user
related variables. Test users who have responsibility for interpreting scores or other test results have
an obligation to do so in accordance with the established procedures and ethical
Geriatric Setting. This relates to old people, especially with regard to their healthcare. Such guidelines
assessments are used in the extent to which assessees are enjoying as good a quality of life
as possible. Quality of life is the variables related to perceived stress, loneliness, sources of Assessment of people with disabilities can be done through accommodation and alternate
satisfaction, personal values, quality of living conditions, and quality of friendships and other assessment. Accommodation is the adaptation of a test, procedure, or situation, or the
social supports. substitution of one test for another, to make the assessment more suitable for an assessee
with exceptional needs. Alternate assessment is an evaluative or diagnostic procedure or
Business and Military Setting. Most notable in the selection and decision making about the process that varies from the usual, customary, or standardized way a measurement is
careers of personnel. derives, either by virtue of some special accommodation made to the assessee or by means of
alternative methods designed to measure the same variable(s).
Governmental & Organizational Setting. One of the applications is in governmental
licensing, certification, or general credentialing of professionals. Before one is legally entitled WHERE TO GO FOR ATHORATIVE INFORMATION: REFERENCE SOURCES
to practice something, they must pass an examination. Members of some professions have
Test catalogues is one of the most accessible sources of information distributed by
formed organizations with requirements for membership that go beyond licensing or
the publisher of the test. Can be tapped by a simple telephone call, e-mail, or note.
certification.
Test manuals have detailed information concerning the development of a particular
test and technical information relating to it.
Academic Research Setting & Other Settings. Many different kinds of measurement
procedures find application in a wide variety of settings. Reference volumes
Journal Articles may contain reviews of the test, updated or independent studies of
WHY IS THERE A NEED FOR PSYCHOLOGICAL ASSESSMENT? its psychometric soundness, or examples of how the instrument was used in either
Psychological tests and assessments allow a psychologist to understand the nature of the research or an applied context.
problem, and to figure out the best way to go about addressing it. Psychologists use test and Online databases
other assessment tools to measure and observe a client’s behavior to arrive at a diagnosis
and guide treatment.
Pre-Test Obligations:
Keep Rising In Zeal, Ignite Ambition’s Heart (KRIZIAH)!
The other (more theoretically relevant and probably stronger) based on the work of the German
psychophysicist Herbart, Weber, Fechner, and Wundt (experimental psychology developed from
this)
There are also tests that arose in response to important needs such as classifying and
identifying the mentally and emotionally handicapped.
Seguin Form Board Test was developed in an effort to educate and evaluate the mentally
disabled.
CHAPTER 2: HISTORICAL AND CULTURAL PERSPECTIVE OF Kraepelin devised a series of examination for evaluating emotionally impaired people.
ASSESSMENT
HISTORICAL PERSPECTIVE OF ASSESMENT THE EVOLUTION OF INTELLIGENCE AND STANDARDIZED ACHIEVEMENT TEST
EARLY ANTECEDENTS The Binet-Simon Scale had its first version published in 1905. The instrument contained
Tests and testing programs first came into being in China as early as 2200 B.C.E. Testing 30-item of increasingly difficult and was designed to identify intellectually subnormal
was used as a means of selecting who, of many applicants, would obtain government jobs. individuals. It is designed to help identify Paris schoolchildren with intellectual disability The
Evry third year in China, oral examinations were given to help determine work evaluations 1908 Binet-Simon Scale has been improved and introduced the significant concept of a
and promotion decisions. child’s mental age.
Han Dynasty (206 B.C.E. to 220 B.C.E.). Test batteries (two or more tests used in conjunction)
were used. Topics were civil law, military affairs, agriculture, revenue, and geography L.M. Terman of Stanford University had revised the Binet test for the use in the US. The
Ming Dynasty (1368-1644 C.E.). National Multistage Testing Programs involved local and Stanford-Binet Intelligence Scale was the only American version of the Binet test that
regional testing centers equipped with special testing booths. Only those who passes this third flourished.
set of tests were eligible for public office .
Robert Yerkes was requested for assistance by the army. He headed a committee who
Reports by British missionaries and diplomats encouraged the English East India Company developed two structured group tests of human abilities:
in 1832 to copy the Chinese system as a method of selecting employees for overseas 1. The Army Alpha was a verbal test, measuring such skills as ability to follow
duty. Because testing programs worked well for the company, the British government directions; it required reading ability
adopted a similar system of testing for its civil service in 1855. After the British 2. The Army Beta presented nonverbal problems to illiterate subjects and recent
endorsement of a civil testing system, the French and German governments followed. In immigrants who were not proficient in English It measured the intelligence of
1883, the US government established the American Civil Service Commission, which illiterate adults.
developed and administered competitive examinations for certain government jobs.
Standardized Achievement Tests provide multiple-choice questions that are standardized
Charles Darwin suggested that to develop a measuring device, we must understand what we want to on a large sample to produce norms against which the results of new examinees can be
measure. compared. It became famous because of the ease of administration and scoring and the lack
Sir Francis Galton continued Darwin’s works and he initiated a search for knowledge concerning of subjectivity/favoritism that can occur in essay or other written tests.
human individual differences which is now one of the most important domains in scientific
psychology.
Two years after the revision of 1937 Stanford-Binet Test, David Wechsler published the
Galton’s work was extended by James McKeen Cattel, who coined the term mental test. Cattel
perpetuated and stimulated the forces that ultimately led to the development of modern tests. Wechsler-Bellevue Intelligence Scale, which contained several interesting innovations in
intelligence testing. It yielded several scores, permitting an analysis of an individual’s pattern
EXPERIMENTAL PSYCHOLOGY AND PSYCHOPHYSICAL MEASUREMENT or combination of abilities. It can produce the performance IQ score.
J.E. Herbart used mathematical models as the basis for educational theories that strongly influenced
19th-century educational practices.
E.H. Weber followed and attempted to demonstrate the existence of a psychological threshold; the Personality Tests measured presumably stable characteristics or traits that theoretically
minimum stimulus necessary to activate a sensory system. underlie behavior.
G.T. Fechner devised the law that the strength of a sensation grows as the logarithm of the stimulus The first structured Personality Test was the Woodworth Personal Data Sheet,
intensity.
which was developed during WWI and was published in final form just after the war.
Wilhelm Wundt set up a laboratory at the University of Leipzig in 1879, and was credited with
founding the science of psychology. Projective test is also developed which provides an ambiguous stimulus and
E.B. Titchner succeeded the works of Wundt and founded structuralism unclear response requirement (Rorschach Inkblot Test and Thematic Apperception
G. Whipple, a student of Titchner provided the basis for immense changes in the field of testing by Test was the most famous).
conducting a seminar at the Carnegie Institute in 1919. The seminar came up with the Carnegie The Minnesota Multiphasic Personality Inventory (MMPI) published in 1943
Interest Inventory and later the Strong Vocational Interest Blank. began a new era for structured personality tests. It is currently the most widely
used and referenced personality test.
Psychological testing developed form at least two lines of inquiry: Factor analysis is a method of finding the minimum number of dimensions. Called
One based on the work of Darwin, Galton, and Cattell on the measurement on individual factors, to account for a large number of variables. R.B. Cattel had introduced the
differences; and
Keep Rising In Zeal, Ignite Ambition’s Heart (KRIZIAH)!
Sixteen Personality factor Questionnaire (16PF), which remains as one of the
most well-constructed structured personality tests and an important example of a Descriptive statistics are methods used to provide a concise description of a collection of
test developed with the aid of factor analyses. quantitative information. Inferential statistics are methods used to make inferences from
observations of a small groups of people known as a sample to a larger group of individuals
SOME ISSUES REGARDING CULTURE AND ASSESSMENT known as population.
Individualist Culture is characterized by value being placed on traits such as self-reliance,
autonomy, independence, uniqueness, and competitiveness. Collectivist culture value is The levels of measurements are:
placed on traits such as conformity, cooperation interdependence, and striving toward group Nominal scale - categorical data and numbers that are simply used as identifiers
goals. Culture-specific tests are designed for use with people from one culture but not from or names represent a nominal scale of measurement (e.g., gender).
another. Ordinal scales - represents an ordered series of relationships or rank order (e.g.,
CHAPTER 3: A STATISTICS REFRESHER Likert-type scale, rank in a contest).
Interval scales - represents quantity and has equal units but for which zero
SAMPLING AND SAMPLING TECHNIQUES represents simply an additional point of measurement is an interval scale. In
addition, zero does not represent the absolute lowest value. Rather, it is point on
A population is the set of all individuals of interest in a particular study. A sample is a set of the scale with numbers both above and below it.
individuals selected from a population, usually intended to represent the population in a Ratio scale - similar to the interval scale in that it also represents quantity and
research study. has equality of units. However, this scale also has an absolute zero (no numbers
exist below the zero).
Sampling is the process of selecting observations to provide an adequate description and
inferences of the population. Probability sampling is a method of selecting a sample wherein A frequency table is an ordered listing of number of individuals having each of the different
each element in the population has a known, nonzero chance of being included in the values for a particular variable. A frequency distribution shows the pattern of frequencies
sample; otherwise, it is nonprobability sampling. over the various values. They can be illustrated through a histogram, bar graph, or frequency
Simple random sampling is a sampling method wherein all elements of the polygon.
population have the same probability of inclusion in the selected sample (e.g., draw
lots, fishbowl method). Skewness is the nature and extent to which symmetry is absent. Skewed distribution in the
Stratified sampling is a probability method where we divide the population into distribution in which the scores pile up one side of the middle and are spread out on the
nonoverlapping subpopulation or strata (a group whose members are of the other side. Can be positively skewed or negatively skewed.
characteristics), and then randomly pick samples from each stratum (population –
strata – random selection – sample). Floor effect is a situation in which many scores pile up at the low end of a distribution
Cluster sampling divide the population into nonoverlapping groups or clusters. (creating skewness to the right; positively skewed distribution) because it is not possible to
And then randomly select clusters (a group whose members are not of the same have any lower score. Ceiling effect is a situation in which many scores pile up at the high
characteristics). end of a distribution (creating skewness to the left; negatively skewed distribution) because it
Systematic sampling is where researchers select members of the population at a is not possible to have a higher score.
regular interval (or k) determined in advance.
Multistage sampling draws a sample from a population using smaller and smaller The term testing professionals use to refer to the steepness of a distribution in its center is
groups (units) at each stage. It’s often used to collect data from a large, kurtosis. It is the peakedness/flatness. It can be platykurtic (relatively flat), leptokurtic
geographically spread group of people in national surveys. (relatively peaked, and mesokurtic (somewhere in the middle).
Multistage cluster sampling is where the researcher divides the
population into groups at various stages for a better data collection, The central tendency of a distribution refers to the middle of the group of score.
management, and interpretation. The groups are called clusters. Mean is the sum of the scores divided by the number of scores.
Multistage random sampling is where the researcher chooses the Mode is the value with the greatest frequency in a distribution
samples randomly at each stage. The researcher does not create clusters, Median is the middle score.
but narrow down the convenience sample by applying random sampling.
In convenience sampling, units are selected for inclusion in the sample because Variability is an indication of how scores in a distribution are scattered or dispersed.
they are easiest for the researcher to access. Range indicates the distance between the two most extremes scores in distribution
In snowball sampling, new samples are recruited by other sample to form part of (range = highest score – lowest score)
the sample. This can be a useful way to conduct research about people with specific Variance is the average of each score’s squared difference from the mean
traits who might otherwise be difficult to identify. Standard deviation is simply the square root of the variance. It indicates the
Purposive sampling/Judgement sampling involves the researcher using their average deviation from the mean, the consistency in the scores, and how far scores
expertise to select a sample that is most useful to the purpose of the research. An are the spread out around the mean.
effective purposive sample must have clear criteria and rationale for inclusion.
Quota sampling is done until a specific number of samples for various A standard score is a raw score that has been converted from one scale to another scale,
subpopulations have been selected. where the latter scale has some arbitrarily set mean and standard deviation.
Keep Rising In Zeal, Ignite Ambition’s Heart (KRIZIAH)!
validity. Typical cause of systematic error includes observational error, imperfect
Correlation is an expression of the degree and direction of correspondence between two instrument calibration, and sampling bias.
things. The coefficient of correlation is the numerical index that expresses the relationship,
it tells us the extent to which X and Y are corelated . Can be positive or negatively SOURCES OF ERROR VARIANCE
correlated. Aside from measurement error, we also have the error variance (also called residual error,
residual variance, or unexplained variance). It is the element of variability in a score that is
Code of ethics: Beneficence & Nonmaleficence, Fidelity & Responsibility, Integrity, produced by extraneous factors, such as measurement imprecision, and is not attributable to
Justice, and Respect for People's Rights & Dignity the independent variable or other controlled experimental manipulation. There are 3 sources
of error variance:
1. Construction
CHAPTER 5: RELIABILITY Example of this is the item sampling or content sampling – refer to variation among
Reliability – consistency in measurement; proportion of the total variance attribute to true items within a test as well as to variation among items between test.
variance. The greater the proportion of the total variance attributed to true variance, the 2. Administration
more reliable the test. Can be affected by the (1) Test Environment (room temperature, level of lighting,
and the amount of ventilation and noise); (2) Testtaker variables (emotional
In psychological assessment context, a reliable test means that it gives a dependable and problems, physical discomfort, lack of sleep): and (3) Examiner-related Variables
consistent results even though it will be used again and again although, error would always (examiners physical appearance and demeanor – even the presence or absence of an
be present. Error implies that there will always be some inaccuracy in our measurement. One examiner).
of the goals in psychological assessment is to minimize, as much as possible, the presence of 3. Scoring and Interpretation
error in testing. Despite rigorous scoring criteria set forth in many better-known tests of intelligence,
the scorer (or rater) can be a source of error variance. If subjectivity is involved in
The reliability can be measured through its reliability coefficient – an index that indicates scoring, then the scorer can be a source of error variance.
the ratio between the true score variance on a test and the total variance. Reliability
estimates in the range of .70 and .80 are good enough for most purposes in basic research. RELIABILITY ESTIMATES/MEASURES
1. Test-Retest Reliability (Time-Sampling Reliability)
Variance from true differences is true variance, and variance from irrelevant, random sources It uses the same instrument to measure the same thing/factor at two points in time. It is
is error variances. an estimate of reliability obtained by correlating pairs of scores from the same people on
two different administrations of the same test. It is appropriate when evaluating a test
[observed score = true ability + random error] that purports to measure something that is relatively stable over time (e.g., personality
According to the Classical Test Theory, each person has a true score that would be obtained traits). Poor test-retest correlations do not always mean that a test is unreliable –
if there were no errors in measurement. Difference between the true score and the observed suggest that the characteristic under study has changed. Pearson r or Spearman Rho is
score results from measurement error. The true score of an individual will not change with usually used to measure the test-retest reliability where: 1 – perfect reliability; ≥0.5 –
repeated applications of the same test. poor reliability; and 0 – no reliability.
One advantage of ADP over Cronbach’s alpha is that APD’s index is not connected to
3. Internal Consistency the number of items on a measure unlike the later method.
Assesses the correlation between multiple items in a test that are intended to measure
the same construct. It measures how related the items in the test that measures a Purpose: To evaluate the extent to which items on a scale relate to one another
certain construct or characteristic. You can calculate the internal consistency without Typical use: When evaluating the homogeneity of a measure
repeating the test or involving other researchers, so it’s a good way of assessing Number of Testing Sessions: 1
reliability when you only have one data set. In effect, we judge the reliability of the Sources of Error Variance: Test Construction
instrument by estimating how well the items that reflect the same construct yield similar Statistical Procedure: Pearson r with Spearman Brown/ KR-20/ Cronbach’s alpha/
results. There are three ways to compute the internal consistency: ADP
A. Split-half Reliability
Correlating two pairs of scores obtained from equivalent halves of a single test 4. Inter-scorer Reliability
administered once. This is useful when it is impractical to assess reliability with two Inter-scorer reliability is the degree of agreement or consistency between two or more
tests or to administer test twice. Results of one half of the test are then compared scorers (or judges or raters) with regards to a particular measure. It is often used when
with the results of the other. coding nonverbal behavior.
Steps in performing split-half reliability: Purpose: To evaluate the level of agreement between raters on a measure
(1) Divide the test into equivalent halves. Do not divide in the middle because it Typical use: Interviews or coding of behavior. Used when researchers need to show that
would lower the reliability, instead use the odd-even system – where one sub there is consensus in the way that different raters view a particular behavior pattern
score is obtained for the odd-numbered items in the test and another for the (and hence no observer bias)
even-numbered items. Number of Testing Sessions: 1
(2) Calculate the Pearson r between scores on the two halves of the test. Sources of Error Variance: Scoring and Interpretation
(3) Adjust the half-test reliability using the Spearman-Brown formula. This allows Statistical Procedure: Cohen’s kappa/ Pearson r /Spearman rho
a test developer or user to estimate internal consistency reliability from a
correlation of two halves. It can also be used to determine the number of items
needed to attain a desired level of reliability. USING AND INTERPRETING A COEFFICENT OF RELIABILITY
As a rule of thumb, .90s rates a grade of A, ,80s rates a B, and anywhere from .65 through
B. Kuder-Richardson Formula 20 (KR-20) the .70s rates a weak grade that borders on failing.
A measure of internal consistency reliability for measures with dichotomous
choices. It is a special case of Cronbach’s Alpha, computed for dichotomous scores. THE PURPOSE OF THE RELIABILITY COEFFICIENT. If the purpose of determining
It is often claimed that a high KR-20 coefficient (e.g., > 0.90) indicates a reliability is to break down the error variance into its parts, then a number of reliability
homogeneous test. However, like Cronbach’s Alpha, homogeneity (that is, coefficients would have to be calculated.
unidimensionality) is actually an assumption, not a conclusion, or reliability
coefficients.
NATURE OF TESTS
C. Cronbach’s Alpha/Coefficient alpha 1. Homogenous vs Heterogeneous Items
The most famous and commonly used among reliability coefficients because it Homogenous (equal difficulties and equal intercorrelations) items have high
requires only one administration of the test. Cronbach’s Alphha is mathematically reliability. The more homogenous a test is, the more inter-item consistency (refers to
equivalent to the average of all possible split-half estimates. The general rule of the degree of correlation among all the items on a scale) it can be expected to have.
thumb is that a Cronbach’s alpha of .70 and above is good, .80 and above is better, 2. Dynamic vs Static Characteristics
and .890 and above is best. This is calculated to help answer questions about how Dynamic Characteristics are traits, state, and/or ability presumed to be ever-
similar sets of data are. changing as a function of situational and cognitive experiences. Static
Characteristics on the other hand, are the trait, state, and ability which are
Although it does have some limitations; scores that have a lower number of items relatively unchanging.
associated with them tend to have lower reliability, and sample size can also 3. Restriction or Inflation of Range
influence the results. Restricted variance results in a correlation coefficient that tends to be lower.
Inflated variance results in a correlational coefficient that tends to be higher.
D. Average Proportional Distance (ADP) 4. Speed Test vs Power Tests
Rather than focusing on similarity between scores in items of a test, the ADP is a A power test has a time limit that is long enough to allow test takers to attempt all
measure that focuses on the degree of differences that exists between item scores. items, some items have a higher degree of difficulty. A speed test has items of
Keep Rising In Zeal, Ignite Ambition’s Heart (KRIZIAH)!
uniform level of difficulty, and when given time limits, all test takers should be able
to complete all the test items correctly (based on performance speed).
5. Criterion-referenced Tests
Designed to provide an indication of where a testtaker stands with respect to some
variable or criterion. Scores on this test tends to be interpreted in pass-fail terms,
and any scrutiny of performance on individual items tends to be for diagnostic and
remedial purposes.
SEM is the tool used to estimate or infer the extent to which an observed score deviates from
a true score. It is the standard deviation of a theoretically normal distribution of test scores
obtained by one person on equivalent tests. It is an index of the extent to which one CHAPTER 6: VALIDITY
individual’s scores vary over tests presumed to be parallel.
A measurement can be reliable but not valid, however, if a measure is valid, then it is also
The SEM is useful in establishing the confidence interval – a range or band of test scores reliable.
that is likely to contain the true score. The standard error of the difference is the statistical
measure that can aid a test user in determining how large a difference should be before it is Validity – as applied to a test, is a judgement or estimate of how well a test measures what it
considered statistically significant. purports to measure in a particular context.
Judgment is based on evidences about the appropriateness of inferences drawn
from test scores. Validity of test must be shown from time to time to account for
What to do about low reliability? culture and advancement.
1. Increase the numbers of items. The larger the sample, the more likely that the test Validation is the process of gathering and evaluating evidence about validity. Test
will represent the true characteristics. user and testtaker both have roles in validation of test.
2. Factor and Item Analysis. Tests are more reliable if they are unidimensional: Test users can conduct a (1) validation studies to yield insights regarding a
measuring a single ability, attribute, construct, or skill. Examine the correlation particular population of as compared to the norming sample or (2) local validation
between each item and the total score for the test. studies which is absolutely necessary when test user plans to alter elements of the
test.
Modification of the BGC formula is the productivity gain, which exist for researchers TEST CONSTRUCTION
who prefer their findings in terms of productivity gains (estimated increase in work SCALING. The process by which a measuring device is designed and calibrated and by which
output) rather than the financial ones. numbers (or other indices) – scale values – are assigned to different amounts of the trait,
attribute, or characteristic being measured.
L.L. Thurstone introduced the notion of absolute scaling – a procedure for obtaining a
METHODS FOR SETTING CUT SCORES measure of item difficulty across samples of test takers who vary in ability.
ANGOFF METHOD. Setting fixed cut scores can be applied to personnel selection tasks as
well as to questions regarding the presence or absence of a particular trait, attribute, or Scaling methods are as follows:
ability. The judgment of the experts is averaged to yield cut scores for the test. Rating Scale. Can be defined as a grouping of words, statements, or symbols on which
judgement of the strength of a particular trait, attitude, or emotion are indicated by the
KNOWN GROUPS METHOD. Collecting of data on the predictor of interest from groups test taker. The use of rating scales of any type results in ordinal-level data.
known to possess, and not to possess a trait, attribute, or ability of interest. Based in an Method of Paired Comparisons. Test takers are presented with pairs of stimuli which
analysis of this data, a cut score is set on the test that best discriminates the two groups’ test they are asked to compare and must select one.
performance. Guttman scale. Items on it range sequentially from weaker to stronger expressions of
the attitude, belief or feeling being measured. All respondents who agree with the
IRT-BASED METHODS. Each item is associated with a particular level of difficulty, and stronger statements of the attitude will also agree with milder statements. The resulting
testtakers must answer items that are deemed to be above some minimum level of difficulty data are then analyzed by means of scalogram analysis.
to pass the test, which is determined by experts and serves as the cut score. Techniques can
be item-mapping method, and bookmark method. WRITING ITEMS. It is usually advisable that the first draft contain approximately twice the
number of items that the final version of the test will contain. An item pool is the reservoir
or well from which items will or will not be drawn for the final version of the test.
CHAPTER 8: TEST DEVELOPMENT
The creation of a good test is the product of the thoughtful and sound application of Items presented in a selected-response format require test takers to select a response from
established principles of test development – an umbrella term for all that goes into the a set of alternative responses.
process of creating a test. The process of developing a test occurs in five stages: Multiple-choice formats have three elements: (1) stem, (2) correct alternative/option,
1. TEST CONCEPTUALIZATION and (3) several distractors/foils.
Ideas for a test In matching items has two columns: premises on the left and responses on the
2. TEST CONSTRUCTION right.
Stage in the process of test development that entails writing True-False items requires the test taker to indicate whether the statement is or is
test items (or re-writing or revising existing items), as well not a fact.
as formatting items, setting scoring rules, and otherwise Items presented in a constructed-response format require test takers to supply or to create
designing and building a test. the correct answer, not merely to select it.
3. TEST TRYOUT Completion item
Once a preliminary form of the test has been developed, it Short-answer item
is administered to a representative sample of testtakers Essay item
under conditions that simulate the conditions that the final
version of the test will be administered. Computerized Adaptive Testing (CAT) refers to an interactive, computer-administered test-
4. ITEM ANALYSIS taking process wherein items presented to the test taker are based in part on the test taker’s
Statistical procedures are employed to assist in making performance on previous items. It tends to reduce floor effects and ceiling effects. It also has
judgements about which items are good as they are, which
Keep Rising In Zeal, Ignite Ambition’s Heart (KRIZIAH)!
the ability to tailor the content and order of presentation of test items in the basis of
responses to previous items referred to as item branching.
SCORING ITEMS. Many different test scoring models have been devised and some of them
are as follow:
Cumulative model – the higher the score on the test, the higher the test taker is on
the ability, trait, or other characteristic that the test purports to measure.
Class/Category scoring – test taker responses earn credit toward placement in a
particular class or category with other test takers whose pattern of responses is
presumably similar in some ways.
Ipsative scoring – comparing a test taker’s score on one scale within a test to
another scale within that same test.
TEST TRYOUT
The test should be tried out on people who are similar in critical respects to the people for
whom the test was designed. A good test item is reliable and valid, and it helps to
discriminate test takers.
ITEM ANALYSIS
ITEM-DIFFICULTY INDEX. It is obtained by calculating the proportion of the total number
of test takers who answered the item correctly. The result can range from 0 to 1, and the
larger the item-difficulty index, the easier the item. For example, 50 out of 100 students got
the item correct so 50 divided by 100 = .5 (p)