0% found this document useful (0 votes)

41 views16 pages

What Is Psychological Testing

This document discusses guidelines for constructing valid psychological tests. It explains that tests are needed to efficiently measure concepts through self-report questions. The author outlines steps to develop a test, including defining the construct, generating items, and validating the test. Translating existing tests for other populations while maintaining validity is also addressed.

Uploaded by

rushna

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

41 views16 pages

What Is Psychological Testing

Uploaded by

rushna

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 16

What is psychological testing?

Psychological testing is the basis for mental health treatment. These tools are often used to
measure and observe a person’s behaviors, emotions, and thoughts.

Tests are performed by a psychologist who will evaluate the results to determine the cause,
severity, and duration of your symptoms. This will guide them in creating a treatment plan that
meets your needs.

Tests can either be objective or projective:

● Objective testing involves answering questions with set responses like yes/no or
true/false.
● Projective testing evaluates responses to ambiguous stimuli in the hopes of uncovering
hidden emotions and internal conflicts.

Both provide valuable insight into your symptoms and help psychologists see your overall level
of functioning and distress.

Psychological tests can include formal, or “norm-referenced,” tests to measure your ability to
comprehend different concepts. They can come in the form of checklists and questionnaires.

Each test measure ensures the reliability, validity, and objectivity necessary to avoid bias in
scoring or interpreting your results.

During a psychological evaluation, assessments may also be used to help diagnose and treat
mental health conditions. Assessments include standardized tests as well as informal tests, such
as:

● surveys

● clinical interviews

● observational data

● medical exams

● previous educational and medical history

Why is psychological testing needed or recommended?

Psychological testing can be recommended for a number of reasons including diagnosing mental
health conditions and identifying troubling behavior.

According to the American Psychiatric Association, the following symptoms may indicate that a
psychological test may be needed:

● increased social withdrawal

● nervousness

● changes in mood

● difficulty completing normal tasks

● a dramatic change in sleep and eating habits

● problems with concentration

If you or a loved one are experiencing any of these, consider reaching out to a mental health
professional. You can use our Find a Therapist tool for help.

Psychological testing may also be used by researchers or scientists to test hypotheses about a
variety of topics that aren’t always related to psychology.

Other uses for psychological testing include:

● screening job applicants

● organizational development

● academic placement

What types of psychological tests are there?

Psychologists use testing to examine a variety of factors, including emotional intelligence,

personality, mental aptitude, and neurological functioning.

Here’s a more in-depth look at the types of testing available and the most commonly used tests
for each category.
Type Tests and Measures

Measure behaviors, emotions, attitude, and behavioral and

environmental characteristics
Personality tests
Test names: Basic Personality Inventory (BPI), 16 Personality
Factor Questionnaire

Measure respondents’ intellectual interests, achievements, and

cognitive abilities
Achievement tests
Test names: Woodcock-Johnson Psychoeducational Battery,
Kaufman Test of Education Achievement (K-TEA)

Measure views of respondents based on how much they agree or

Attitude tests disagree with a statement
Test names: Likert Scale, Thurstone Scale

Measure capabilities, skill sets, and projection of future success

Aptitude tests
Test names: Visual Reasoning Test, Abstract Reasoning Test

Measure emotional responses such as anger, sadness, happiness,

Emotional Intelligence
and impulsivity
tests
Test names: Emotional and Social Competence Inventory

Measure mental/developmental learning disabilities

Intelligence tests Test names: Wechsler Individual Achievement Test, Wechsler
Adult Intelligence Scale, Universal Nonverbal Intelligence

Measure cognitive abilities like memory, language, and executive

Neuropsychological
functioning
tests
Test names: QNST, Ammons Quick Test

Measure feedback from external influences to identify unrealized

emotions/conflicts
Projective tests
Test names: Rorschach Inkblot Test, Thematic Apperception Test
(TAT)

Observation (direct) Measure immediate observable behavior; performed in a

tests laboratory, clinical, or natural setting
Test Construction
LEARNING OBJECTIVES

● learn the steps to construct a test for any social science concept that needs to be measured
● understand the concepts and techniques to make the constructed test psychometrically robust
● master the method to translate an existing psychological test to create an equally valid scale in
another language.
There will be no clinician working in the broad area of mental health who has not used a
psychological test/rating scale, either for research or in their clinical practice. The need also
arises in the field of social sciences either to develop a new scale or to translate a scale into the
local language for use in a particular population. For any scale to be used within a specific
sample, it has to be valid in that group – hence, it will require validation prior to use. The process
of developing and validating an instrument is in large part focused on reducing error in the
measurement process. In this article, we take the reader through easy-to-follow practical steps
involved in three processes: developing a test, validating it and translating it.

Why construct a psychological test?

Understanding and assessing a person's behaviour usually requires in-depth interaction with or
observation of the person over a period of time. We can also get information about the person
from those in constant contact with them or from records about them. A quicker method is to get
the answers from them directly, through self-report. They can answer self-report questions on the
basis of their past behaviour or their possible response to various imagined situations. Put
together, these questions can give a sample of the person's behaviour.

This direct method of self-report has several advantages, in that it is quick, can be obtained from
several people simultaenously, is comparable, and is directly obtained from the person rather
than based on inferences made by an observer. But it also has several potential problems. For
example, failure to elicit authentic information about feelings and behaviour if the person
wrongly perceives themselves or the question; falsification of responses to impress;
forgetfulness; surface replies because of lack of involvement; and, most important, factors
relating to the test items, such as their relevance, the response options given and also how they
are worded (Ackroyd Reference Ackroyd and Hughes1981). The same question worded
differently and asked by different people can elicit different answers. For example, the following
three questions are aimed at finding out whether a person is depressed: ‘Are you terribly
depressed?’; ‘Considering your family's complaints about your constantly low mood, would you
say you are depressed?’; ‘On a scale of 1–10, with 10 being very high, can you rate your level of
depression?’. The first question might elicit a ‘Yes’ or ‘No’ response, but no indication of the
severity of the depression; the second might make the respondent defensive and, in rebellion,
they might falsely respond ‘No’; the third item, however, if well understood by the respondent,
might show us not only whether the person feels depressed, but also how severely.

A well-constructed test that taps all aspects of a concept or situation in a scientific way and has
been confirmed to be consistent can offset many of these problems and be a quick and accurate
tool. Psychological tests are being increasingly used in research, clinical practice and educational
settings, because of their reliability and the ease and speed of gathering, comparing and
differentiating data. A new scale is usually created when instruments or tests to measure the
construct of interest are not readily available or existing tests do not fully satisfy the requirement
or are not in the required language.

A psychological test is a systematic procedure for obtaining samples of behaviour relevant to

cognitive or affective functioning, and for scoring and evaluating those samples according to
standards (Urbina Reference Urbina2004). Psychological tests are useful for making decisions in
clinical and career settings, for understanding the self and others, and as personal growth
indicators. They are also useful in research to understand psychological phenomena and
individual differences. They can be ability tests, achievement tests, attitude and aptitude tests,
personality tests, intelligence tests or neuropsychological tests. Strictly speaking, the term ‘test’
should be used only where the individual's responses are evaluated, on the basis of their quality
or correctness, as an indicator of some aspect of their cognitive functioning, knowledge, skills or
abilities. Instruments whose responses are neither evaluated nor scored as right/wrong or
pass/fail are called inventories, questionnaires, surveys, checklists, schedules or projective
techniques, and are usually grouped under the rubric of personality tests (Urbina Reference
Urbina2004).

The construction of a test

Test construction usually involve three main procedures:

● item generation and selection

● item analysis.
Item generation and selection

Draft item creation

The initial step in creating draft items involves understanding the concept thoroughly through
reading related literature, talking to experts, sharpening one's own thinking and observation.
Attention must also be paid to whether it is a uniform concept or if there are sub-aspects to it
(Irvine Reference Irvine and Kyllonen2002). For illustrative purposes, let us take the example of
anxiety. Anxiety involves not only the feeling aspects of being fearful, feeling discouraged and
anxious, but also cognitive aspects such as constantly thinking of failures, problems and negative
consequences, as well as physiological aspects such as tremors, palpitations, feeling faint, cold
clammy hands, dry mouth and upset stomach. A pool of items representing all the sub-aspects of
the concept must be prepared by the test constructor, ideally with a team of experts.
Shortcomings at this stage with regard to the relevance and wording of the test items can
contribute to the error variance. Error variance is variance due to random, irrelevant sources, as
opposed to the true variance, which is variance due to true difference (Cohen Reference Cohen
and Swerdlik2009).

Test format
The test format also has to be decided. Will the test be open-ended (also called free response or
constructed response) or will it be closed (objective or forced choice)? Open-ended responses
may involve writing samples (e.g. in an ability test), free oral responses (e.g. in projective tests)
or even performance of some kind of task. They are good for exploring
concepts/situations/phenomena and provide a wider range of possibilities, richer samples of the
individual's behaviour and allow for their unique characteristics to emerge. However, scoring is
more complex and time-consuming, and the reliability and validity of such tests are lower than in
closed-response (forced-choice) tests (Urbina Reference Urbina2004).

The forced-choice format presents a limited number of alternatives from which the respondent
must choose. The choice can be on the basis of multiple choice, true/false statement, rating,
ranking and matching, as well as rearrangement of the options provided. Well-known among
them is the Likert scale, where the items are listed and the respondent expresses their degree of
agreement or acceptance or frequency of occurrence by choosing one of an odd number (3, 5, 7
or even 9) of response options, with the midpoint usually signifying a neutral or middle-of-the-
road position (Dawes Reference Dawes2008). The sum of the responses to all the items is taken
as the score. The 5-point Likert scale is particularly popular, with response options such as
‘strongly agree’, ‘agree’, ‘neither agree or disagree’, ‘disagree’ and ‘strongly disagree’ (Derrick
& White Reference Derrick and White2017). Seven or more response options are used if there is
just a single item for measurement, as in the Net Promoter Score, a single-item customer loyalty
measure where customers are asked how likely it is that they will recommend a company or
product to others (Sauro Reference Sauro and Dumas2009).

The forced-choice format is the most frequently used option for test items because of the ease
and speed of scoring. The objectivity of the responses makes it more reliable. It can be applied in
individual or group testing. However, errors are possible if the respondent guesses in ability
tests, or answers manipulatively or carelessly in personality tests. These can reduce the reliability
and validity of the test. Also, preparing a forced-choice test requires thorough familiarity with
the concept and specialised test development and item-writing skills. Most of this article focuses
on this type of test construction.

Item analysis

Qualitative item analysis

The order, type and wording of the measurement items, the introduction, the instructions, the
transitions and the closure/ending of a test ideally should encourage respondents to provide
accurate and adequate information. They should also discourage them from discontinuation or
refusing to answer specific questions. Test item wording also has to be sensitive to the target
group. The following tips help meet these aims.

● •Use simple language, short sentences and terms that the participant understands (e.g. in the item
‘I dread panic attacks’, the respondent might not know what ‘panic attacks’ are).
● •Ensure that all possible response alternatives are provided for forced-choice (closed-response)
items (e.g. in asking about ‘Frequency of panic attacks’, if the response options range from ‘once
in a year’ to ‘several times a day’, someone who has never experienced a panic attack ever is
forced to chose an option that does not apply to them).
● •Ensure that the wording of the item matches the response options (e.g. if the item reads ‘Do you
feel fearful for no reason?’, the response options ‘strongly agree’, ‘agree’, ‘disagree’, ‘strongly
disagree’ will not be suitable as they are meant for a statement and not a question).
● •Ensure that all the items are relevant and needed.

● •Avoid double-barrelled items (e.g. ‘Do you have digestive problems and shortness of breath?’ –
if the patient has only one and not the other, how will they answer?).
● •Avoid leading, biased wording (e.g. ‘Anxiety is often accompanied by irritability; rate your
level of irritability on a scale of 1 to 10’).
● •Avoid ambiguous words with broad or abstract meaning. For example, words such as
‘generally’, ‘on average’ and ‘regularly’ can mean different things to different people.
● •Avoid items that require effort and distant recall (e.g. ‘Did you experience anxiety attacks 5
years ago?’)
● •Avoid asking for ego-threatening, embarrassing or private information if it is not essential (e.g.
income, moral stand, private habits).
A test maker should also be aware of ‘response set’ while creating the final form of the test. The
response a person makes to a test is a function not just of the item content but also of the form of
the items and the test direction. Going for speed rather than accuracy, tendency to opt for the
neutral category, tendency to guess when in doubt, tendency to mark extreme categories,
tendency to agree or like, tendency to respond desirably are all examples of an individual's
response set when completing a test. A person's response set is the consistency in the manner of
their response to the test's form rather than its specific content (Cronbach Reference Cronbach
and Lee1950). For example, if all the items of the test refer to the anxiety symptoms such as
trembling, fearfulness, thoughts of failure, this might induce a temporary preference to respond
in a set way to all the items. To avoid this response set, some of the items could be worded in the
reverse direction (‘reverse wording’), with corresponding reverse scoring. For example, ‘I am
usually calm most of the time’ instead of ‘I am often fearful’.

Administering the draft version of the test to a small sample of 15–30 respondents, representative
of the population the test is intended for, can help to ensure that the test items are understandable
and answerable. Thus in this qualitative item analysis phase, the content coverage, wording and
sentence structure of the item pool are fine tuned by the test constructor, then submitted to
reviewers for their comments and for further revision, if any.

After the refinement at these three levels by the test constructor, using information from the pilot
study and the reviewers’ comments, the test is ready for the important stage of quantitative item
analysis, which is carried out on a much larger sample of 200+ respondents.

Quantitative item analysis

This involves a variety of statistical procedures used for the final selection of the items of a test
on the basis of the responses obtained from the samples used in the process of test development.
Two important concepts used for selecting items from the pool are their difficulty level and
discriminative power.

Difficulty level is a measure of the proportion or percentage of respondents who answered the
item correctly; for this reason it is frequently called the P-value (Anastasi Reference
Anastasi1954; Urbina Reference Urbina2004). It can range between 0.0 and 1.0, where 1.0
indicates that 100% of examinees responded to the item correctly, indicating that it is an easy
item. In criterion-referenced tests, where each respondent's performance is compared directly
with the standard, without considering how others perform on it (e.g. classroom-achievement
tests), the item difficulty is kept in the range 0.7–0.8, as it has to be within the reach of the
majority of respondents. On the other hand, norm-referenced tests, where the purpose is usually
to rank and compare respondents, are designed to be harder and to spread out the examinees’
scores. To achieve this purpose, an item difficulty index between 0.4 and 0.6 is chosen for such
tests (Urbina Reference Urbina2004). Difficulty level is critical in ability tests, as it differentiates
between people of high and low ability, but in personality tests, difficulty level is applicable only
to ensure that the respondents are in a position to understand and respond as required.

Discriminative power refers to the extent to which items elicit responses that accurately
differentiate test takers in terms of the behaviours, knowledge or other characteristics that the test
is designed to evaluate (Urbina Reference Urbina2004). For every type of test, be it an ability
test, an achievement test, an attitude or a personality test, this is one of the most important, as
well as basic, qualities an item must possess: individuals high on what the test aims to measure
should score high on the item and those low on that variable should score low on the item.

A simple way to compute the index of discrimination (D) using the classical test theory (CCT)
approach is to arrange the respondents’ total scores (sum or average of all the items) for the test
in descending order and classify the respondents into three distinct groups: those scoring the
highest 27% of marks, those scoring lowest 27% and those in the middle. For each item, the
percentage of respondents in the upper and lower groups who answer correctly or answer in the
intended direction is calculated. The difference is one measure of item discrimination. The
formula is:

(1)
The possible range of the discrimination index is −1.0 to 1.0. A
negative discrimination index may indicate that the item is
measuring something other than what the rest of the test is
measuring. More often, it is a sign that the item score has been
wrongly entered. This can also happen due to carelessless or when the
items are written in reverse direction.

Another method of determining discriminative power for questions with a right and wrong
answer is the point-biserial correlation (PBC), which measures the correlation between the
correct answer (viewed as 1 = right and 0 = wrong) on an item and the total test score of all
students (Wright Reference Wright1992). The PBC is sometimes preferred because it identifies
items that correctly discriminate between high- and low-scoring groups, as defined by the test as
a whole, instead of the upper and lower 27% of a group.

Summary
Item analysis will thus help the test constructor to decide on the items to select for the final test
by choosing those with levels of difficulty and discriminative power suited to their purpose.

Pilot testing
The revised version of the newly constructed test should be pilot tested on a small sample (15–
40) from the target population, as well as evaluated by a subject expert, to assess its acceptability
and comprehensibility. The respondents should be asked whether they understand the meaning of
the items and to describe them in their own words (Ntourantonis Reference Ntourantonis,
Panagopoulos and Iliopoulos2017).

Checking the reliability and validity of the test

Just as we would expect a weighing scale to display what we actually weigh (validity) and to
show the correct weight every time we use it (reliability), the same trustworthiness is expected in
psychological testing, even though the concepts being measured are not tangible. Reliability is an
indicator of how consistent or stable a test score is over time and validity is an indicator that the
test measures what it is intended to measure. Interestingly, a measure can be reliable without
being valid, but a measure cannot be valid without being reliable (Eldridge Reference Eldridge,
Boswell and Cannon2017; Kimberlin Reference Kimberlin and Winterstein2008). This section
discusses how the reliability and validity of the constructed test be measured and ensured.

Item analysis, both qualitative and quantitative, aims to increase the reliability by taking care of
the errors that can occur due to lack of clarity of test items and instructions and also ensuring the
inclusion of only relevant and discriminating items. Estimating the reliability and validity is
aimed at making the scale even more robust.

Reliability
Reliability (stability or consistency) of a constructed test can be checked using a variety of
methods

Types of reliability used in test checking

● Test–retest reliability
● Parallel forms reliability
● Split-half reliability
● Internal consistency

Test–retest reliability
The most straightforward method is where the test is administered once and then a second time
to the same or a similar group after a suitable gap (not too short that they remember the items,
and not too long that respondents could have changed with respect to the variable being
measured). This is called test–retest reliability, and it measures the temporal stability. The
correlation of the scores between the two administrations will give the test–retest reliability.
Scores above 0.70 indicate reasonable reliability.

Parallel (or alternate) forms reliability

To overcome the possible practice effect of the test–retest method, instead of the same test, a
similar version of the test is administered to the same group. This is called parallel or alternate
forms reliability (it is also called equivalence). A high correlation of above 0.70 between the two
test scores indicates reliability.
Split-half reliability
A third type of reliability, which requires only a single test administration is split-half reliability,
checks the reliability by splitting the test into two comparable halves and finding the correlation
between the data for each half.

Internal consistency reliability

A single administration is also enough to determine internal consistency reliability, which
measures how well each item measures the content or construct under consideration. The logic
here is that different items that measure the same construct should display high relation with
each other. There are a variety of internal consistency measures. Usually, they involve
determining how highly these items are correlated and how well they predict each other.
Cronbach's alpha is one commonly used measure if the variable is unidimensional, and it can be
quickly calculated using statistical software.

Test length
It was believed that the longer a test is, the more reliable it is because more items reduce the
error of measurement. Indeed, a sufficient number of items must be included to cover the content
areas tested; however, the quality of items can contribute to how efficiently a test measures and
separates respondent's ability. Thus, if the quality of the items is high, a shorter test can have
higher reliability than a longer one (Urbina Reference Urbina2004).

Validity
Ensuring the validity of a test can also be done in several ways

Types of validity used in test checking

Content validity
Face validity
Construct validity:
● convergent

● divergent
Criterion validity:
● concurrent

● predictive

Content validity
Content validity is the extent to which test items are relevant to and representative of the concept
being measured (Urbina Reference Urbina2004). Psychological tests designed to aid in the
diagnosis of psychiatric disorders often include, or may even be entirely composed of, items that
reflect critical symptomatic aspects of the syndromes they are designed to diagnose. Here again,
the relevance and representativeness of the items sampled by these instruments are of crucial
importance in determining their usefulness for diagnostic purposes. One way to provide evidence
for content validity is by using subject matter experts to review a measure for any construct
deficiency or contamination. For example, a test of anxiety can be relevant if it adequately
covers the emotional, cognitive and behavioural symptoms of anxiety and also does not
inadvertantly cover symptoms of depression, which is another concept.

Criterion validity
Criterion validity uses various strategies that focus on the correlation between the score obtained
from the test being matched on a known outcome indicator (i.e. criterion) for the construct being
tested. Criterion validity can be manifested by concurrent validity and predictive validity.
Concurrent validity refers to the extent to which the score (of the test being validated) is related
to an outcome indicator score that is measured at the same time, whereas predictive validity
refers to how well the score can predict a criterion obtained later (Hubley Reference Hubley,
Zumbo, Geisinger, Bracken and Carlson2013; Messick Reference Messick1995). For example,
the scores of a newly constructed test of intelligence are matched to students' current grades in
class (concurrent) and are also matched to their final grade point average a year later.

Construct validity
Construct validity is ‘the degree to which a test measures what it claims, or purports, to be
measuring’ (Cronbach Reference Cronbach and Meehl1955). In the classic model of test validity,
construct validity is one of three main types of validity evidence, alongside content validity and
criterion validity. Modern validity theory defines construct validity as the overarching concern of
validity research, subsuming all other types of validity evidence (Messick Reference
Messick1998). Key to construct validity are the theoretical ideas behind the concept under
consideration. There is no single way to measure it: construct validity should be demonstrated
from a number of perspectives, by an accumulation of evidence (Brown Reference Brown1996).

Convergent validity and discriminant validity are two subtypes of construct validity. Convergent
validity represents the correlation between the score obtained from the target test (e.g. a newly
developed scale for anxiety) and the score on a test for an existing highly related construct (e.g.
emotional stability) or the score derived from another well-validated test that measures the same
construct (e.g. another scale to measure anxiety). Discriminant validity indicates the relationship
between the scores of anxiety and of an unrelated construct (e.g. intelligence).

Face validity
Face validity refers to the superficial appearance of what a test measures, from the perspective of
a test taker or any other naive observer. Improving face validity does not improve the test's
objective validity, but it is needed for cooperation and public relations – for test users, for those
who select and decide on which tests to be used, for the observer, and for judicial and legislative
purposes. For example, in measuring the numerical aptitude of engineers, having test items that
deal with machines and tools might elicit more involvement than items about flowers and
oranges.

Factor analysis
The newly constructed test, with its original items and revised (or reduced) items, should be
subjected to exploratory factor analysis (EFA) and confirmatory factor analysis (CFA) to
investigate and verify its factor structure. This method is used to further check its construct
validity. The principle goal of factor analysis is to reduce the number of dimensions needed to
describe data derived from a large number of measures, and in this case it is to explore the
dimensions underlying the variable being measured. Although theorists differ in the use of both
EFA and CFA, one view is that EFA can be used to establish a preliminary construct validity and
CFA can be used if there is theory to support the factor structure (Urbina Reference
Urbina2004). The following sections briefly explain EFA and CFA.

Exploratory factor analysis (EFA)

The first step is to collect a new set of data using the test with the items that survived the pilot
test. There is no clear-cut requirement for sample size, but a sample of over 300 or at least 10
observations per item of the draft scale is recommended to achieve a stable result. The data are
then submitted to EFA to examine the underlying theoretical structure of the test. EFA starts with
finding the correlations among the scores obtained by the respondents to the test items, and the
end product is a factor matrix that lists the loadings of each test item on the factors extracted
from the analyses. The quality of the test is assessed by three elements: factor structure (i.e. the
number of factors to be retained), factor loading and total explained variance. A brief account of
evaluating the three elements is provided below. A detailed explanation of conducting an EFA is
beyond the scope of this article. Interested readers can refer to Fabrigar et al (Reference Fabrigar,
Wegener and MacCallum1999), Osborne et al (Reference Osborne, Costello, Kellow and
Osborne2008) and Tabachnick & Fidell (Reference Tabachnick and Fidell2013).

Factor structure (factors to be retained)

Scree plots and eigenvalues are the two widely used indicators to determine the number of
factors to be retained. A scree plot is a curve that shows the eigenvalues in a downward direction
(Fig. 1). The number of factors to be retained is determined by referring to the left of the point
where the ‘elbow’ of the graph seems to level off. Factors with eigenvalue greater than 1.0
should be retained. For instance, if the EFA shows five factors but only the first three factors’
eigenvalues exceed 1, then a three-factor, instead of five-factor, solution is recommended.
FIG 1 A scree plot shows the eigenvalues for a 12-item test. Factors sitting to the left of the point
where the ‘elbow’ of the graph levels off are those to be retained. Here, three factors should be
retained.
It is important to note that that the eigenvalue method is not without limitations. Researchers can
fix the number of factors according to the underlying theory of the test.

Factor loading
After identifying the number of factors to be retained, the next step
is to examine the loading pattern of the items. Loading shows the
variance explained by the item on that particular factor.
Theoretically speaking, items that are designed for a specific factor
should load onto the target factor. Moreover, such items are
expected to demonstrate (a) a high factor loading (e.g. ≥0.40) on the
target factor and (b) lower factor loading on non-target factors. For
example, in the anxiety scale, items that measure biological
responses such as hand tremors and palpitations should have a high
factor loading (e.g. ≥0.40) on the physiological response factor and
lower factor loading on the emotional and cognitive factors.

If the factor-loading results are different from the expected, for example the factor loading of an
item is consistently low on all factors, researchers can remove such item(s) and submit the
remaining items to another EFA. The procedure can be repeated until all the items show
satisfactory factor loading. When the results find more than one item that deserves deletion, it is
advisable to remove them one item at a time (starting with the item with the lowest factor
loading) and then to re-run EFA. The sequence of removing items does matter and should be
reported. In addition, it is possible that an item may have acceptable factor loading on both target
and non-target factors. Two solutions are suggested to deal with this cross-loading problem.
Some researchers suggest removing such items to enhance clarity of the structure and ease
interpretation of results; others argue that cross-loading is not uncommon and should be allowed,
especially when it is theoretically adequate.

Explained variance
Finally, after identifying a satisfactory factor structure with acceptable factor loadings, it is
critical to examine whether the structure has a high percentage of total explained variance, which
is the part of the model's total variance that is explained by factors that are actually present (e.g.
>50%). Higher percentages of explained variance indicate low discrepancy between a structure
(or model) and actual data (i.e. low error variance) and hence better predictions can be made
(Rosenthal Reference Rosenthal and Rosenthal2011).

Summary
Taken together, a EFA result consists of an interpretable factor structure, clear-cut factor loading
and adequate explained total variance. Note, however, that EFA may generate three types of
result: exactly identical to the structure of the original version of the test, slightly different or
totally different from the original version. Regardless of the results, it is necessary to collect
another set of data and further examine the qualities of the test using CFA.

Confirmatory factor analysis (CFA)

In scale or test development, the purpose of CFA is to examine the fitness of the structure of the
test (also called a model in CFA) or how closely observed data match the relationships specified
in a hypothesized model. Having been through EFA, the test should next be submitted to CFA
(using a new data-set) to further examine whether the structure is supported.

More than a dozen different model fit statistics have been suggested for examining the fitness of
a model and the options continue to increase. Although each statistic has its pros and cons, some
are more widely used than others. For example, Kline (Reference Kline2015) recommends
reporting the χ2-test (and ratio of the χ2-value to degrees of freedom), the root mean square error
of approximation (RMSEA), standardised root mean square residual (SRMR) and comparative
fit index (CFI). The χ2-test, RMSEA and SRMR belong to the category of absolute fit indices,
which examine how far the model being tested is from perfect fit. Put differently, these measures
indicate how bad the model is. A larger value represents a poorer fit. On the contrary, the CFI is
a type of relative fit index, which indicates the goodness of a model. The larger the value, the
better the model is.

Experts have suggested cut-off points for these tests. Specifically, a good-fit model shows a ratio
of χ2-values to the degrees of freedom <3, RMSEA ≤ 0.05, SRMR < 0.08
and CFI > 0.95 (Hu Reference Hu and Bentler1999; Tabachnick & Fidell Reference
Tabachnick and Fidell2013). If the model shows poor fit, researchers can refer to the
modification indices for recommendations, such as adding error covariance to improve fitness.
However, strictly speaking, modification should be avoided. Recall that the purpose of CFA is to
verify the model suggested by EFA. If the model is modified, its results should be interpreted as
exploratory instead of confirmatory. Moreover, researchers should collect a new data-set and
conduct another CFA to verify the modified model.
A detailed description of conducting CFA and model comparison is beyond the scope of this
article. Interested readers are referred to Byrne (Reference Byrne2016), Brown & Moore
(Reference Brown, Moore and Hoyle2012) and Kline (Reference Kline2015).

CFA for translated tests

CFA is also conducted for tests that have been translated into another language to ensure that the
translated version is true to the original (translation is discussed in more detail below). It is
noteworthy that the best model of a translated test that is supported by both EFA and CFA could
be the same or different from the structure of the parent version. When the same structure is
found, it implies that translation has had very little impact, if any, on the structure of the test.
Therefore, one can confidently believe that the same respondents would perceive items of the
parent and translated versions equally. On the other hand, minor or major differences imply
either that translation has distorted the meaning of the items or that the theoretical concept or
construct of the test is cross-culturally different. In the latter case, researchers may want to
examine the target population's perception of the construct assessed by the test using a qualitative
approach. Alternatively, if possible, researchers may collect data using both the parent and the
translated versions and evaluate the variance of the structure of the two tests to identify the
sources (e.g. different structure, factor loading) of any differences.

Preparing the test manual

Once the test has proven to be psychometrically sound, with high reliability and validity, a
manual is created that summarises the test-making procedure as well as giving instructions on
how to use the test. The manual should thus cover: (a) the concept being measured; (b) the target
group and purpose of the test; (c) how the test items were developed and selected; (d) the
reliability and validity methods (including factorial structure obtained through EFA) used and
their values; (e) how the test is to be conducted; and (f) how to score and interpret the results.

Translating a rating scale/questionnaire

In certain situations, standardised rating scales/questionnaires might not be available in the
language desired. Rather than create a new test in the required language, existing tests can
translated using the process described below to ensure that the psychometric qualities of the
original remain and are not diluted or tampered with. This involves the processes of forward and
backward translation (Box 3

BOX 3 The steps in translating a test into another language

1. 1Forward translation by at least two translators working separately
2. 2Backward translation by at least two translators working separately
3. 3Check by committee of experts
Forward translation
The original questionnaire should be translated into the required language by at least two
independent translators working separately to produce two translations. The translators should be
fluent in the original and target languages and cultures to be able to understand the nuances of
the concept being measured and the test items (Beaton Reference Beaton, Bombardier and
Guillemin2000). Best results are obtained if one translator is a language expert and the other
translator is a subject expert, to ensure that language complexity as well as subject matter
intricacies are not missed. The two versions should be compared and discrepancies between them
discussed and resolved between the translators, preferably with input from the researcher or
another unbiased bilingual translator not involved in the previous translations.

Backward translation
After a single agreed version in the required language is created, it should be translated back into
the original language to ascertain the accuracy of translation. This should be done by at least two
translators not involved in the previous exercise, to avoid the influence of familiarity with the
original test. During this phase too, the translators should be fluent in both languages and,
ideally, one should be a subject expert and the other a language expert.

Check by experts
After checking that the backward translation matches the original test, an expert committee
familiar with the concept being measured, as well language experts (previous translators can be
included) and the researchers, should review the translations to reach a consensus on all items so
as to produce a final version of the translated test that is equivalent in meaning and metric to the
original (Tsang Reference Tsang, Royse and Terkawi2017). As discussed in the previous section,
fresh data can be collected and a CFA can be run on the translated test to see whether it
maintains the same structure as the original.

Conclusions
We have outlined the steps (Box 4) as well as guidelines for developing, validating and
translating a psychological test. The development of a psychological test requires the test
constructor's thorough understanding of the concept or construct being measured, familiarity
with test formats to select the best format for the test and creation of items appropriate to the
population to be tested. After the test development stage, a pilot study is conducted to see
whether the items are being understood correctly by the intended respondents. The next stage of
validation involves a series of procedures to gauge the test's reliability and validity to make it
psychometrically sound. The process of translating a test involves forward and backward
translation and review of the translations by an expert committee.

Psychological Assessment
100% (2)
Psychological Assessment
20 pages
Principles of Psychological Testing
50% (2)
Principles of Psychological Testing
9 pages
Psycological Testing and Assessment
No ratings yet
Psycological Testing and Assessment
12 pages
Phillips Disaster 1989
50% (2)
Phillips Disaster 1989
24 pages
P.A. OUTCOME 1 Combined
No ratings yet
P.A. OUTCOME 1 Combined
135 pages
Pa Introooooo
No ratings yet
Pa Introooooo
14 pages
Unit - V
No ratings yet
Unit - V
32 pages
Module-3 Introduction To Psychological Tests
No ratings yet
Module-3 Introduction To Psychological Tests
9 pages
A1 - Psychological Testing - Nature and Uses
No ratings yet
A1 - Psychological Testing - Nature and Uses
9 pages
General Intro Fromm Albert
No ratings yet
General Intro Fromm Albert
6 pages
Introduction To Psychological Assessment
100% (1)
Introduction To Psychological Assessment
28 pages
PSYCH Assessment
No ratings yet
PSYCH Assessment
54 pages
Types of Psychological Test: 1. Achievement and Aptitude Tests
No ratings yet
Types of Psychological Test: 1. Achievement and Aptitude Tests
6 pages
Psychological - Behavioral Assessment
No ratings yet
Psychological - Behavioral Assessment
19 pages
Testing Introduction
No ratings yet
Testing Introduction
27 pages
PsychAssess (Prelim)
No ratings yet
PsychAssess (Prelim)
23 pages
Psychological Assessment
No ratings yet
Psychological Assessment
5 pages
Psychological Assessment
No ratings yet
Psychological Assessment
13 pages
Lecture 1 Psychological Assessments and Tests
No ratings yet
Lecture 1 Psychological Assessments and Tests
31 pages
Introduction To Psychological Testing
No ratings yet
Introduction To Psychological Testing
22 pages
Introduction To Psychological Testing
No ratings yet
Introduction To Psychological Testing
8 pages
Introduction To Psychological Testing
No ratings yet
Introduction To Psychological Testing
6 pages
Slide 1-Introduction To Psychological Testing
No ratings yet
Slide 1-Introduction To Psychological Testing
26 pages
PPM (Achievement Test)
No ratings yet
PPM (Achievement Test)
82 pages
Lesson 2 Etp
No ratings yet
Lesson 2 Etp
8 pages
Class 12 Practical
No ratings yet
Class 12 Practical
17 pages
Chapter 1 Introduction To Psychological Assessment 1
No ratings yet
Chapter 1 Introduction To Psychological Assessment 1
54 pages
Counseling and Testing
No ratings yet
Counseling and Testing
56 pages
PSY Chapter 11
No ratings yet
PSY Chapter 11
8 pages
Report Psy Assessment
No ratings yet
Report Psy Assessment
110 pages
Report Rough Albert
No ratings yet
Report Rough Albert
110 pages
BPB 212 - Overview of Psychological Assessment
No ratings yet
BPB 212 - Overview of Psychological Assessment
9 pages
Psychological Testing HR Perspective
No ratings yet
Psychological Testing HR Perspective
6 pages
Psychological Testing SEM 2
No ratings yet
Psychological Testing SEM 2
25 pages
Psychological Testing and Assessment Reviewer
No ratings yet
Psychological Testing and Assessment Reviewer
32 pages
Psychological Testing
No ratings yet
Psychological Testing
4 pages
From Cohen Swerdlik's Book of Psychological Testing and Assessment: An Introduction To Tests and Measurement 7th Edition
No ratings yet
From Cohen Swerdlik's Book of Psychological Testing and Assessment: An Introduction To Tests and Measurement 7th Edition
6 pages
Psychological Testing and Assessment Notes
No ratings yet
Psychological Testing and Assessment Notes
44 pages
Pscychological Testing
No ratings yet
Pscychological Testing
73 pages
Psychological Testing
No ratings yet
Psychological Testing
26 pages
Psychological Assessment
No ratings yet
Psychological Assessment
27 pages
Seminar 13-07-2017
No ratings yet
Seminar 13-07-2017
46 pages
Uses of Tests Scales Inventories
No ratings yet
Uses of Tests Scales Inventories
6 pages
Lessons 1 and 2
No ratings yet
Lessons 1 and 2
14 pages
FIRO B Finallll
No ratings yet
FIRO B Finallll
14 pages
Introduction To Psychological Testing-2
No ratings yet
Introduction To Psychological Testing-2
16 pages
Format For Psychology Practical File
No ratings yet
Format For Psychology Practical File
7 pages
Unit I-III
No ratings yet
Unit I-III
21 pages
Part 1
No ratings yet
Part 1
6 pages
Introduction To Psychological Assessment
No ratings yet
Introduction To Psychological Assessment
7 pages
2 - Measuring Personality
No ratings yet
2 - Measuring Personality
21 pages
Unit 1 Introduction To Psychological Testing 1.1. Nature and Meaning of Psychological Tests
No ratings yet
Unit 1 Introduction To Psychological Testing 1.1. Nature and Meaning of Psychological Tests
17 pages
Advanced Psychology Testing and Test Constuctions Psy 461
No ratings yet
Advanced Psychology Testing and Test Constuctions Psy 461
23 pages
Tests Ss
No ratings yet
Tests Ss
14 pages
Characteristics of Psychological Test
67% (3)
Characteristics of Psychological Test
3 pages
Psychological Testing
No ratings yet
Psychological Testing
19 pages
Slide 1-Introduction To Psychological Testing (1) - 241130 - 011616
No ratings yet
Slide 1-Introduction To Psychological Testing (1) - 241130 - 011616
25 pages
Adobe Scan 03-Sept-2024
No ratings yet
Adobe Scan 03-Sept-2024
25 pages
AP Psychology Premium, 2026: Prep Book with 3 Practice Tests + Comprehensive Review + Online Practice
From Everand
AP Psychology Premium, 2026: Prep Book with 3 Practice Tests + Comprehensive Review + Online Practice
Barron's Educational Series
No ratings yet
Emotional Intelligence: Anger Management, Emotional Control, and Life Mastery
From Everand
Emotional Intelligence: Anger Management, Emotional Control, and Life Mastery
Samirah Eaton
5/5 (2)
Emotional Intelligence: Manage and Master Your Emotions More Effectively
From Everand
Emotional Intelligence: Manage and Master Your Emotions More Effectively
Samirah Eaton
5/5 (2)
Neuropsychological Assessments
No ratings yet
Neuropsychological Assessments
48 pages
Psychiatric Disorders in Term-Born Children With Marginally Low Birth Weight: A Population-Based Study
No ratings yet
Psychiatric Disorders in Term-Born Children With Marginally Low Birth Weight: A Population-Based Study
11 pages
The Role of Spatial Ability in Mixed Reality Learning With The Hololens
No ratings yet
The Role of Spatial Ability in Mixed Reality Learning With The Hololens
12 pages
Hosozawa 2024 Oi 231548 1705094509.26292
No ratings yet
Hosozawa 2024 Oi 231548 1705094509.26292
12 pages
Short Term Neonatal and Long Term Neurodevelopmental Outcome of Children Born Term Low Birth Weight
No ratings yet
Short Term Neonatal and Long Term Neurodevelopmental Outcome of Children Born Term Low Birth Weight
8 pages
Keehn 2024 Oi 240403 1715020378.42576
No ratings yet
Keehn 2024 Oi 240403 1715020378.42576
14 pages
WhitmanHolmesetal.2023MMPI 3NEOPI Rincrementalvalidity
No ratings yet
WhitmanHolmesetal.2023MMPI 3NEOPI Rincrementalvalidity
23 pages
Autism in Lebanon Looking Back To Inform The Future
No ratings yet
Autism in Lebanon Looking Back To Inform The Future
10 pages
Ucar Icatces Proceeding
No ratings yet
Ucar Icatces Proceeding
15 pages
VR4Health: Personalized Teaching and Learning Anatomy Using VR
No ratings yet
VR4Health: Personalized Teaching and Learning Anatomy Using VR
11 pages
SSRN Id3799204
No ratings yet
SSRN Id3799204
20 pages
Machine Learning Applied To Student Attentiveness Detection: Using Emotional and Non Emotional Measures
No ratings yet
Machine Learning Applied To Student Attentiveness Detection: Using Emotional and Non Emotional Measures
21 pages
Comment: Materials and Devices For Immersive Virtual Reality
No ratings yet
Comment: Materials and Devices For Immersive Virtual Reality
3 pages
Research Methodology-CO
No ratings yet
Research Methodology-CO
5 pages
Systematic Literature Review and Bibliometric Analysis On Virtual Reality and Education
No ratings yet
Systematic Literature Review and Bibliometric Analysis On Virtual Reality and Education
38 pages
BGT 2
No ratings yet
BGT 2
29 pages
Psychopathology of adults-CO
No ratings yet
Psychopathology of adults-CO
17 pages
What Are Research Objectives and How To Write Them
No ratings yet
What Are Research Objectives and How To Write Them
9 pages
Research Methods Tools and Techniques
No ratings yet
Research Methods Tools and Techniques
84 pages
Neurotransmitters
No ratings yet
Neurotransmitters
29 pages
Disruptive Disorder
No ratings yet
Disruptive Disorder
24 pages
Neuropsychological Assessment
100% (1)
Neuropsychological Assessment
27 pages
Neuropsychological Assessment
No ratings yet
Neuropsychological Assessment
35 pages
Course Outline On Psychodiagnostic Assessment of Children
No ratings yet
Course Outline On Psychodiagnostic Assessment of Children
5 pages
Psychopharmacology
No ratings yet
Psychopharmacology
59 pages
Psychodiagnostic Assessment of Children
No ratings yet
Psychodiagnostic Assessment of Children
26 pages
Psychosocial Assesment
No ratings yet
Psychosocial Assesment
37 pages
Antipsychotics Psychopharmacology
No ratings yet
Antipsychotics Psychopharmacology
15 pages
ADHD
No ratings yet
ADHD
29 pages
Endocrinology Lecture
100% (1)
Endocrinology Lecture
40 pages
VB7
No ratings yet
VB7
44 pages
Rubric For Preparation of Design/Computational Plate
No ratings yet
Rubric For Preparation of Design/Computational Plate
1 page
Trevor Ivan - Final Assessment
No ratings yet
Trevor Ivan - Final Assessment
3 pages
#13 Addition Polymerization: Preparation of Polystyrene Using Two Types of Initiators
No ratings yet
#13 Addition Polymerization: Preparation of Polystyrene Using Two Types of Initiators
9 pages
Tutorial Benzene and Phenol
No ratings yet
Tutorial Benzene and Phenol
4 pages
Input Output Devices
No ratings yet
Input Output Devices
44 pages
Chapter 15 Exercises No Answers
No ratings yet
Chapter 15 Exercises No Answers
3 pages
The Salvatore Saga Part
No ratings yet
The Salvatore Saga Part
45 pages
CID 20210320173003021556 989295 uniROC Ipayob
No ratings yet
CID 20210320173003021556 989295 uniROC Ipayob
6 pages
Nippon Paints
No ratings yet
Nippon Paints
19 pages
Mockingbird
No ratings yet
Mockingbird
4 pages
Entreprene Urship: Welcome
No ratings yet
Entreprene Urship: Welcome
33 pages
The Evolving Concept of Life
No ratings yet
The Evolving Concept of Life
17 pages
Intelligent Motion Control Design For An Omnidirectional Conveyor System
No ratings yet
Intelligent Motion Control Design For An Omnidirectional Conveyor System
11 pages
Forensic Nursing
No ratings yet
Forensic Nursing
2 pages
141
No ratings yet
141
12 pages
Usg Plasters Hydrocal Gypsum Cements Sealers Parting Compounds Brochure en IG515
No ratings yet
Usg Plasters Hydrocal Gypsum Cements Sealers Parting Compounds Brochure en IG515
2 pages
Grade 6 Conjunctions
No ratings yet
Grade 6 Conjunctions
65 pages
Theo Notes
No ratings yet
Theo Notes
5 pages
Et Zc341 Ec-3r Solution Second Sem 2013-2014
No ratings yet
Et Zc341 Ec-3r Solution Second Sem 2013-2014
9 pages
Based On May 2011 Occupational Standards: Ethiopian TVET-System
No ratings yet
Based On May 2011 Occupational Standards: Ethiopian TVET-System
92 pages
Ethics in Human Resource Management: A Conceptual and Theoretical Analysis
No ratings yet
Ethics in Human Resource Management: A Conceptual and Theoretical Analysis
17 pages
Feature Amhed Farouk R4
No ratings yet
Feature Amhed Farouk R4
11 pages
Cladistics and Phylogeny - Notes
No ratings yet
Cladistics and Phylogeny - Notes
6 pages
DBMS Lab 6
No ratings yet
DBMS Lab 6
3 pages
Control Serum Preparation
No ratings yet
Control Serum Preparation
12 pages
Bock HGX44e
No ratings yet
Bock HGX44e
24 pages
64709b0902cd9 RN Ati Capstone Proctored Comprehensive Assessment 2019 B Ati Comprehensive Practice Test B Best Study Guide Version With Complete Solution 2 Revised (1) - 2
No ratings yet
64709b0902cd9 RN Ati Capstone Proctored Comprehensive Assessment 2019 B Ati Comprehensive Practice Test B Best Study Guide Version With Complete Solution 2 Revised (1) - 2
1 page
Patliputra University, Patna: Under-Graduate (UG) Admission Application Form 24G0066273
No ratings yet
Patliputra University, Patna: Under-Graduate (UG) Admission Application Form 24G0066273
2 pages

What Is Psychological Testing

Uploaded by

What Is Psychological Testing

Uploaded by

What is psychological testing?

Tests can either be objective or projective:

● previous educational and medical history

● increased social withdrawal

● difficulty completing normal tasks

● a dramatic change in sleep and eating habits

● problems with concentration

Other uses for psychological testing include:

● screening job applicants

What types of psychological tests are there?

Psychologists use testing to examine a variety of factors, including emotional intelligence,

Measure behaviors, emotions, attitude, and behavioral and

Measure respondents’ intellectual interests, achievements, and

Measure views of respondents based on how much they agree or

Measure capabilities, skill sets, and projection of future success

Measure emotional responses such as anger, sadness, happiness,

Measure mental/developmental learning disabilities

Measure cognitive abilities like memory, language, and executive

Measure feedback from external influences to identify unrealized

Observation (direct) Measure immediate observable behavior; performed in a

Why construct a psychological test?

A psychological test is a systematic procedure for obtaining samples of behaviour relevant to

The construction of a test

● item generation and selection

Draft item creation

Qualitative item analysis

Quantitative item analysis

Checking the reliability and validity of the test

Types of reliability used in test checking

Parallel (or alternate) forms reliability

Internal consistency reliability

Types of validity used in test checking

Exploratory factor analysis (EFA)

Factor structure (factors to be retained)

Confirmatory factor analysis (CFA)

CFA for translated tests

Preparing the test manual

Translating a rating scale/questionnaire

BOX 3 The steps in translating a test into another language

You might also like