Alva's Logic Test - Technical Manual
Alva's Logic Test - Technical Manual
Technical Manual
September 2020
Contents
1 Background 1
1.1 Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 General Mental Ability . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Logical ability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 GMA and job performance . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Administration 9
2.1 Adaptive testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Calculation of scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Interpretation of results . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 Validation tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3 Psychometric methods 14
3.1 Measurement theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1.1 Introduction to the measurement model . . . . . . . . . . . . . 15
3.1.2 Measurement model . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1.3 Graphical model . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 Parameter estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3 Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.4 Item selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4 Test Development 23
4.1 Item design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2 Standardization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.3 Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
i
Logic Test September 2020
5 Test quality 27
5.1 Construct validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.2 Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.2.1 Temporal stability . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.2.2 Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6 Adverse Impact 33
6.1 Sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
6.2 Level of education . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
6.3 Age . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6.4 Gender . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
7 Description of data 39
7.1 Linear test sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
7.2 Standardization sample . . . . . . . . . . . . . . . . . . . . . . . . . . 40
7.3 Raven’s SPM+ norm group . . . . . . . . . . . . . . . . . . . . . . . . . 41
8 References 43
ii
List of Tables
4.1 Percentile score distributions for two norm groups of SPM+ and Alva’s
logic test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
iii
List of Figures
iv
1 Background
Alva’s logic test is a computerized adaptive test aimed for professional assessments
in recruitment settings. This technical manual describes the theory and empirical
evidence behind it.
Alva’s logic test assesses your logical ability, i.e., your ability to process complex infor-
mation and draw accurate conclusions from it. This is an important part of General
Mental Ability (GMA), which has been proven in a vast amount of research to predict
job performance in a large variety of roles and industries. GMA has been shown to be
the single assessment tool with the strongest predictive power of performance. Log-
ical ability is related to the capacity to solve problems, interpret information, learn
new things, and make decisions. The more complex the position, the more impact will
mental ability have.
Alva’s logic test is a non-verbal, figures-based test; a format that is widely used in
both research and practice. This format is very useful, since it minimizes the role of
previous experience or domain-specific knowledge. The test requires the test taker
to identify patterns and relationships in a material that is rather abstract. Test takers’
ability to do so indicates to what extent they are proficient at solving basically any
task containing complex or incomplete information.
The fact that the test is non-verbal also makes it less sensitive to differing levels of
language proficiency compared to other types of GMA tests.
1
Logic Test September 2020
1.1 Intelligence
There are many words to keep track of when discussing logical ability and intelligence.
It can be quite confusing to try to weed out the similarities and differences between
tests on the market, when the vendors are using different terms to describe them.
Let’s start with the oldest and most widely used term; intelligence. The origins of mea-
suring intelligence with standardized tests goes back more than 100 years. Most fa-
mous is the endeavour by Alfred Binet and Theodore Simon to develop a standardized
measurement for assessing students’ intellectual development. The goal was to iden-
tify children who were falling behind developmentally and in need of help. This work
resulted in the Binet-Simon intelligence scale, which was later revised to the famous
Stanford-Binet scale in 1916 (still in use today).
The term Intelligence Quotient (IQ), coined by William Stern, was originally intended
to describe the ratio between a student’s mental and chronological age. Students
who had a mental ability above their peers would get a high IQ while those who were
behind got a low IQ.
Today, IQ is instead defined in relation to a population. An IQ score between 85 and
115 is average, while scores below 85 or above 115 indicates intelligence below and
above average respectively. The distribution of intelligence in a broad population
is often assumed to be gaussian, or normal, which has a strong foundation both in
empirical observations and statistical theory (i.e. the Central Limit Theorem).
2
Logic Test September 2020
0.40
0.35
0.30
Probability density
0.25
0.20
0.15
0.10
0.05
0.00 Below Average Above
3
Logic Test September 2020
theories include a hierarchy of specific abilities below g, and some even challenge the
assumption of one overarching factor. The idea has taken hold however, making the
measurement of General Intelligence the goal of most IQ tests today.
Taking all of this into account, the definition of intelligence we adopt at Alva is in-
spired by David Wechsler, Howard Gardner, Linda Gottfredson and Robert Sternberg
to name a few:
Intelligence is the global ability to process information, think rationally, solve
problems, deal with complexity and learn efficiently from experience.
While General Intelligence and IQ are commonly used terms in academia, clinical psy-
chology and in day-to-day conversations, it is more common to talk about General
Mental Ability (GMA) in organizational and industrial psychology.
Our understanding is that the different terminology is not due to IQ and GMA being
different theoretical constructs, nor that the measurements are different in nature.
Rather, it’s due to the fact that the goal of measurement is different. While IQ tests aim
to produce a score that describes individuals’ intelligence in relation to the general
population, GMA tests are mainly used for ranking candidates in recruitment settings.
The reference population is therefore more narrow, focusing only on the working pop-
ulation for example. When a part of the adult population is excluded, in this case
non-working adults who can be expected to have a lower level of GMA on average,
the ability scale is shifted.
A consequence of this shift is that many individuals get a lower score on a GMA test
than on an IQ test. This makes some sense, since it is often of interest to differentiate
between individuals that are in the higher ranges of ability in recruitment settings.
Therefore, the tasks will be more difficult and the scale won’t cover the very lowest
ability range. This will of course depend on the test in question, the quality of the
tasks, the method for calculating scores, the data collected during test development
among other things.
4
Logic Test September 2020
Another consequence is the frequent use of norm groups in tests used for recruitment.
Instead of committing to build a single scale that accurately reflects the population
of interest, many test publishers resort to providing many different scales and leaving
the choice of reference sample (or “norm group”) to the test administrator. In our view,
this has caused a lot of confusion regarding the meaning of scores in GMA tests. Our
ambition is to provide one well designed and properly calibrated scale that can be
used across all types of recruitments.
Alva’s logic test measures one central aspect of GMA, which is also referred to as
abstract reasoning, figure reasoning, matrix reasoning and fluid intelligence in some
contexts. It is inspired by an established test format first introduced by John Raven
in 1938, which can be found in modern versions of Raven’s Standard Progressive Ma-
trices (SPM) and Advanced Progressive Matrices (APM) among other contemporary
logic ability tests.
In the Cattell-Horn theory of intelligence (Horn & Cattell, 1966), General Intelligence
is divided into fluid and crystallized ability. Raven’s matrices and Alva’s logic test
are both designed to measure fluid ability (gf), which is the ability to reason and solve
novel problems without relying on previously acquired knowledge and skills. It can be
described as the source of intelligence that an individual uses when he or she doesn’t
already know what to do. In contrast, crystallised ability (gc) stems from learned
experience and is reflected in tests of knowledge, general information, vocabulary
and other acquired skills. While gf is relatively stable over time, gc often increases
with age due to the accumulation of knowledge (for a more nuanced discussion, see
Hartshorne and Germine, 2015 and Deary, 2012).
5
Logic Test September 2020
While logic ability tests do not fully capture GMA or g, they have consistently been
shown to have a high correlation with - or a high “loading” on - general intelligence.
This, together with the practical advantages of being independent of language and
other crystallized abilities, makes them ideal for efficiently approximating GMA in
many settings.
The value of using GMA tests for identifying high potential candidates in recruitment
is well known. The utility and predictive validity of GMA tests has been studied since
the beginning of the 20th century. Research shows conclusively that people with high
GMA are more likely to be top performers than people with low GMA. In the words of
Sternberg and Hedlund (2002):
6
Logic Test September 2020
The so-called general factor (g) successfully predicts performance in virtually all
jobs (Schmidt & Hunter, 1998). We do not believe there are any dissenters to this
view. […] The issue is resolved, and it is not clear that further research will do any-
thing more than to replicate what has already been replicated many times over.
It is also clear that the effect of GMA on job performance is stronger as the complexity
of the job increases. For unskilled jobs, it has been estimated to r=.39 (which is still a
strong relationship in the context of psychological science) and r=.74 for professional
and managerial jobs (Hunter et al., 2006).
These findings are however always discussed on an aggregated level, and the effects
are calculated for groups of people. Yet, as managers, recruiters, candidates and
employees we are mostly concerned about single individuals. What does it mean for
one individual to have a high GMA?
A common theory is that GMA is related to job performance through learning. That is,
a person with high GMA is likely to accumulate relevant skills and knowledge, which
in turn makes them more likely to perform well.
Figure 1.3: The mediating role of knowledge on the relationship between GMA and
job performance
Looking at the schematic above, it is clear that there are more things at play than
intelligence when it comes to job performance. Even the most intelligent person will
not acquire any knowledge without putting in time and effort. And without relevant
knowledge, no amount of GMA will make a top performer.
High intelligence can be seen as a competitive advantage, much like being tall in a
team of basketball players. It certainly helps, but there’s a lot more to being a good
basketball player than being tall. And there’s also a lot more to being a top performer
than having a high GMA.
7
Logic Test September 2020
We believe that GMA should be evaluated in relation to the demands of the job and
weighted together with other information about the individual. Personality, previous
experience, existing knowledge are all relevant data that should be taken into ac-
count. What an individual lack in one area can be compensated by strengths in other
areas.
8
2 Administration
In Alva’s adaptive logic test, test takers use their problem solving and abstract rea-
soning skills to solve logical tasks. They are presented with 20 tasks, to be solved each
one in turn.
In each task, the objective is to identify the missing piece in a matrix of figures. The
tasks are selected for each individual, based on previous responses, in order to be
challenging but not impossible.
There are six options to choose from. There is one, and only one correct answer to
each task. The correct answer follows a logical pattern that is applied from top to
bottom down the columns, and from left to right across the rows.
9
Logic Test September 2020
There is a time limit of two minutes for each task. Most people complete the test in
15-20 minutes.
Before starting the actual test, test takers get two sample tasks for practice and get-
ting used to the interface.
Scores are calculated in real time after each recorded answer as an estimation of
their logical ability using an algorithm called EAP (eq. 3.9). Before any answers have
been recorded, an average level of logical ability is assumed along with a wide uncer-
tainty range. This assumption is encoded in a prior distribution. After recording the
first answer, a score is calculated taking both the prior distribution and the likelihood
of the observed score (if the answer was correct or not) into account (eq. 3.11). This is
10
Logic Test September 2020
repeated throughout the test, and as more answers are recorded, the certainty of the
score increases.
The first task is selected randomly from the easiest tasks in the item bank, and the fol-
lowing two are selected to be increasingly challenging. This applies to all test takers,
and the purpose is to help them “warm up” to the test before getting more challeng-
ing tasks. After the warm-up tasks, subsequent tasks are selected randomly from the
10 most informative tasks in the item bank given test takers’ estimated logical ability
(eq. 3.13).
The test is completed when answers to 20 tasks have been recorded. This includes
both tasks where the test taker provided an answer and tasks where the time ran out
(in which case the answer will be treated as incorrect).
The standard score is an estimation of logical ability, relative to the adult working
population. The scale is commonly referred to as the Standard Ten (STEN) scale, and
it has a mean of 5.5 and a standard deviation of 2.
The most common standard scores are 5 and 6. The percentile ranges for these
11
Logic Test September 2020
scores are wide due to the fact that they cover a large part of the population. Stan-
dard scores above 9 or below 2 are much less common, resulting in narrow percentile
ranges.
0.40
0.35
0.30
Probability density
0.25
0.20
0.15
0.10
0.05
0.00 Below Slightly below Average Slightly above Above
1 2 3 4 5 6 7 8 9 10
Standard Ten
The test is adaptive, which means that the tasks are uniquely selected for each in-
dividual to match their ability level. A person with a logical ability above average,
for example, will get questions that are above average in difficulty. This is why two
people with the same number of correct answers can get different standard scores.
12
Logic Test September 2020
Organizations have the possibility to validate candidates’ results from the adaptive
logic test. This can only be done once per candidate, and Alva strongly recommend
administering the validation test on-site under supervision. Best practice is to incor-
porate the validation test late in the recruitment process, when the candidate visits
the office for an interview for example.
The validation test consists of 20 tasks with a time limit of 2 minutes per task, just like
the original test. It takes between 15 and 20 minutes to complete. The candidate will
get new tasks that they have not been exposed to in the original test.
After the candidate completes the validation test, the administrator gets a side-by-
side comparison of the results along with guidance on how to interpret potential dif-
ferences. Reasons for the differing results include, but are not limited to:
Even with different scores from the original and validation test, it is unlikely that the
underlying construct of logical ability changes significantly in the short term.
Please note that the candidate will not receive their own results from the validation
test. The purpose is for you as an organization to validate candidates’ original score
and determine their true score.
13
3 Psychometric methods
Alva’s tests are built on second generation psychometrics, also called Item Response
Theory (IRT; van der Linden & Hambleton, 1997) or Latent Trait Theory. The measure-
ment theories behind psychometric testing have evolved over the years and while
most psychometric tests still rely on first generation psychometrics, or Classical Test
Theory (CTT), there are a number of clear advantages in second generation psycho-
metrics:
• IRT supports adaptive testing, where the most informative questions are pre-
sented based on your previous responses. In this way, candidates are presented
with questions most relevant for them, making testing more efficient.
• IRT supports item banking, which means that the pool of questions in the test
can be continuously developed and increased.
• IRT scoring increases the accuracy of the results, by taking item characteristics
into account. This means that fewer questions are needed to get accurate re-
sults.
• While CTT deals with a fixed number of questions to form a test scale, IRT deals
with each question, or item, separately. This makes the process of continuously
improving a test simpler - adding one question doesn’t change the entire scale.
14
Logic Test September 2020
The statistical model used in Alva’s logic test is called the Three-Parameter Logistic
(3PL) model (Birnbaum, 1968). This model captures three characteristics in which tasks
can vary - difficulty, discrimination and guessing.
The difficulty of a task defines its location on the latent ability scale. An individual
who attempts to solve a task that is perfectly matched with his or her logical ability
has a 50% chance of finding the correct answer.
The discrimination of a task is defined as the “steepness” of the slope separating indi-
viduals with higher logical ability from individuals with lower logical ability. The slope
is highest at the difficulty level of the task - that’s where the task is most informative
about individuals’ logical ability.
The pseudo-guessing parameter of a task is defined as the probability of guessing
the correct answer randomly. This is especially relevant for individuals with an abil-
ity far below the difficulty of a task - even though the probability of finding the cor-
rect answer by logical reasoning is very low, there are only 6 options to choose from.
Therefore, one can expect a “lucky guess” once every 6 tries (which translates to a
probability of around 17%).
By taking these characteristics into account, logical ability can be estimated with
higher precision than in classical tests.
The model specifies a statistical relationship between individuals’ logical ability, de-
noted by the greek symbol 𝜃, and the probability of solving a given task correctly.
We define 𝑋𝑖,𝑗 as the outcome for person 𝑖 ∈ {1, ..., 𝑁 } for item 𝑗 ∈ {1, ..., 𝑀 }.
The probability of a correct response for person 𝑖 on a given item 𝑗 is given by the
following Item Response Function (IRF):
1 − 𝛾𝑗
𝑝(𝑋𝑖,𝑗 = 1|𝜃𝑖 , 𝛼𝑗 , 𝛽𝑗 , 𝛾𝑗 ) = 𝛾𝑗 + (3.1)
1 + 𝑒−𝛼𝑗 (𝜃𝑖 −𝛽𝑗 )
15
Logic Test September 2020
1.0
= 1.7
0.8
0.6
p(X = 1)
0.4
0.2
= 0.17 = 0.5
0.0
2 1 0 1 2
An alternative way to represent the 3PL model is by the equivalent graphical model.
In this notation, a circle represents a latent random variable (𝜃, 𝛼, 𝛽 and 𝛾 ) and a
16
Logic Test September 2020
To be able to use an IRT model for scoring, the model parameters need to be trained
using observed data. This term is common in statistics and machine learning, and
it basically means numerical optimization is used to estimate the parameters in the
model. The results from this process are the parameter values that best explain ob-
served data.
A fully observed dataset with 𝑁 individuals and 𝑀 items would lead to 𝑁 + 3𝑀 pa-
rameters and 𝑁 𝑀 observations. This is a quite complex model, with many parame-
17
Logic Test September 2020
ters relative to the number of observations, which means that the risk of overfit is high.
Overfit is an issue when parameter estimates vary a lot depending on the particular
dataset used for estimation, making them unreliable for making inferences outside of
observed data (generalization). In scenarios like this, it is best practice to apply regu-
larization to reduce the variance in the model and prevent overfit (Hastie, Tibshirani
& Friedman, 2009).
At Alva, we use Bayesian Inference for parameter estimation (Fox, 2010). We apply
gaussian priors (probability distributions assigned to model parameters before ob-
serving any data), which is equivalent to L2-regularization (Koller & Friedman, 2009),
to control model complexity and generate stable results from the optimization.
Our models are implemented in the Probabilistic Programming Language PyMC3 (Sal-
vatier, Wiecki & Fonnesbeck, 2016), which enables efficient optimization from state-
of-the-art sampling methods (Hamiltonian Markov Chain Monte Carlo with the NUTS
sampler) and Variational Inference. We were inspired by Luo & Jiao (2017), who have
provided code for implementing the 3PL in a Bayesian framwork using similar soft-
ware (Stan; Carpenter et al., 2017).
New questions and a constantly growing database of users enables us to continu-
ously analyze and calibrate model parameters. With more data, we learn more about
question characteristics and the distribution of personality traits in the population.
3.2.1 Implementation
We set a standard normal prior probability distribution for both the ability and the
difficulty parameters:
𝜃0 ∼ 𝒩(0, 1) (3.2)
𝛽0 ∼ 𝒩(0, 1) (3.3)
18
Logic Test September 2020
Matrices - Plus version (Raven, Raven & Court, 1998) were used as informative prior, to
calibrate the model toward this gold standard measurement of logical ability.
Then we set an informative normal prior for the discrimination parameter, regularizing
the model towards the Rasch model:
𝛼>0 (3.6)
The guessing parameter is based on the a priori probability of randomly selecting the
correct option, which for the logic test is 1/6. The beta distribution is apropriate, since
it is bounded to the interval [0, 1]:
𝛾0 ∼ Beta(1, 5) (3.7)
Finally, the responses 𝑋 are modelled as bernoulli trials with the probability of success
from eq. 3.1:
3.3 Scoring
In CTT, scoring consists of two steps - first a raw score is calculated (typically a sum
over the responses) and second a transformation from the raw score to a normed
score (typically using a norm table). The normed score depend directly on the sam-
ple (“norm group”) used to generate the norm table. It also requires responses to all
19
Logic Test September 2020
questions in the scale, making adaptive testing either impossible or requiring some
creative tricks from the test developer.
In IRT, scoring is a statistical estimation of the construct of interest, 𝜃. The result-
ing score is often on a standard normal scale, often referred to as the z-scale (see
eq. 3.2).
The scoring process only depends on the observed responses and the parameters of
the administered questions, not any specific sample. There are no concepts of raw
scores or norm tables in IRT.
The question parameters, however, depend on the sample(s) used in the parameter
estimation process. One can argue that the parameter estimation process is compa-
rable to the norming process in CTT, but with the advantage of being more flexible in
including prior information and aggregating over multiple data sources.
In Alva’s tests, the Expected A Posteriori (EAP) algorithm is used for scoring (van der
Linden & Pashley, 2010; van der Linden & Glas, 2000). This is an application of Bayes’
theorem, where EAP is defined as the expectation of the latent trait over the posterior
distribution.
̂
𝜃𝐸𝐴𝑃 ≡ ∫ 𝜃𝑔(𝜃|𝑥1 , … , 𝑥𝑗 )𝑑𝜃 (3.9)
The Posterior Standard Deviation (PSD) is used to quantify the uncertainty of the score.
This is roughly comparable to the Standard Error of Measurement (SEM) in CTT, but
with the advantage of being specific to each individual. The SEM, by contrast, is as-
sumed to be constant for all values of 𝜃. PSD is defined as the standard deviation of
the posterior distribution.
̂
𝑃 𝑆𝐷 ≡ ∫[𝜃 − 𝜃𝐸𝐴𝑃 ]2 𝑔(𝜃|𝑥1 , … , 𝑥𝑗 )𝑑𝜃1/2 (3.10)
First, the unnormalized posterior distribution of the latent trait given the observed
responses is calculated as a product of the prior given by eq. 3.2 and the likelihood
function given by eq. 3.1.
20
Logic Test September 2020
𝑚
𝑔(𝜃|𝑥1 , … , 𝑥𝑗 ) = 𝜃0 ∏ 𝑝(𝑋𝑗 = 𝑥𝑗 ∣ 𝜃, 𝛼𝑗 , 𝛽𝑗 , 𝛾𝑗 ) (3.11)
𝑗=1
Second, the posterior expectation and posterior variance are estimated using the
quadrature method for appoximating integrals. This is repeated for every adminis-
tered question, until a satisfactory posterior variance has been reached. This ap-
proach is described in detail by de Ayala (2008).
Finally, the scores are transformed from the z-scale to the Standard Ten scale using
a simple transformation and rounded to the nearest integer.
̂
𝑆𝑇 𝐸𝑁 = 2𝜃𝐸𝐴𝑃 + 5.5 (3.12)
Throughout the test, the score (EAP) and the uncertainty of the score (PSD) are up-
dated after each recorded answer. New tasks are selected adaptively based on how
much information they will provide to the next update. In general, the tasks with most
information are those where the difficulty level is closely matched to the logical ability
of the test taker.
To select the next task throughout the test, the Maximum Posterior Weighted Informa-
tion criterion (van der Linden & Pashley, 2010) is used.
That is, the information function 𝐼(𝜃) is calculated for all remaining tasks in the item
bank, which is then multiplied with the posterior distribution of 𝜃 (eq. 3.11). The poste-
rior weighted information is estimated by integrating out 𝜃. Finally, the tasks with the
10 tasks with the highest posterior weighted information are identified and the next
question is chosen randomly from them.
21
Logic Test September 2020
22
4 Test Development
Alva’s logic test has been developed according to European standards, set by the Eu-
ropean Federation of Psychologists’ Associations (EFPA). Their test review model is the
framework used by most certification agencies in Europe. It contains, among other
things, a guide for evaluating the quality of the documentation, norms, reliability, va-
lidity and result reports.
45 tasks were designed to measure logical ability in a fixed, linear testing format. The
tasks followed a similar structure as other established tests on the market, such as
Raven’s Matrices and the Figure Reasoning Test (FRT).
In a small pilot study, results on this test were compared to FRT version A, a test used
for admission to Mensa in Sweden. In a sample of 39 participants, the correlation was
r=0.77, which is considered Excellent by EFPA.
4.2 Standardization
In 2019, the adaptive version of Alva’s logic test was developed entirely using Item
Response Theory (IRT). The data used in the process consisted of responses to the
existing 45 tasks from the linear test by 2,295 Alva users and responses to 50 newly
designed tasks by an additional standardization sample consisting of 286 partici-
pants.
23
Logic Test September 2020
As a first step, we estimated parameters for the 45 tasks in the linear test using data
from Alva’s platform. The number of observations per task ranged between 1,743 and
2,295, with a mean of 2,133 and a standard deviation of 252. Tasks that were not
reached due to participants running out of time were filtered out.
In the second step, tasks were divided into 3 parallel sets, so that there was an overlap
of 15 tasks between the sets. This is common practice to be able to link observations
across parallel sets. Each set consisted of 39 tasks, both old and new. Using Ama-
zon Mechanical Turk, 286 participants were recruited and assigned to one of the sets
randomly.
Parameter estimation for the entire bank of tasks was performed using item param-
eters from the first step as priors in the Bayesian model. This, together with the fact
that some tasks were overlapping, ensures that the parameters for the new tasks are
on the same scale as parameters for the old tasks. Normed results from Raven’s Stan-
dard Progressive Matrices - Plus version (SPM+) for 134 participants was used as priors
for the ability parameter, to anchor the scale at appropriate values. This procedure
of simultaneous estimation with informative priors means that all available informa-
tion - from the large sample of Alva users and previous testing with SPM+ - is explicitly
taken into account in the final parameter estimates.
4.3 Calibration
24
Logic Test September 2020
Table 4.1: Percentile score distributions for two norm groups of SPM+ and Alva’s logic
test
The second calibration study consisted of 55 members of Mensa Sweden who com-
pleted the adaptive version of Alva’s logic test. Mensa is an organization for highly
intelligent individuals and only those with a measured IQ above 130 are admitted.
Since Alva’s logic test measures an ability that is closely related to IQ, Mensa mem-
bers should achieve results above average. Specifically, given some measurement
error in the Mensa admission process, the average result for Mensa members should
be close to 9.1 with a standard deviation close to 1.21 . The observed average score
was 8.9 and the observed standard deviation was 1.2 which is comparable to the ex-
pected values. This confirms that Alva’s logic test is well calibrated for a general adult
population.
1
These figures are based on a simulation, where 100,000 samples were drawn from the IQ score distri-
bution (gaussian with a mean of 100 and a standard deviation of 15) to represent true ability levels
and random noise was added (gaussian with a mean of 0 and a standard deviation of 9) to represent
measurement error. Results above 130 were labeled “admitted” and the mean and standard devia-
tion of the true ability scores were calculated and transformed to the standard ten scale (gaussian
with a mean of 5.5 and a standard deviation of 2).
25
Logic Test September 2020
Most test publishers work according to the waterfall model - they develop content for
a new test, collect data, implement and launch. Then they do not touch the content
until it’s time for the next version. Once the new version is out, it is treated as an
entirely new product and the difference to the previous version may be very large.
Results from the new version are often not comparable to the old version.
At Alva, we started out with a best-in-class logic test, powered by best practices in ma-
chine learning and modern test theory. We are now also introducing the agile model
in the way we develop and iterate on our tests. Instead of waiting up to five years to
launch an entirely new version of our logic test (which is common in the industry), we
are continuously collecting data and introducing new tasks. This way, we are making
sure that our test stays ahead of the curve and performs even better than before.
4.4.1 What we do
26
5 Test quality
The usefulness of any measurement depends on the quality of the measurement pro-
cess. For psychological tests, this comes down to the validity of the test results. The
most widely accepted definition of validity is:
the degree to which evidence and theory support the interpretations of test
scores for proposed uses of tests (AERA et al. 2014)
In the case of Alva’s logic test, the proposed use is selection in recruitment settings.
In earlier chapters, we have presented existing theory and research supporting the
use of GMA tests to predict job performance. This chapter presents evidence from
our own studies.
When using a psychological test, it is of great interest for both the administrator and
the respondent that the test is actually measuring what it claims to measure. This is
what construct validity is all about.
A construct is a theoretical and statistical concept. It is the “hidden truth” about
individuals that we assume causes the responses to the questions in a test. With this
assumption, we can use the test results to calculate estimates of the construct.
In the case of Alva’s logic test, the construct of interest is the logical ability of individ-
uals. The tasks in the tests make up the measurement.
There are several ways to collect evidence for the construct validity of a psychologi-
cal test. The most common method is to collect data from two tests that are designed
27
Logic Test September 2020
to measure the same or similar constructs and calculate the correlation between the
scores of the two tests. This is often referred to as the convergent validity of a test.
The hypothesis is that if one and the same construct causes the results of two sepa-
rate tests, then the scores of the tests should correlate highly. According to European
standards from EFPA, convergent validity coefficients above 0.75 are deemed Excel-
lent.
The construct validity of Alva’s logic test was estimated using a sample of 134 par-
ticipants who completed both Alva’s test and Raven’s Standard Progressive Matrices
Plus (SPM+). The observed validity coefficient was r=0.83.
Sample
The sample consisted of 58% males and 42% females. The age ranged from 20 to 63
with an average of 34.9 and a standard deviation of 9.3. The education level of the
participants was high. 63% had completed a Bachelor’s degree or equivalent, 7% a
Master’s degree or equivalent, 29% secondary education or high school and 1% other
educational background.
Instruments
Alva’s logic test is a computerized adaptive test aimed for professional assessments
in recruitment settings. Raven’s SPM+ is an extended version of the original matrix
test, used since 1938 to measure abstract reasoning.
Study
Participants were recruited using Amazon Mechanical Turk, an online crowdsourcing
platform. They were asked to complete 39 tasks from Alva’s item bank and 36 tasks
from Raven’s SPM+ in two separate sessions with one week apart. All participants
received compensation.
The standard score for Alva’s test was calculated using the Expected A Posteriori (EAP)
method, with the same model, item parameters and prior distribution as in the live
test. The total score was used to represent results on Raven’s SPM+. Tasks from sets A
and B from SPM+ were left out, due to being overly simplistic, and the total score was
estimated using table 35 from the SPM+ manual (Raven, Raven & Court, 1998).
28
Logic Test September 2020
5.2 Reliability
Reliability refers to the overall consistency of a measurement. A test with high relia-
bility produces similar results under similar conditions.
While a high degree of validity implies a high reliability of a measurement, the inverse
is not true. A test can have a high reliability but measure something completely irrel-
evant.
In the framework of Classical Test Theory, reliability is defined as the ratio of true
score variance to the total variance of test scores. Since the true score is impossible
to directly observe, reliability is instead estimated using methods such as test-retest
reliability, internal consistency and parallel-test reliability. From the reliability coeffi-
cient, a Standard Error of Measurement (SEM) can be estimated, which provides an
uncertainty range to test scores.
In the framework of Item Response Theory, reliability has a less prominent role. On the
one hand, reliability can reasonably be estimated from the Test Information Function
(see section below) which is an extension of the concept of reliability. On the other
hand, the uncertainty of a measurement is more commonly estimated for each test
session using methods like PSD (eq. 3.10).
In this section we will nontheless focus on the more familiar concept of reliability, to
be consistent with technical manuals from classical tests.
29
Logic Test September 2020
responses to the questions. It is common practice to administer the test with an inter-
val of two weeks, which was also used in this study.
According to European standards set by EFPA, a test-retest reliability coefficient
above 0.8 is deemed Good and above 0.9 Excellent.
The test-retest reliability of Alva’s logic test was estimated using a sample of 117 par-
ticipants who completed the test twice with 14-15 days in between. The observed
correlation was r=0.81.
Sample
The sample consisted of 58% males and 42% females. The age ranged from 20 to 63
with an average of 35.1 and a standard deviation of 9.4. The education level of the
participants was high. 63% had completed a Bachelor’s degree or equivalent, 6%
a Master’s degree or equivalent, 29% secondary education or high school and 2%
other educational background.
Study
Participants were recruited using Amazon Mechanical Turk. They were asked to com-
plete a set of 39 tasks from Alva’s item bank twice, with two weeks in between.
5.2.2 Information
It is well known, even to classical test theorists, that measurement precision is not
uniform across the scale of measurement. Tests tend to distinguish better for test-
takers with moderate trait levels and worse in the higher and lower score ranges. In
IRT, the concept of reliability is extended from a single number to a function called
the Test Information Function.
In statistics, information (or more specifically, Fisher Information) refers to how strong
the relationship is between data and parameters of a statistical model. High levels
of information means that parameters can be estimated efficiently. In Alva’s logic
test, the Test Information Function tells us how well logical ability of individuals can
be estimated using the tasks in the item bank (eq. 3.14).
30
Logic Test September 2020
1
𝑟 =1− (5.1)
𝐼
According to EFPA Standards, an average information across the scale of measure-
ment above 10 is deemed Excellent and above 5 is deemed Good. This is equivalent
to a reliability coefficient above 0.9 and 0.8 respectively.
The average information across the score range for Alva’s logic test is 7.4, with a max-
imum of 8.5 on the higher end of the range and a minimum of 4.8 at the lowest end of
the range. This reflects the fact that the test currently contains more medium-hard
tasks than easy tasks. This is transformed to an average reliability of r=0.86, with a
maximum of 0.88 and a minimum of 0.79, using eq. 5.1.
31
Logic Test September 2020
Excellent
10
8
Information
6 Good
4 Adequate
Inadequate
2
0
1.00
0.95 Excellent
0.90
0.85 Good
Reliability
0.80
0.75 Adequate
0.70
0.65 Inadequate
0.60
1 2 3 4 5 6 7 8 9 10
Standard score
32
6 Adverse Impact
6.1 Sample
Alva’s logic test is developed to represent the “working population” with test results
normalized on a STEN-scale, which assumes a normal distribution 𝑋 ∼ 𝒩(5.5, 2).
Many of Alva’s customers serve in industries with job positions that on a general level
is considered “complex”, and samples from Alva’s database are thus expected to
have a characteristic of slightly above 5.5 and 2.
Furthermore, it’s important to note that a sample from Alva’s database is not a rep-
resentative sample for the “working population” since it is highly dependent on the
current customer portfolio. For example, if Alva’s customers primarily were hiring
for PhD-positions, the sample would consist of candidates with relatively high age
and presumably high logical ability, which in turn would not be representative for the
general “working population”. The job positions offered by Alva’s customers, also in
terms of social norms (for example “bus driver” is overrepresented by men) will lay the
ground for the sample. What this means is that we look at the statistics and try to
explain the variance by looking at various factors.
This analysis is performed on 2020-07-16 on a sample of 20,827 people that have taken
Alva’s adaptive logic test between 2019-12-09 and 2020-07-16. The characteristics of
33
Logic Test September 2020
The sample mean is 6.52, which is significantly higher than the expected mean for
a working population of 5.5. This indicates that the sample on average has higher
logical ability than a working population, which most likely coincides with the fact
that many of Alva’s customers offer more complex and competitive jobs. The sample
standard deviation of 2.26 is about what’s expected. The skewness of -0.54 indicates
that the sample is moderately skewed towards the upper bound on the scale, which
also suits well with the underlying assumptions for the sample.
Following from our definition of GMA, we expect to see a relationship between highest
attained education level and results on the logic test. Due to the connection between
GMA and effective learning, individuals with higher GMA should be more likely to both
initiate and successfully finish higher education.
By splitting the sample between different levels of education, one observes sample
means between 5.15and 6.98. The different sub-samples all have a sample standard
deviation slightly above 2 and skewness between -0.59 and -0.25. The sub-sample
characteristics are presented in the table below.
34
Logic Test September 2020
The group with the highest sample mean is the group with “Post-secondary education,
3 or more years”, which is expected. Generally, groups with higher education receive
higher test results.
6.3 Age
We expect a slightly negative correlation between age and logical ability, since this
is closely related to the concept Fluid Intelligence (see for example Hartshorne and
Germine, 2015 and Deary, 2012).
By splitting the sample into different age categories, we observe that the group be-
tween 20-29 are receiving slightly higher test results compared to the total sample.
Characteristics of the different sub-samples are presented in the table below.
35
Logic Test September 2020
Overall, one observes a slightly negative correlation between age and results on the
logic test, which is according to the expectations. However it’s worth noticing that
there are big differences in the sizes of the sub-samples. Furthermore, Alva’s has a
significant amount of “internship programs” which typically attracts high performing
young adults. On the other hand, there are also customers with more “blue-collar”-
jobs, which is not as prestigious and potentially could attract an older population.
Before drawing any conclusions from this result, one should investigate if some of
the variance might be explained by differences in jobs and their different candidate
pools.
6.4 Gender
We do not expect any significant relationship between gender and logical ability. We
believe that observed differences found in some studies on related tests of fluid in-
telligence are due to methodological flaws (see for example Blinkhorn, 2005) and/or
sample limitations (see for example Dykiert, Gale & Deary, 2008).
In the set of demographic data, the big majority (99.9%) select “male” (52.2%) or “fe-
male” (47.2%), which makes it the only two sub-samples relevant to investigate. The
36
Logic Test September 2020
By comparing the sub-samples, one can identify that the sample mean for men is
slightly higher than the sample mean for women.
The difference of sample means of 0.62 points on a STEN-scale between men and
women is not dramatic but worth investigating. One hypothesis would be that Alva’s
customers offer complex jobs that attract more men than women, and vice versa.
One example would be Engineering roles, which typically require a higher level of log-
ical ability, but which also, generally have more male than female candidates. Such
characteristics would skew the sample results.
To investigate if the variance might be explained by job characteristics, the differ-
ences between logic results for male and female candidates were investigated for
each job position (with more than 3 female and 3 male applicants). The stochastic
variable 𝑋𝑖 was defined as the difference between average logic test results for men
and women for job 𝑖. The analysis would then look at 𝑋 as a normal distributed vari-
able and compare the hypothesis 𝐻0 ∶ 𝜇 = 0 to 𝐻1 ∶ 𝜇 ≠ 0. Rejecting 𝐻0 would
indicate that it most likely is a systematic difference in the logic test results between
men and women. The characteristics of 𝑋 is described in the table below.
37
Logic Test September 2020
Table 6.5: t-test of the observed score differences between men and women in 425
job positions
The p-value for the two tailed t-test was 0.13, which indicates that we cannot reject
(𝛼 = 0.05) the hypothesis that the average difference between logic test results
for men and women is 0. Furthermore, the result indicates that the different sample
means for men and women can be explained by characteristics of the jobs on the
Alva-plattform.
In conclusion, when controlling for the type of job position one can not identify any
meaningful differences in logic test scores between male and female candidates.
38
7 Description of data
In the development of Alva’s logic test, the following samples were used:
• Linear test sample: used to train the initial parameters of the model.
• Standardization sample: used to extend the item bank with an additional 50
items.
• SPM+ norm group: used to calibrate the model to a general adult population.
n %
Gender
Male 1,485 65%
Female 780 34%
Other/Rather not say 30 1%
Age
20-34 1,808 79%
35-44 303 13%
45-54 119 5%
39
Logic Test September 2020
n %
55-64 18 1%
Rather not say 47 2%
Total 2,295
n %
Gender
Male 190 66%
Female 95 33%
Other/Rather not say 1 0%
Age
20-34 182 64%
35-44 65 23%
45-54 24 8%
55-64 13 5%
Rather not say 2 1%
Education level
Lower secondary 2 1%
Upper secondary 76 27%
Post-secondary, less than 3 years1 - -
40
Logic Test September 2020
n %
n %
Age
19-34 83 40%
35-44 35 17%
45-54 32 16%
55-64 56 27%
Total 206
[…] The adults were tested individually. A quota sampling procedure was em-
ployed. Testers were asked to seek out respondents of appropriate age, sex,
place of residence (e.g. large city, small town, village), and education to yeild
1
Not among the options in data collection
41
Logic Test September 2020
42
8 References
43
Logic Test September 2020
Rise and Fall of Different Cognitive Abilities Across the Life Span, Psychological Sci-
ence, 26:4, 433-443.
Hartshorne, J. K, & Germine, L. T. (2015). When Does Cognitive Functioning Peak? The
Asynchronous Rise and Fall of Different Cognitive Abilities Across the Life Span, Psy-
chological Science, 26:4, 433-443.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning:
Data Mining, Inference, and Prediction (Second Edition). Springer.
Horn, J. L., & Cattell, R. B. (1966). Refinement and test of the theory of fluid and crys-
tallized general intelligences. Journal of Educational Psychology, 57, 253-270.
Hunter, J. E., Schmidt, F. L., & Le, H. (2006). Implications of direct and indirect range
restriction for meta-analysis methods and findings. Journal of Applied Psychology,
91, 594-612.
Koller, D. & Friedman, N. (2009). Probabilistic Graphical Models: Principles and Tech-
niques. MIT Press, Cambridge.
Luo, Y. & Jiao, H. (2018). Using the Stan Program for Bayesian Item Response Theory.
Educational and Psychological Measurement, 78:3, 384-408. DOI:10.1177/0013164417693666
Raven, J., Raven, J. C., & Court, J. H. (1998). Manual for Raven’s progressive matrices
and vocabulary scales (1998 edition). Oxford: Oxford Psychologists Press.
Salvatier J., Wiecki T.V., Fonnesbeck C. (2016) Probabilistic programming in Python us-
ing PyMC3. PeerJ Computer Science, 2:55. DOI: 10.7717/peerj-cs.55
Spearman, C. (1904). “General Intelligence”, Objectively Determined and Measured.
The American Journal of Psychology, 15:2, 201-292.
Sternberg, R. J. & Hedlund, J. (2002). Practical Intelligence, g, and Work Psychology.
Human Performance, 15, 143-160.
van der Linden, W. J., & Hambleton, R. K. (Eds.) (1997). Handbook of Modern Item
Response Theory. New York: Springer.
van der Linden, W. J., & Glas C. A. W. (Eds.) (2000). Computerized Adaptive Testing:
Theory and Practice. Netherlands: Springer.
44
Logic Test September 2020
van der Linden, W. J. & Pashley, P. J. (2010) Item Selection and Ability Estimation in
Adaptive Testing. In: van der Linden, W. J., & las, C. A. W (Eds), Elements of Adaptive
Testing. New York: Springer. doi: 10.1007/978-0-387-85461-8_1
45